SOPBench

Overview

SOPBench is an executable evaluation framework for language agents that must follow standard operating procedures, policies, and constraints while taking actions and making tool calls. It provides executable environments, SOP graphs, and rule-based verifiers rather than relying on post-hoc prose judgments.

Why it matters

It matters because many serious harness failures are failures of procedure following, not merely failures of raw competence. SOPBench makes that procedural layer measurable.

Distinctive trait

Its distinctive trait is oracle-style rule-based verification over procedure graphs. That gives it a cleaner promotion signal than vague “did the agent seem responsible?” grading.

Relationships

Read SOPBench with tau-bench, appworld, evaluation-and-review-loops, and the broader map in rl-gyms-and-executable-environments-for-ai-harnesses.

Agent Harness Wiki

Browse

SOPBench

Overview

Why it matters

Distinctive trait

Relationships

Graph View

Table of Contents

Backlinks