SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints

Source: arXiv Authors: Zekun Li, Shinda Huang, Jiangtian Wang, Nathan Zhang, Antonis Antoniades, Wenyue Hua, Kaijie Zhu, Sirui Zeng, Chi Wang, William Yang Wang, Xifeng Yan Date: 2025-03-11 Primary category: cs.CL All categories: cs.CL, cs.AI

Abstract

SOPBench builds executable environments, SOP graphs, and rule-based verifiers to measure whether agents actually follow procedures and constraints. For a workflow-evolution control plane, this is precisely the sort of benchmark substrate that can drive promotion rather than letting the agent grade its own homework.

Agent Harness Wiki

Browse

SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints

Abstract

Graph View

Table of Contents

Backlinks