SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints
Source: arXiv Authors: Zekun Li, Shinda Huang, Jiangtian Wang, Nathan Zhang, Antonis Antoniades, Wenyue Hua, Kaijie Zhu, Sirui Zeng, Chi Wang, William Yang Wang, Xifeng Yan Date: 2025-03-11 Primary category: cs.CL All categories: cs.CL, cs.AI
Abstract
SOPBench builds executable environments, SOP graphs, and rule-based verifiers to measure whether agents actually follow procedures and constraints. For a workflow-evolution control plane, this is precisely the sort of benchmark substrate that can drive promotion rather than letting the agent grade its own homework.