τ-bench
Overview
τ-bench evaluates tool-agent-user interaction in dynamic real-world domains with policy rules, domain-specific APIs, and simulated user conversation. It is one of the cleanest tool-centric environments for studying whether agents behave consistently and follow rules.
Why it matters
It matters because many harness failures are not navigation failures but policy and interaction failures. τ-bench puts that problem in the foreground.
Distinctive trait
Its distinctive trait is end-state grading over multi-turn tool use plus domain rules, rather than just checking whether the agent touched the right API in the right order once.
Relationships
Read τ-bench with appworld, proxy-state-based-evaluation, sopbench if mentioned through related queries, and rl-gyms-and-executable-environments-for-ai-harnesses.