Proxy State-Based Evaluation
Overview
Proxy State-Based Evaluation is a scalable reward and grading approach for multi-turn tool-calling agents that replaces fully deterministic backends with structured scenarios, proxy-state tracking, and LLM judges. It aims to preserve state-based evaluation without paying the full cost of hand-built deterministic worlds.
Why it matters
It matters because deterministic backends are expensive, and many harnesses will need something less brittle if they want wide coverage without a small civil-service devoted to simulator upkeep.
Distinctive trait
Its distinctive trait is verifiable-enough proxy state: structured scenario constraints plus state tracking and judging rather than freeform grader sentiment.
Relationships
Read Proxy State-Based Evaluation with tau-bench, evaluation-and-review-loops, appworld, and the gym-design discussion in rl-gyms-and-executable-environments-for-ai-harnesses.