Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Source: arXiv Authors: Yun-Shiuan Chuang, Chaitanya Kulkarni, Alec Chiu, Avinash Thangali, Zijie Pan, Shivani Shekhar, Yirou Ge, Yixi Li, et al. Date: 2026-02-18 Primary category: cs.AI All categories: cs.AI

Abstract

This paper is important because it relaxes the assumption that every serious agent benchmark needs a fully deterministic backend. Its proxy-state evaluation framework uses structured scenario specifications, LLM state tracking, and judges to provide scalable verifiable reward for multi-turn tool agents.

Agent Harness Wiki

Browse

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Abstract

Graph View

Table of Contents

Backlinks