AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

Source: arXiv Authors: Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong Date: 2024-01-24 Primary category: cs.CL All categories: cs.CL, cs.AI, cs.LG

Abstract

AgentBoard is less a gym than an analytical evaluation board spanning multiple environments and progress metrics. It is useful because it reminds us that not every harness substrate must itself be trainable; some are interpretive score surfaces over many tasks.