BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems
Source: arXiv Authors: Wei Wang, Dan Zhang, Tao Feng, Boyan Wang, Jie Tang Date: 2024-08-28 Primary category: cs.CL All categories: cs.CL
Abstract
BattleAgentBench exists to measure coordination itself rather than only single-agent competence in disguise. It evaluates models across navigation, paired execution, and harder collaborative or competitive settings, which gives a more granular picture of where multi-agent language models fail. Its main importance here is evidentiary: it shows that even strong models still have a large gap between local competence and robust teamwork.