Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Source: arXiv Authors: Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, et al. Date: 2024-09-12 Primary category: cs.AI All categories: cs.AI
Abstract
Windows Agent Arena adapts the real-OS benchmark idea specifically to Windows and is designed for large-scale parallel evaluation. It matters because general “computer use” remains too broad unless one can actually parallelize and grade the environment cheaply.