Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Source: arXiv Authors: Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, et al. Date: 2024-09-12 Primary category: cs.AI All categories: cs.AI

Abstract

Windows Agent Arena adapts the real-OS benchmark idea specifically to Windows and is designed for large-scale parallel evaluation. It matters because general “computer use” remains too broad unless one can actually parallelize and grade the environment cheaply.

Agent Harness Wiki

Browse

Abstract

Graph View

Table of Contents

Backlinks