VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
Source: arXiv Authors: Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou Date: 2024-01-24 Primary category: cs.LG All categories: cs.LG, cs.CL, cs.CV
Abstract
VisualWebArena extends realistic web evaluation into visually grounded tasks where text alone is insufficient. It matters because many serious computer-use harnesses will need multimodal grounding, not just DOM text and polite optimism.