Agent Harness Wiki
Search
Search
Dark mode
Light mode
Explorer
Tag: benchmark
38 items with this tag.
Apr 15, 2026
another-harness and Atropos
comparison
benchmark
tool-execution
work-management
Apr 15, 2026
Gas City Live Ops, Benchmarks, and Sandboxes
gas-city
benchmark
work-management
context-engineering
Apr 11, 2026
Atropos
tool-execution
orchestration
benchmark
Apr 11, 2026
EvoSkills
memory
work-management
benchmark
Apr 11, 2026
JudgeFlow
benchmark
error-recovery
work-management
Apr 11, 2026
RobustFlow
benchmark
context-engineering
work-management
Apr 11, 2026
WorfBench
benchmark
work-management
orchestration
Apr 11, 2026
WorfEval
benchmark
work-management
orchestration
Apr 11, 2026
another-harness resume-recover environment
benchmark
tool-execution
work-management
Apr 11, 2026
Prompt Optimization Open Questions: Evaluation, Transfer, Robustness, and Benchmarking
survey
comparison
benchmark
context-engineering
Apr 11, 2026
RL Gyms and Executable Environments for AI Harnesses
survey
benchmark
tool-execution
work-management
Apr 10, 2026
AgentBoard
benchmark
orchestration
work-management
Apr 10, 2026
AgentGym
benchmark
orchestration
tool-execution
Apr 10, 2026
AppWorld
benchmark
tool-execution
work-management
Apr 10, 2026
BrowserGym
benchmark
tool-execution
orchestration
Apr 10, 2026
ComputerRL
benchmark
tool-execution
orchestration
Apr 10, 2026
EnterpriseBench Corecraft
benchmark
tool-execution
work-management
Apr 10, 2026
GAIA
benchmark
tool-execution
orchestration
Apr 10, 2026
MLGym
benchmark
orchestration
work-management
Apr 10, 2026
OSWorld
benchmark
tool-execution
orchestration
Apr 10, 2026
Proxy State-Based Evaluation
benchmark
tool-execution
work-management
Apr 10, 2026
SOPBench
benchmark
tool-execution
work-management
Apr 10, 2026
SWE-Gym
benchmark
code-quality
work-management
Apr 10, 2026
τ-bench
benchmark
tool-execution
work-management
Apr 10, 2026
VisualWebArena
benchmark
tool-execution
orchestration
Apr 10, 2026
WebArena
benchmark
tool-execution
orchestration
Apr 10, 2026
WebCanvas
benchmark
tool-execution
work-management
Apr 10, 2026
WebShop
benchmark
tool-execution
work-management
Apr 10, 2026
Windows Agent Arena
benchmark
tool-execution
orchestration
Apr 10, 2026
WorkArena++
benchmark
tool-execution
work-management
Apr 10, 2026
WorkArena
benchmark
tool-execution
work-management
Apr 10, 2026
another-harness Atropos environment schema
comparison
benchmark
tool-execution
work-management
Apr 10, 2026
another-harness evaluator-discipline environment
benchmark
tool-execution
work-management
Apr 10, 2026
another-harness work-item closure environment
benchmark
tool-execution
work-management
Apr 10, 2026
Neural-Native Programming Research Program
benchmark
formal-methods
program-synthesis
mechanistic-interpretability
Apr 10, 2026
Neural-Native Programming via Direct Interfaces to Transformer Internal Layers
survey
comparison
program-synthesis
benchmark
safety
Apr 09, 2026
Harness Quality Comparison
comparison
benchmark
code-quality
Apr 07, 2026
Harness Decision Matrix
comparison
benchmark
code-quality