Evaluation and Review Loops
Definition
Evaluation and review loops are the mechanisms by which a harness checks whether the worker agent actually achieved the goal instead of merely composing a plausible success story. They can be automated tests, browser checks, separate evaluator agents, or human-in-the-loop PR triage. The important feature is adversarial distance from the original worker.
Representative patterns
codex-cli appears in this corpus through OpenAI’s self-review and agent-review loop: implement, request review, absorb criticism, repeat, then inspect diffs in a multi-agent supervisor surface. claude-code pushes the separation further by giving evaluation work to distinct roles with explicit pass/fail criteria and live-system tooling, then extending that into CI and automated review workflows. In the Yegge line, review becomes operational governance: PR sheriffs, maintainers, and merge strategies that keep swarm output from becoming a landfill.
What strong loops require
- Explicit acceptance criteria, often externalized into durable files or checklists.
- Tool access to reality: browser automation, logs, screenshots, metrics, or local repro commands.
- Structural independence between builder and reviewer, even if both are agents.
- A workflow that routes failure back into the next iteration instead of letting it dissolve into vague “looks good” prose.
Executable environments
The newer benchmark literature adds a more concrete substrate for review loops: executable environments with state-based grading. rl-gyms-and-executable-environments-for-ai-harnesses collects the main families, but the practical lesson is already clear. AppWorld, SWE-Gym, OSWorld, and related systems do not merely evaluate final prose; they evaluate world state, test outcomes, or rubric satisfaction after a multi-step interaction trace. That is much closer to the kind of evidence a learning or promotion loop can safely consume.
Review as training signal
on-policy-self-distillation raises the bar for what a review loop should emit. A useful evaluator should not merely return pass or fail; it should preserve compiler errors, runtime exceptions, failed-test traces, reviewer comments, judge rationales, and user follow-up replies in a form that can condition later agent behavior. Even when no weight update happens, these richer artifacts improve recovery and future context assembly.
This does not remove the need for adversarial distance. A self-distilled teacher is still the same model with more context, so independent tests and reviewers remain the authority; their feedback simply becomes more reusable.
Simulatability tests
agentic-imodels suggests another review primitive: evaluate whether an agent can answer held-out operational questions from an artifact representation alone. In that paper the artifact is a fitted model’s __str__ output, but the pattern generalizes to tool outputs, issue summaries, trace digests, and failure reports. If an evaluator cannot reconstruct the relevant behavior from the artifact, the artifact is not agent-readable in any operational sense.
Main trade-off
Good review loops cost more in tokens, time, and operator design. They also add coordination overhead. But without them, long-running systems drift toward premature victory, hidden regressions, and PR pileups. This is why evaluation belongs inside harness-engineering rather than as an afterthought bolted onto release time.
Related pages
Read with harness-engineering, claude-code, codex-cli, and work-management-primitives. This concept also explains much of the ranking logic in harness-quality-comparison and the evaluation column in harness-architecture-comparison. The gym-style extension of this idea is rl-gyms-and-executable-environments-for-ai-harnesses, and the new project anchor is software-verification-testing-environment-research-program.
The architecture synthesis for treating verifiers and evidence as harness objects is in agent-facing-verifier-environment-architecture.
Advanced Evidence Primitives
Evaluation loops can integrate richer evidence beyond simple pass/fail tests:
- Metamorphic relations: Useful as a primary review technique for non-deterministic LLM pipelines, catching fact-conflicting hallucinations without strict ground-truth oracles.
- Agentic coverage-guided fuzzing: Systems like FLARE and WhiteFox show that agents can drive fuzzing campaigns, exploring edge cases formally and returning concrete paths.
- Property-based testing (PBT): Serves as a continuous contract primitive, validating invariants on generated code across diverse inputs.