Agent-Facing Verifier and Testing Harness Architecture

Question

If a harness treats verifiers, tests, proofs, traces, and reviewer gates as first-class environment objects, what is the minimum object model, state machine, and primitive set that makes them usable by an agent rather than merely visible to a human operator?

Short answer

The model is not merely “APIs over pytest.” It is a structured evidence surface with specification objects, evidence ledgers, promotion gates, and regression memory, coupled to a state machine that separates builder, verifier, and reviewer roles without flattening them into one actor.

What the evidence says

From the harness survey literature

The modern harness survey frames an agent harness as six governance functions: Execution (E), Tool Registry (T), Context (C), State Store (S), Lifecycle Hooks (L), Evaluation Interface (V). The critical additions for verification are S and V when they are treated as structured objects rather than atmospheric logging. The state store S is where evidence survives, and the evaluation interface V is where evidence becomes legible enough to drive further action.

From CodeTracer

CodeTracer shows that raw agent run directories are insufficient. Heterogeneous traces must be normalized into typed records (action, observation, diff, verification) and then indexed into a hierarchical trace tree with exploration nodes and state-changing nodes. The important step for harness design is that failure-onset localization is not a human convenience; it is a structured signal that can be fed back into the agent as a reflective replay prefix, recovering failed runs deterministically under the same budget.

From another-harness

The local prototypes in another-harness-work-item-closure-environment and another-harness-evaluator-discipline-environment validate a narrower but real version of the same idea: the environment is not a test runner but a frozen contract with role-specific permissions. Builder episodes may not approve completion. Evaluator episodes may not rewrite deliverables. This is the verifier-as-environment-object in concrete form.

From SWE-Gym and AppWorld

swe-gym pairs builder training with verifier training on the same trajectory substrate. appworld uses state-based evaluation with collateral-damage checks. The lesson is that the environment must reward not only the final artifact but the trace of interaction that produced it, and that verifier interaction traces are themselves learnable.

Candidate primitives

1. Specification surface

A first-class specification object that narrows natural-language intent into something checkable.

Kind	Granularity	Example
Assertion	Line/function	`assert_equals(out, expected)`
Contract	Module boundary	Pre/post conditions, type invariants
Property	System behavior	`forall lists l: reverse(reverse(l)) == l`
Temporal property	Multi-step trace	`eventually(always(reconciled))`
Theorem statement	Formal claim	`∑ n i=1 i = n(n+1)/2`
Reviewer checklist	Human gate	Acceptance artifact with pass/fail items

The harness does not require all of these at once. It requires that whatever kind is chosen, the specification object is addressable, versioned, and referenced by evidence records, not buried in prose.

2. Evidence ledger

A durable append-only record of what was run, against what specification, in what environment, producing what result.

evidence_record:
  record_id: sha256(content)
  spec_ref: {kind, address, version}
  run_context: {command, env_hash, agent_turn}
  result: {status, counterexample, diff, log_ref}
  feedback: {kind, payload_ref, teacher_context_ref, credit_scope, used_for_training}
  artifact_hash: sha256(files_at_time)
  reviewer_decision: {pending | accepted | rejected | waived}
  limitation: optional
  timestamp: iso8601

This is the harness-level equivalent of what swe-gym and appworld do inside their evaluators, but lifted into a reusable ledger so that evidence survives beyond one benchmark run.

The feedback field is the addition suggested by on-policy-self-distillation: the verifier should preserve not only the outcome, but the explanatory payload that might later condition a teacher model, reviewer replay, or adapter-training job. used_for_training must remain explicit so evaluation evidence does not quietly become a hidden training set.

3. Promotion gate

A state object that tracks where a piece of work sits in the acceptance pipeline.

promotion_state:
  artifact_ref: {path, commit, version}
  tests: {passed, failed, skipped, record_refs}
  proofs: {proved, pending, counterexample, record_refs}
  reviews: {open, approved, changes_requested, dismissed}
  regression_checks: {all_passing, known_failures, record_refs}
  overall: {proposed | tested | reviewed | proved | rejected | waived}

A promotion gate is not a CI badge. It is a structured state machine with computed fields so that an agent can query it, reason about it, and act on blocked transitions.

4. Regression memory

Prior failures and counterexamples promoted into reusable specification objects.

regression_item:
  spec_ref: {address, version}  # the test or property that failed
  counterexample: {input, trace_ref}
  first_seen: timestamp
  last_reproduced: timestamp
  status: {open | fixed | accepted | regression_detected}
  linked_spec_refs: [...]  # other specs that this failure validates

This is adjacent to memory-persistence but scoped to evidence rather than general context. It answers the question: “if this worked before, why should I believe it works now?“

5. Trace tree node

A normalized node in the hierarchical trace model suggested by CodeTracer.

trace_node:
  node_id: uuid
  kind: exploration | state_change | verification | rollback
  action: {tool_call, raw_command}
  observation: {stdout, stderr, dom, screenshot, return_value}
  diff: {files_changed, patch_ref}
  verification: {spec_ref, evidence_record_ref, pass|fail|unknown}
  children: [trace_node_refs]
  parent: trace_node_ref | null
  failure_onset: bool  # true if this is the earliest error-critical step

The trace tree survives one run and becomes navigable for replay or diagnosis.

Proposed object model

Environment
├── specification_library
│   ├── assertions[]
│   ├── contracts[]
│   ├── properties[]
│   ├── theorems[]
│   └── reviewer_checklists[]
├── evidence_ledger
│   └── evidence_records[]
├── promotion_gates
│   └── promotion_states[]
├── regression_memory
│   └── regression_items[]
└── trace_forest
    └── trace_trees[]
        └── trace_nodes[]

Every top-level object is versioned and hash-addressed. The ledger and regression memory are append-only. The promotion gates are mutable state machines with versioned transitions. The trace forest is a write-once, read-many index of historical runs.

State machine: the promotion pipeline

stateDiagram-v2
    [*] --> Proposed: agent submits work
    Proposed --> Tested: evidence ledger shows passing tests
    Tested --> Reviewed: reviewer gate approved
    Reviewed --> Proved: formal lane completes proof
    Reviewed --> Rejected: reviewer requests changes
    Proved --> Accepted: promotion gate updates
    Rejected --> Proposed: agent revises and resubmits
    Tested --> Rejected: tests fail after revision
    Proved --> Waived: operator overrides proof requirement
    Waived --> Accepted: promotion gate updates
    Accepted --> [*]

Important invariants:

A builder may not transition its own promotion gate beyond Proposed.
An evaluator may not transition a gate beyond Tested; only a reviewer or formal verifier may advance further.
A waiver is logged as an evidence_record with a limitation statement so that later audit trails can reason about it.
Rejection returns to Proposed, not to a generic “failed” sink, so the state machine preserves retry context.

How this relates to existing harness components

Harness component	Verifier-environment mapping
Session container	Promotion gate owns session-scoped evidence
Prompt assembly	Specification objects feed into context engineering
Tool execution	Evidence records instrument tool calls
Durable memory	Evidence ledger and regression memory
Evaluators	Promotion gates with evaluator-discipline rules
Resume	Trace trees preserve state across interruptions
Work objects	Promotion gates become the canonical work state
Review loops	Structured reviewer gates with recorded decisions
Formal lanes	Specification surfaces and proof-checked transitions
Evolution	Regression memory informs learned test selection

Open design questions

Specification authoring authority. Who may add a specification object? The agent, the operator, a separate requirements agent, or an inferred specification mined from trajectories? Each choice changes the trust model.
Trace tree granularity versus compression. CodeTracer demonstrates full tree indexing, but at scale this may exceed context-window or storage budgets. How should a harness summarize trace trees without losing the failure-onset signal?
Formal lane integration depth. Should the harness treat a theorem prover as just another tool in the evidence ledger, or as a privileged transition in the promotion gate? The latter is cleaner but harder to generalize across tools.
Anti-gaming at the evidence level. If evidence records are durable and hash-addressed, an agent could replay old successful evidence to mask a regression. The ledger needs freshness checks, environment hash binding, and possibly adversarial re-run sampling — similar to the hardening already done in another-harness-evaluator-discipline-environment.
Waiver semantics and governance. A waiver is a decision to accept without evidence. It must be explicit, attributed, and reversible. But it also opens an attack surface: an agent that learns to petition for waivers instead of producing evidence. How should waiver patterns be detected and surfaced?
Distillable feedback governance. If evidence records can become training examples, who approves that use? The harness needs consent, privacy filters, poisoning defenses, and adapter/checkpoint scoping before treating user follow-ups or reviewer comments as model-update material.
Regression item decay. Not all historical failures deserve permanent memory. Some were fixed, some were accepted as expected behavior, and some were symptoms of transient environment drift. The regression memory needs a principled eviction or demotion policy rather than unbounded accumulation.

Bottom line

The agent-facing verifier environment is not a bag of testing tools. It is a structured substrate for specifications, evidence, promotion, and regression, governed by a state machine that enforces role separation between builder, tester, reviewer, and formal verifier. The primitives are not exotic: ledgers, state machines, trees, and hashes. What matters is that they are exposed as addressable objects rather than hidden inside CI logs or human dashboards.

The closest existing precedent is the combination of swe-gym (trajectory + verifier training), appworld (state-based grading), another-harness-work-item-closure-environment (frozen contract + role isolation), and CodeTracer (hierarchical trace indexing). The proposed architecture unifies these into a single object model that a harness can query, inspect, and learn from.

Read this with software-verification-testing-environment-research-program, formal-methods-for-agent-harnesses, evaluation-and-review-loops, work-management-primitives, agent-harness-anatomy, self-evolving-workflows, another-harness-work-item-closure-environment, another-harness-evaluator-discipline-environment, swe-gym, appworld, and the raw CodeTracer note under raw/papers/code-tracer-towards-traceable-agent-states.md.

See also: cobalt-tla, leetproof

Agent Harness Wiki

Browse

Agent-Facing Verifier and Testing Harness Architecture

Question

Short answer

What the evidence says

From the harness survey literature

From CodeTracer

From another-harness

From SWE-Gym and AppWorld

Candidate primitives

1. Specification surface

2. Evidence ledger

3. Promotion gate

4. Regression memory

5. Trace tree node

Proposed object model

State machine: the promotion pipeline

How this relates to existing harness components

Open design questions

Bottom line

Graph View

Table of Contents

Backlinks

Agent Harness Wiki

Browse

Agent-Facing Verifier and Testing Harness Architecture

Question

Short answer

What the evidence says

From the harness survey literature

From CodeTracer

From another-harness

From SWE-Gym and AppWorld

Candidate primitives

1. Specification surface

2. Evidence ledger

3. Promotion gate

4. Regression memory

5. Trace tree node

Proposed object model

State machine: the promotion pipeline

How this relates to existing harness components

Open design questions

Bottom line

Related pages

Graph View

Table of Contents

Backlinks