Harness Decision Matrix

Purpose

This page converts the qualitative comparisons in harness-quality-comparison and harness-architecture-comparison into a decision table suitable for design choice rather than mere literary appreciation. The numbers are provisional and literature-grounded, not benchmark scripture.

Scoring rubric

Use a 1-5 scale with 0.5 increments. Weighted total = sum(weight × score / 5).

Criterion	Weight	Meaning
Architecture cleanliness and legibility	20	Protocol clarity, session model clarity, and repo/system-of-record discipline
State continuity	20	Memory persistence, resumability, and recovery after context loss
Evaluation and review rigor	20	Explicit QA loops, evaluator separation, and reality-bearing verification
Work primitives and orchestration power	15	Richness of work representation and coordination model
Operational trustworthiness	15	Safety controls, permissions, and stability or migration risk
Surface breadth and reuse	10	Coherent operation across CLI, IDE, API, web, or messaging surfaces

Matrix

Harness	Arch	State	Eval	Orch	Trust	Surf	Total
claude-code	4.5	5.0	5.0	3.5	4.0	4.5	89.5
codex-cli	5.0	3.5	4.0	3.5	4.5	5.0	84.0
hermes-agent	3.5	5.0	3.0	3.0	4.5	5.0	78.5
gas-town	2.5	4.0	3.5	5.0	2.5	2.0	66.5
gas-city	3.0	3.5	3.0	5.0	2.0	3.0	65.0
openclaw	3.0	4.0	2.0	2.5	1.5	5.0	58.0

Reading the matrix

claude-code wins the present-tense overall score because the corpus treats it as the strongest on resumable artifacts, evaluator separation, and long-running task recovery. codex-cli remains the best architectural specimen: if the question is what shape a new harness core should have, Codex is the cleanest answer.

hermes-agent ranks lower overall only because the current corpus puts less weight on explicit evaluator loops than on persistent memory and multi-surface continuity. If the goal were long-term personal usefulness rather than design purity, Hermes would rise. gas-town and gas-city dominate the orchestration column for the obvious industrial reasons, while openclaw demonstrates that breadth without strong trust boundaries is a poor bargain.

Design verdict for another-harness

The matrix supports the design thesis already sketched in new-harness-design-notes:

borrow core architecture from codex-cli
borrow persistence and skill accumulation from hermes-agent
borrow evaluator discipline from claude-code
borrow explicit work objects and orchestration direction from gas-town and gas-city

Cautions

These scores are not laboratory measurements. They are a disciplined reduction of the current wiki corpus. If the source base becomes more quantitative, the matrix should be recomputed rather than fondly defended.

Read this beside harness-quality-comparison, harness-architecture-comparison, evaluation-and-review-loops, and new-harness-design-notes.

Agent Harness Wiki

Explorer

Harness Decision Matrix

Purpose

Scoring rubric

Matrix

Reading the matrix

Design verdict for another-harness

Cautions

Graph View

Table of Contents

Backlinks

Agent Harness Wiki

Explorer

Harness Decision Matrix

Purpose

Scoring rubric

Matrix

Reading the matrix

Design verdict for another-harness

Cautions

Related pages

Graph View

Table of Contents

Backlinks