Harness Decision Matrix

Purpose

This page converts the qualitative comparisons in harness-quality-comparison and harness-architecture-comparison into a decision table suitable for design choice rather than mere literary appreciation. The numbers are provisional and literature-grounded, not benchmark scripture.

Scoring rubric

Use a 1-5 scale with 0.5 increments. Weighted total = sum(weight × score / 5).

CriterionWeightMeaning
Architecture cleanliness and legibility20Protocol clarity, session model clarity, and repo/system-of-record discipline
State continuity20Memory persistence, resumability, and recovery after context loss
Evaluation and review rigor20Explicit QA loops, evaluator separation, and reality-bearing verification
Work primitives and orchestration power15Richness of work representation and coordination model
Operational trustworthiness15Safety controls, permissions, and stability or migration risk
Surface breadth and reuse10Coherent operation across CLI, IDE, API, web, or messaging surfaces

Matrix

HarnessArchStateEvalOrchTrustSurfTotal
claude-code4.55.05.03.54.04.589.5
codex-cli5.03.54.03.54.55.084.0
hermes-agent3.55.03.03.04.55.078.5
gas-town2.54.03.55.02.52.066.5
gas-city3.03.53.05.02.03.065.0
openclaw3.04.02.02.51.55.058.0

Reading the matrix

claude-code wins the present-tense overall score because the corpus treats it as the strongest on resumable artifacts, evaluator separation, and long-running task recovery. codex-cli remains the best architectural specimen: if the question is what shape a new harness core should have, Codex is the cleanest answer.

hermes-agent ranks lower overall only because the current corpus puts less weight on explicit evaluator loops than on persistent memory and multi-surface continuity. If the goal were long-term personal usefulness rather than design purity, Hermes would rise. gas-town and gas-city dominate the orchestration column for the obvious industrial reasons, while openclaw demonstrates that breadth without strong trust boundaries is a poor bargain.

Design verdict for another-harness

The matrix supports the design thesis already sketched in new-harness-design-notes:

Cautions

These scores are not laboratory measurements. They are a disciplined reduction of the current wiki corpus. If the source base becomes more quantitative, the matrix should be recomputed rather than fondly defended.

Read this beside harness-quality-comparison, harness-architecture-comparison, evaluation-and-review-loops, and new-harness-design-notes.