LLM-readable spec.md Files Research

Context: research pass for /Users/ericfode/wiki.

Question: what kind of spec.md files work best for LLM agents, what excellent specs look like, and how those specs should be verified.

Sources checked

Synthesis

An excellent spec.md for LLM work is not a product essay. It is a compact operating contract for agents, humans, tests, and sometimes provers.

It needs to do four jobs at once:

  1. Orient the model quickly.
  2. Constrain implementation choices.
  3. Define what counts as success.
  4. Provide enough structure for verification.

The best specs are therefore closer to a hybrid of prompt, requirements set, test plan, and traceability matrix than to a traditional narrative design doc.

What works best for LLMs

  • Stable top-level location: the agent should know where the canonical contract lives. If subdocuments exist, spec.md should be the table of contents and authority map.
  • Predictable headings: models benefit from repeated section shapes. Avoid clever prose hierarchy.
  • Explicit terms: include a glossary for domain nouns, state-machine names, role names, event names, and artifact names.
  • Requirement identifiers: use stable IDs such as REQ-001, INV-003, AC-002, NFR-004, TBR-001.
  • One thought per requirement: compound paragraphs are bad input for both models and tests.
  • Concrete examples and counterexamples: these are often more steerable than prose instructions.
  • Non-goals and forbidden moves: LLMs overbuild when negative space is not stated.
  • Verification next to the claim: every normative claim should name at least one verification mode.
  • Real gates: commands, tests, linters, browser checks, formal builds, replay commands, or explicit human review steps.
  • Named uncertainty: use TBR or NEEDS CLARIFICATION rather than burying unresolved choices in smooth prose.
  • Progressive disclosure: keep the canonical file concise; push deep protocol details, UI flows, or proofs into linked files.

Proposed top-level shape

# Spec: <system or feature>
 
## Status
- owner
- status
- created / updated
- canonical source
- target readers
 
## Objective
- problem
- success outcome
- why now
 
## Context
- relevant repo paths
- upstream documents
- existing constraints
- what the agent should inspect first
 
## Glossary
- domain terms
- state names
- actor names
- artifact names
 
## Scope
- in scope
- out of scope
- forbidden implementation moves
 
## System Model
- actors
- entities
- state machine
- events
- data contracts
- external interfaces
 
## Behavioral Requirements
- REQ-001: ...
- REQ-002: ...
 
## Invariants
- INV-001: ...
- INV-002: ...
 
## User / Operator Scenarios
- scenario
- priority
- independent test
- acceptance cases
 
## Verification Matrix
| ID | Claim | Method | Evidence | Owner | Status |
 
## Operational Gates
- test command
- build command
- lint command
- replay / browser / formal command
 
## Examples and Counterexamples
- positive examples
- negative examples
- edge cases
 
## Open Questions
- TBR-001: ...
- TBR-002: ...
 
## Change Log
- date, decision, reason

Verification ladder

  1. Structural lint: file exists, headings present, IDs unique, no duplicate requirement IDs, no unresolved TBR without owner/date.
  2. Requirements-quality lint: flag vague terms, compound requirements, missing subject/predicate, missing verification method, implementation leakage, and undefined glossary terms.
  3. Traceability check: every normative ID appears in the verification matrix; every acceptance test points back to at least one requirement or invariant.
  4. Executable tests: unit, integration, browser, replay, or CLI tests prove concrete behavior.
  5. Negative and metamorphic tests: the spec should say what must fail or remain invariant under transformation.
  6. Cross-agent comprehension eval: give the spec to an agent in a fresh context and ask it to produce an implementation plan, test plan, and risk list; score whether it preserves the intended constraints.
  7. Implementation replay: compare the final diff, artifacts, and evidence against the spec and verification matrix.
  8. Formal slice: for small but central invariants, translate into Lean, TLA+, Alloy, Dafny, or another checker. Do not pretend the whole spec is formal unless it actually is.

Ideas worth building

  • spec doctor: parses spec.md, extracts normative claims, finds ambiguity, and emits a verification matrix.
  • spec score: reports verifiability, traceability, and falsifiability scores.
  • spec mutate: creates bad implementations or bad tests to see whether the spec catches them.
  • spec replay: checks whether an implementation branch actually discharged the acceptance surface.
  • spec ask: runs a fresh-agent comprehension probe and compares the plan to a rubric.
  • spec lattice: converts prose into a claim lattice with refinements, evidence, contradictions, and open obligations.
  • spec formalize: promotes selected invariants into a prover-facing artifact.

Bottom line

The best spec.md files for LLMs are short enough to load, structured enough to parse, specific enough to constrain, and verifiable enough to argue with. A spec that cannot produce tests, checks, counterexamples, or proof obligations is only context. Useful context is not the same as a specification.