Spec Dataset Evolution Research Project

Question

How can we gather a large public dataset of spec.md and spec-like software specification files, preserve references to their actual repositories, and study how those specs change over time: before/after the AI era, under project pressure, with changing popularity, and in relation to the code they are meant to constrain?

This extends llm-readable-spec-files from “what makes a good spec for agents” into an empirical corpus project: what specs actually exist in public software, how they evolve, and whether they behave like contracts or decorative prose. It connects to context-engineering, work-management-primitives, and evaluation-and-review-loops because useful specs are durable context objects, work-management contracts, and evaluation surfaces at once. Decorative prose is cheaper to write; it also has the structural integrity of a soufflé in CI.

Private corpus and wiki ingest

The working private corpus is now in https://github.com/ericfode/spec-dataset-evolution-corpus and local checkout /Users/ericfode/src/spec-dataset-evolution-corpus. It contains the private raw-file research archive, provenance manifests, dedup outputs, analysis frames, and authored deep-dive dossiers.

Public wiki ingestion is tracked separately in spec-deep-dive-wiki-ingest-project on Kanban board spec-deep-dive-wiki-ingest; its aggregate public-safe navigation page is spec-deep-dive-index. That project should synthesize deep dives into wiki pages without dumping wholesale raw copied specs into the public wiki.

Output contract

The research project should produce:

  1. A versioned dataset of public spec artifacts.
  2. Stable public references for every artifact: repository URL, commit SHA, file path, blob URL, raw URL where available, and retrieval timestamp.
  3. A lineage table showing how each logical spec file changes across commits, renames, moves, deletions, rewrites, and reactivations.
  4. Repo-level features: stars, forks, languages, created date, activity, releases, issues/PRs where available, and license status.
  5. Spec-level features: size, section structure, requirement-like content, change magnitude, volatility, AI-era labels, and AI-related content signals.
  6. Code/spec connectedness features: ratios, direct references, path proximity, co-change, issue/PR linkage, symbol linkage, test linkage, and graph metrics.
  7. Analysis notebooks/reports answering the research questions below with descriptive and statistical evidence.

Core research questions

RQ1: How many public software specs exist, and what forms do they take?

Measure exact spec.md files plus broader spec-like files: requirements, technical specs, PRDs, RFCs, ADRs, architecture/design documents, API contracts, OpenAPI/GraphQL/Proto schemas, and agent-native spec directories such as .kiro/specs or .agent-os/specs when they contain requirements/design/tasks.

Report both:

  • total public occurrences, including forks/templates; and
  • unique normalized documents after deduplication.

RQ2: How much do specs change over their lifetimes?

Measure revisions, line/word churn, normalized edit distance, heading/section changes, semantic change, rewrite events, spec half-life, time to first major change, dormancy, reactivation, deletion, and replacement.

RQ3: Were specs born before or after the AI era?

Use explicit timing labels rather than folk astrology:

  • pre_chatgpt: first spec commit before 2022-11-30.
  • chatgpt_transition: 2022-11-30 through 2023-06-30.
  • copilot_chat_era: 2023-07-01 through 2024-05-31.
  • agentic_spec_era: 2024-06-01 onward.
  • unknown: insufficient evidence.

Sensitivity cutoffs:

  • GitHub Copilot technical preview: 2021-06-29.
  • GPT-4 launch: 2023-03-14.

Keep timing separate from AI-generation claims. A file committed after ChatGPT is not therefore generated by ChatGPT; dates are not provenance.

RQ4: What pressures are associated with spec changes?

Model temporal associations with:

  • popularity pressure: stars, forks, star bursts;
  • development pressure: code churn, large refactors, new modules;
  • collaboration pressure: contributors, new contributors, issue/PR bursts;
  • release pressure: tags/releases, major releases, pre-release freezes;
  • dependency/ecosystem pressure: package/dependency changes;
  • security/compliance pressure: CVEs, advisories, SECURITY.md, auth changes;
  • AI-feature pressure: LLM/agent/prompt/model dependencies or spec sections.

The output should say “associated with” unless the design supports causal claims. A lagged regression is not a small causal god.

RQ5: How much spec exists relative to code?

Compute repo and module-level ratios:

  • spec file count / code file count;
  • spec LOC / code LOC;
  • spec tokens / code tokens;
  • spec files per KLOC;
  • executable contract ratio for schemas/contracts;
  • test/spec ratios;
  • docs/code ratios as a comparison baseline.

Exclude generated, vendored, lockfile, minified, compiled, and large binary-ish artifacts from primary ratios.

RQ6: How connected are specs to code?

Measure explicit and implicit linkage:

  • direct links from specs to source files, modules, symbols, API endpoints, config keys, CLI flags, and schemas;
  • code comments/docstrings linking back to specs, RFCs, ADRs, issues, or requirement IDs;
  • path proximity between specs and implementation modules;
  • same-commit and rolling-window spec/code co-change;
  • issue/PR links and labels connecting spec changes to implementation work;
  • test/spec linkage and contract-test evidence;
  • documentation graph connectivity, orphan-spec ratio, and graph centrality.

Dataset units

  • repo: one public repository at a forge host.
  • artifact_occurrence: one file path at one repository commit.
  • content_object: raw and normalized bytes keyed by hash.
  • spec_lineage: one logical spec document across renames/moves/deletions.
  • spec_revision: one commit that changes a spec lineage.
  • spec_snapshot: a material version of a spec at a commit.
  • repo_time_bucket: monthly or quarterly repo pressure features.
  • spec_code_edge: typed evidence connecting specs, code, tests, commits, issues, PRs, sections, symbols, or releases.

Discovery scope

Positive discovery patterns

Exact/path patterns:

  • spec.md, SPEC.md, specs.md, specification.md
  • requirements.md, software-requirements.md, functional-spec.md
  • prd.md, product-requirements.md
  • design.md, technical-design.md, architecture.md
  • rfcs/**, adr/**, docs/specs/**, specs/**, requirements/**
  • .kiro/specs/**, .agent-os/specs/**, agent/coding-assistant spec folders

Content signals:

  • “requirements”, “acceptance criteria”, “non-functional requirements”
  • “technical specification”, “software requirements specification”
  • “user stories”, “edge cases”, “success criteria”
  • requirement IDs such as REQ-001, INV-001, AC-001

Negative discovery patterns

Exclude or downrank:

  • test specs such as *.spec.ts, *.spec.js, *.spec.rb;
  • packaging specs such as .rpm.spec, .nuspec, .podspec unless a separate package-spec stratum is intentionally studied;
  • vendored/generated docs;
  • changelogs, READMEs, legal docs, and API schemas unless classified as spec-like by content or format-specific rules.

Metadata schema sketch

Every artifact occurrence should include at least:

{
  "artifact_id": "stable occurrence id",
  "content_sha256": "raw content hash",
  "normalized_sha256": "normalized text hash",
  "forge": "github|gitlab|gitea|sourcehut|other",
  "host": "github.com",
  "repo_full_name": "owner/repo",
  "repo_url": "https://github.com/owner/repo",
  "repo_id": "platform native id if available",
  "default_branch": "main",
  "commit_sha": "immutable commit sha",
  "file_path": "docs/spec.md",
  "file_url": "https://github.com/owner/repo/blob/<sha>/docs/spec.md",
  "raw_url": "raw immutable URL if available",
  "license_spdx": "MIT|Apache-2.0|NOASSERTION",
  "repo_created_at": "timestamp",
  "file_first_commit_at": "timestamp|null",
  "file_last_commit_at": "timestamp|null",
  "doc_type": "requirements|technical_spec|prd|architecture|rfc|adr|api_contract|unknown",
  "spec_likeness_score": 0.0,
  "ai_era_label": "pre_chatgpt|chatgpt_transition|copilot_chat_era|agentic_spec_era|unknown",
  "era_basis": "file_first_commit_at|repo_created_at|first_seen_at|unknown",
  "ai_signal_flags": ["mentions_claude", "kiro_path"],
  "dedup_cluster_id": "cluster id",
  "redistribution_status": "allowed|metadata_only|review_required",
  "secret_scan_status": "clean|flagged|quarantined"
}

Pipeline architecture

  1. Query registry and slicer.
  2. Forge adapters: GitHub first, GitLab second, then Gitea/Forgejo/Codeberg, SourceHut, and Bitbucket if useful.
  3. Repo candidate queue.
  4. File candidate queue.
  5. Immutable fetcher keyed by commit SHA.
  6. Content-addressed raw/normalized store.
  7. Spec classifier and taxonomy scorer.
  8. Dedup and near-dedup clustering.
  9. Git lineage miner.
  10. Revision/snapshot extractor.
  11. Repo pressure feature collector.
  12. Code/spec connectedness analyzer.
  13. Privacy/license/compliance gate.
  14. DuckDB/Parquet export and analysis notebooks.
  15. Manual audit harness.

Statistical analysis plan

  • Descriptive distributions: corpus size, doc types, languages, stars, licenses, AI-era labels, spec size, churn, and connectedness.
  • Survival analysis: time to first major spec change.
  • Negative binomial or quasi-Poisson models: spec churn counts.
  • Logistic models: probability of a major spec change in a future window.
  • Lagged panel models: code churn, star bursts, issue/PR pressure, and release windows preceding spec changes.
  • Event studies: star bursts, releases, security events, dependency changes, AI feature additions.
  • Matched comparisons: high-star vs low-star repos; pre-AI vs AI-era specs; specs with direct code linkage vs orphan specs.
  • Robustness checks: alternate AI cutoffs, fork/template filtering, generated-file exclusion, small-repo winsorization, and manual audit labels.

Validation plan

  • Manual audit set for spec detection precision and recall.
  • Manual audit set for lineage reconstruction across renames/moves/deletions.
  • Manual audit set for minor/moderate/major/rewrite change labels.
  • Edge audit for spec-code links: direct path, symbol, semantic, issue/PR, and co-change evidence.
  • Reproducibility checks: same query registry, same commit SHA, same normalized hashes.
  • Bias report: search result caps, forge coverage, public-only bias, license restrictions, fork/template inflation, and missing historical star data.

Kanban design

Board: spec-dataset-evolution.

High-level DAG:

  1. SPEC-DATA-00 / t_39d3ad17: project anchor and acceptance contract. Done.
  2. Parallel discovery/methodology tasks, ready after anchor:
    • SPEC-DATA-01 / t_76046bef: taxonomy and scoring.
    • SPEC-DATA-02 / t_f1b2fc83: query registry and discovery strategy.
    • SPEC-DATA-03 / t_0ba9fcc1: schema and storage design.
    • SPEC-DATA-04 / t_d0fcc4f7: licensing/privacy/compliance gate.
  3. Pilot implementation tasks:
    • SPEC-DATA-05 / t_7a1d6c82: GitHub pilot adapter.
    • SPEC-DATA-06 / t_8c5c2d92: GitLab/public forge pilot adapter.
    • SPEC-DATA-07 / t_7ba6769b: dedup/canonicalization.
    • SPEC-DATA-08 / t_7f0570b6: lineage/revision miner.
  4. Analysis feature tasks:
    • SPEC-DATA-09 / t_dbd9c936: AI-era labels and repo pressure timeline.
    • SPEC-DATA-10 / t_092f8c62: code/spec ratio and connectedness features.
    • SPEC-DATA-11 / t_07cd26c6: statistical analysis plan and notebook skeleton.
  5. Gates:
    • SPEC-DATA-12 / t_aed8c299: manual audit harness.
    • SPEC-DATA-13 / t_b5c6851b: pilot crawl report. Now also depends on the deep-repo review below.
    • SPEC-DATA-14 / t_d29a8854: reviewer synthesis and next-build recommendation.
  6. Deep whole-repository exploration wave, added after the user clarified that file-level crawling is insufficient:
    • SPEC-REPO-00 / t_2d3d9d2f: deep-repo protocol anchor. Done.
    • SPEC-REPO-01 / t_1baaadc9: exact spec.md high-signal repositories.
    • SPEC-REPO-02 / t_90c5bff9: requirements / PRD / design / architecture repos.
    • SPEC-REPO-03 / t_9b7d3af9: AI-native spec directories such as .kiro/specs and .agent-os/specs.
    • SPEC-REPO-04 / t_6a30a73e: spec-driven development / Spec Kit influenced repos.
    • SPEC-REPO-05 / t_8cffbb6d: RFC- and ADR-heavy repos.
    • SPEC-REPO-06 / t_966e73be: executable contract repos: OpenAPI, GraphQL, Proto, TLA+, Alloy, Dafny, Lean, etc.
    • SPEC-REPO-07 / t_c21079b7: mature pre-AI long-lived repos.
    • SPEC-REPO-08 / t_71d87a22: recent AI-era fast-growing repos.
    • SPEC-REPO-09 / t_c4f61036: low-star and small-repo counter-sample.
    • SPEC-REPO-10 / t_50b78ec2: GitLab and non-GitHub public forges.
    • SPEC-REPO-11 / t_9ef23081: forks, templates, and duplicate-spec inflation.
    • SPEC-REPO-12 / t_0093bf70: negative controls and spec-poor code-heavy repos.
    • SPEC-REPO-13 / t_2d6bc0ce: aggregate deep repo dossiers and seed empirical corpus.
    • SPEC-REPO-14 / t_482593d1: adversarial review of deep-repo evidence quality.

Each Kanban task should point back to this page and return structured metadata: outputs, tables, sources, acceptance_status, risks, and next_tasks. Deep repo scouts must also write whole-repo dossiers under /Users/ericfode/.hermes/gateway-scratch/spec-dataset-evolution/repo-dossiers/<task-key>/.

Operational repair notes

A Kanban runtime repair was recorded on 2026-05-05 after default workers for SPEC-DATA-21, SPEC-DATA-JCODE, and downstream SPEC-DATA-22 were affected by a globally enabled plugin schema error. The raw incident note is preserved at spec-dataset-evolution-kanban-error-repair-2026-05-05.

Key state after the repair:

  • Basis plugin local source commit 0061d32 made basis_reduce_spec and basis_validate_packet schemas compatible with OpenAI Codex by removing top-level anyOf from tool parameter schemas.
  • SPEC-DATA-JCODE / t_63d0e80e completed and produced the jcode-first seed pack.
  • SPEC-DATA-21 / t_383dad01 was manually closed after deterministic artifact verification; its gate recommendation is hold_bounded_staged_expansion.
  • SPEC-DATA-22 / t_977f775e correctly blocked before crawl work because human primary/dual/adjudicated labels remain unresolved.

The remaining blocked task is therefore an intentional research gate, not a Kanban runtime failure.

Public deep-dive ingest wave

The private corpus is now being rendered into public-safe wiki synthesis through spec-deep-dive-wiki-ingest-project. The aggregate map is spec-deep-dive-index, and the first priority case pages are:

The first cohort pages are:

These pages preserve repo URLs, corpus-relative source paths, caveats, and metadata-only publication boundaries rather than exporting raw corpus bodies.

Definition of done for the design phase

The design phase is complete when the Kanban board contains the project anchor, parallel research/design lanes, pilot crawl tasks, analysis tasks, audit gates, a deep whole-repository scout wave with aggregation/review, and a final synthesis task, all linked by dependencies and pointed at this wiki page.