Spec Deep-Dive: Aggregate Index
Question
What did the private spec-dataset-evolution-corpus deep-dive wave show once
whole repositories, cohort scouts, and public-safety gates were considered
together?
This page is the public-safe aggregate index for spec-deep-dive-wiki-ingest-project and the deep-dive companion to spec-dataset-evolution-research-project. It uses the private corpus repository as a private-source pointer and evidence store, but it does not release raw copied corpus files: raw corpus bodies stay private. The public unit here is synthesis plus provenance: URLs, commits, corpus-relative paths, counts, caveats, and links to focused wiki pages.
Short answer
The deep-repo wave confirms that “software specification” is not one document
shape. Exact spec.md files, requirements documents, .kiro/specs, Spec Kit
scaffolds, RFC/ADR corpora, OpenAPI/Proto/Smithy/AsyncAPI contracts, formal
specification languages, and negative-control documentation all have to be kept
as separate artifact families. Collapse them into one bucket and the analysis
becomes elegantly wrong, which is still wrong.
Across the aggregate private-corpus evidence, the scout wave produced 55 selected dossier occurrences across 51 unique public repositories, from 400 candidate rows. It also preserved 335 rejected/deferred rows and 10 failed-search or metadata-limit rows, so discovery failures remain evidence instead of vanishing into the carpet.
Source basis
The source paths below are corpus-relative inside the private repository
https://github.com/ericfode/spec-dataset-evolution-corpus at observed HEAD
4659608 unless otherwise stated.
| Claim scope | Private corpus source | Public upstream reference | Evidence fields used | Caveat |
|---|---|---|---|---|
| Ingest boundary and public-safety policy | PROJECT_BRIEF.md; reports/deep-dives/SPEC-REPO-16/COMPLIANCE_EXPORT_GATE.md; spec-deep-dive-wiki-ingest-project | private corpus repo URL only | publication mode, source routing, raw-export gate, excerpt policy | The private corpus is evidence, not a public raw-content release. |
| Aggregate dossier counts and cohort yield | reports/AGGREGATE.md; reports/deep-dives/AGGREGATE.md; data/aggregate_repo_records.jsonl | 51 public repositories represented by selected dossier records | selected occurrences, candidate classes, scout directories, Markdown/JSON artifacts read, parse errors | Aggregate counts summarize selected scouts; they are not prevalence estimates for all public repositories. |
| Artifact-class distribution | reports/AGGREGATE.md; data/corpus_file_manifest.jsonl; data/aggregate_repo_records.jsonl | public repo URLs in aggregate rows | normalized doc types, inventory samples, authority origin, raw-inclusion status | File-class counts include occurrence evidence and must be separated from independent authority. |
| Code/spec connectedness | data/connectedness_features.jsonl; data/aggregate_repo_records.jsonl; per-cohort dossiers | repo URLs and inspected commits preserved in child pages | direct links, reverse backrefs, path/symbol mentions, test linkage, same-commit co-change | Connectedness is evidence of coupling, not proof of causality or governance success. |
| Template and duplicate lineage | reports/deep-dives/SPEC-REPO-20/TEMPLATE_LINEAGE_MODEL.md; reports/deep-dives/SPEC-REPO-20/template_lineage_mapping.jsonl | Spec Kit, Kiro, Agent OS, prompt-template, and tutorial/fork repos named in child pages | template family, authority origin, generated scaffold, edited project contract, independent authority weight | Template descendants are adoption evidence unless reviewed as independent project authority. |
| Search and platform limits | reports/deep-dives/SPEC-REPO-01..12/index.md; sources.md; candidates.jsonl | GitHub, GitLab, Codeberg, SourceHut, Bitbucket, and public clone URLs where available | failed searches, HTTP 401/403, rate-limit notes, host-specific fallback method, rejected candidates | These failures constrain recall; they must not be silently converted into complete-search claims. |
Navigation map
| Page | Role in the ingest wave | Main artifact family |
|---|---|---|
| spec-deep-dive-wiki-ingest-project | Source map, citation style, public-safety gate, and Kanban task graph | ingest contract |
| spec-deep-dive-case-jcode | Priority calibration case for a post-LLM coding-agent harness | distributed spec surfaces and high connectedness |
| spec-deep-dive-case-droidagent | Priority case for agent-generated mobile-GUI behavioral scenarios | requirements/tasks/reports/scripts as replayable behavior specs |
| spec-deep-dive-case-j8-ambiguity | Negative-evidence trail separating j8agent ambiguity from J8Spec control value | namespace ambiguity and pre-AI executable-spec control |
| spec-deep-dive-cohort-exact-spec-md-and-standards | Exact spec.md, mature standards, and protocol repositories | normative standards and implementation-backed spec files |
| spec-deep-dive-cohort-agent-native-spec-kit-kiro | Agent-native workflow specs, Spec Kit, Kiro, templates, and lineage | generated scaffolds versus edited project contracts |
| spec-deep-dive-cohort-rfc-adr-executable-contracts | RFC/proposal governance and executable/formal contracts | governance prose, API schemas, IDLs, TLA+/Dafny-like formal artifacts |
Aggregate findings
1. Cohort yield is heterogeneous by design
The aggregate wave read 12 scout directories, 79 top-level Markdown artifacts, and 70 JSON artifacts, with zero parse errors reported. It emitted 55 aggregate records. The main cohort yield table is:
| Cohort family | Selected records | Candidate rows | What the cohort teaches |
|---|---|---|---|
Exact lower-case spec.md | 5 | 130 | spec.md is a precise seed, not the definition of a specification. |
| Requirements / PRD / design / architecture | 4 | 10 | Code-rich repos need section/path classification and license review. |
AI-native .kiro / Agent OS specs | 4 | 49 | Requirements/design/tasks packets expose project-local agent contracts. |
| Spec Kit / spec-driven development | 5 | 63 | Templates, commands, hooks, skills, and consumers form a control plane. |
| RFC / ADR / proposals | 5 | 17 | Governance corpora are spec-primary but often implementation-external. |
| Executable contracts | 5 | 9 | OpenAPI, Proto, Smithy, AsyncAPI, and similar files are machine-readable specs. |
| Mature pre-AI baselines | 4 | 8 | Standards and protocol projects provide long-lived comparison cases. |
| Recent AI-era fast-growing repos | 5 | 11 | Agentic product repos show small but highly connected spec islands. |
| Low-star / small-repo counter-sample | 5 | 69 | Co-born specs, missing licenses, and SRS recall holes appear early. |
| Non-GitHub forges | 4 | 11 | Host-specific adapters are required; GitHub assumptions do not port. |
| Forks/templates/tutorials | 5 | 13 | Duplicate inflation is a lineage problem, not just a hash problem. |
| Negative controls | 4 | 10 | Docs/tests/manuals can be code-connected without being product specs. |
2. Artifact classes are not interchangeable
The aggregate artifact-class inventory is dominated by requirements, RFCs, and
technical specs, but it also includes architecture docs, exact spec.md, Proto,
OpenAPI/API contracts, AsyncAPI, schema/config contracts, agent-native Kiro
packets, PRDs, ADRs, and formal/executable surfaces. The high-count classes
reported in the aggregate include:
| Normalized artifact class | Aggregate count | Interpretation |
|---|---|---|
requirements | 627 | Largest prose-spec family; needs false-positive controls for dependency manifests and loose docs. |
rfc | 385 | Governance records; often spec-primary but not app-local implementation specs. |
technical_spec | 350 | Broad technical-design/spec family; needs path and authority labels. |
architecture | 61 | Can be normative, explanatory, or historical depending on connectedness evidence. |
proto_contract | 48 | Machine-readable contract surface; codegen/test linkage matters. |
api_contract | 41 | API definitions belong in the spec corpus even when not Markdown. |
exact_spec_md | 34 | Useful discovery handle, not an artifact ontology. |
schema_or_config_contract | 28 | Config/schema files can carry contractual meaning. |
asyncapi_contract | 24 | Executable contract family with raw-export caveats in this pass. |
agent_native_kiro | 20 | Agent-native requirements/design/tasks packets; important AI-era stratum. |
3. Connectedness is widespread, but it is evidence, not metaphysics
Across aggregate records, connectedness flags were common: test_linkage appears
in 50 records, path_or_symbol_mentions in 43, cochange in 42, direct_links
in 29, and reverse_backrefs in 21. Those flags support graph features for
llm-readable-spec-files and evaluation-and-review-loops, but they should
remain typed evidence. Same-commit co-change says “these moved together,” not
“this spec governed that implementation by divine right.”
4. Spec/code ratios need role labels
The aggregate ratio distribution was available for 51 records: min 0.0000, p25
0.0319, median 0.1337, p75 3.4141, and max 3008.3182, with four
zero-high-confidence-spec records. That range is meaningful only when each repo’s
role is preserved:
- spec-as-product repositories such as RFC, KEP, EIP, or OpenTelemetry-style standards can have enormous ratios;
- code-rich implementation repos often have small but meaningful spec islands;
- executable-contract repos need separate treatment because schemas are both specifications and build inputs;
- negative controls prove that manuals, tests, and dependency files can be strongly code-connected while still not being high-confidence specs.
5. Raw export still fails closed
The normalized redistribution summary in the aggregate was 34 allowed, 11
review_required, 6 metadata_only, 3 allowed_pending_policy_review, and 1
allowed_with_obligations. That is not a license to publish raw content from the
private corpus. Wiki pages in this wave use synthesis, metadata, public URLs,
commits, representative paths, and caveats. Any raw-content export still depends
on the fail-closed gate: license status, redistribution status, secret/PII/internal
scan status, excerpt policy, and human audit labels.
6. Discovery failures are part of the result
Several scouts recorded unavailable or rate-limited discovery surfaces:
unauthenticated GitHub code search returned HTTP 401 or 403 in multiple lanes;
Hermes web search was unavailable where Firecrawl was not configured; GitLab,
Codeberg, SourceHut, Bitbucket, and self-hosted GitLab required host-specific
methods. The correct dataset behavior is to store these as discovery_event and
negative_evidence rows, not to pretend the search frontier was complete.
Dataset-design carry-forward
The next corpus build should preserve these schema separations:
repo_dossier/repo_contextalongside artifact rows.artifact_formseparate fromauthority_origin.template_lineageseparate from independent project authority.discovery_eventandnegative_evidencerows for failed searches and fallback methods.clone_and_history_coveragefor full, blobless, sparse, current-tree-only, and rate-limited retrieval.- Typed
spec_code_edgeevidence: direct link, reverse backref, path proximity, symbol/endpoint mention, test linkage, codegen linkage, co-change, and issue/PR/release linkage. - Per-record compliance gates and raw-content export policy.
- AI-era timing labels separate from AI-generation claims.
- Negative-control strata in every crawl batch.
Remaining gates
- spec-deep-dive-wiki-ingest-project still routes a final public-safety review
through
SPEC-WIKI-06before commit/push publication. - The manual/adjudicated label gate from the broader spec-dataset-evolution-research-project remains unresolved for raw public export.
- Historical pressure timelines are future work. Current stars, forks, and recent churn are pressure markers, not longitudinal causal evidence.
- Non-GitHub forge coverage needs proper adapters before anyone says “complete” with a straight face.