Spec Deep-Dive: Aggregate Index

Question

What did the private spec-dataset-evolution-corpus deep-dive wave show once whole repositories, cohort scouts, and public-safety gates were considered together?

This page is the public-safe aggregate index for spec-deep-dive-wiki-ingest-project and the deep-dive companion to spec-dataset-evolution-research-project. It uses the private corpus repository as a private-source pointer and evidence store, but it does not release raw copied corpus files: raw corpus bodies stay private. The public unit here is synthesis plus provenance: URLs, commits, corpus-relative paths, counts, caveats, and links to focused wiki pages.

Short answer

The deep-repo wave confirms that “software specification” is not one document shape. Exact spec.md files, requirements documents, .kiro/specs, Spec Kit scaffolds, RFC/ADR corpora, OpenAPI/Proto/Smithy/AsyncAPI contracts, formal specification languages, and negative-control documentation all have to be kept as separate artifact families. Collapse them into one bucket and the analysis becomes elegantly wrong, which is still wrong.

Across the aggregate private-corpus evidence, the scout wave produced 55 selected dossier occurrences across 51 unique public repositories, from 400 candidate rows. It also preserved 335 rejected/deferred rows and 10 failed-search or metadata-limit rows, so discovery failures remain evidence instead of vanishing into the carpet.

Source basis

The source paths below are corpus-relative inside the private repository https://github.com/ericfode/spec-dataset-evolution-corpus at observed HEAD 4659608 unless otherwise stated.

Claim scopePrivate corpus sourcePublic upstream referenceEvidence fields usedCaveat
Ingest boundary and public-safety policyPROJECT_BRIEF.md; reports/deep-dives/SPEC-REPO-16/COMPLIANCE_EXPORT_GATE.md; spec-deep-dive-wiki-ingest-projectprivate corpus repo URL onlypublication mode, source routing, raw-export gate, excerpt policyThe private corpus is evidence, not a public raw-content release.
Aggregate dossier counts and cohort yieldreports/AGGREGATE.md; reports/deep-dives/AGGREGATE.md; data/aggregate_repo_records.jsonl51 public repositories represented by selected dossier recordsselected occurrences, candidate classes, scout directories, Markdown/JSON artifacts read, parse errorsAggregate counts summarize selected scouts; they are not prevalence estimates for all public repositories.
Artifact-class distributionreports/AGGREGATE.md; data/corpus_file_manifest.jsonl; data/aggregate_repo_records.jsonlpublic repo URLs in aggregate rowsnormalized doc types, inventory samples, authority origin, raw-inclusion statusFile-class counts include occurrence evidence and must be separated from independent authority.
Code/spec connectednessdata/connectedness_features.jsonl; data/aggregate_repo_records.jsonl; per-cohort dossiersrepo URLs and inspected commits preserved in child pagesdirect links, reverse backrefs, path/symbol mentions, test linkage, same-commit co-changeConnectedness is evidence of coupling, not proof of causality or governance success.
Template and duplicate lineagereports/deep-dives/SPEC-REPO-20/TEMPLATE_LINEAGE_MODEL.md; reports/deep-dives/SPEC-REPO-20/template_lineage_mapping.jsonlSpec Kit, Kiro, Agent OS, prompt-template, and tutorial/fork repos named in child pagestemplate family, authority origin, generated scaffold, edited project contract, independent authority weightTemplate descendants are adoption evidence unless reviewed as independent project authority.
Search and platform limitsreports/deep-dives/SPEC-REPO-01..12/index.md; sources.md; candidates.jsonlGitHub, GitLab, Codeberg, SourceHut, Bitbucket, and public clone URLs where availablefailed searches, HTTP 401/403, rate-limit notes, host-specific fallback method, rejected candidatesThese failures constrain recall; they must not be silently converted into complete-search claims.
PageRole in the ingest waveMain artifact family
spec-deep-dive-wiki-ingest-projectSource map, citation style, public-safety gate, and Kanban task graphingest contract
spec-deep-dive-case-jcodePriority calibration case for a post-LLM coding-agent harnessdistributed spec surfaces and high connectedness
spec-deep-dive-case-droidagentPriority case for agent-generated mobile-GUI behavioral scenariosrequirements/tasks/reports/scripts as replayable behavior specs
spec-deep-dive-case-j8-ambiguityNegative-evidence trail separating j8agent ambiguity from J8Spec control valuenamespace ambiguity and pre-AI executable-spec control
spec-deep-dive-cohort-exact-spec-md-and-standardsExact spec.md, mature standards, and protocol repositoriesnormative standards and implementation-backed spec files
spec-deep-dive-cohort-agent-native-spec-kit-kiroAgent-native workflow specs, Spec Kit, Kiro, templates, and lineagegenerated scaffolds versus edited project contracts
spec-deep-dive-cohort-rfc-adr-executable-contractsRFC/proposal governance and executable/formal contractsgovernance prose, API schemas, IDLs, TLA+/Dafny-like formal artifacts

Aggregate findings

1. Cohort yield is heterogeneous by design

The aggregate wave read 12 scout directories, 79 top-level Markdown artifacts, and 70 JSON artifacts, with zero parse errors reported. It emitted 55 aggregate records. The main cohort yield table is:

Cohort familySelected recordsCandidate rowsWhat the cohort teaches
Exact lower-case spec.md5130spec.md is a precise seed, not the definition of a specification.
Requirements / PRD / design / architecture410Code-rich repos need section/path classification and license review.
AI-native .kiro / Agent OS specs449Requirements/design/tasks packets expose project-local agent contracts.
Spec Kit / spec-driven development563Templates, commands, hooks, skills, and consumers form a control plane.
RFC / ADR / proposals517Governance corpora are spec-primary but often implementation-external.
Executable contracts59OpenAPI, Proto, Smithy, AsyncAPI, and similar files are machine-readable specs.
Mature pre-AI baselines48Standards and protocol projects provide long-lived comparison cases.
Recent AI-era fast-growing repos511Agentic product repos show small but highly connected spec islands.
Low-star / small-repo counter-sample569Co-born specs, missing licenses, and SRS recall holes appear early.
Non-GitHub forges411Host-specific adapters are required; GitHub assumptions do not port.
Forks/templates/tutorials513Duplicate inflation is a lineage problem, not just a hash problem.
Negative controls410Docs/tests/manuals can be code-connected without being product specs.

2. Artifact classes are not interchangeable

The aggregate artifact-class inventory is dominated by requirements, RFCs, and technical specs, but it also includes architecture docs, exact spec.md, Proto, OpenAPI/API contracts, AsyncAPI, schema/config contracts, agent-native Kiro packets, PRDs, ADRs, and formal/executable surfaces. The high-count classes reported in the aggregate include:

Normalized artifact classAggregate countInterpretation
requirements627Largest prose-spec family; needs false-positive controls for dependency manifests and loose docs.
rfc385Governance records; often spec-primary but not app-local implementation specs.
technical_spec350Broad technical-design/spec family; needs path and authority labels.
architecture61Can be normative, explanatory, or historical depending on connectedness evidence.
proto_contract48Machine-readable contract surface; codegen/test linkage matters.
api_contract41API definitions belong in the spec corpus even when not Markdown.
exact_spec_md34Useful discovery handle, not an artifact ontology.
schema_or_config_contract28Config/schema files can carry contractual meaning.
asyncapi_contract24Executable contract family with raw-export caveats in this pass.
agent_native_kiro20Agent-native requirements/design/tasks packets; important AI-era stratum.

3. Connectedness is widespread, but it is evidence, not metaphysics

Across aggregate records, connectedness flags were common: test_linkage appears in 50 records, path_or_symbol_mentions in 43, cochange in 42, direct_links in 29, and reverse_backrefs in 21. Those flags support graph features for llm-readable-spec-files and evaluation-and-review-loops, but they should remain typed evidence. Same-commit co-change says “these moved together,” not “this spec governed that implementation by divine right.”

4. Spec/code ratios need role labels

The aggregate ratio distribution was available for 51 records: min 0.0000, p25 0.0319, median 0.1337, p75 3.4141, and max 3008.3182, with four zero-high-confidence-spec records. That range is meaningful only when each repo’s role is preserved:

  • spec-as-product repositories such as RFC, KEP, EIP, or OpenTelemetry-style standards can have enormous ratios;
  • code-rich implementation repos often have small but meaningful spec islands;
  • executable-contract repos need separate treatment because schemas are both specifications and build inputs;
  • negative controls prove that manuals, tests, and dependency files can be strongly code-connected while still not being high-confidence specs.

5. Raw export still fails closed

The normalized redistribution summary in the aggregate was 34 allowed, 11 review_required, 6 metadata_only, 3 allowed_pending_policy_review, and 1 allowed_with_obligations. That is not a license to publish raw content from the private corpus. Wiki pages in this wave use synthesis, metadata, public URLs, commits, representative paths, and caveats. Any raw-content export still depends on the fail-closed gate: license status, redistribution status, secret/PII/internal scan status, excerpt policy, and human audit labels.

6. Discovery failures are part of the result

Several scouts recorded unavailable or rate-limited discovery surfaces: unauthenticated GitHub code search returned HTTP 401 or 403 in multiple lanes; Hermes web search was unavailable where Firecrawl was not configured; GitLab, Codeberg, SourceHut, Bitbucket, and self-hosted GitLab required host-specific methods. The correct dataset behavior is to store these as discovery_event and negative_evidence rows, not to pretend the search frontier was complete.

Dataset-design carry-forward

The next corpus build should preserve these schema separations:

  1. repo_dossier / repo_context alongside artifact rows.
  2. artifact_form separate from authority_origin.
  3. template_lineage separate from independent project authority.
  4. discovery_event and negative_evidence rows for failed searches and fallback methods.
  5. clone_and_history_coverage for full, blobless, sparse, current-tree-only, and rate-limited retrieval.
  6. Typed spec_code_edge evidence: direct link, reverse backref, path proximity, symbol/endpoint mention, test linkage, codegen linkage, co-change, and issue/PR/release linkage.
  7. Per-record compliance gates and raw-content export policy.
  8. AI-era timing labels separate from AI-generation claims.
  9. Negative-control strata in every crawl batch.

Remaining gates

  • spec-deep-dive-wiki-ingest-project still routes a final public-safety review through SPEC-WIKI-06 before commit/push publication.
  • The manual/adjudicated label gate from the broader spec-dataset-evolution-research-project remains unresolved for raw public export.
  • Historical pressure timelines are future work. Current stars, forks, and recent churn are pressure markers, not longitudinal causal evidence.
  • Non-GitHub forge coverage needs proper adapters before anyone says “complete” with a straight face.

See also