Spec Deep-Dive Wiki Ingest Project
Question
How should the private spec-dataset-evolution-corpus repository be turned into
public-safe wiki knowledge without dumping raw copied specs into the public wiki?
This is a continuation of spec-dataset-evolution-research-project. The private corpus is a working archive; the wiki should receive synthesis, source-grounded deep dives, cohort maps, and explicit caveats. Raw specs stay in the private repository unless an export gate says otherwise. Neat archive, clean window.
Current private corpus
Private GitHub repository:
https://github.com/ericfode/spec-dataset-evolution-corpus
Local checkout:
/Users/ericfode/src/spec-dataset-evolution-corpus
Inspected state for SPEC-WIKI-00:
- Local corpus checkout
git status --short: clean. - Local corpus checkout
git rev-parse --short HEAD:4659608. reports/deep-dives/: 207 files, including 103 Markdown files, 87 JSON files, and 17 JSONL files.reports/AGGREGATE.mdandreports/deep-dives/AGGREGATE.md: aggregate digestsha256[:12] = 76c912ddb1e3.data/corpus_file_manifest.jsonl: 1,741 artifact occurrence rows, 1,676 private raw-file copies, and 65 metadata-only or hard-quarantined raw rows.data/aggregate_repo_records.jsonl: 55 selected dossier occurrences across 51 unique public repositories.data/connectedness_features.jsonl: 55 connectedness feature rows.
The private repository is intentionally not a public release. Its raw files retain upstream licenses and copyright context; the public wiki should not copy them wholesale. The publication boundary is therefore: synthesis and provenance out; raw corpus files stay in.
Source map
The source map below is the contract for downstream SPEC-WIKI-* tasks. Paths are
corpus-relative unless an absolute project scratch path is explicitly named.
| Source surface | What it proves | Public-safe use | Hard limits |
|---|---|---|---|
PROJECT_BRIEF.md in the project scratch directory | Board purpose, intended outputs, source surfaces, and publication boundary. | Cite as the local project brief in Kanban handoffs; use it to keep the ingest scope stable. | Do not treat it as evidence about upstream repos. |
reports/deep-dives/jcode.md and reports/deep-dives/jcode-analysis.json | Priority case study for 1jehuang/jcode, including source basis, spec inventory, churn, pressure, and code/spec connectedness. | Synthesize a case page; preserve https://github.com/1jehuang/jcode and cited commit/path evidence. | Do not paste the dossier wholesale; do not copy private raw corpus files. |
reports/deep-dives/droidagent.md | Priority case study for coinse/droidagent, including retrieval notes, repo chronology, requirements, architecture, and memory model. | Synthesize a case page with repo URL, inspected paths, and caveats. | Keep runtime/setup details descriptive; do not launder raw repo text into the wiki. |
reports/deep-dives/j8-agent.md | J8/J8Spec ambiguity trail, including the distinction between j8agent and j8spec/j8spec. | Create a public-safe ambiguity/correction section that names the namespace uncertainty and final inspected repo. | Do not overstate identity resolution beyond the dossier evidence. |
reports/deep-dives/SPEC-REPO-01..12/ | Whole-repo scout cohorts: exact spec.md, requirements/design, agent-native specs, Spec Kit, RFC/ADR, executable contracts, mature baselines, AI-era repos, low-star samples, non-GitHub forges, templates/forks, and negative controls. | Use index.md, sources.md, candidates.jsonl, and per-repo dossier Markdown/JSON pairs to build cohort pages. Preserve repo URLs, commit SHAs, file paths, selected/rejected status, and negative evidence. | Do not flatten cohorts into one generic “spec” label; do not omit rejected candidates and search failures when they shape interpretation. |
reports/deep-dives/AGGREGATE.md and reports/AGGREGATE.md | Aggregate counts, cohort yield, artifact classes, duplicate/template clusters, ratio regimes, connectedness patterns, pressure patterns, schema changes, and case-study carry-forward list. | Use for the public deep-dive index and cohort summaries. Prefer aggregate numbers and short paraphrases. | Do not treat aggregate snippets as a substitute for per-dossier source checks on individual claims. |
data/corpus_file_manifest.jsonl | Artifact occurrence metadata: cohort, doc type, repo URL, inspected commit, private raw path, normalized cluster, authority origin, and raw-inclusion status. | Use for counts, source path preservation, doc-type summaries, and fail-closed raw-export checks. | Never publish raw bytes from corpus/by_repo/; private raw paths are provenance, not public content. |
data/aggregate_repo_records.jsonl | Repo-level records with artifact inventories, compliance fields, clone/history coverage, connectedness, pressure, ratios, and source-path provenance. | Use as the main structured source for per-repo evidence tables. Include repo_url, repo_full_name, source dossier path, compliance status, and caveats. | Do not collapse nested compliance fields into a single permissive label. Missing or ambiguous fields fail closed. |
data/connectedness_features.jsonl | Typed connectedness flags and missingness for direct links, reverse backrefs, path/symbol mentions, test linkage, and co-change. | Use for connectedness summaries and evidence-type tables. | Do not claim causality from connectedness flags; they are evidence categories, not proofs of governance. |
reports/deep-dives/SPEC-REPO-16/COMPLIANCE_EXPORT_GATE.md and compliance_export_gate.* | Public raw-content export gate design and per-record compliance policy. | Use to decide whether a downstream page may include a quoted excerpt at all. | Any missing license, flagged scan, internal/private-LAN signal, or unclear raw-content right means metadata-only or review-required. |
reports/deep-dives/SPEC-REPO-17/PRESSURE_EVIDENCE_REVIEW.md | Pressure metric caveats and causality lint. | Use to phrase pressure claims as associations with evidence limits. | No historical pressure timeline claim without a proper time-series source. |
reports/deep-dives/SPEC-REPO-20/TEMPLATE_LINEAGE_MODEL.md and template_lineage_* | Template, fork, tutorial, and generated-scaffold lineage model. | Use to separate canonical authority from copied or generated descendants. | Do not count template descendants as independent evidence without a lineage caveat. |
queries/spec-dataset-evolution-research-project.md | Public research frame, dataset units, discovery patterns, schema sketch, and Kanban design. | Link from all public pages as the project root. | Do not rewrite the public research question in each page; link back and state the local slice. |
Downstream task source routing
| Downstream task | Primary sources | Expected public output |
|---|---|---|
SPEC-WIKI-01 priority cases | jcode.md, jcode-analysis.json, droidagent.md, j8-agent.md | Case-study page or sections for jcode, DroidAgent, and the J8/J8Spec ambiguity trail. |
SPEC-WIKI-02 exact spec.md and mature standards | SPEC-REPO-01, SPEC-REPO-07, SPEC-REPO-18, SPEC-REPO-19, aggregate records | Cohort page distinguishing exact spec.md, standards/protocol repos, and mature baselines. |
SPEC-WIKI-03 agent-native / Spec Kit / Kiro | SPEC-REPO-03, SPEC-REPO-04, SPEC-REPO-08, SPEC-REPO-11, SPEC-REPO-20 | Cohort page for .kiro, .agent-os, .specify, Spec Kit, templates, and generated scaffolds. |
SPEC-WIKI-04 RFC / ADR / executable contracts | SPEC-REPO-05, SPEC-REPO-06, SPEC-REPO-18, SPEC-REPO-19, connectedness features | Cohort page separating prose governance corpora from machine-readable contracts. |
SPEC-WIKI-05 aggregate index | AGGREGATE.md, aggregate_repo_records.jsonl, corpus_file_manifest.jsonl, all cohort pages | Public-safe index page spec-deep-dive-index with source map, cohort links, top findings, caveats, and remaining gates. |
SPEC-WIKI-06 public-safety review | All wiki pages produced by the ingest wave plus SPEC-REPO-16 compliance gate | Review page or handoff confirming lint, public-safety policy adherence, and push verification. |
Naming conventions
Use stable, boring names. Boring names are a gift to future grep.
- Project anchor:
queries/spec-deep-dive-wiki-ingest-project.md. - Aggregate public index:
queries/spec-deep-dive-index.md. - Priority case pages, if split out:
queries/spec-deep-dive-case-jcode.mdqueries/spec-deep-dive-case-droidagent.mdqueries/spec-deep-dive-case-j8-ambiguity.md
- Cohort pages, if split out:
queries/spec-deep-dive-cohort-exact-spec-md-and-standards.mdqueries/spec-deep-dive-cohort-agent-native-spec-kit-kiro.mdqueries/spec-deep-dive-cohort-rfc-adr-executable-contracts.md
- Raw wiki notes are optional and should only summarize a source; never store raw
private corpus files under
raw/. If a raw note is needed, name itraw/articles/spec-deep-dive-<topic>-source-note-2026-05-05.mdand keep it paraphrased. - Every new content page must be added to index and must use
type: querywhile it lives underqueries/. - Use the visible page title form
Spec Deep-Dive: <topic>for pages in this ingest wave.
Citation style
Each page produced from the private corpus should contain a Source basis section
near the top with this shape:
| Claim scope | Private corpus source | Public upstream reference | Evidence fields used | Caveat |
|---|---|---|---|---|
| One sentence naming the claim | Corpus-relative path such as reports/deep-dives/SPEC-REPO-08/openai__codex.json | Repo URL plus commit/file path when available | repo_url, inspected_commit, file_path, doc_type, connectedness.flags, compliance.* | Missingness, rate limit, license, or ambiguity note |
Rules:
- Prefer corpus-relative paths over absolute local paths in public prose.
- Preserve public repo URLs exactly when they are available.
- Preserve commit SHAs and file paths when the source record provides them.
- Cite source rows by dossier path plus row identity when a JSONL row is the
evidence. Example:
data/aggregate_repo_records.jsonlrow foropenai/codex, sourced fromreports/deep-dives/SPEC-REPO-08/openai__codex.json. - Treat negative evidence as citable evidence: failed GitHub code search, API rate limits, missing license fields, partial clone coverage, and rejected candidates should appear when they constrain the claim.
- Keep llm-readable-spec-files, context-engineering, work-management-primitives, and evaluation-and-review-loops as conceptual links, not substitutes for corpus evidence.
Excerpt policy
Default publication mode is synthesis plus metadata. Direct quotation is an exception, not a decorating habit.
Allowed:
- Short excerpts from public upstream files when needed to explain a classification or ambiguity.
- Quoted field names, file paths, repo names, command names, and stable IDs.
- Aggregate numbers, paraphrased findings, and evidence tables derived from private JSON/JSONL records.
Limits:
- No wholesale raw corpus files in the wiki.
- No copied private
corpus/by_repo/file bodies. - No excerpt from a record with
metadata_only,review_required, flagged secret or PII scan, unclear license, internal/private-LAN signal, or missing scan evidence unless a human reviewer explicitly approves it in a later task. - Keep ordinary excerpts to at most 25 words each and at most three excerpts per upstream repository on a single page. If the claim needs more than that, the page should summarize and link to the public upstream repository instead.
- Every excerpt must have a nearby citation row naming corpus-relative path, public repo URL, file path, and commit or retrieval basis when available.
- Do not publish secrets, tokens, private emails, private local-only URLs, raw PII, or screenshots of private corpus content. Placeholder-looking secrets must still be classified as placeholder before quotation.
Public-safety ingest policy
Wiki pages may include:
- synthesis of the deep-dive dossiers;
- repository URLs, commit SHAs, paths, and stable source references;
- short excerpts only under the excerpt policy above;
- caveats about missing metadata, API limits, license review, clone coverage, and manual-audit status;
- links back to spec-dataset-evolution-research-project, llm-readable-spec-files, context-engineering, work-management-primitives, and evaluation-and-review-loops.
Wiki pages must not include:
- bulk raw spec contents;
- unreviewed long copied excerpts;
- private raw corpus file bodies;
- claims that a post-ChatGPT timestamp proves AI generation;
- current-star snapshots presented as historical pressure timelines;
- flattened “spec” labels that erase templates, executable contracts, RFCs, negative controls, and package/test lookalikes;
- compliance shortcuts that treat
allowed_pending_policy_reviewas unrestricted raw-export permission.
Fail-closed publication gates:
- If license or redistribution status is missing or unclear, publish metadata and synthesis only.
- If secret/PII/internal scan status is flagged, missing, or only spot-checked, publish metadata and synthesis only unless the page explicitly records why a short excerpt is safe.
- If clone/history coverage is partial, preserve the limitation next to any churn, pressure, or connectedness claim.
- If a repo is a fork, template, tutorial, translation, generated scaffold, or downstream Spec Kit consumer, label its authority origin before using it as independent evidence.
- If a finding depends on failed or rate-limited search, cite the failure as negative evidence and avoid recall-completeness claims.
Acceptance criteria for downstream pages
A downstream ingest page is acceptable only when all of the following are true:
- It has valid YAML frontmatter for the wiki directory it lives in.
- It contains at least two outbound wikilinks and is listed in index if it is a
content page under
queries/,concepts/,entities/, orcomparisons/. - It has a
Source basissection with corpus-relative paths, public repo URLs, and commit/file-path evidence where available. - It records compliance status or explicitly says that raw export is blocked or unreviewed.
- It marks negative evidence and uncertainty rather than smoothing them into confident prose.
- It does not include wholesale raw corpus files or long copied source passages.
- It distinguishes at least the relevant artifact class: exact
spec.md, requirements/design, agent-native spec, template/control plane, RFC/ADR, executable contract, or negative control. - It separates timing labels from AI-generation claims.
- It runs
scripts/lint-wiki.shafter edits and records the result in the Kanban handoff. - If it creates or edits a public ingest wave page, it updates log with the page names and validation result.
Kanban task graph
Board: spec-deep-dive-wiki-ingest
| Task | ID | Purpose |
|---|---|---|
SPEC-WIKI-00 | t_06d43bca | Source map and public-safety ingest policy |
SPEC-WIKI-01 | t_34564330 | Priority case studies: jcode, DroidAgent, J8/J8Spec |
SPEC-WIKI-02 | t_8f345b83 | Exact spec.md and mature standards cohort |
SPEC-WIKI-03 | t_23bdfe0a | Agent-native / Spec Kit / Kiro cohort |
SPEC-WIKI-04 | t_a42f722e | RFC / ADR / executable-contract cohort |
SPEC-WIKI-05 | t_81257281 | Aggregate deep-dive index and cross-links |
SPEC-WIKI-06 | t_c84d9ec2 | Public-safety review, lint, commit, push |
Dependency shape:
SPEC-WIKI-00
├─ SPEC-WIKI-01
├─ SPEC-WIKI-02
├─ SPEC-WIKI-03
└─ SPEC-WIKI-04
↓
SPEC-WIKI-05
↓
SPEC-WIKI-06SPEC-WIKI-01 priority case pages
The first public-safe case-study wave produced three focused pages:
| Page | Main private corpus sources | Public-safe emphasis |
|---|---|---|
| spec-deep-dive-case-jcode | reports/deep-dives/jcode.md, reports/deep-dives/jcode-analysis.json, reports/jcode_first_calibration_seed.md | Post-LLM coding-agent harness; distributed spec surfaces; high code/spec connectedness; raw export remains review-required. |
| spec-deep-dive-case-droidagent | reports/deep-dives/droidagent.md | Agent-generated mobile-GUI behavioral scenarios; task/report/script connectedness; weak ordinary repo-history signal. |
| spec-deep-dive-case-j8-ambiguity | reports/deep-dives/j8-agent.md plus the corrected jcode dossier/seed | Negative evidence for J8 Agent; j8agent namespace collision; j8spec/j8spec as a useful pre-AI executable-spec control, not the corrected priority target. |
These pages intentionally publish synthesis and source metadata only. They do not copy raw private corpus bodies or long upstream passages.
Cohort pages
The first cohort synthesis pages are:
| Page | Main private corpus sources | Public-safe emphasis |
|---|---|---|
| spec-deep-dive-cohort-exact-spec-md-and-standards | SPEC-REPO-01, SPEC-REPO-07, SPEC-REPO-18, SPEC-REPO-19, aggregate records | Exact spec.md as a discovery signal; mature standards/protocol repositories; formal and executable specification surfaces. |
| spec-deep-dive-cohort-agent-native-spec-kit-kiro | SPEC-REPO-03, SPEC-REPO-04, SPEC-REPO-08, SPEC-REPO-11, SPEC-REPO-20 | Agent-native spec directories, Spec Kit / .specify, .kiro/specs, prompt-template families, and template-lineage caveats. |
| spec-deep-dive-cohort-rfc-adr-executable-contracts | SPEC-REPO-05, SPEC-REPO-06, SPEC-REPO-18, SPEC-REPO-19, SPEC-REPO-16 export gate | RFC/proposal governance records versus OpenAPI/Proto/Smithy/AsyncAPI/GraphQL/Thrift/TLA+/Dafny executable or formal contracts. |
Aggregate index
The SPEC-WIKI-05 aggregate page is spec-deep-dive-index. It links the
priority cases, cohort pages, aggregate source-basis counts, artifact-family
taxonomy, connectedness flags, compliance posture, and remaining gates. It uses
reports/AGGREGATE.md, reports/deep-dives/AGGREGATE.md,
data/aggregate_repo_records.jsonl, data/corpus_file_manifest.jsonl,
data/connectedness_features.jsonl, SPEC-REPO-16, and SPEC-REPO-20 as
private-corpus evidence while preserving the publication boundary: synthesis and
metadata out; raw corpus bodies stay private.
Acceptance criteria
The ingest project is complete when:
- The wiki has a public-safe deep-dive index page: spec-deep-dive-index.
- Priority case studies and cohorts have either dedicated pages or clearly named sections with source paths and caveats.
- The original spec-dataset-evolution-research-project links to the private corpus and the ingest index.
- The wiki log records each ingest wave.
scripts/lint-wiki.shpasses.- The final wiki commit is pushed and
HEAD == origin/mainis verified.
Open gates
- Raw public export remains blocked by the manual/adjudicated label gate from
SPEC-DATA-22. - The private corpus may be used as evidence, but publication still needs the fail-closed export checks: license, redistribution, secret/PII/internal scans, excerpt policy, and human audit labels.
- Historical pressure timelines remain future work; current stars and sampled co-change are not longitudinal pressure series.