Wiki Index
Content catalog. Every wiki page listed under its type with a one-line summary. Read this first to find relevant pages for any query. Last updated: 2026-04-21 | Total pages: 136
News
- nightly-src-projects-desk-2026-04-21 — First nightly local-projects desk: safe publishable work across the active src tree, from harness control planes to game/UI/runtime experiments.
Entities
- aflow — Workflow-search system that optimizes code-represented agent graphs with MCTS and execution feedback.
- agent-workflow-memory — External procedural-memory system that induces reusable workflows from past trajectories.
- agentboard — Analytical evaluation board for multi-turn agents with progress metrics across many task settings.
- agentevolver — Broader self-evolving-agent framework built around self-questioning, self-navigation, and self-attribution.
- agentgym — Multi-environment suite for evolving LLM agents across diverse tasks rather than one narrow world.
- appworld — Controllable multi-app world with state-based grading for interactive coding and tool-use agents.
- atommem — Learnable memory-control system that decomposes memory management into atomic operations.
- atropos — Hermes-facing RL environment and rollout substrate for multi-turn tool-calling agent tasks.
- autoflow — Natural-language workflow-generation system that makes agent procedures explicit and iterable.
- autodspy — RL-driven DSPy pipeline constructor that optimizes modules, signatures, and execution strategies.
- browsergym — Unified gym-like substrate that standardizes action and observation surfaces across web-agent benchmarks.
- claude-code — Anthropic’s coding agent and harness research program for long-running, evaluator-driven development.
- codex-app-server — The durable protocol layer that lets Codex span CLI, IDE, web, and app clients.
- codex-cli — OpenAI’s terminal coding agent with an App Server architecture and strong repo-legibility discipline.
- compiled-memory — Instruction-compilation system that rewrites agent guidance from validated experience.
- computer-rl — Distributed RL infrastructure for training desktop and computer-use agents at scale.
- dspy — LM-program compilation and optimization framework that turns prompt engineering into modular program engineering.
- dspy-assertions — Contract-bearing extension of DSPy that adds computational constraints and self-repair loops.
- dyflow — Runtime workflow-adaptation system that revises procedures from intermediate feedback.
- enterprisebench-corecraft — High-fidelity enterprise RL environment with rubric-based rewards and transfer-focused evaluation.
- evoskills — Skill-generation system with a co-evolving verifier lane for autonomous improvement.
- expel — Experiential-learning system that distills reusable lessons from prior tasks.
- gaia — Broad benchmark for general AI assistants requiring reasoning, browsing, multimodality, and tool use.
- gas-city — Modular successor to Gas Town, oriented around composable orchestration primitives and Wasteland federation.
- gas-town — Steve Yegge’s multi-agent coding factory built around the MEOW stack and durable work objects.
- gepa — Reflective prompt-evolution system that learns from traces and preserves Pareto-diverse candidates.
- graph-of-skills — Dependency-aware retrieval layer for large executable skill libraries.
- hermes-agent — Persistent self-improving agent centered on searchable memory, skills, and multi-surface continuity.
- judgeflow — Block-level workflow-diagnosis system for targeted repair and promotion decisions.
- mathcode — Terminal mathematical coding agent that translates natural-language problems into Lean proofs with reusable theorem and axiom stores.
- memento-skills — Self-evolving agent framework that treats skills as writable memory and learns by rewriting them.
- memskill — System that turns memory procedures into evolvable skills.
- mermaidflow — Safety-constrained workflow-search system over statically structured Mermaid graphs.
- metaagent — Self-evolving agent framework centered on tool meta-learning and durable capability growth.
- metaclaw — Continual-learning agent platform that combines fast skill synthesis with slower policy optimization.
- mlgym — Gym framework for AI-research agents working on open-ended machine-learning tasks.
- openclaw — Ecosystem-first persistent agent runtime with broad integrations and a large public skill marketplace.
- opro — In-context black-box optimizer that proposes new candidates from scored history.
- osworld — Real-computer benchmark environment for open-ended multimodal agents across operating systems.
- promptagent — Planning-based prompt optimizer that searches prompt states via reflective tree search.
- promptbreeder — Evolutionary prompt optimizer that co-evolves task prompts and mutation prompts.
- proxy-state-based-evaluation — Scalable reward and grading approach for tool-calling agents without fully deterministic backends.
- reflexion — Verbal-reinforcement-learning system that writes reflective feedback into episodic memory.
- rlprompt — Canonical RL-on-prompts method that optimizes discrete prompt text for frozen language models.
- robustflow — Workflow-generation system optimized for invariance under paraphrase and noisy instructions.
- sage — RL framework for accumulating and reusing skills across sequential rollouts.
- sammo — Symbolic compile-time prompt-program optimizer built around structure-aware transformations.
- severa — Verified-synthesis framework for self-evolving agents under hard formal constraints.
- skillfoundry — Skill-library construction system that mines validated skills from heterogeneous resources.
- skillx — Hierarchical skill-knowledge-base system built from trajectories and execution feedback.
- sop-agent — Procedure-externalization system that turns SOPs into pseudocode and decision graphs.
- sopbench — Executable benchmark for agents following standard operating procedures, constraints, and tool-use rules.
- swe-gym — Executable software-engineering training environment for agents and verifiers over real codebases.
- tau-bench — Benchmark for multi-turn tool-agent-user interaction under domain rules and dynamic conversation.
- tempera — Runtime prompt-editing system that adapts instruction phrases, exemplars, and verbalizers per query.
- textgrad — Textual-autograd framework for optimizing compound AI systems through language feedback.
- trace2skill — Distillation system that turns trajectory-local lessons into transferable skills.
- visualwebarena — Realistic multimodal web benchmark for visually grounded browsing tasks.
- webarena — Realistic multi-domain web environment for autonomous long-horizon browser tasks.
- webcanvas — Online web-agent benchmark framework that stays live under interface drift.
- webshop — Early grounded web-interaction environment with real products and RL-compatible task structure.
- windows-agent-arena — Scalable Windows-specific environment for evaluating multimodal OS agents.
- worfbench — Graph-aware benchmark for evaluating workflow-generation quality as workflow structure.
- worfeval — Evaluation layer paired with WorfBench for structural, partial, and downstream workflow scoring.
- workarena — Enterprise knowledge-work benchmark built on BrowserGym for routine professional web tasks.
- workarena-plus-plus — More compositional and reasoning-heavy extension of WorkArena for enterprise workflows.
Concepts
- agent-harness-anatomy — Structural breakdown of session state, tools, memory, validation, and coordination layers in modern agent harnesses.
- automation-and-background-work — How serious harnesses schedule, dispatch, and review agent work outside a live chat turn.
- context-engineering — How harnesses manage visibility, resets, compaction, and handoff artifacts across long-running work.
- evaluation-and-review-loops — Why serious harnesses separate building from checking and route failure into iteration.
- formal-cognition-loop — The architecture that routes problems into formal space, solves there, and then reifies checked witnesses back into implementation space.
- formal-methods-for-agent-harnesses — Why harness reliability increasingly looks like intent formalization plus checkable acceptance surfaces.
- harness-engineering — The discipline of making agents effective by shaping repos, tools, feedback loops, and invariants.
- instruction-layering — Why durable repo, project, user, and policy instructions need explicit scope instead of one giant prompt.
- memory-persistence — Patterns for preserving project state, personal recall, and durable design intent across sessions.
- neural-native-programming — Model-facing latent IR design for direct read/write interfaces into transformer internals.
- non-hierarchical-coordination-patterns — Serious coordination patterns for agents that do not collapse everything into a manager tree.
- orchestration-topologies — When subagents, session teams, or swarm structures are the right coordination shape.
- partial-order-trace-semantics — Why concurrent and branching harness work wants pomsets or other partial-order models instead of a single serial transcript.
- probabilistic-epistemic-updates — How richer belief/update layers can refine simpler harness quotients without discarding them.
- safety-and-permissions — How harnesses bound tool execution, approvals, trust, and blast radius.
- self-evolving-workflows — When workflows, skills, or instruction kernels become versioned learning artifacts rather than static setup.
- sybil-resistance-and-local-trust — Why multiplayer harness networks should prefer local trust evidence and sybil-resistant identity over scalar global reputation.
- theorem-proving-as-cognitive-kernel — Why proof assistants can serve as active reasoning workspaces rather than mere post-hoc verifiers.
- fission-fusion-orchestration — Dynamic coalition orchestration with stable identities, split/merge teams, and information-scoped leadership.
- work-management-primitives — The task objects and state machines that let agents resume, coordinate, and verify work coherently.
Comparisons
- harness-architecture-comparison — Side-by-side comparison of session models, memory substrates, work graphs, and execution surfaces.
- harness-decision-matrix — Weighted scoring matrix for choosing what to borrow from each major harness family.
- harness-quality-comparison — Qualitative comparison of rigor, persistence, evaluation discipline, and orchestration style across major harnesses.
Queries
- nightly-src-projects-desk-2026-04-21 — First nightly local-projects desk summarizing the safe publishable work currently moving across the src tree.
- another-harness-and-atropos — Fit analysis for whether a thinner Codex-native harness should adopt Atropos now, later, or not at all, including why current run history stays derived rather than canonical.
- another-harness-atropos-environment-schema — Concrete repo-artifact-first episode and reward schema for a later Atropos sidecar in another-harness.
- another-harness-model-docs-drift-checker — Why the repo’s first Lean-backed docs/model drift fence targets the attempt-vs-stream grounding distinction instead of pretending to compare everything.
- another-harness-resume-recover-environment — First executable recovery family in another-harness, separating honest re-orientation from resumed work that can actually return to reviewed.
- another-harness-evaluator-discipline-environment — First live evaluator-side environment prototype under another-harness’s Atropos sidecar design.
- another-harness-work-item-closure-environment — First live builder-side environment prototype under another-harness’s Atropos sidecar design.
- attention-and-attribution-views-for-llm-harnesses — Honest UI guidance for attention, attribution, and what can actually be shown about model focus.
- arxiv-round-two-formal-semantics-for-agent-harnesses — Targeted arXiv scouting on formal methods, epistemic updates, and partial-order semantics for harness theory.
- arxiv-self-evolving-workflows-for-codex-control-plane — ArXiv map of workflow search, evaluator loops, skill evolution, and memory compilation for Codex-native control planes.
- arxiv-under-explored-coordination-strategies — Verified arXiv pass on coordination strategies that still look thinner than manager-worker and debate loops.
- commitment-governance-semantics-for-multiplayer-harness — Concrete commitment, case, and governance primitives for a sovereignty-preserving multiplayer harness.
- codex-app-server-provider-vs-runtime-bridge — Why Codex app-server currently belongs in Hermes as a plugin-level runtime bridge rather than as a primary provider transport.
- context-assembly-visualization-for-harnesses — Design memo for showing assembled context, source trust, and influence without collapsing them into one score.
- formal-core-agent-architecture — Synthesis of how to put a formalization gate and witness-first reasoning at the core of agent cognition.
- gas-city-but-its-just-codex — Up-to-date deep dive on the repo’s current ledger, formula, gRPC, operator, UI, and formal structure around Codex-native execution.
- gas-city-control-plane-and-authority-split — Focused rendering of the repo’s intended three-service authority split and the current sidecar/runtime duplication seam.
- gas-city-live-ops-benchmarks-and-sandboxes — Operational tour of checkpoints, benchmarks, sandboxes, and the repo’s current live center of gravity.
- gas-city-operator-policy-and-formal-bridge — Focused rendering of the typed operator-policy runtime and the newer recipe/workflow/policy bridge work.
- grounding-moldable-operations-studio-ideas-in-real-research — Concrete HCI, provenance, security, and distributed-systems research that makes the studio ideas implementable rather than merely tasteful.
- high-impact-artifacts-for-multiplayer-harness-design — Prioritized inspection list of the pages and sources that most strongly constrain multiplayer harness design.
- how-to-build-a-multiplayer-harness-network — Implementation ordering and adapter strategy for a federated multiplayer harness that other harnesses can jack into.
- node-card-and-minimum-adapter-contract — Concrete node-card document and minimum honest adapter interface by which foreign harnesses can join the collaboration fabric.
- legacy-distributed-systems-ideas-for-moldable-operations-studio — Old distributed-systems control-plane ideas that still look oddly underused in developer-facing harnesses.
- moldable-operations-studio-architecture-spec — A concrete state-model and projection spec for turning the harness into a moldable operations studio.
- moldable-operations-studio-schema-pass — Concrete event, object, checkpoint, view, and promotion schemas for the moldable operations studio.
- moldable-operations-studio-wireframes — Concrete screen models and interaction loops for the wallboard, graph, evidence, queue, canvas, and pocket surfaces.
- multiplayer-agent-harnesses-and-p2p-networks — Research synthesis on local-first, peer-to-peer, and multiplayer control-plane ideas for human-plus-agent collaboration.
- neural-native-programming-research-program — Kill-happy staged experiment plan with promotion gates, benchmark order, and no-go criteria for neural-native programming.
- neural-native-programming-via-direct-interfaces-to-transformer-internal-layers — Research synthesis on typed latent IRs, activation-level interfaces, and execution-first evaluation for neural-native programming.
- sovereign-identity-and-observed-goals-schema-pass — Concrete schema patch for sovereign identity, portable attestations, commitments, goal hypotheses, and governance objects.
- sovereignty-and-observed-goals-ledgers-for-multiplayer-harnesses — Multi-round deep-dive on replacing scalar reputation with sovereign identity, commitments, provenance, and inferred-goal hypotheses.
- new-harness-design-notes — Synthesis notes on combining Codex cleanliness, Hermes learning loops, Anthropic evaluators, Gas City orchestration, and now a formalization plane.
- non-hierarchical-agent-orchestration — Direct answer to the question of what to use instead of a default manager hierarchy.
- non-linear-interface-options-for-next-harness — ArXiv-backed surface ideas for moving beyond the flat transcript into graphs, checkpoints, runtime overlays, and generated control panels.
- open-questions-in-prompt-optimization-and-language-programs — Umbrella map of the main open questions in prompt optimization, language programs, and DSPy-style systems, with fan-out into three research clusters.
- prompt-optimization-and-dspy-follow-ups — Map of RL prompt optimization, prompt-program systems, and the early research line following DSPy.
- research-on-open-questions-in-prompt-optimization-and-language-programs — Question-by-question research map covering the ten cross-cutting problems in prompt optimization, evaluators, transfer, memory, constraints, and release engineering.
- prompt-optimization-eval-transfer-robustness-open-questions — Open questions memo on prompt-program evaluation validity, transfer across models, robustness under shift, and missing benchmark designs.
- prompt-optimizer-regimes-for-harnesses — Regime map for when to use runtime editing, RL over programs, black-box search, evolution, or planning in prompt optimization.
- prompt-optimization-timeline-and-harness-lessons — Chronological map of prompt optimization and concrete design lessons for agent harnesses.
- prompt-program-architecture-plans-for-another-harness-and-gas-city — Repo-grounded architecture plans translating the ten prompt-program questions into concrete stances for another-harness and gas-city-but-its-just-codex.
- prompt-program-deployment-open-questions — Open research questions on deploying, adapting, constraining, and operating optimized prompt artifacts inside long-lived harnesses.
- prompt-program-representation-and-optimizer-open-questions — Open questions on prompt-program representations, module granularity, assertions, credit assignment, and optimizer regime selection.
- rl-gyms-and-executable-environments-for-ai-harnesses — Map of browser, desktop, tool-use, coding, and research-agent gym substrates for harness evaluation and training.
- sci-fi-audit-for-moldable-operations-studio — Science-fiction control-room and distributed-cognition ideas translated into harness primitives and cautions.
- web-patterns-for-non-linear-harness-interfaces — Broader web-system patterns for moldable views, provenance-rich traces, and durable mission-control surfaces around harness work.