Wiki Index

Content catalog. Every wiki page listed under its type with a one-line summary. Read this first to find relevant pages for any query. Last updated: 2026-04-21 | Total pages: 136

News

nightly-src-projects-desk-2026-04-21 — First nightly local-projects desk: safe publishable work across the active src tree, from harness control planes to game/UI/runtime experiments.

Entities

aflow — Workflow-search system that optimizes code-represented agent graphs with MCTS and execution feedback.
agent-workflow-memory — External procedural-memory system that induces reusable workflows from past trajectories.
agentboard — Analytical evaluation board for multi-turn agents with progress metrics across many task settings.
agentevolver — Broader self-evolving-agent framework built around self-questioning, self-navigation, and self-attribution.
agentgym — Multi-environment suite for evolving LLM agents across diverse tasks rather than one narrow world.
appworld — Controllable multi-app world with state-based grading for interactive coding and tool-use agents.
atommem — Learnable memory-control system that decomposes memory management into atomic operations.
atropos — Hermes-facing RL environment and rollout substrate for multi-turn tool-calling agent tasks.
autoflow — Natural-language workflow-generation system that makes agent procedures explicit and iterable.
autodspy — RL-driven DSPy pipeline constructor that optimizes modules, signatures, and execution strategies.
browsergym — Unified gym-like substrate that standardizes action and observation surfaces across web-agent benchmarks.
claude-code — Anthropic’s coding agent and harness research program for long-running, evaluator-driven development.
codex-app-server — The durable protocol layer that lets Codex span CLI, IDE, web, and app clients.
codex-cli — OpenAI’s terminal coding agent with an App Server architecture and strong repo-legibility discipline.
compiled-memory — Instruction-compilation system that rewrites agent guidance from validated experience.
computer-rl — Distributed RL infrastructure for training desktop and computer-use agents at scale.
dspy — LM-program compilation and optimization framework that turns prompt engineering into modular program engineering.
dspy-assertions — Contract-bearing extension of DSPy that adds computational constraints and self-repair loops.
dyflow — Runtime workflow-adaptation system that revises procedures from intermediate feedback.
enterprisebench-corecraft — High-fidelity enterprise RL environment with rubric-based rewards and transfer-focused evaluation.
evoskills — Skill-generation system with a co-evolving verifier lane for autonomous improvement.
expel — Experiential-learning system that distills reusable lessons from prior tasks.
gaia — Broad benchmark for general AI assistants requiring reasoning, browsing, multimodality, and tool use.
gas-city — Modular successor to Gas Town, oriented around composable orchestration primitives and Wasteland federation.
gas-town — Steve Yegge’s multi-agent coding factory built around the MEOW stack and durable work objects.
gepa — Reflective prompt-evolution system that learns from traces and preserves Pareto-diverse candidates.
graph-of-skills — Dependency-aware retrieval layer for large executable skill libraries.
hermes-agent — Persistent self-improving agent centered on searchable memory, skills, and multi-surface continuity.
judgeflow — Block-level workflow-diagnosis system for targeted repair and promotion decisions.
mathcode — Terminal mathematical coding agent that translates natural-language problems into Lean proofs with reusable theorem and axiom stores.
memento-skills — Self-evolving agent framework that treats skills as writable memory and learns by rewriting them.
memskill — System that turns memory procedures into evolvable skills.
mermaidflow — Safety-constrained workflow-search system over statically structured Mermaid graphs.
metaagent — Self-evolving agent framework centered on tool meta-learning and durable capability growth.
metaclaw — Continual-learning agent platform that combines fast skill synthesis with slower policy optimization.
mlgym — Gym framework for AI-research agents working on open-ended machine-learning tasks.
openclaw — Ecosystem-first persistent agent runtime with broad integrations and a large public skill marketplace.
opro — In-context black-box optimizer that proposes new candidates from scored history.
osworld — Real-computer benchmark environment for open-ended multimodal agents across operating systems.
promptagent — Planning-based prompt optimizer that searches prompt states via reflective tree search.
promptbreeder — Evolutionary prompt optimizer that co-evolves task prompts and mutation prompts.
proxy-state-based-evaluation — Scalable reward and grading approach for tool-calling agents without fully deterministic backends.
reflexion — Verbal-reinforcement-learning system that writes reflective feedback into episodic memory.
rlprompt — Canonical RL-on-prompts method that optimizes discrete prompt text for frozen language models.
robustflow — Workflow-generation system optimized for invariance under paraphrase and noisy instructions.
sage — RL framework for accumulating and reusing skills across sequential rollouts.
sammo — Symbolic compile-time prompt-program optimizer built around structure-aware transformations.
severa — Verified-synthesis framework for self-evolving agents under hard formal constraints.
skillfoundry — Skill-library construction system that mines validated skills from heterogeneous resources.
skillx — Hierarchical skill-knowledge-base system built from trajectories and execution feedback.
sop-agent — Procedure-externalization system that turns SOPs into pseudocode and decision graphs.
sopbench — Executable benchmark for agents following standard operating procedures, constraints, and tool-use rules.
swe-gym — Executable software-engineering training environment for agents and verifiers over real codebases.
tau-bench — Benchmark for multi-turn tool-agent-user interaction under domain rules and dynamic conversation.
tempera — Runtime prompt-editing system that adapts instruction phrases, exemplars, and verbalizers per query.
textgrad — Textual-autograd framework for optimizing compound AI systems through language feedback.
trace2skill — Distillation system that turns trajectory-local lessons into transferable skills.
visualwebarena — Realistic multimodal web benchmark for visually grounded browsing tasks.
webarena — Realistic multi-domain web environment for autonomous long-horizon browser tasks.
webcanvas — Online web-agent benchmark framework that stays live under interface drift.
webshop — Early grounded web-interaction environment with real products and RL-compatible task structure.
windows-agent-arena — Scalable Windows-specific environment for evaluating multimodal OS agents.
worfbench — Graph-aware benchmark for evaluating workflow-generation quality as workflow structure.
worfeval — Evaluation layer paired with WorfBench for structural, partial, and downstream workflow scoring.
workarena — Enterprise knowledge-work benchmark built on BrowserGym for routine professional web tasks.
workarena-plus-plus — More compositional and reasoning-heavy extension of WorkArena for enterprise workflows.

Concepts

agent-harness-anatomy — Structural breakdown of session state, tools, memory, validation, and coordination layers in modern agent harnesses.
automation-and-background-work — How serious harnesses schedule, dispatch, and review agent work outside a live chat turn.
context-engineering — How harnesses manage visibility, resets, compaction, and handoff artifacts across long-running work.
evaluation-and-review-loops — Why serious harnesses separate building from checking and route failure into iteration.
formal-cognition-loop — The architecture that routes problems into formal space, solves there, and then reifies checked witnesses back into implementation space.
formal-methods-for-agent-harnesses — Why harness reliability increasingly looks like intent formalization plus checkable acceptance surfaces.
harness-engineering — The discipline of making agents effective by shaping repos, tools, feedback loops, and invariants.
instruction-layering — Why durable repo, project, user, and policy instructions need explicit scope instead of one giant prompt.
memory-persistence — Patterns for preserving project state, personal recall, and durable design intent across sessions.
neural-native-programming — Model-facing latent IR design for direct read/write interfaces into transformer internals.
non-hierarchical-coordination-patterns — Serious coordination patterns for agents that do not collapse everything into a manager tree.
orchestration-topologies — When subagents, session teams, or swarm structures are the right coordination shape.
partial-order-trace-semantics — Why concurrent and branching harness work wants pomsets or other partial-order models instead of a single serial transcript.
probabilistic-epistemic-updates — How richer belief/update layers can refine simpler harness quotients without discarding them.
safety-and-permissions — How harnesses bound tool execution, approvals, trust, and blast radius.
self-evolving-workflows — When workflows, skills, or instruction kernels become versioned learning artifacts rather than static setup.
sybil-resistance-and-local-trust — Why multiplayer harness networks should prefer local trust evidence and sybil-resistant identity over scalar global reputation.
theorem-proving-as-cognitive-kernel — Why proof assistants can serve as active reasoning workspaces rather than mere post-hoc verifiers.
fission-fusion-orchestration — Dynamic coalition orchestration with stable identities, split/merge teams, and information-scoped leadership.
work-management-primitives — The task objects and state machines that let agents resume, coordinate, and verify work coherently.

Comparisons

harness-architecture-comparison — Side-by-side comparison of session models, memory substrates, work graphs, and execution surfaces.
harness-decision-matrix — Weighted scoring matrix for choosing what to borrow from each major harness family.
harness-quality-comparison — Qualitative comparison of rigor, persistence, evaluation discipline, and orchestration style across major harnesses.

Queries

nightly-src-projects-desk-2026-04-21 — First nightly local-projects desk summarizing the safe publishable work currently moving across the src tree.
another-harness-and-atropos — Fit analysis for whether a thinner Codex-native harness should adopt Atropos now, later, or not at all, including why current run history stays derived rather than canonical.
another-harness-atropos-environment-schema — Concrete repo-artifact-first episode and reward schema for a later Atropos sidecar in another-harness.
another-harness-model-docs-drift-checker — Why the repo’s first Lean-backed docs/model drift fence targets the attempt-vs-stream grounding distinction instead of pretending to compare everything.
another-harness-resume-recover-environment — First executable recovery family in another-harness, separating honest re-orientation from resumed work that can actually return to reviewed.
another-harness-evaluator-discipline-environment — First live evaluator-side environment prototype under another-harness’s Atropos sidecar design.
another-harness-work-item-closure-environment — First live builder-side environment prototype under another-harness’s Atropos sidecar design.
attention-and-attribution-views-for-llm-harnesses — Honest UI guidance for attention, attribution, and what can actually be shown about model focus.
arxiv-round-two-formal-semantics-for-agent-harnesses — Targeted arXiv scouting on formal methods, epistemic updates, and partial-order semantics for harness theory.
arxiv-self-evolving-workflows-for-codex-control-plane — ArXiv map of workflow search, evaluator loops, skill evolution, and memory compilation for Codex-native control planes.
arxiv-under-explored-coordination-strategies — Verified arXiv pass on coordination strategies that still look thinner than manager-worker and debate loops.
commitment-governance-semantics-for-multiplayer-harness — Concrete commitment, case, and governance primitives for a sovereignty-preserving multiplayer harness.
codex-app-server-provider-vs-runtime-bridge — Why Codex app-server currently belongs in Hermes as a plugin-level runtime bridge rather than as a primary provider transport.
context-assembly-visualization-for-harnesses — Design memo for showing assembled context, source trust, and influence without collapsing them into one score.
formal-core-agent-architecture — Synthesis of how to put a formalization gate and witness-first reasoning at the core of agent cognition.
gas-city-but-its-just-codex — Up-to-date deep dive on the repo’s current ledger, formula, gRPC, operator, UI, and formal structure around Codex-native execution.
gas-city-control-plane-and-authority-split — Focused rendering of the repo’s intended three-service authority split and the current sidecar/runtime duplication seam.
gas-city-live-ops-benchmarks-and-sandboxes — Operational tour of checkpoints, benchmarks, sandboxes, and the repo’s current live center of gravity.
gas-city-operator-policy-and-formal-bridge — Focused rendering of the typed operator-policy runtime and the newer recipe/workflow/policy bridge work.
grounding-moldable-operations-studio-ideas-in-real-research — Concrete HCI, provenance, security, and distributed-systems research that makes the studio ideas implementable rather than merely tasteful.
high-impact-artifacts-for-multiplayer-harness-design — Prioritized inspection list of the pages and sources that most strongly constrain multiplayer harness design.
how-to-build-a-multiplayer-harness-network — Implementation ordering and adapter strategy for a federated multiplayer harness that other harnesses can jack into.
node-card-and-minimum-adapter-contract — Concrete node-card document and minimum honest adapter interface by which foreign harnesses can join the collaboration fabric.
legacy-distributed-systems-ideas-for-moldable-operations-studio — Old distributed-systems control-plane ideas that still look oddly underused in developer-facing harnesses.
moldable-operations-studio-architecture-spec — A concrete state-model and projection spec for turning the harness into a moldable operations studio.
moldable-operations-studio-schema-pass — Concrete event, object, checkpoint, view, and promotion schemas for the moldable operations studio.
moldable-operations-studio-wireframes — Concrete screen models and interaction loops for the wallboard, graph, evidence, queue, canvas, and pocket surfaces.
multiplayer-agent-harnesses-and-p2p-networks — Research synthesis on local-first, peer-to-peer, and multiplayer control-plane ideas for human-plus-agent collaboration.
neural-native-programming-research-program — Kill-happy staged experiment plan with promotion gates, benchmark order, and no-go criteria for neural-native programming.
neural-native-programming-via-direct-interfaces-to-transformer-internal-layers — Research synthesis on typed latent IRs, activation-level interfaces, and execution-first evaluation for neural-native programming.
sovereign-identity-and-observed-goals-schema-pass — Concrete schema patch for sovereign identity, portable attestations, commitments, goal hypotheses, and governance objects.
sovereignty-and-observed-goals-ledgers-for-multiplayer-harnesses — Multi-round deep-dive on replacing scalar reputation with sovereign identity, commitments, provenance, and inferred-goal hypotheses.
new-harness-design-notes — Synthesis notes on combining Codex cleanliness, Hermes learning loops, Anthropic evaluators, Gas City orchestration, and now a formalization plane.
non-hierarchical-agent-orchestration — Direct answer to the question of what to use instead of a default manager hierarchy.
non-linear-interface-options-for-next-harness — ArXiv-backed surface ideas for moving beyond the flat transcript into graphs, checkpoints, runtime overlays, and generated control panels.
open-questions-in-prompt-optimization-and-language-programs — Umbrella map of the main open questions in prompt optimization, language programs, and DSPy-style systems, with fan-out into three research clusters.
prompt-optimization-and-dspy-follow-ups — Map of RL prompt optimization, prompt-program systems, and the early research line following DSPy.
research-on-open-questions-in-prompt-optimization-and-language-programs — Question-by-question research map covering the ten cross-cutting problems in prompt optimization, evaluators, transfer, memory, constraints, and release engineering.
prompt-optimization-eval-transfer-robustness-open-questions — Open questions memo on prompt-program evaluation validity, transfer across models, robustness under shift, and missing benchmark designs.
prompt-optimizer-regimes-for-harnesses — Regime map for when to use runtime editing, RL over programs, black-box search, evolution, or planning in prompt optimization.
prompt-optimization-timeline-and-harness-lessons — Chronological map of prompt optimization and concrete design lessons for agent harnesses.
prompt-program-architecture-plans-for-another-harness-and-gas-city — Repo-grounded architecture plans translating the ten prompt-program questions into concrete stances for another-harness and gas-city-but-its-just-codex.
prompt-program-deployment-open-questions — Open research questions on deploying, adapting, constraining, and operating optimized prompt artifacts inside long-lived harnesses.
prompt-program-representation-and-optimizer-open-questions — Open questions on prompt-program representations, module granularity, assertions, credit assignment, and optimizer regime selection.
rl-gyms-and-executable-environments-for-ai-harnesses — Map of browser, desktop, tool-use, coding, and research-agent gym substrates for harness evaluation and training.
sci-fi-audit-for-moldable-operations-studio — Science-fiction control-room and distributed-cognition ideas translated into harness primitives and cautions.
web-patterns-for-non-linear-harness-interfaces — Broader web-system patterns for moldable views, provenance-rich traces, and durable mission-control surfaces around harness work.

Agent Harness Wiki

Explorer

Wiki Index

News

Entities

Concepts

Comparisons

Queries

Graph View

Table of Contents

Backlinks