Wiki Index

Content catalog. Every wiki page listed under its type with a one-line summary. Read this first to find relevant pages for any query. Last updated: 2026-04-21 | Total pages: 136

News

  • nightly-src-projects-desk-2026-04-21 — First nightly local-projects desk: safe publishable work across the active src tree, from harness control planes to game/UI/runtime experiments.

Entities

  • aflow — Workflow-search system that optimizes code-represented agent graphs with MCTS and execution feedback.
  • agent-workflow-memory — External procedural-memory system that induces reusable workflows from past trajectories.
  • agentboard — Analytical evaluation board for multi-turn agents with progress metrics across many task settings.
  • agentevolver — Broader self-evolving-agent framework built around self-questioning, self-navigation, and self-attribution.
  • agentgym — Multi-environment suite for evolving LLM agents across diverse tasks rather than one narrow world.
  • appworld — Controllable multi-app world with state-based grading for interactive coding and tool-use agents.
  • atommem — Learnable memory-control system that decomposes memory management into atomic operations.
  • atropos — Hermes-facing RL environment and rollout substrate for multi-turn tool-calling agent tasks.
  • autoflow — Natural-language workflow-generation system that makes agent procedures explicit and iterable.
  • autodspy — RL-driven DSPy pipeline constructor that optimizes modules, signatures, and execution strategies.
  • browsergym — Unified gym-like substrate that standardizes action and observation surfaces across web-agent benchmarks.
  • claude-code — Anthropic’s coding agent and harness research program for long-running, evaluator-driven development.
  • codex-app-server — The durable protocol layer that lets Codex span CLI, IDE, web, and app clients.
  • codex-cli — OpenAI’s terminal coding agent with an App Server architecture and strong repo-legibility discipline.
  • compiled-memory — Instruction-compilation system that rewrites agent guidance from validated experience.
  • computer-rl — Distributed RL infrastructure for training desktop and computer-use agents at scale.
  • dspy — LM-program compilation and optimization framework that turns prompt engineering into modular program engineering.
  • dspy-assertions — Contract-bearing extension of DSPy that adds computational constraints and self-repair loops.
  • dyflow — Runtime workflow-adaptation system that revises procedures from intermediate feedback.
  • enterprisebench-corecraft — High-fidelity enterprise RL environment with rubric-based rewards and transfer-focused evaluation.
  • evoskills — Skill-generation system with a co-evolving verifier lane for autonomous improvement.
  • expel — Experiential-learning system that distills reusable lessons from prior tasks.
  • gaia — Broad benchmark for general AI assistants requiring reasoning, browsing, multimodality, and tool use.
  • gas-city — Modular successor to Gas Town, oriented around composable orchestration primitives and Wasteland federation.
  • gas-town — Steve Yegge’s multi-agent coding factory built around the MEOW stack and durable work objects.
  • gepa — Reflective prompt-evolution system that learns from traces and preserves Pareto-diverse candidates.
  • graph-of-skills — Dependency-aware retrieval layer for large executable skill libraries.
  • hermes-agent — Persistent self-improving agent centered on searchable memory, skills, and multi-surface continuity.
  • judgeflow — Block-level workflow-diagnosis system for targeted repair and promotion decisions.
  • mathcode — Terminal mathematical coding agent that translates natural-language problems into Lean proofs with reusable theorem and axiom stores.
  • memento-skills — Self-evolving agent framework that treats skills as writable memory and learns by rewriting them.
  • memskill — System that turns memory procedures into evolvable skills.
  • mermaidflow — Safety-constrained workflow-search system over statically structured Mermaid graphs.
  • metaagent — Self-evolving agent framework centered on tool meta-learning and durable capability growth.
  • metaclaw — Continual-learning agent platform that combines fast skill synthesis with slower policy optimization.
  • mlgym — Gym framework for AI-research agents working on open-ended machine-learning tasks.
  • openclaw — Ecosystem-first persistent agent runtime with broad integrations and a large public skill marketplace.
  • opro — In-context black-box optimizer that proposes new candidates from scored history.
  • osworld — Real-computer benchmark environment for open-ended multimodal agents across operating systems.
  • promptagent — Planning-based prompt optimizer that searches prompt states via reflective tree search.
  • promptbreeder — Evolutionary prompt optimizer that co-evolves task prompts and mutation prompts.
  • proxy-state-based-evaluation — Scalable reward and grading approach for tool-calling agents without fully deterministic backends.
  • reflexion — Verbal-reinforcement-learning system that writes reflective feedback into episodic memory.
  • rlprompt — Canonical RL-on-prompts method that optimizes discrete prompt text for frozen language models.
  • robustflow — Workflow-generation system optimized for invariance under paraphrase and noisy instructions.
  • sage — RL framework for accumulating and reusing skills across sequential rollouts.
  • sammo — Symbolic compile-time prompt-program optimizer built around structure-aware transformations.
  • severa — Verified-synthesis framework for self-evolving agents under hard formal constraints.
  • skillfoundry — Skill-library construction system that mines validated skills from heterogeneous resources.
  • skillx — Hierarchical skill-knowledge-base system built from trajectories and execution feedback.
  • sop-agent — Procedure-externalization system that turns SOPs into pseudocode and decision graphs.
  • sopbench — Executable benchmark for agents following standard operating procedures, constraints, and tool-use rules.
  • swe-gym — Executable software-engineering training environment for agents and verifiers over real codebases.
  • tau-bench — Benchmark for multi-turn tool-agent-user interaction under domain rules and dynamic conversation.
  • tempera — Runtime prompt-editing system that adapts instruction phrases, exemplars, and verbalizers per query.
  • textgrad — Textual-autograd framework for optimizing compound AI systems through language feedback.
  • trace2skill — Distillation system that turns trajectory-local lessons into transferable skills.
  • visualwebarena — Realistic multimodal web benchmark for visually grounded browsing tasks.
  • webarena — Realistic multi-domain web environment for autonomous long-horizon browser tasks.
  • webcanvas — Online web-agent benchmark framework that stays live under interface drift.
  • webshop — Early grounded web-interaction environment with real products and RL-compatible task structure.
  • windows-agent-arena — Scalable Windows-specific environment for evaluating multimodal OS agents.
  • worfbench — Graph-aware benchmark for evaluating workflow-generation quality as workflow structure.
  • worfeval — Evaluation layer paired with WorfBench for structural, partial, and downstream workflow scoring.
  • workarena — Enterprise knowledge-work benchmark built on BrowserGym for routine professional web tasks.
  • workarena-plus-plus — More compositional and reasoning-heavy extension of WorkArena for enterprise workflows.

Concepts

  • agent-harness-anatomy — Structural breakdown of session state, tools, memory, validation, and coordination layers in modern agent harnesses.
  • automation-and-background-work — How serious harnesses schedule, dispatch, and review agent work outside a live chat turn.
  • context-engineering — How harnesses manage visibility, resets, compaction, and handoff artifacts across long-running work.
  • evaluation-and-review-loops — Why serious harnesses separate building from checking and route failure into iteration.
  • formal-cognition-loop — The architecture that routes problems into formal space, solves there, and then reifies checked witnesses back into implementation space.
  • formal-methods-for-agent-harnesses — Why harness reliability increasingly looks like intent formalization plus checkable acceptance surfaces.
  • harness-engineering — The discipline of making agents effective by shaping repos, tools, feedback loops, and invariants.
  • instruction-layering — Why durable repo, project, user, and policy instructions need explicit scope instead of one giant prompt.
  • memory-persistence — Patterns for preserving project state, personal recall, and durable design intent across sessions.
  • neural-native-programming — Model-facing latent IR design for direct read/write interfaces into transformer internals.
  • non-hierarchical-coordination-patterns — Serious coordination patterns for agents that do not collapse everything into a manager tree.
  • orchestration-topologies — When subagents, session teams, or swarm structures are the right coordination shape.
  • partial-order-trace-semantics — Why concurrent and branching harness work wants pomsets or other partial-order models instead of a single serial transcript.
  • probabilistic-epistemic-updates — How richer belief/update layers can refine simpler harness quotients without discarding them.
  • safety-and-permissions — How harnesses bound tool execution, approvals, trust, and blast radius.
  • self-evolving-workflows — When workflows, skills, or instruction kernels become versioned learning artifacts rather than static setup.
  • sybil-resistance-and-local-trust — Why multiplayer harness networks should prefer local trust evidence and sybil-resistant identity over scalar global reputation.
  • theorem-proving-as-cognitive-kernel — Why proof assistants can serve as active reasoning workspaces rather than mere post-hoc verifiers.
  • fission-fusion-orchestration — Dynamic coalition orchestration with stable identities, split/merge teams, and information-scoped leadership.
  • work-management-primitives — The task objects and state machines that let agents resume, coordinate, and verify work coherently.

Comparisons

Queries