Prompt Optimization Research and DSPy Follow-Ups

Goal

Survey research on RL training prompts and prompting systems while explicitly excluding papers whose main contribution is a new base model architecture, then identify the research line that directly follows or operationalizes dspy’s thesis.

Executive read

The literature is best understood as a sequence of widening optimization surfaces:

optimize a prompt embedding or prefix
optimize a discrete instruction string
optimize example selection and reasoning scaffolds
optimize a prompt program or LM pipeline

That last step is where dspy becomes important. It changes the object of optimization from a single prompt to a structured language-model program, which places it much closer to instruction-layering, context-engineering, and self-evolving-workflows than to ordinary prompt hacking.

The main literature clusters

1. Prompt parameters, but not yet programs

These papers still treat the prompt itself as the primary optimization object.

Prefix-Tuning (2021) — continuous prompt optimization for frozen models. Important as the clean early statement that the prompt surface itself can be trained.
Black-Box Tuning for Language-Model-as-a-Service (2022) — prompt optimization when the model is only accessible through an API.
RLPrompt (2022) — the clearest direct paper on reinforcement learning over discrete prompts.
TEMPERA (2022) — test-time prompt editing via reinforcement learning.

This cluster is useful, but it remains relatively local: the optimizer is still shaping one prompt surface rather than a larger reasoning workflow.

2. Search, selection, and critique over natural-language prompts

This cluster broadens the optimization object from a token string to an iterative editing/search process.

Automatic Prompt Engineer / Large Language Models Are Human-Level Prompt Engineers (2022) — uses the model itself to propose and score instruction candidates.
Active Prompting with Chain-of-Thought (2023) — optimizes which exemplars deserve reasoning traces and enter the prompt.
Automatic Prompt Optimization with “Gradient Descent” and Beam Search (2023) — ProTeGi/APO turns critique into a prompt-editing loop.
Large Language Models as Optimizers (2023) — casts prompt search as black-box optimization performed in-context by the LM.
PromptAgent (2023) — treats prompt improvement as strategic planning over edits and failures.
Promptbreeder (2023) — evolutionary search over prompts and even over mutation prompts.
GEPA (2025) — argues reflective prompt evolution can outperform RL in at least some settings.

The field’s rather clear joke is that explicit RL exists, but a surprising amount of strong prompt optimization work ended up preferring evolution, planning, beam search, or textual gradients.

3. Prompting systems rather than prompts

This is the more consequential cluster for harness design.

Reflexion (2023) — verbal reinforcement learning through self-critique and memory, without weight updates.
dspy (2023) — treats LM pipelines as compilable programs whose instructions and demonstrations can be optimized.
TextGrad (2024) — optimization substrate for compound AI systems using textual gradients.
Symbolic Prompt Program Search (2024) — structure-aware compile-time optimization for prompt programs.
Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together (2024) — argues that modular LM systems may want joint prompt and model adaptation.

This is the point where the prompt becomes a program artifact. For a serious agent harness, that makes these papers more strategically relevant than a long tail of one-shot prompt tricks.

What follows DSPy specifically

The direct DSPy paper trail is real, but currently thinner than the broader optimizer/programming ecosystem around it.

A. Direct extensions inside the DSPy line

DSPy Assertions (2023) — adds computational constraints and self-refining compilation strategies. This is the cleanest direct follow-up because it extends DSPy’s programming model rather than merely using it.
AutoDSPy (2025) — frames full DSPy pipeline construction as an RL problem, selecting modules and prompt templates automatically. Conceptually, this is the sharpest direct answer to your “RL training prompts and prompting systems” angle.

B. Empirical evaluations of DSPy optimizers

A Comparative Study of DSPy Teleprompter Algorithms… (2024) — compares COPRO, MIPRO, BootstrapFewShot, BootstrapFewShot with Optuna, and KNN few-shot with respect to alignment against human evaluations.
Analyzing LLM Instruction Optimization for Tabular Fact Verification (2026) — studies COPRO, MiPROv2, and SIMBA across text-only and tool-using prompting regimes.

These are valuable because they move the discussion from “DSPy sounds tidy” to “which DSPy optimizers behave well under an actual task family?”

C. Application papers that treat DSPy as the prompt-program substrate

Dspy-based Neural-Symbolic Pipeline to Enhance Spatial Reasoning in LLMs (2024) — combines DSPy with ASP-based symbolic reasoning.
Is It Time To Treat Prompts As Code? (2025) — multi-use-case study across guardrails, hallucination detection, code generation, routing, and prompt evaluation.
Optimizing LLM Prompt Engineering with DSPy Based Declarative Learning (2026) — recent application-facing paper positioning DSPy as a scalable alternative to manual prompt engineering.

These papers matter less as theory and more as evidence that DSPy’s basic thesis travels across use cases.

My synthesis

1. RL is only one branch of the story

If you ask specifically for “RL training prompts”, the direct hits are mainly RLPrompt, TEMPERA, and now AutoDSPy in the DSPy line. But the broader research arc did not converge on RL alone. It spread into:

textual-gradient systems
evolutionary prompt search
planning-based prompt search
compiler-style optimization of LM programs

So if the real question is “how are people learning prompts and prompt systems?” the answer is wider than reinforcement learning.

2. DSPy matters because it changes the optimization unit

The most important conceptual move in DSPy is not merely “better prompt tuning”. It is that the prompt surface becomes a typed, decomposed, compiler-facing program. That makes DSPy unusually relevant to harness-engineering because serious harnesses increasingly need editable, optimizable instruction artifacts rather than one giant monolithic prompt.

3. The follow-up literature is still early

The direct paper trail after DSPy is modest compared with the amount of practice happening in the library itself. What has appeared so far is exactly what one would expect in an early systems line:

one clean language extension (DSPy Assertions)
a few optimizer comparison papers
several case studies
adjacent optimizer frameworks trying similar ideas from other angles

In other words, the thesis is spreading faster than the citation graph has had time to become grand and baroque.

Best reading queue

If you want the shortest high-signal queue, I would read these in order:

RLPrompt — the clean direct RL-on-prompts reference.
TEMPERA — test-time RL prompt editing.
Reflexion — the verbal-RL branch that matters more for agent systems than raw policy-gradient elegance.
DSPy — the shift from prompt to prompt program.
DSPy Assertions — the first serious extension of that programming model.
SAMMO — compile-time symbolic search over prompt programs.
TextGrad — optimization substrate for compound AI systems.
AutoDSPy — explicit RL automation of DSPy pipeline construction.
Comparative Study of DSPy Teleprompter Algorithms — which optimizer behaves best is a different question from whether the framework is elegant.
GEPA — the clearest recent pushback against the idea that RL should dominate this area.

Bottom line

The literature does support a real line of work on RL training prompts and prompting systems, but the more interesting trajectory is that the field keeps escaping the single-prompt setting. The current frontier is much closer to optimizing structured prompt programs, retrieval-and-reasoning pipelines, and self-revising instruction systems. DSPy is one of the clearest names on that frontier, and the research that follows it is best read as early work on compiler-like optimization for language-model programs.

This note bears most directly on instruction-layering, context-engineering, self-evolving-workflows, and harness-engineering. It also sits adjacent to prompt-optimizer-regimes-for-harnesses, new-harness-design-notes, and arxiv-self-evolving-workflows-for-codex-control-plane.

Agent Harness Wiki

Explorer

Prompt Optimization Research and DSPy Follow-Ups

Goal

Executive read

The main literature clusters

1. Prompt parameters, but not yet programs

2. Search, selection, and critique over natural-language prompts

3. Prompting systems rather than prompts

What follows DSPy specifically

A. Direct extensions inside the DSPy line

B. Empirical evaluations of DSPy optimizers

C. Application papers that treat DSPy as the prompt-program substrate

My synthesis

1. RL is only one branch of the story

2. DSPy matters because it changes the optimization unit

3. The follow-up literature is still early

Best reading queue

Bottom line

Graph View

Table of Contents

Backlinks

Agent Harness Wiki

Explorer

Prompt Optimization Research and DSPy Follow-Ups

Goal

Executive read

The main literature clusters

1. Prompt parameters, but not yet programs

2. Search, selection, and critique over natural-language prompts

3. Prompting systems rather than prompts

What follows DSPy specifically

A. Direct extensions inside the DSPy line

B. Empirical evaluations of DSPy optimizers

C. Application papers that treat DSPy as the prompt-program substrate

My synthesis

1. RL is only one branch of the story

2. DSPy matters because it changes the optimization unit

3. The follow-up literature is still early

Best reading queue

Bottom line

Related pages

Graph View

Table of Contents

Backlinks