Prompt Optimization Open Questions: Evaluation, Transfer, Robustness, and Benchmarking
Goal
Identify the highest-value open questions in the prompt-optimization / prompting-systems literature around evaluation, transfer, robustness, and benchmark design, with an emphasis on DSPy-style language programs and adjacent optimizers such as TextGrad, SAMMO, PromptAgent, Reflexion, and GEPA.
Executive read
The literature has produced many optimizers, but the evaluation stack is still immature. The recurring problems are: proxy-metric mismatch, benchmark overfitting, weak cross-model transfer, fragility under prompt perturbations and distribution shift, and too few shared benchmarks for multi-stage prompt programs.
Open questions
1. Are prompt optimizers improving the real task, or merely the proxy metric or judge?
Most optimizers search against a development metric, an LLM judge, or a task-specific scorer. That is necessary, but it leaves the core evaluation question open: how often does optimization improve the thing we actually care about rather than a convenient proxy?
Anchor papers
- A Comparative Study of DSPy Teleprompter Algorithms for Aligning Large Language Models Evaluation Metrics to Human Evaluation (2024) — one of the few papers that explicitly evaluates optimizer outputs against human labels.
- Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs (2024) — makes downstream-metric optimization central, but also highlights how little direct module-level supervision exists.
- To Write or to Automate Linguistic Prompts, That Is the Question (2026) — shows that optimizer wins are task- and model-dependent, and that “automatic beats manual” is not a stable conclusion.
What would help answer it
- A benchmark suite with blinded human evaluation on held-out data, not only LLM-judge scores.
- Cross-judge evaluation: optimize on judge A, test on judge B and on humans.
- Reporting of calibration and failure modes, not only average task accuracy.
2. How much of optimizer performance is real algorithmic quality versus search-budget advantage?
Prompt optimizers vary wildly in optimization budget: number of rollouts, candidate prompts, token usage, mini-batch evaluations, and wall-clock time. Right now, many comparisons are not cost-normalized enough to tell whether a method is truly better or simply allowed to search more.
Anchor papers
- GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning (2025) — explicitly argues it beats GRPO and MIPROv2 while using far fewer rollouts.
- Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs (2024) — uses surrogate modeling and stochastic minibatching precisely because prompt-program evaluation is expensive.
- Adaptive Prompt Structure Factorization (2026) — claims lower optimization cost via factor-level credit assignment.
- Analyzing LLM Instruction Optimization for Tabular Fact Verification (2026) — shows optimizer rankings depend on prompting regime, model family, and tool use.
What would help answer it
- A cost-controlled benchmark with fixed token budget, fixed number of optimizer steps, fixed labeled examples, and fixed base model.
- Pareto-front reporting over quality, cost, latency, and prompt length growth.
- Multiple random seeds and optimizer restarts, so variance is visible instead of hidden.
3. How well do optimized prompts and prompt programs transfer across models, providers, and model versions?
This is still unresolved. A prompt optimized for one model often degrades on another, and even within one family, a model upgrade can invalidate previous prompt tuning work. For production systems, this is one of the biggest practical obstacles.
Anchor papers
- PromptBridge: Cross-Model Prompt Transfer for Large Language Models (2025) — directly frames the problem as model drifting and proposes a calibration-light transfer method.
- To Write or to Automate Linguistic Prompts, That Is the Question (2026) — different prompt strategies win on different model configurations.
- Is It Time To Treat Prompts As Code? (2025) — optimized prompts improved some tasks, but did not let a cheaper model inherit the stronger model’s gains.
- Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together (2024) — suggests prompt-only adaptation may be insufficient for portability.
What would help answer it
- A transfer matrix benchmark: optimize on one model, freeze the prompt/program, and evaluate across other families and later model versions.
- Separate zero-shot transfer, light calibration, and full re-optimization conditions.
- Reporting of which prompt sections or modules transfer, rather than only whole-program scores.
4. Do optimized prompts generalize under distribution shift, prompt paraphrase, and formatting noise?
Much of prompt optimization is still evaluated on relatively clean i.i.d. task splits. But real deployments face subpopulation shifts, paraphrases, punctuation changes, reordered instructions, and user formatting noise. The field still does not know which optimizers are genuinely stable under those shifts.
Anchor papers
- Robust Prompt Optimization for Large Language Models Against Distribution Shifts (2023) — directly shows standard prompt optimization is vulnerable to subpopulation shift.
- PromptRobust (2023) — large benchmark on adversarial prompts at character, word, sentence, and semantic levels.
- Benchmarking Prompt Sensitivity in Large Language Models (2025) — introduces PromptSET and shows prompt sensitivity is still hard to predict.
- When Punctuation Matters (2025) — large unified comparison of robustness methods across 8 models and 52 tasks.
What would help answer it
- Benchmarks that combine source/target subpopulation splits with prompt-level perturbation suites.
- Robustness reporting as degradation curves, not only one before/after score.
- Held-out perturbation families so methods cannot overfit a fixed attack set.
5. Are structured prompt-program optimizers actually more robust than monolithic prompt editing?
dspy, SAMMO, TextGrad, MPO, and aPSF all imply that prompts should be treated as structured programs with modules, sections, or semantic factors. That representation feels right, but the direct evidence is still thin on whether structure reliably improves robustness, transfer, and interpretability.
Anchor papers
- DSPy (2023) — establishes the LM-program view.
- SAMMO (2024) — explicit structure-aware optimization.
- Modular Prompt Optimization (2026) — section-local textual gradients rather than whole-prompt rewriting.
- Adaptive Prompt Structure Factorization (2026) — interventional factor-level scoring and updates.
- TextGrad (2024) — optimization over compound AI systems, not only one prompt string.
What would help answer it
- Apples-to-apples comparisons between monolithic and modular representations on the same tasks and budgets.
- Instrumented benchmarks that record edit locality, prompt growth, failure localization, and recovery under perturbation.
- End-to-end tasks where module boundaries are meaningful: retrieval, planning, tool choice, critique, and answer formatting.
6. How robust are prompt programs in agentic settings with retrieved context, tools, and hidden instruction conflicts?
Prompt robustness is not only about punctuation. In tool-using or retrieval-heavy systems, the bigger issue is often instruction hierarchy: which instruction wins when user text, retrieved text, tool output, and developer prompt disagree? Optimized prompt programs may even become more attackable if they overfit to surface cues.
Anchor papers
- Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection (2024) — benchmark for instruction-priority failures under prompt injection.
- InjecAgent (2024) — extends the problem to tool-integrated agents.
- DSPy Assertions (2023) — points toward explicit constraints as part of the solution surface.
- Is It Time To Treat Prompts As Code? (2025) — includes guardrail and router settings where instruction robustness matters directly.
What would help answer it
- Agent benchmarks that mix normal tasks with indirect prompt injections in retrieved documents and tool outputs.
- Joint reporting of task success, attack success, false refusal, and cost.
- Comparisons between plain prompting, assertion/constrained prompting, and optimized prompt-program variants.
7. What is a fair benchmark for comparing automated prompt optimization to human prompt engineering?
Current comparisons are often asymmetric. Human experts may use domain knowledge and iterative craft with little or no labeled data, while automated methods may search directly over gold splits or validation sets. Without controlling for labels, human time, and optimizer budget, claims about “expert-level prompt optimization” remain hard to interpret.
Anchor papers
- To Write or to Automate Linguistic Prompts, That Is the Question (2026) — states this fairness problem explicitly.
- Large Language Models Are Human-Level Prompt Engineers / APE (2022) — early automated-prompt-search anchor.
- A Comparative Study of DSPy Teleprompter Algorithms… (2024) — useful for controlled optimizer-to-optimizer comparison, but still on one evaluation regime.
- Is It Time To Treat Prompts As Code? (2025) — shows mixed gains across use cases rather than a universal automation win.
What would help answer it
- A benchmark with explicit resource regimes: zero-label expert prompting, small-label automatic optimization, larger-label automatic optimization, and hybrid human+optimizer.
- Reporting of human authoring time, domain expertise, label count, optimizer tokens, and model-call budget.
- Cross-model retesting to see whether the winning prompt strategy is durable or merely benchmark-local.
Bottom line
The field does not lack optimizer ideas; it lacks hard, shared answers to whether those optimizers generalize, transfer, and remain robust outside their home benchmark. The most useful next wave of work would be benchmark-heavy: human-anchored evaluation, cost-normalized optimizer comparisons, cross-model transfer matrices, and adversarial/context-shift suites for full prompt programs rather than single prompt strings.
Related pages
This note extends prompt-optimization-and-dspy-follow-ups, prompt-optimization-timeline-and-harness-lessons, and dspy. It also sits close to evaluation-and-review-loops, instruction-layering, context-engineering, and self-evolving-workflows.