Prompt optimization evaluation/transfer/robustness open questions research batch

Date: 2026-04-11 Collector: Hermes Agent Method: OpenAlex API lookups plus targeted arXiv metadata retrieval via python3 in the local workspace.

Scope

  • Prompt optimization and prompting-system literature focused on evaluation validity, transfer, robustness, and benchmark design.
  • Center of gravity: DSPy-style LM programs, teleprompter/optimizer comparisons, textual-gradient and evolutionary optimizers, and robustness/transfer papers that expose where prompt gains fail to travel.
  • Explicitly excludes papers whose main contribution is a new base model architecture.

Evaluation and benchmarking papers

  1. Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs (2024)

  2. A Comparative Study of DSPy Teleprompter Algorithms for Aligning Large Language Models Evaluation Metrics to Human Evaluation (2024)

    • URL: https://arxiv.org/abs/2412.15298
    • Note: direct optimizer comparison inside DSPy against human-labeled hallucination annotations rather than only an internal proxy metric.
  3. Analyzing LLM Instruction Optimization for Tabular Fact Verification (2026)

  4. To Write or to Automate Linguistic Prompts, That Is the Question (2026)

    • URL: https://arxiv.org/abs/2603.25169
    • Note: systematic comparison of expert-written prompts, base DSPy signatures, and GEPA-optimized signatures across tasks and model configurations; explicitly notes the fairness asymmetry between labeled-data-driven optimization and expert prompt writing.
  5. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning (2025)

    • URL: https://arxiv.org/abs/2507.19457
    • Note: claims higher quality with far fewer rollouts than GRPO and MIPROv2, which makes search-budget normalization central to fair evaluation.

Transfer and portability papers

  1. PromptBridge: Cross-Model Prompt Transfer for Large Language Models (2025)

  2. Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy (2025)

    • URL: https://arxiv.org/abs/2507.03620
    • Note: useful negative result as well as positive one; optimized prompts helped some tasks, but a cheaper model did not inherit gains simply by reusing the optimized prompt.
  3. Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together (2024)

Robustness papers

  1. Robust Prompt Optimization for Large Language Models Against Distribution Shifts (2023)

  2. PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts (2023)

  3. Benchmarking Prompt Sensitivity in Large Language Models (2025)

    • URL: https://arxiv.org/abs/2502.06065
    • Note: introduces Prompt Sensitivity Prediction and PromptSET, showing that even small prompt variations remain hard to predict and benchmark.
  4. When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs (2025)

  5. Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection (2024)

  6. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents (2024)

Structured prompt-program papers that matter for the benchmark design question

  1. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines (2023)

  2. DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines (2023)

    • URL: https://arxiv.org/abs/2312.13382
    • Note: adds assertions and self-refinement strategies, making reliability constraints part of the optimization/evaluation surface.
  3. Symbolic Prompt Program Search: A Structure-Aware Approach to Efficient Compile-Time Prompt Optimization / SAMMO (2024)

  4. TextGrad: Automatic “Differentiation” via Text (2024)

  5. Modular Prompt Optimization: Optimizing Structured Prompts with Section-Local Textual Gradients (2026)

  6. Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs (2026)

    • URL: https://arxiv.org/abs/2604.06699
    • Note: factorized prompt programs plus interventional factor-level scoring; directly relevant to module-level credit assignment and cost-aware benchmarking.

Working synthesis

  • The main open issue is no longer whether prompts can be optimized at all; it is how to evaluate prompt-program optimizers without confusing proxy overfitting, search-budget advantages, and genuine end-task improvement.
  • Transfer is now a first-class empirical problem: optimized prompts often do not survive model changes, prompt-format perturbations, or subpopulation shift without extra calibration.
  • Robustness needs to be measured at multiple layers: prompt wording, prompt format, data distribution, instruction hierarchy, and tool-mediated prompt injection.
  • Benchmark design is lagging behind optimizer design. The literature has many new optimizers, but still too few shared suites that equalize search budget, labels, model families, and adversarial/context-shift conditions.