A Survey on LLM Inference-Time Self-Improvement
Techniques that enhance inference through increased computation at test-time have recently gained attention. In this survey, we investigate the current state of LLM Inference-Time Self- Improvement from three different perspectives: Independent Self-improvement, focusing on enhancements via decoding or sampling methods; Context-Aware Self-Improvement, leveraging additional context or datastore; and Model- Aided Self-Improvement, achieving improvement through model collaboration. We provide a comprehensive review of recent relevant studies, contribute an in-depth taxonomy, and discuss challenges and limitations, offering insights for future research.
Introduction. The capabilities of large language models (LLMs) have advanced dramatically in recent years (Achiam et al., 2023; Team et al., 2023). These advancements have largely been driven by scaling up model training compute (Kaplan et al., 2020; Brown et al., 2024), with investments in larger models, extensive pretraining datasets, and enhanced alignment techniques (Ouyang et al., 2022; Bai et al., 2022a,b; Rafailov et al., 2023).1 Recently, scaling computation during inferencetime to improve task performance has gained attention (Snell et al., 2024), e.g., increasing testtime compute (i.e., model thinking time) (OpenAI, 2024) and scaling inference compute through repeated sampling (Brown et al., 2024). Test-time capabilities enable smaller models to replace larger ones by trading size for extra inference computation and pave the way for self-improvement with minimal human supervision (Brown et al., 2024). Self-improvement approaches at inference-time offer a new set of opportunities for researchers to continue pushing the boundaries of AI models beyond scaling model size and training data.
Discussion / Conclusion. Inference-time self-improvement methods excel in reasoning (§2.5, §3.1), enable faithful generation (§2.2, §3.3), increase speed via parallelism (§2.4), and more without updating model parameters or additional training. Despite these advancements, several challenges remain. In this section, we discuss considerations for method selection, and outline potential directions for future research: Maintenance: Methods with dependence on an external datastore (§3) or model (§4) require ongoing maintenance, as they need to be updated over time. In contrast, independent methods (§2) do not require this level of maintenance, as they solely operate based on the decoding process. Trade-Offs in Inference Costs: Methods scaling up inference time such as sampling with multiple generations (§2.5) generally take more time at inference than methods directly manipulating the decoding process.