From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

Paper · arXiv 2406.16838 · Published June 24, 2024
Test-Time Compute

One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during inference. This survey focuses on these inference-time approaches. We explore three areas under a unified mathematical formalism: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called decoding algorithms, operate by sampling a single token at a time or constructing a token-level search space and then selecting an output. These methods typically assume access to a language model’s logits, next-token distributions, or probability scores. Meta-generation algorithms work on partial or full sequences, incorporating domain knowledge, enabling backtracking, and integrating external information. Efficient generation methods aim to reduce token costs and improve the speed of generation. Our survey unifies perspectives from three research communities: traditional natural language processing, modern LLMs, and machine learning systems.

Introduction. One of the most striking findings in modern research on large language models (LLMs) is that, given a model and dataset of sufficient scale, scaling up the compute used at training time leads to better final results (Kaplan et al., 2020; Hoffmann et al., 2022). However, there is another, lesser-mentioned scaling phenomenon, where adopting more sophisticated methods or scaling compute at inference time (Jones, 2021) can result in substantially better outputs from LLMs. This survey focuses on these approaches by exploring three connected themes: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called decoding algorithms, have a rich history in natural language processing, ranging from classical greedy decoding and beam search to modern sampling algorithms such as nucleus (Holtzman et al., 2020) and η-sampling (Hewitt et al., 2022). These methods operate by sampling one token at a time or constructing a token-level search space.

Discussion / Conclusion. Finally, we return to the question that we posed in the introduction: why are sophisticated generation algorithms needed at all? For example, we might imagine that simply sampling once from the model’s unmodified output distribution, y ∼pθ(y|x) is sufficient. We offer some takeaways based on our survey. We surveyed generation algorithms for language models. We motivated generation algorithms, formalized their goals, and provided a unified treatment of three themes: token-level generation algorithms, metageneration algorithms, and efficient generation. Our survey brings together past research from the decoding, LLM reasoning, and machine learning systems communities, and identifies directions for future work.