On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model’s reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL’s effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model’s edge of competence, tasks at the boundary that are difficult but not yet out of reach.
Introduction. Recent advances in reinforcement learning (RL) have led to significant improvements in the reasoning capabilities of language models (LMs) [DeepSeek-AI et al., 2025, OpenAI et al., 2024]. Yet despite this progress, a fundamental conceptual question remains unresolved: does post-training truly extend a model’s reasoning ability beyond what is acquired during pre-training? The literature offers conflicting views: some work characterizes RL as a capability refiner [Yue et al., 2025, Wu et al., 2025, Shao et al., 2025, Yeo et al., 2025], while others present evidence of substantial reasoning gains beyond pre-training [Wen et al., 2025, Yuan et al., 2025, Sun et al., 2025a]. A major source of this discrepancy is that prior analyses rely on uncontrolled training environments. Modern LMs are pre-trained on massive, opaque internet corpora whose composition is fundamentally unknown. As a result, we cannot ascertain which reasoning primitives the base model has already internalized.
Discussion / Conclusion. In this work, we presented a controlled investigation into how pre-training and post-training jointly determine the reasoning capabilities of language models. By disentangling the contributions of each stage, our study clarifies the causal mechanisms through which RL enhances or fails to enhance reasoning generalization. Using fully controllable synthetic reasoning tasks and process-level evaluations, we demonstrated that genuine reasoning improvements through post-training arise only when key reasoning primitives are established during pre-training. Together, these results refine our understanding of reasoning development in language models and provide actionable guidance for constructing data curricula, designing reward functions, and allocating compute across training stages.