Does gradually tightening token budgets beat fixed budget training?
Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
Existing length-control approaches for reasoning use fixed token budgets during training. Train Long Think Short proposes instead a curriculum: start with generous budgets and gradually tighten them. The intuition is that learning has two phases — exploration (discovering effective strategies) and compression (distilling strategies into concise traces) — and these phases have different budget needs.
The reward function balances three signals: task correctness via verifier feedback, length efficiency, and formatting adherence via structural tags. The curriculum aspect controls the length efficiency signal over training, becoming progressively more demanding.
Across five benchmarks (GSM8K, MATH500, SVAMP, College Math, GSM+), curriculum-based training consistently outperforms fixed-budget baselines at the same final budget, achieving higher accuracy and significantly improved token efficiency. The key is that the generous early phase allows the model to explore diverse solution strategies without being penalized for verbosity, then the tightening phase forces compression of only the strategies that actually work.
This connects to the broader overthinking cluster. Since Why does chain of thought accuracy eventually decline with length?, the curriculum approach may naturally navigate this inverted-U by allowing the model to find the peak during exploration and then descend toward conciseness. And since Can minimal reasoning chains match full explanations?, the compression phase is not sacrificing quality — it's removing the filler that makes no real progress.
The deeper principle is that exploration and exploitation require different resource allocations, and temporal scheduling of these allocations (generous first, tight later) outperforms any fixed compromise. This generalizes beyond token budgets to task ordering.
Cognitive science grounding: The CURIOUS algorithm (Colas et al.) provides the developmental robotics foundation for this principle. In open-ended environments, autonomous agents that bias attention toward goals maximizing absolute learning progress naturally self-organize a developmental curriculum — focusing sequentially on goals of increasing complexity, and importantly, refocusing on goals that are being forgotten. The robustness to distracting goals and changing body properties suggests the learning-progress signal is a more general curriculum principle than task difficulty alone. The connection to RL reasoning training: the "generous early, tight later" budget curriculum may succeed precisely because early generosity maximizes learning progress (many strategies discovered per token), while later tightening maximizes efficiency (compression without new discovery needed).
Backward transfer extends curriculum from temporal budgets to task ordering: Omni-Thinker's BWT-guided scheduling reveals that the dimension that matters for multi-task RL is not just how much compute per task, but which tasks come first. Structured domains (math, coding) decrease output entropy while creative domains (writing, dialogue) increase it. Training structured tasks first, then creative tasks, preserves both capabilities. Training creative tasks first risks having structured training collapse the entropy creative training expanded. The ordering effect is predictable from backward transfer measurements, giving practitioners a principled scheduling criterion. See Does training order reshape how models handle different task types?. Together with token-budget curriculum, this suggests RL training benefits from curriculum design along multiple dimensions simultaneously — budget generosity over time AND task type ordering across the training run.
Inquiring lines that use this note as a source 11
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do thinking tokens exhibit diminishing returns beyond a critical threshold?
- Can budget-tightening curricula improve reasoning efficiency more than fixed budgets?
- Why does curriculum learning with tight budgets beat fixed-budget approaches?
- How should token budgets be allocated when prompt-inference coupling matters?
- How does constraint complexity relate to optimal reasoning token budgets?
- Why do reasoning models reduce effort despite having token budget remaining?
- Does parallel token spending always beat sequential spending at the same budget?
- How does reasoning accuracy degrade when token budgets exceed critical thresholds?
- How should token budgets be set to prevent runaway oscillation during inference?
- Why do frontier models remain cost-effective despite higher token prices in production?
- Why does parallel thinking outperform sequential thinking under fixed token budgets?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does training order reshape how models handle different task types?
Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.
extends curriculum from temporal budgets to task ordering: BWT-guided scheduling is a second curriculum dimension
-
Why does chain of thought accuracy eventually decline with length?
Explores why longer reasoning chains don't always improve answers, and how the optimal length shifts based on task difficulty and model capability.
supports: curriculum navigates the inverted-U naturally
-
Can minimal reasoning chains match full explanations?
Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.
validates: compression phase removes filler without losing quality
-
Can we reward reasoning steps without human annotation?
Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
connects: curriculum approach may achieve similar efficiency gains through a simpler mechanism
-
Can language models improve themselves without any external training data?
Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.
SQLM's proposer-solver dynamic creates an emergent curriculum analogous to the budget curriculum: the proposer automatically calibrates problem difficulty to the solver's frontier (neither too easy nor too hard), producing the same explore-then-compress dynamic but through adversarial generation rather than temporal budget scheduling
-
Can adaptive guidance from solution traces reduce reward sparsity in RL?
When reinforcement learning struggles with hard problems due to sparse rewards and zero-advantage rollouts, does providing partial solution traces as adaptive guidance help the model learn more efficiently? This matters because standard RL wastes compute on unsolvable problems.
GHPO operationalizes adaptive curriculum via a different lever: instead of tightening budget over time, it provides solution-trace guidance calibrated to problem difficulty, converting zero-advantage rollouts into learning signal
-
Can reinforcement learning optimize therapy dialogue in real time?
Can RL systems trained on working alliance scores recommend therapy topics that improve clinical outcomes during live sessions? This explores whether validated clinical constructs can serve as reward signals for dialogue optimization.
R2D2's three-level architecture (backbone RL to content-enriched to personalized) mirrors the curriculum principle in a clinical domain: progressive specialization from general therapeutic strategies to disorder-specific to patient-personalized policies
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
- The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
- Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
- Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
- Train Long, Think Short: Curriculum Learning for Efficient Reasoning
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- Rethinking Thinking Tokens: LLMs as Improvement Operators
Original note title
curriculum budgets that start generous and gradually tighten outperform fixed-budget rl for reasoning efficiency