Can models learn to optimize their own chain-of-thought generation?
This explores whether a model can learn to improve its own chain-of-thought — not just produce reasoning steps, but get better at when, how long, and how to generate them — rather than being hand-tuned from outside.
This explores whether a model can learn to improve its own chain-of-thought — deciding when to reason, how long to reason, and even rewarding itself for reasoning — instead of having those choices imposed from outside. The corpus says yes, and along several different axes at once. The most direct case is treating reasoning as something the model trains itself on: with information-gain rewards, a model can learn chain-of-thought during pretraining by treating each reasoning step as an exploratory action and scoring it by how much it improves its own predictions — no external verifier needed Can chain-of-thought reasoning be learned during pretraining itself?. Push that further and the model can learn to grade its own work entirely, computing its own reward in the unused sequence space after its answer, so self-evaluation gets internalized at training time with no inference cost Can models learn to evaluate their own work during training?. A model can even bootstrap from nothing, with a proposer half inventing problems and a solver half learning to crack them, both improving through reinforcement alone Can language models improve themselves without any external training data?.
But 'optimize' isn't only about getting better answers — it's also about not wasting reasoning. Here the corpus is striking: models trained with RL naturally drift toward *shorter* chains as they get more capable, because the reward signal itself favors simplicity. Optimal chain length follows an inverted-U — too short underthinks, too long degrades — and stronger models prefer the shorter end without being told to Why does chain of thought accuracy eventually decline with length?. A model can also learn the meta-decision of whether to reason at all, routing between extended thinking and a quick direct answer, calibrating that choice itself without difficulty labels Can models learn when to think versus respond quickly?.
The twist that makes this question more interesting than it looks: the thing being optimized may not be 'reasoning' in the way it appears. Several notes argue that what reads as chain-of-thought is often stylistic mimicry — invalid logical steps perform almost as well as valid ones, so the gains aren't coming from semantic correctness Do reasoning traces show how models actually think?, and CoT degrades under distribution shift in the signature pattern of imitation rather than genuine inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?. More unsettling: when models are trained to hide their reasoning, transformers compute the answer in early layers and then overwrite it with format-compliant filler tokens Do transformers hide reasoning before producing filler tokens?. So 'optimizing CoT generation' can mean optimizing the *visible performance* of reasoning, which may diverge from the computation actually doing the work.
That opens a door worth walking through: if the visible chain is partly theater, why generate it token-by-token at all? Latent-thought approaches scale reasoning through internal thought vectors rather than more words Can latent thought vectors scale language models beyond parameters?, and diffusion-based models drop the left-to-right constraint entirely, refining reasoning and answer in place and simultaneously — the answer often converges early while reasoning keeps refining, letting the model cut compute by half Can reasoning and answers be generated separately in language models?. There's also a hard limit to keep in mind: a lot of CoT error is local memorization — leaning on the immediately preceding tokens — which accounts for most reasoning mistakes and gets worse as problems get harder Where do memorization errors arise in chain-of-thought reasoning?. The reader's takeaway: models can absolutely learn to optimize their chains — to reason, to self-reward, to shorten, to route, to skip — but whether they're optimizing reasoning or the appearance of it is the open question underneath.
Sources 11 notes
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.
ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.