Can we predict when a model will develop thinking behaviors?

This explores whether we can anticipate when a model starts to reason — whether 'thinking' is a capability that suddenly appears, or something already latent that training merely switches on at a predictable moment.

This reads the question as: can we forecast the point at which a model develops reasoning behavior — and the corpus reframes it in a surprising way. The more striking answer isn't about *predicting an emergence event*, but that there may be no emergence event to predict. Several notes converge on the idea that reasoning is already sitting inside base models, latent, waiting to be elicited rather than built. One survey finds five completely independent methods — reinforcement learning steering, critique fine-tuning, decoding tweaks, sparse-feature steering, and RLVR — all unlock the *same* dormant capability, concluding that post-training selects reasoning rather than creating it Do base models already contain hidden reasoning ability?. If that's right, 'when does thinking develop' becomes 'when do we choose to switch it on.'

That reframing sharpens what training actually contributes: timing, not capability. RL post-training is described as a *deployment optimizer* — it teaches a model *when* to spend reasoning effort, with a hybrid setup recovering 91% of the gains using just 12% of the tokens Does RL teach reasoning or just when to use it?. The same flavor of result shows up in models that learn to route between deep thinking and quick answers on their own, without anyone labeling which problems are hard Can models learn when to think versus respond quickly?. So the predictable thing isn't the birth of reasoning — it's the model learning a policy for when to use it.

There's also a quieter, more unsettling thread: thinking behavior often *hurts* before it helps, and only training flips the sign. Vanilla models given a thinking mode use it for counterproductive self-doubt until RL redirects that exact mechanism into useful gap analysis Does extended thinking help or hurt model reasoning?. Asking a model to think first degrades general performance until RL — judging only the final answer — teaches it to make those thoughts pay off Why does asking models to think first hurt performance?. Even when present, more thinking isn't monotonically better: accuracy peaks then collapses as thinking tokens climb, models overthinking easy problems and underthinking hard ones Does more thinking time always improve reasoning accuracy?. The trajectory is predictable in shape — initial harm, then benefit, then diminishing returns — but it's a training curve, not a switch.

The genuinely interesting twist is whether you can *measure* thinking as it forms rather than guess at it. A 'deep-thinking ratio' tracks how many tokens get their predictions substantially revised across the model's layers, and that internal signal correlates with accuracy well enough to drive a cheaper test-time strategy Can we measure how deeply a model actually reasons?. That's closer to a real predictor — an observable internal marker of reasoning effort. But it comes with a warning from the skeptics: visible reasoning traces can be pure stylistic mimicry, with invalid traces still producing correct answers, meaning the *appearance* of thinking isn't proof of the function Do reasoning traces actually cause correct answers?. So if you want to predict thinking, you'll want internal measures, not the visible chain-of-thought.

Two adjacent findings stretch the question further. Reasoning may be plantable *earlier* than anyone assumed — chain-of-thought trained directly into pretraining with an information-gain reward lifts reasoning ~19% Can chain-of-thought reasoning be learned during pretraining itself? — pushing the 'when' upstream. And models develop a kind of self-knowledge about their own behaviors without being trained to report it Can language models describe their own learned behaviors?, which hints that the cleanest predictor of a model's thinking tendencies might eventually be the model itself.

Sources 10 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL teach reasoning or just when to use it?

Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Why does asking models to think first hurt performance?

Prompting models to think before responding degrades performance on general tasks. RL training with judges evaluating only responses teaches models to generate thoughts that actually improve outputs across diverse task types, not just math.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Can we measure how deeply a model actually reasons?

Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.

Do reasoning traces actually cause correct answers?

R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can language models describe their own learned behaviors?

LLMs fine-tuned on datasets exhibiting specific behaviors accurately describe those behaviors without any training to self-report. This suggests behavioral regularities are encoded and accessible in ways that factual knowledge often is not.

Can we predict when a model will develop thinking behaviors?

Sources 10 notes

Next inquiring lines