Do models genuinely reason harder on difficult tasks or just appear to?

This explores whether the extra 'thinking' models do on hard problems reflects genuine deeper computation, or whether it's surface behavior — longer traces and confident-looking reasoning that don't actually correspond to harder work happening inside the model.

This explores whether the extra 'thinking' models do on hard problems reflects genuine deeper computation, or whether it's surface behavior that only looks like effort. The corpus suggests the honest answer is: sometimes genuinely, often not — and the two are surprisingly hard to tell apart from the outside. The most deflating evidence is that reasoning traces can be stylistic theater. When researchers fed models logically invalid or corrupted reasoning steps, performance held up almost as well as with valid ones Do reasoning traces show how models actually think? — meaning the visible chain-of-thought is closer to a persuasive performance than a faithful readout of the computation that produced the answer. In the same spirit, many models that appear to 'reason' about constraints are actually just defaulting to the harder, safer option: remove the constraints and twelve of fourteen models get *worse*, because they were never evaluating anything — they were exploiting a conservative bias that happens to look like reasoning Are models actually reasoning about constraints or just defaulting conservatively?.

What makes this trickier is that effort and difficulty are often mismatched in the wrong direction. Models can detect how hard a question is — difficulty is linearly decodable from their hidden states *before* they start reasoning — yet they override that signal and overthink easy questions anyway Can models recognize question difficulty before they reason?. And more thinking is not more reasoning: push thinking tokens from ~1,100 to ~16K and accuracy can fall from 87% to 70%, because models overthink the easy and underthink the genuinely hard Does more thinking time always improve reasoning accuracy?. So a long trace on a hard problem may signal flailing rather than depth.

There's also a question of what 'hard' even means. Models don't break at a complexity threshold so much as at an unfamiliarity boundary — they succeed on any reasoning chain, long or short, if they've seen similar instances, and fail on novel ones regardless of length Do language models fail at reasoning due to complexity or novelty?. That points toward instance-matching dressed as reasoning rather than a general algorithm that scales effort with difficulty. Relatedly, easy and hard problems reinforce *different* internal features during training — easy ones reward answer shortcuts and suppress deliberation, hard ones activate genuine reasoning only on rare successes — so identical accuracy gains can hide opposite internal changes What reasoning features does each difficulty level reinforce?.

But it isn't all illusion, and the most interesting work is the attempt to measure real effort directly. The 'deep-thinking ratio' tracks how many tokens have their predictions substantially revised as they pass through the model's layers — a signature of computation actually being reworked rather than echoed — and it correlates robustly with accuracy across hard math and science benchmarks Can we measure how deeply a model actually reasons?. That suggests genuine differential effort is real and *detectable*, just not from trace length. Other work shows the capability is latent in base models and merely elicited by training rather than created Do base models already contain hidden reasoning ability?, and that models can be taught to route between deep thinking and quick answers based on need Can models learn when to think versus respond quickly? — both implying the 'harder reasoning' is a real internal mode that can be switched on, not just narrated.

The thing you didn't know you wanted to know: the appearance of reasoning and the substance of reasoning are *measured by completely different instruments*. Trace length, confidence, and even logical validity of the visible steps are poor proxies — they can be faked or are simply uncorrelated with the answer. The real signal lives in the layer-wise revision of internal predictions. So 'does it reason harder, or just appear to?' isn't one question — it's two, and a model can score high on the appearance while the genuine-effort needle barely moves (and, occasionally, the reverse).

Sources 9 notes

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Can models recognize question difficulty before they reason?

Linear probes successfully decode difficulty from LRM representations before reasoning begins, yet models still overthink simple questions. This reveals an action-commitment failure rather than a perception failure.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

What reasoning features does each difficulty level reinforce?

Easy problems reinforce answer shortcuts while suppressing deliberation; hard problems activate reasoning features only on rare success; medium difficulty strengthens both simultaneously. Identical accuracy gains can reflect opposite internal changes.

Can we measure how deeply a model actually reasons?

Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Do models genuinely reason harder on difficult tasks or just appear to?

Sources 9 notes

Next inquiring lines