Why do models automatically adjust reasoning length to problem difficulty?
This explores whether models genuinely scale their reasoning to match how hard a problem is — and the corpus mostly pushes back on that premise.
This explores whether models genuinely lengthen their reasoning when a problem gets harder — and the surprising thing the collection shows is that the premise mostly doesn't hold. Longer traces aren't a thermostat tracking difficulty. Controlled maze experiments find that trace length correlates with difficulty only on problems close to what the model saw in training; push the problem out-of-distribution and the link breaks entirely. What looks like "thinking harder" is largely the model recalling how long similar training examples were Does longer reasoning actually mean harder problems?. A companion finding reframes failure the same way: reasoning collapses not at some complexity threshold but at instance-level novelty, because models fit patterns from specific instances rather than running a general algorithm Do language models fail at reasoning due to complexity or novelty?.
The deeper twist is that models can perceive difficulty — they just don't act on it. Linear probes can decode a question's difficulty from a reasoning model's hidden states *before* it writes a single token, yet the model still overthinks easy questions. That's an action-commitment failure, not a perception failure Can models recognize question difficulty before they reason?. So the signal exists internally; the behavior just doesn't follow it. You see the cost of that disconnect when models pour redundant steps into ill-posed questions with missing premises that a non-reasoning model would simply flag as unanswerable — training rewards producing reasoning steps but never teaches a model when to stop Why do reasoning models overthink ill-posed questions?.
Why does any difficulty-tracking show up at all, then? When it does, it tends to be an emergent byproduct of reward, not a designed feature. Accuracy follows an inverted-U against reasoning length: optimal length rises with task difficulty but falls as the model gets more capable, and RL training naturally drifts toward shorter chains as models improve — simplicity emerges from the reward signal rather than being trained in explicitly Why does chain of thought accuracy eventually decline with length?. Push past the sweet spot and accuracy actually drops; one benchmark fell from 87% to 70% as thinking tokens grew from ~1,100 to ~16K, the classic overthink-easy / underthink-hard pattern Does more thinking time always improve reasoning accuracy?.
The failure isn't usually too little compute — it's disorganized compute. Reasoning models "wander like tourists," exploring invalid paths and abandoning promising ones prematurely, so success probability decays exponentially with problem depth rather than being rescued by longer traces Why do reasoning models abandon promising solution paths? Why do reasoning LLMs fail at deeper problem solving?. And more context can hurt outright: padding inputs to just 3,000 tokens dropped accuracy from 92% to 68%, well below the context limit Does reasoning ability actually degrade with longer inputs?.
The most interesting corner is what it takes to make difficulty-adjustment real rather than incidental. The capability seems to already be latent — several independent methods (RL steering, critique tuning, decoding tweaks, SAE feature steering) all elicit reasoning that base models already contain, suggesting post-training selects reasoning rather than creating it Do base models already contain hidden reasoning ability?. Building on that, one approach explicitly trains a model to *route* between extended thinking and a quick answer using decoupled RL, learning calibrated mode-selection without ever being handed difficulty labels Can models learn when to think versus respond quickly?. The takeaway worth carrying away: genuine length-to-difficulty matching is something you have to deliberately train *for*, because left to default training it gets approximated by memorized trace lengths and conservative defaults Are models actually reasoning about constraints or just defaulting conservatively? — which only look like adaptive reasoning from the outside.
Sources 12 notes
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Linear probes successfully decode difficulty from LRM representations before reasoning begins, yet models still overthink simple questions. This reveals an action-commitment failure rather than a perception failure.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.