How do reasoning-related features behave when trained on near-impossible problems?
This explores what happens inside reasoning models — specifically which internal features get strengthened or weakened — when the training problems are so hard the model almost never solves them, and why that regime behaves differently from training on easy or medium problems.
This explores what happens inside reasoning models when they're trained on problems they can barely solve — not whether they get the answer right, but which internal features the training reinforces. The sharpest finding in the corpus is that difficulty isn't a single dial that makes learning more or less of the same thing. Easy, medium, and hard problems reinforce *opposite* internal changes: easy problems strengthen answer shortcuts and actively suppress deliberation, medium problems strengthen both shortcuts and reasoning together, and hard problems activate genuine reasoning features only on the rare occasions the model happens to succeed What reasoning features does each difficulty level reinforce?. That last clause is the crux of your question. On near-impossible problems the reasoning-reinforcing signal is real but starved — it only fires on the sparse successes, so two training runs with identical accuracy gains can be carving the model's internals in completely different directions.
Why is the success signal so sparse on hard problems? Because reasoning models don't search systematically — they wander. Success probability drops *exponentially* with problem depth, since the models lack validity, effectiveness, and necessity in how they explore. Medium problems stay solvable, but deep ones become catastrophically harder, which means the model almost never lands the rare correct trace that would reinforce real reasoning Why do reasoning LLMs fail at deeper problem solving?. Worse, the part of the gradient that actually teaches reasoning is concentrated in a tiny minority of tokens — only ~20% are high-entropy 'forking points' where a real decision happens, and those carry essentially all the learning signal Do high-entropy tokens drive reasoning model improvements?. On a near-impossible problem the model rarely reaches those forks correctly, so the signal that would sculpt reasoning features barely registers.
Here's the part you might not expect: training on the impossibly-hard end may not even be teaching new reasoning at all. Across five independent methods, post-training appears to *select* reasoning that's already latent in the base model rather than create it — the bottleneck is elicitation, not capability acquisition Do base models already contain hidden reasoning ability?. And when models fail on hard instances, the failure is usually not about complexity in the abstract: it's instance-level *unfamiliarity*. Models fit instance-based patterns rather than general algorithms, so a problem feels 'impossible' precisely when it's far from anything seen in training Do language models fail at reasoning due to complexity or novelty?. That reframes 'near-impossible' as 'far out-of-distribution' more than 'intrinsically deep.'
That reframing has a strange downstream consequence for how we read the model's behavior. Out of distribution, the usual signals decouple: chain-of-thought length stops tracking difficulty and instead reflects how close the problem is to a remembered training schema Does longer reasoning actually mean harder problems?. So a model thrashing on a near-impossible problem may produce long traces not because it's reasoning harder, but because it's pattern-matching against the wrong recalled template. And some of what looks like a reasoning collapse on hard problems turns out to be an *execution* ceiling — text-only models can't carry out enough steps even when they know the algorithm; give them tools and they clear the supposed cliff Are reasoning model collapses really failures of reasoning?.
The thread tying these together — and the thing worth taking away — is that the hard-problem regime is the one where appearances and internals diverge most. Models can decode a problem's difficulty in their hidden states before they even start, then override that perception and overthink anyway Can models recognize question difficulty before they reason?; they keep generating reasoning steps for genuinely unanswerable questions because training rewarded producing steps but never taught when to stop Why do reasoning models overthink ill-posed questions?. So training on near-impossible problems doesn't reliably build deeper reasoning — it sparsely reinforces real reasoning features, drowns them in shortcut and overthinking signals, and produces outward behavior (long traces, confident effort) that no longer reflects what's happening inside.
Sources 9 notes
Easy problems reinforce answer shortcuts while suppressing deliberation; hard problems activate reasoning features only on rare success; medium difficulty strengthens both simultaneously. Identical accuracy gains can reflect opposite internal changes.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Linear probes successfully decode difficulty from LRM representations before reasoning begins, yet models still overthink simple questions. This reveals an action-commitment failure rather than a perception failure.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.