INQUIRING LINE

How do reasoning-related features behave when trained on near-impossible problems?

This explores what happens inside reasoning models — specifically which internal features get strengthened or weakened — when the training problems are so hard the model almost never solves them, and why that regime behaves differently from training on easy or medium problems.


This explores what happens inside reasoning models when they're trained on problems they can barely solve — not whether they get the answer right, but which internal features the training reinforces. The sharpest finding in the corpus is that difficulty isn't a single dial that makes learning more or less of the same thing. Easy, medium, and hard problems reinforce *opposite* internal changes: easy problems strengthen answer shortcuts and actively suppress deliberation, medium problems strengthen both shortcuts and reasoning together, and hard problems activate genuine reasoning features only on the rare occasions the model happens to succeed What reasoning features does each difficulty level reinforce?. That last clause is the crux of your question. On near-impossible problems the reasoning-reinforcing signal is real but starved — it only fires on the sparse successes, so two training runs with identical accuracy gains can be carving the model's internals in completely different directions.

Why is the success signal so sparse on hard problems? Because reasoning models don't search systematically — they wander. Success probability drops *exponentially* with problem depth, since the models lack validity, effectiveness, and necessity in how they explore. Medium problems stay solvable, but deep ones become catastrophically harder, which means the model almost never lands the rare correct trace that would reinforce real reasoning Why do reasoning LLMs fail at deeper problem solving?. Worse, the part of the gradient that actually teaches reasoning is concentrated in a tiny minority of tokens — only ~20% are high-entropy 'forking points' where a real decision happens, and those carry essentially all the learning signal Do high-entropy tokens drive reasoning model improvements?. On a near-impossible problem the model rarely reaches those forks correctly, so the signal that would sculpt reasoning features barely registers.

Here's the part you might not expect: training on the impossibly-hard end may not even be teaching new reasoning at all. Across five independent methods, post-training appears to *select* reasoning that's already latent in the base model rather than create it — the bottleneck is elicitation, not capability acquisition Do base models already contain hidden reasoning ability?. And when models fail on hard instances, the failure is usually not about complexity in the abstract: it's instance-level *unfamiliarity*. Models fit instance-based patterns rather than general algorithms, so a problem feels 'impossible' precisely when it's far from anything seen in training Do language models fail at reasoning due to complexity or novelty?. That reframes 'near-impossible' as 'far out-of-distribution' more than 'intrinsically deep.'

That reframing has a strange downstream consequence for how we read the model's behavior. Out of distribution, the usual signals decouple: chain-of-thought length stops tracking difficulty and instead reflects how close the problem is to a remembered training schema Does longer reasoning actually mean harder problems?. So a model thrashing on a near-impossible problem may produce long traces not because it's reasoning harder, but because it's pattern-matching against the wrong recalled template. And some of what looks like a reasoning collapse on hard problems turns out to be an *execution* ceiling — text-only models can't carry out enough steps even when they know the algorithm; give them tools and they clear the supposed cliff Are reasoning model collapses really failures of reasoning?.

The thread tying these together — and the thing worth taking away — is that the hard-problem regime is the one where appearances and internals diverge most. Models can decode a problem's difficulty in their hidden states before they even start, then override that perception and overthink anyway Can models recognize question difficulty before they reason?; they keep generating reasoning steps for genuinely unanswerable questions because training rewarded producing steps but never taught when to stop Why do reasoning models overthink ill-posed questions?. So training on near-impossible problems doesn't reliably build deeper reasoning — it sparsely reinforces real reasoning features, drowns them in shortcut and overthinking signals, and produces outward behavior (long traces, confident effort) that no longer reflects what's happening inside.


Sources 9 notes

What reasoning features does each difficulty level reinforce?

Easy problems reinforce answer shortcuts while suppressing deliberation; hard problems activate reasoning features only on rare success; medium difficulty strengthens both simultaneously. Identical accuracy gains can reflect opposite internal changes.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Can models recognize question difficulty before they reason?

Linear probes successfully decode difficulty from LRM representations before reasoning begins, yet models still overthink simple questions. This reveals an action-commitment failure rather than a perception failure.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-model mechanist. The question: **Do models trained on near-impossible problems develop genuine reasoning features, or just sparse shortcuts and overthinking patterns?** Treat the following as dated claims (2022–2026) needing re-test.

**What a curated library found — and when:**
Findings span 2022–2026; note these are perishable constraints, not current truth.
- Difficulty is not a single dial: easy problems suppress deliberation, medium reinforce both shortcuts and reasoning, hard problems activate real reasoning only on rare successes (~2025–2026).
- Reasoning models search randomly, not systematically; success probability drops exponentially with depth, starving the reasoning signal on near-impossible problems (~2025).
- Only ~20% of tokens are high-entropy 'forking points' carrying the learning signal; on hard problems the model rarely reaches those forks correctly (~2025–2026).
- Post-training *selects* latent reasoning in base models rather than creating it; the bottleneck is elicitation, not acquisition (~2025–2026).
- Hard-problem failures are instance-level unfamiliarity, not task-level complexity; models fit patterns, not algorithms (~2025–2026).
- Chain-of-thought trace length reflects training-distribution proximity, not problem difficulty; long traces on near-impossible problems may signal pattern-matching against wrong templates (~2025).
- Performance collapses on hard problems are often execution failures (text-only limits), not reasoning failures; tools relax the cliff (~2025).

**Anchor papers (verify; mind their dates):**
- 2025-05 arXiv:2505.20296 *Reasoning LLMs are Wandering Solution Explorers*
- 2025-06 arXiv:2506.01939 *Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning*
- 2025-09 arXiv:2509.07339 *Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity*
- 2026-02 arXiv:2602.06176 *Large Language Model Reasoning Failures*

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every claim above, assess whether newer models (o1, o3 variants), search/sampling methods (beam search, tree search, process reward models), tools (code executors, symbolic engines), or multi-agent orchestration have since RELAXED or OVERTURNED it. Separate the durable question (e.g., *do models reason or pattern-match?*) from the perishable limitation (e.g., *text-only models cannot carry deep reasoning*). Cite what resolved each constraint; state plainly where it still holds.

(2) **SURFACE THE STRONGEST CONTRADICTING OR SUPERSEDING WORK** from the last ~6 months. If newer papers show systematic search *does* emerge, or that hard-problem training *does* build generalizable reasoning, or that trace length *does* correlate with genuine reasoning under new evaluation, flag it explicitly.

(3) **PROPOSE TWO RESEARCH QUESTIONS** that assume the regime may have moved:
   - If execution and elicitation were no longer bottlenecks, what would distinguish genuine reasoning from sophisticated pattern-matching?
   - Under what training or inference regime does instance-level unfamiliarity stop predicting failure?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Next inquiring lines