INQUIRING LINE

Why do models show performative reasoning on easy tasks but genuine reasoning on hard ones?

This explores whether models actually 'fake' reasoning on simple tasks while genuinely reasoning on hard ones — and the corpus suggests the premise itself is shaky: traces look performative across the board, but what changes with difficulty is how much real work the model needs to do behind them.


This reads as a question about a difficulty gradient — models seemingly going through the motions on easy problems and doing real work on hard ones. The corpus complicates that story in a useful way: the visible reasoning trace is mostly performance regardless of difficulty, but the underlying computation it sits on top of scales with how unfamiliar the problem is.

Start with the uncomfortable finding that reasoning traces are largely theater everywhere. Multiple notes show that the words a model writes while 'thinking' don't faithfully reflect what produced the answer — invalid logical steps perform nearly as well as valid ones Do reasoning traces show how models actually think?, deliberately corrupted traces teach as well as correct ones and sometimes generalize better Do reasoning traces need to be semantically correct?, and the intermediate tokens carry no special execution semantics — they're generated like any other output and correlate with answers through learned formatting, not functional logic Do reasoning traces actually cause correct answers?. Reflection rarely changes an initial answer; it's mostly confirmatory Can we actually trust reasoning model outputs?. So 'performative on easy tasks' isn't a special failure mode — it's the default character of the trace.

What actually varies with difficulty is the work underneath. One reframing says reasoning failures aren't about complexity at all but about instance-level novelty: a model fits patterns from similar training instances, so any chain succeeds when the problem resembles something seen before, and breaks at the novelty boundary regardless of length Do language models fail at reasoning due to complexity or novelty?. Easy tasks tend to be familiar — pattern-match and the answer falls out, so the trace is decorative. Hard or novel tasks force the model past memorized scaffolding into genuinely transferable procedure, the kind drawn from broad procedural knowledge in pretraining rather than narrow fact recall Does procedural knowledge drive reasoning more than factual retrieval?. Another angle: many apparent 'collapses' on hard problems are execution failures, not reasoning failures — the model knows the algorithm but can't carry out the steps at scale in text, and tool access removes the cliff Are reasoning model collapses really failures of reasoning?.

There's also a sharper inversion of your premise hiding in the corpus. Models tend to overthink easy problems and underthink hard ones — accuracy is non-monotonic in thinking tokens, peaking then declining as the model spins extra words on problems that didn't need them Does more thinking time always improve reasoning accuracy?. That's the opposite of efficient: the most performative, padded reasoning often shows up on the easy cases. And some 'success' on constrained hard problems turns out to be a conservative default rather than reasoning at all — most models do worse when constraints are removed, meaning they were leaning on a heuristic, not evaluating the problem Are models actually reasoning about constraints or just defaulting conservatively?.

The deeper resolution: reasoning capability already lives latent in base model activations, and post-training selects rather than creates it Do base models already contain hidden reasoning ability? — and that work can happen in hidden states without being verbalized at all Can models reason without generating visible thinking tokens?. So the genuine reasoning isn't really 'in' the visible trace on hard tasks either; it's in the compute the model is forced to recruit. The promising direction is teaching models to route — to spend extended thinking only when difficulty warrants it and answer directly otherwise, without needing difficulty labels Can models learn when to think versus respond quickly?. Which suggests the real question isn't 'why is reasoning fake on easy tasks' but 'why do models narrate at all when the answer is already cheap.'


Sources 12 notes

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do reasoning traces actually cause correct answers?

R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning researcher evaluating whether models truly reason differently across difficulty levels. The question: do models perform reasoning (generate convincing traces) on easy tasks while executing genuine reasoning (computational work) on hard ones—or is this dichotomy already obsolete?

What a curated library found—and when (findings span 2024–2026; dated claims, not current truth):
• Reasoning traces are largely performative everywhere: invalid logical steps and deliberately corrupted traces perform nearly as well as valid ones; intermediate tokens correlate with answers through learned formatting, not functional logic (2025).
• Reasoning failures track instance-level novelty, not task complexity—models pattern-match on familiar problems (easy or hard) and break at the unfamiliarity boundary, forcing use of broad procedural knowledge rather than narrow memorization (2024–2025).
• Accuracy is non-monotonic in thinking tokens; models often overthink easy problems and underthink hard ones, peaking then declining as token count rises (2025).
• Genuine reasoning capability is latent in base model activations and hidden states; post-training routes rather than creates it; extended computation can occur without verbalization (2025–2026).
• Models should learn to route: spend extended thinking only when difficulty warrants it, without needing explicit difficulty labels (2025).

Anchor papers (verify; mind their dates):
• arXiv:2411.12580 (2024) – Procedural Knowledge in Pretraining Drives Reasoning
• arXiv:2604.15726 (2026) – LLM Reasoning Is Latent, Not the Chain of Thought
• arXiv:2509.07339 (2025) – Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity
• arXiv:2603.29025 (2026) – The Model Says Walk: How Surface Heuristics Override Implicit Constraints

Your task:
(1) RE-TEST THE DICHOTOMY. Each claim above assumes a fixed model regime. Have architectural changes (RNNs, hierarchical memory, continuous latent compute), new training methods (e.g., reinforcement learning on routing decisions), or evaluation harnesses (e.g., adversarial prompt injections, hidden constraint tests) since relaxed or inverted the easy/hard boundary? Cite what changed the regime, and say plainly where the dichotomy still appears to hold.
(2) Surface the strongest CONTRADICTING work from the last ~6 months—papers claiming models *do* reason differently by difficulty, or that visible traces *are* functionally faithful, or that routing doesn't matter. Flag disagreements on mechanism (heuristic vs. procedural, latent vs. verbalized).
(3) Propose 2 research questions that assume the regime has moved: e.g., "If latent reasoning is primary, what role does verbalization play in multi-step tasks requiring human feedback?" or "If routing is learnable, do models converge on the same routing policy across diverse domains?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines