What reasoning features does each difficulty level reinforce?
When models train on problems of different difficulty, do they build the same internal reasoning machinery or different kinds? This matters because accuracy gains alone hide what's actually being learned.
Reward curves and advantage magnitudes tell you whether training is improving accuracy, but they are silent about what kind of reasoning is being reinforced. Reading RLVR through a Temporal Sparse Autoencoder — extracting sparse reasoning features from activations along the reasoning trajectory — exposes a structured story that the scalar signals hide. Difficulty does not just change how much the model learns; it changes which internal features get strengthened versus suppressed.
The breakdown: easy problems mainly reinforce direct-answer and basic-computation features while actively suppressing deliberative-reasoning features — the model learns to shortcut because shortcutting works. Hard problems activate reasoning-related features, but those features become useful only on the rare successful trajectory, so most hard-sample updates do not consolidate them. Medium-difficulty problems provide a balanced signal, strengthening both computation and multi-step reasoning features at once. The same accuracy gain can therefore correspond to opposite internal changes depending on the difficulty of the data producing it.
Why it matters: it warns that benchmark improvement is an ambiguous summary statistic. Two RLVR runs can post similar accuracy gains while one has built multi-step reasoning machinery and the other has sharpened answer-shortcutting and let deliberation atrophy. The feature-level view is what distinguishes them, and it is the basis for difficulty-adaptive interventions that target feature consolidation directly (e.g., feature-guided training signals). The connection to interpretability work is direct: this is the same SAE-feature lens that lets you steer or read reasoning, now used to audit what a training regime is silently rewarding. The limitation is that T-SAE features are themselves a learned, imperfect decomposition — the "reasoning feature" labels are interpretive, not ground truth.
Inquiring lines that use this note as a source 9
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do task difficulty and skill type interact in model performance?
- Why do models overthink easy problems and underthink difficult ones?
- Why do difficult problems force models to develop reasoning strategies?
- How does a challenger's escalating difficulty function as curriculum?
- Why do medium-difficulty problems produce more stable learning gains?
- How do difficulty metrics relate to the true value of training examples?
- How do reasoning-related features behave when trained on near-impossible problems?
- Do models genuinely reason harder on difficult tasks or just appear to?
- How does question difficulty and breadth affect what models learn to reason?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we trigger reasoning without explicit chain-of-thought prompts?
This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
same SAE-feature methodology; that note steers reasoning features, this one audits which features a difficulty regime reinforces
-
Why do medium-difficulty problems teach reasoning better than hard ones?
Does harder always mean better for learning? This explores why easy and extremely hard samples produce weak training signals in RLVR, while medium-difficulty problems drive the strongest improvements.
the behavioral inverted-U whose mechanistic basis this note supplies: medium difficulty strengthens both feature families
-
Do overly hard RLVR samples actually harm model capabilities?
Explores whether training on problems beyond a model's competence band causes active regression rather than mere learning failures. Investigates whether group-relative normalization amplifies accidental successes into harmful shortcuts.
the feature view explains the degeneration: hard-sample reasoning features consolidate only on rare success, leaving shortcut features dominant
-
Do high-entropy tokens drive reasoning model improvements?
Explores whether only a small fraction of tokens—those with high entropy at decision points—actually matter for improving reasoning performance in language models, and whether training on them alone could work as well as full training.
complementary fine-grained lens on where RLVR's effect concentrates — tokens there, features here
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
- Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs
- Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
- Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
- Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges
- Base Models Know How to Reason, Thinking Models Learn When
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Original note title
different difficulty levels selectively reinforce or suppress distinct reasoning features invisible from advantage signals alone