What reasoning features does each difficulty level reinforce?

When models train on problems of different difficulty, do they build the same internal reasoning machinery or different kinds? This matters because accuracy gains alone hide what's actually being learned.

Synthesis note · 2026-05-28 · sourced from RLVR

Reward curves and advantage magnitudes tell you whether training is improving accuracy, but they are silent about what kind of reasoning is being reinforced. Reading RLVR through a Temporal Sparse Autoencoder — extracting sparse reasoning features from activations along the reasoning trajectory — exposes a structured story that the scalar signals hide. Difficulty does not just change how much the model learns; it changes which internal features get strengthened versus suppressed.

The breakdown: easy problems mainly reinforce direct-answer and basic-computation features while actively suppressing deliberative-reasoning features — the model learns to shortcut because shortcutting works. Hard problems activate reasoning-related features, but those features become useful only on the rare successful trajectory, so most hard-sample updates do not consolidate them. Medium-difficulty problems provide a balanced signal, strengthening both computation and multi-step reasoning features at once. The same accuracy gain can therefore correspond to opposite internal changes depending on the difficulty of the data producing it.

Why it matters: it warns that benchmark improvement is an ambiguous summary statistic. Two RLVR runs can post similar accuracy gains while one has built multi-step reasoning machinery and the other has sharpened answer-shortcutting and let deliberation atrophy. The feature-level view is what distinguishes them, and it is the basis for difficulty-adaptive interventions that target feature consolidation directly (e.g., feature-guided training signals). The connection to interpretability work is direct: this is the same SAE-feature lens that lets you steer or read reasoning, now used to audit what a training regime is silently rewarding. The limitation is that T-SAE features are themselves a learned, imperfect decomposition — the "reasoning feature" labels are interpretive, not ground truth.

Inquiring lines that use this note as a source 9

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 118 in 2-hop network ·dense cluster Open in graph ↗

What reasoning features does each difficulty lev… Can we trigger reasoning without explicit chain-of… Why do medium-difficulty problems teach reasoning … Do overly hard RLVR samples actually harm model ca… Do high-entropy tokens drive reasoning model impro…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can we trigger reasoning without explicit chain-of-thought prompts? This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
same SAE-feature methodology; that note steers reasoning features, this one audits which features a difficulty regime reinforces
Why do medium-difficulty problems teach reasoning better than hard ones? Does harder always mean better for learning? This explores why easy and extremely hard samples produce weak training signals in RLVR, while medium-difficulty problems drive the strongest improvements.
the behavioral inverted-U whose mechanistic basis this note supplies: medium difficulty strengthens both feature families
Do overly hard RLVR samples actually harm model capabilities? Explores whether training on problems beyond a model's competence band causes active regression rather than mere learning failures. Investigates whether group-relative normalization amplifies accidental successes into harmful shortcuts.
the feature view explains the degeneration: hard-sample reasoning features consolidate only on rare success, leaving shortcut features dominant
Do high-entropy tokens drive reasoning model improvements? Explores whether only a small fraction of tokens—those with high entropy at decision points—actually matter for improving reasoning performance in language models, and whether training on them alone could work as well as full training.
complementary fine-grained lens on where RLVR's effect concentrates — tokens there, features here

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

different difficulty levels selectively reinforce or suppress distinct reasoning features invisible from advantage signals alone

What reasoning features does each difficulty level reinforce?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4