SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals

Do overly hard RLVR samples actually harm model capabilities?

Explores whether training on problems beyond a model's competence band causes active regression rather than mere learning failures. Investigates whether group-relative normalization amplifies accidental successes into harmful shortcuts.

Synthesis note · 2026-05-28 · sourced from RLVR
What does reward learning actually do to model reasoning?

The damage from over-hard RLVR samples is not merely "the model fails to improve." It is active regression. When almost every rollout on a problem fails, the rare success is unlikely to be a genuinely good solution — it is more often a shortcut, an answer reached by skipping necessary computation, or a lucky guess. Group-relative normalization then treats that one trajectory as the high-advantage exemplar of the group and reinforces it. The model learns the shortcut, not the reasoning.

The behavioral signature is concrete: answer repetition, skipping computation that the problem requires, and other degenerate patterns that look like reasoning collapse. More troubling, these effects do not stay local to the hard problems — they degrade the model's pre-existing capabilities, the things it could already do before training pushed it past its competence band. The internal-feature analysis corroborates this: hard problems activate reasoning-related features but those features become useful only on the rare successful trajectory, so most of the gradient on hard samples is reinforcing the wrong activations.

Why it matters: it identifies a specific corruption channel rather than a generic "training instability." The villain is the interaction between a sparse-success reward landscape and group-relative normalization, which together turn statistical noise (an accidental success) into a learning target. This sharpens the case against naively harvesting hard examples and connects RLVR difficulty to the broader pattern where verifiable-reward training rewards trajectories that pass the check without doing the work. The counterpoint a defender might raise — that some hard problems are exactly where capability frontiers expand — only holds when successful trajectories are sampled densely enough to outvote the shortcuts, which over-hard samples by definition fail to provide.

Inquiring lines that use this note as a source 185

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 124 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

overly hard rlvr samples induce degenerate behaviors and amplify shortcut trajectories degrading prior capability