What capacity threshold determines whether RL teaches activation versus shortcut learning?

This explores whether there's a measurable tipping point — in the base model's own capability — that decides if reinforcement learning unlocks reasoning the model already has (activation) versus drilling it into degenerate guess-the-answer tricks (shortcut learning).

This reads the question as asking for a *capacity threshold*: some property of the model going in that decides whether RL elicits latent skill or teaches a cheap trick. The corpus doesn't hand you one clean number, but several notes triangulate on the same boundary — and the most interesting finding is that the threshold is less about the model and more about the *match between problem difficulty and what the model can already do.*

Start with the activation side. A cluster of work argues RL doesn't create reasoning at all — it selects from what pretraining already deposited. What does reward learning actually do to model reasoning? shows RLVR sharpens sampling efficiency within existing capability boundaries without expanding them: a single training example can trigger the gain, and even *spurious* rewards work nearly as well as correct ones — but only for models whose pretraining already contains the strategy. Do base models already contain hidden reasoning ability? reinforces this from five independent angles (RL steering, SAE feature steering, decoding tweaks), concluding the bottleneck is elicitation, not capability acquisition. So 'activation' has a precondition baked in: the skill has to be latent in the base model for RL to surface it.

The shortcut side tells you what happens when that precondition fails. Do overly hard RLVR samples actually harm model capabilities? is the sharpest piece here: train on problems that are nearly impossible *for that model*, and group-relative normalization treats rare accidental successes as high-advantage trajectories — reinforcing answer-repetition and computation-skipping instead of reasoning. Worse, these shortcuts contaminate skills the model already had. So the 'threshold' isn't a fixed difficulty; it's relative to the model's current reach. Below it, RL activates; above it, RL has nothing genuine to reward, so it rewards luck — and luck looks like a shortcut.

The one literal capacity number in the corpus comes from a different but resonant place. When do language models stop memorizing and start generalizing? measures a hard memorization ceiling of ~3.6 bits per parameter, and shows a phase transition: once capacity fills, the model stops memorizing and starts generalizing (grokking). That's not an RL result, but it's the same shape of idea you're reaching for — a measurable property of an individual model that flips the *kind* of learning that happens. It suggests 'capacity threshold' may be the right instinct even where RL papers haven't yet quantified it. Worth noting too that Does instruction tuning teach task understanding or output format? independently shows another flavor of shortcut: models trained on semantically empty instructions match fully-correct ones, because what transfers is knowledge of the output space, not understanding — a reminder that 'it learned the task' and 'it learned a format shortcut' can look identical from the outside.

If you want the mechanism rather than the threshold, two notes go under the hood: Does reinforcement learning update only a small fraction of parameters? shows RL touches only a sparse, nearly-identical subnetwork across seeds — consistent with selecting existing circuitry rather than building new capacity — and Does RL training follow a predictable two-phase learning sequence? shows learning moves from consolidating execution to exploring strategy, which is where a too-hard curriculum derails things. The honest synthesis: the corpus strongly supports the *existence* of a difficulty-relative-to-capability boundary separating activation from shortcut, names the failure mechanism precisely, but stops short of giving RL the clean bits-per-parameter number that the memorization work has.

Sources 7 notes

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

When do language models stop memorizing and start generalizing?

GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether activation vs. shortcut learning in LLM RL is fundamentally capacity-constrained or curriculum-constrained. The question: *Does a measurable capacity threshold separate genuine skill elicitation from reward-hacking?*

What a curated library found — and when (findings span 2023–2026, treat as dated claims):
• RL selects from pretraining's latent strategies rather than building new reasoning; even spurious rewards work if the base model contains the strategy (2025–26).
• Models trained on problems beyond their current reach exhibit phase transitions to shortcut behaviors (answer-repetition, computation-skipping) driven by group-relative reward normalization treating lucky successes as high-advantage (2026).
• RL updates only 5–30% of parameters in sparse, near-identical subnetworks across seeds — consistent with circuit selection, not capacity expansion (2025).
• A hard memorization ceiling (~3.6 bits/parameter) produces a grokking phase transition from memorization to generalization, suggesting capacity-relative boundaries do exist in LLMs, though not yet quantified for RL specifically (2025).
• Training on semantically empty instructions produces format-matching shortcuts indistinguishable from genuine task learning (2023).

Anchor papers (verify; mind their dates):
• arXiv:2507.14843 (The Invisible Leash, 2025)
• arXiv:2605.28388 (Mechanistically Interpreting Sample Difficulty, 2026)
• arXiv:2505.24832 (How much do language models memorize?, 2025)
• arXiv:2505.11711 (RL Finetunes Small Subnetworks, 2025)

Your task:
(1) RE-TEST THE THRESHOLD CLAIM. The library argues difficulty is *relative to model capability*, not absolute. Does recent work (last 6 months) quantify this boundary? Check: Do scaling laws, pruning studies, or adaptive curricula now provide a measurable bits-per-parameter or loss-diff cutoff? Where does the constraint still hold?
(2) Surface work CONTRADICTING the activation-only thesis. If newer models or training methods (e.g., multi-task RL, hybrid rewards, meta-reasoning scaffolds) *do* build genuine new reasoning, cite it plainly and explain why prior work missed it.
(3) Propose two questions that assume the regime has shifted: (a) If activation + shortcut are not a binary but a continuum (mixed learning), what training signal separates them? (b) Do mechanistic interpretability tools now reveal whether RL edits existing circuits or grafts new ones?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What capacity threshold determines whether RL teaches activation versus shortcut learning?

Sources 7 notes

Next inquiring lines