What capacity threshold determines whether RL teaches activation versus shortcut learning?
This explores whether there's a measurable tipping point — in the base model's own capability — that decides if reinforcement learning unlocks reasoning the model already has (activation) versus drilling it into degenerate guess-the-answer tricks (shortcut learning).
This reads the question as asking for a *capacity threshold*: some property of the model going in that decides whether RL elicits latent skill or teaches a cheap trick. The corpus doesn't hand you one clean number, but several notes triangulate on the same boundary — and the most interesting finding is that the threshold is less about the model and more about the *match between problem difficulty and what the model can already do.*
Start with the activation side. A cluster of work argues RL doesn't create reasoning at all — it selects from what pretraining already deposited. What does reward learning actually do to model reasoning? shows RLVR sharpens sampling efficiency within existing capability boundaries without expanding them: a single training example can trigger the gain, and even *spurious* rewards work nearly as well as correct ones — but only for models whose pretraining already contains the strategy. Do base models already contain hidden reasoning ability? reinforces this from five independent angles (RL steering, SAE feature steering, decoding tweaks), concluding the bottleneck is elicitation, not capability acquisition. So 'activation' has a precondition baked in: the skill has to be latent in the base model for RL to surface it.
The shortcut side tells you what happens when that precondition fails. Do overly hard RLVR samples actually harm model capabilities? is the sharpest piece here: train on problems that are nearly impossible *for that model*, and group-relative normalization treats rare accidental successes as high-advantage trajectories — reinforcing answer-repetition and computation-skipping instead of reasoning. Worse, these shortcuts contaminate skills the model already had. So the 'threshold' isn't a fixed difficulty; it's relative to the model's current reach. Below it, RL activates; above it, RL has nothing genuine to reward, so it rewards luck — and luck looks like a shortcut.
The one literal capacity number in the corpus comes from a different but resonant place. When do language models stop memorizing and start generalizing? measures a hard memorization ceiling of ~3.6 bits per parameter, and shows a phase transition: once capacity fills, the model stops memorizing and starts generalizing (grokking). That's not an RL result, but it's the same shape of idea you're reaching for — a measurable property of an individual model that flips the *kind* of learning that happens. It suggests 'capacity threshold' may be the right instinct even where RL papers haven't yet quantified it. Worth noting too that Does instruction tuning teach task understanding or output format? independently shows another flavor of shortcut: models trained on semantically empty instructions match fully-correct ones, because what transfers is knowledge of the output space, not understanding — a reminder that 'it learned the task' and 'it learned a format shortcut' can look identical from the outside.
If you want the mechanism rather than the threshold, two notes go under the hood: Does reinforcement learning update only a small fraction of parameters? shows RL touches only a sparse, nearly-identical subnetwork across seeds — consistent with selecting existing circuitry rather than building new capacity — and Does RL training follow a predictable two-phase learning sequence? shows learning moves from consolidating execution to exploring strategy, which is where a too-hard curriculum derails things. The honest synthesis: the corpus strongly supports the *existence* of a difficulty-relative-to-capability boundary separating activation from shortcut, names the failure mechanism precisely, but stops short of giving RL the clean bits-per-parameter number that the memorization work has.
Sources 7 notes
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.