How do sparse parameter updates enable when-not-how training to work?

This explores why reinforcement learning touches so few of a model's weights — and how that sparseness lines up with the idea that RL teaches a model *when* to use abilities it already has, rather than *how* to do something new.

This explores why reinforcement learning touches so few of a model's weights, and how that sparseness connects to the notion that RL teaches a model *when* to deploy existing skills rather than *how* to acquire new ones. The corpus points to a tidy mechanical story. When you run RL on a language model, only about 5–30% of parameters actually change — and not randomly. Across seven algorithms and ten model families, the same subnetworks light up nearly identically across random seeds, and those updates are nearly full-rank rather than confined to a thin low-rank slice Does reinforcement learning update only a small fraction of parameters?. That consistency is the tell: if RL were installing genuinely new procedures, you'd expect it to rewrite weights broadly and variably. Instead it behaves like a gating operation on capacity that's already present.

What is that gate doing? The mechanics note argues the dominant move is *negative* — RL suppresses wrong trajectories rather than amplifying correct ones, following a two-phase pattern of consolidating procedure first, then exploring strategy What actually changes inside a model during RL training?. A complementary finding sharpens the picture: within the first epoch, RL converges on a single dominant format that already existed in the pretraining distribution while collapsing the alternatives — and which format wins depends on model scale, not necessarily on which one performs best Does RL training collapse format diversity in pretrained models?. So the small, structured edit isn't adding knowledge; it's selecting among behaviors the base model could already produce and routing the model toward one of them. That is precisely 'when, not how.'

The flip side is what sparse updates *can't* do, which is just as instructive. When you probe RL-tuned models on out-of-distribution variants — the N-1 test sets — performance drops sharply compared to in-distribution problems, suggesting RL sharpened template-matching and memorization rather than installing a transferable reasoning procedure Do fine-tuned language models actually learn optimization procedures?. A small, surgical weight change is enough to teach *when* to fire a known pattern, but not enough to teach *how* to reason through something genuinely novel. The two findings fit together: sparse-and-effective for selection, brittle for true generalization.

There's a deeper reason 'when' matters more than raw capacity. Reasoning models keep beating non-reasoning models no matter how much inference compute you throw at the weaker model, because training instilled a protocol that makes the extra tokens *productive* — the gap is about deployment structure, not headroom Can non-reasoning models catch up with more compute?. Learning when to spend reasoning effort is the thing training actually buys, and it's a low-dimensional skill, so a sparse update is the right-sized tool for it.

The lateral payoff: the same 'edit little, redirect a lot' principle shows up wherever the field separates routing from knowledge. Tuning only the singular values of weight matrices yields composable expert vectors that mix at inference without stepping on each other Can models dynamically activate expert skills at inference time?; proxy-tuning leaves base weights entirely untouched yet closes most of the alignment gap, shifting reasoning and style while preserving stored knowledge in the lower layers Can decoding-time tuning preserve knowledge better than weight fine-tuning?; and models even sparsify their own activations adaptively when a task turns unfamiliar, as a selective filter rather than a breakdown Do language models sparsify their activations under difficult tasks?. Across all of these, the lesson is the same one the question is circling: capability lives in the dense pretrained substrate, and the cheap, sparse intervention is mostly about *governing* it — deciding when each ability gets to speak.

Sources 8 notes

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

What actually changes inside a model during RL training?

RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic ML researcher re-testing claims about sparse RL updates in LLMs. The question remains open: *why* do sparse parameter edits suffice to redirect model behavior, and what are their true limits?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat each as a hypothesis, not settled fact.

• Only 5–30% of parameters change under RL; updates are full-rank, not low-rank, and converge identically across seeds and model families, suggesting *selection* not *installation* of capability (~2025).
• RL collapses pretraining's format diversity into a single dominant template within epoch 1; which template wins depends on scale, not performance (~2025).
• RL sharpens memorization and template-matching; N-1 OOD tests show sharp performance cliffs, indicating sparse updates buy routing without true generalization (~2026).
• Reasoning-centric training instills *deployment protocol* (when to spend compute), a low-dimensional skill; non-reasoning models cannot match reasoning models even with unlimited inference (~2025).
• Routing-only approaches (singular-value tuning, proxy-tuning, activation sparsification) preserve base knowledge while governing its use; sparsification itself acts as an OOD filter (~2025–2026).

Anchor papers (verify; mind their dates):
- arXiv:2505.11711 (2025-05): RL finetunes small subnetworks
- arXiv:2504.07912 (2025-04): RL amplifies pretraining behaviors
- arXiv:2603.03415 (2026-03): OOD sparsification as adaptive mechanism
- arXiv:2501.06252 (2025-01): Transformer2 self-adaptive routing

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, probe whether newer models (Claude 4, o3, etc.), scaling laws, multi-epoch RL, or evolved evals have *relaxed* or *overturned* the brittleness claim. Separate the durable insight (sparse edits likely do govern routing) from the perishable limit (OOD failure is *inevitable* at this update budget). Cite what relaxes it; flag what still holds.
(2) **Surface strongest CONTRADICTING work from the last ~6 months.** Has anyone shown sparse RL updates *do* transfer, or that the template-collapse finding reverses under different data or schedules? Name papers.
(3) **Propose 2 research questions that assume the regime may have shifted:** e.g., "Can continuous multi-modal RL (vision + language) overcome template brittleness?" or "Does adapter-style routing outperform weight sparsity for generalization?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How do sparse parameter updates enable when-not-how training to work?

Sources 8 notes

Next inquiring lines