How does pretrained knowledge constrain what adaptation strategies can achieve?

This explores a ceiling that runs underneath most fine-tuning and RL work: adaptation mostly reorganizes, selects, or risks damaging what pretraining already laid down — it rarely manufactures genuinely new capability.

This explores a ceiling that runs underneath most fine-tuning and RL work — the question of whether adaptation creates capability or merely reshuffles what pretraining already deposited. The corpus comes down hard on the second reading, and from several independent directions. The clearest statement is that base models already carry the abilities we think we're teaching: five separate mechanisms — RL steering, critique tuning, decoding tweaks, feature steering, and RLVR — all turn out to *elicit* reasoning that's already latent in base-model activations rather than build it Do base models already contain hidden reasoning ability?. Follow that thread and post-training looks less like education and more like routing: RL teaches a model *when* to deploy reasoning, not *how* to reason, with hybrid models recovering ~91% of the gains by controlling token routing alone Does RL post-training create reasoning or just deploy it?.

If adaptation selects rather than creates, then the pretraining distribution sets the menu. A striking demonstration: RL doesn't explore freely — within the first epoch it amplifies a single dominant output format already present in pretraining and suppresses the alternatives, and which format 'wins' depends on model scale, not on which format performs best Does RL training collapse format diversity in pretrained models?. The strategy can only promote options the base model came with. The same boundary shows up in imitation learning from a different angle: agents trained on static expert demonstrations are capped by what the dataset's curators imagined, never learning from their own failures because they never interact with anything outside the demonstrated scenarios Can agents learn beyond what their training data shows?.

The second way pretrained knowledge constrains adaptation is darker: aggressive weight updates don't just fail to add — they actively corrupt what's stored. Direct fine-tuning damages knowledge held in the lower layers, which is exactly why proxy-tuning, which leaves base weights untouched and shifts only the output distribution, closes most of the alignment gap while *beating* direct fine-tuning on knowledge tasks Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Fine-tuning can also sever the link between a model's reasoning and its answers — after tuning, chains-of-thought become performative, with truncation and paraphrasing leaving final answers unchanged Does fine-tuning disconnect reasoning steps from final answers?. So the tension is real: the heavier your intervention, the more pretrained structure you put at risk.

The most interesting work treats this constraint as a *design problem* rather than a wall. If forgetting is really a misallocation — task lessons being written into weights where they overwrite pretrained knowledge — then route them elsewhere: Fast-Slow Training keeps parameter updates minimal and pushes task-specific learning into optimized prompts, hitting equal performance faster with far less forgetting Can splitting adaptation into two channels reduce forgetting?. Singular-value tuning composes specialist 'expert' directions out of the existing weight matrices without interference, specializing continually without clobbering the base Can models dynamically activate expert skills at inference time?. And when you genuinely must embed new domain knowledge, *how* you reward matters: RLAG internalizes coherent knowledge structures better than supervised fine-tuning by rewarding explanation quality rather than token-level correctness Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?.

The quiet payoff here: the field's center of gravity is shifting from weights to context. If pretraining sets a hard ceiling on what's *in* the model, the cheapest gains come from better eliciting and orchestrating it — agents that store verbal reflections as episodic memory and improve without any weight update Can agents learn from failure without updating their weights?, and context 'playbooks' that accumulate task knowledge through curation loops, posting double-digit gains with no labeled supervision and nothing touched in the weights at all Can context playbooks prevent knowledge loss during iteration?. Pretrained knowledge doesn't just constrain adaptation — it's increasingly the argument for not adapting the weights in the first place.

Sources 11 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can context playbooks prevent knowledge loss during iteration?

The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether pretrained knowledge truly sets an inescapable ceiling on what adaptation can achieve, or whether recent methods have begun to circumvent or dissolve that constraint.

What a curated library found — and when (findings span 2023–2026; treat as dated claims):
• Base models already contain latent reasoning; five distinct adaptation mechanisms (RL steering, critique tuning, decoding, feature steering, RLVR) merely *elicit* rather than build capability (~2025).
• RL post-training amplifies a single dominant pretraining format within epoch one, constrained by model scale not task performance (~2025).
• Direct fine-tuning corrupts lower-layer knowledge; proxy-tuning (output-distribution shifts only) preserves pretrained structure better and outperforms on knowledge tasks (~2025).
• Fine-tuning degrades chain-of-thought faithfulness; reasoning becomes performative after tuning (~2025).
• Weight-free adaptation (episodic verbal memory, context curation) achieves double-digit gains without parameter updates (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2504.07912 (2025-04) — Echo Chamber: RL amplifies pretraining distributions
• arXiv:2501.06252 (2025-01) — Transformer2: self-adaptive via singular-value composition
• arXiv:2605.12484 (2026-05) — Learning, Fast and Slow: minimal-update adaptation
• arXiv:2510.04618 (2025-10) — Agentic Context Engineering: context as evolving playbooks

Your task:
(1) RE-TEST THE ELICITATION-VS.-CREATION BOUNDARY. For each mechanism cited, determine whether newer post-training schedules, multi-stage training regimes, or compositional fine-tuning (LoRA variants, sparse updates) can credibly *add* capability beyond elicitation. Where does elicitation still hold? What concrete evidence would overturn it?
(2) SURFACE DISAGREEMENT: What work from the last 6 months contradicts the "weights-to-context shift" thesis? Do any papers show weight adaptation recovering ground against context methods, or hybrid regimes outperforming pure context approaches?
(3) PROPOSE TWO FORWARD QUESTIONS assuming the regime *has* moved: (a) If context truly dominates, what role remains for sparse or modular weight updates? (b) Can test-time adaptation + caching close the gap between eliciting pretraining and learning genuinely novel knowledge?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How does pretrained knowledge constrain what adaptation strategies can achieve?

Sources 11 notes

Next inquiring lines