How does pretrained knowledge constrain what adaptation strategies can achieve?
This explores a ceiling that runs underneath most fine-tuning and RL work: adaptation mostly reorganizes, selects, or risks damaging what pretraining already laid down — it rarely manufactures genuinely new capability.
This explores a ceiling that runs underneath most fine-tuning and RL work — the question of whether adaptation creates capability or merely reshuffles what pretraining already deposited. The corpus comes down hard on the second reading, and from several independent directions. The clearest statement is that base models already carry the abilities we think we're teaching: five separate mechanisms — RL steering, critique tuning, decoding tweaks, feature steering, and RLVR — all turn out to *elicit* reasoning that's already latent in base-model activations rather than build it Do base models already contain hidden reasoning ability?. Follow that thread and post-training looks less like education and more like routing: RL teaches a model *when* to deploy reasoning, not *how* to reason, with hybrid models recovering ~91% of the gains by controlling token routing alone Does RL post-training create reasoning or just deploy it?.
If adaptation selects rather than creates, then the pretraining distribution sets the menu. A striking demonstration: RL doesn't explore freely — within the first epoch it amplifies a single dominant output format already present in pretraining and suppresses the alternatives, and which format 'wins' depends on model scale, not on which format performs best Does RL training collapse format diversity in pretrained models?. The strategy can only promote options the base model came with. The same boundary shows up in imitation learning from a different angle: agents trained on static expert demonstrations are capped by what the dataset's curators imagined, never learning from their own failures because they never interact with anything outside the demonstrated scenarios Can agents learn beyond what their training data shows?.
The second way pretrained knowledge constrains adaptation is darker: aggressive weight updates don't just fail to add — they actively corrupt what's stored. Direct fine-tuning damages knowledge held in the lower layers, which is exactly why proxy-tuning, which leaves base weights untouched and shifts only the output distribution, closes most of the alignment gap while *beating* direct fine-tuning on knowledge tasks Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Fine-tuning can also sever the link between a model's reasoning and its answers — after tuning, chains-of-thought become performative, with truncation and paraphrasing leaving final answers unchanged Does fine-tuning disconnect reasoning steps from final answers?. So the tension is real: the heavier your intervention, the more pretrained structure you put at risk.
The most interesting work treats this constraint as a *design problem* rather than a wall. If forgetting is really a misallocation — task lessons being written into weights where they overwrite pretrained knowledge — then route them elsewhere: Fast-Slow Training keeps parameter updates minimal and pushes task-specific learning into optimized prompts, hitting equal performance faster with far less forgetting Can splitting adaptation into two channels reduce forgetting?. Singular-value tuning composes specialist 'expert' directions out of the existing weight matrices without interference, specializing continually without clobbering the base Can models dynamically activate expert skills at inference time?. And when you genuinely must embed new domain knowledge, *how* you reward matters: RLAG internalizes coherent knowledge structures better than supervised fine-tuning by rewarding explanation quality rather than token-level correctness Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?.
The quiet payoff here: the field's center of gravity is shifting from weights to context. If pretraining sets a hard ceiling on what's *in* the model, the cheapest gains come from better eliciting and orchestrating it — agents that store verbal reflections as episodic memory and improve without any weight update Can agents learn from failure without updating their weights?, and context 'playbooks' that accumulate task knowledge through curation loops, posting double-digit gains with no labeled supervision and nothing touched in the weights at all Can context playbooks prevent knowledge loss during iteration?. Pretrained knowledge doesn't just constrain adaptation — it's increasingly the argument for not adapting the weights in the first place.
Sources 11 notes
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.