Can decoding-time tuning preserve knowledge better than weight fine-tuning?
Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.
Proxy-tuning fine-tunes a small model, then applies the difference between the small tuned and small untuned model's predictions to shift a large untuned model's outputs at decoding time. The large model's parameters are never modified. The method closes 91% of the performance gap between Llama-2-13B and its directly tuned CHAT version, and 88% for the 70B model.
The critical finding: on knowledge-intensive tasks, proxy-tuning sometimes surpasses the performance of direct instruction-tuning. This is because direct fine-tuning modifies model weights — and some of those modifications overwrite pretrained knowledge. Since Why does reasoning training help math but hurt medical tasks?, weight modification risks corrupting the knowledge storage that proxy-tuning leaves intact.
Proxy-tuning primarily promotes reasoning and stylistic tokens. Analysis of the token-level distributional shift shows the largest influence on tokens associated with reasoning patterns and output style — consistent with evidence that "alignment mainly affects style rather than knowledge." This aligns with Does instruction tuning teach task understanding or output format? and Can imitating ChatGPT fool evaluators into thinking models improved?: what fine-tuning actually changes is output distribution, not capability. Proxy-tuning achieves this distributional change without touching the model weights that encode knowledge.
For domain adaptation, proxy-tuning Llama-2-13B using CodeLlama-7B produces 17-32% improvement on coding benchmarks. The small expert provides the distributional guidance; the large base model provides the knowledge. An optional hyperparameter controls the amount of guidance, enabling runtime trade-offs between different generation attributes.
This constitutes a fifth paradigm in the How do knowledge injection methods trade off flexibility and cost?: decoding-time adaptation. Zero training cost on the target model, full knowledge preservation, but requires access to base model logits at inference time.
ARGS (Alignment as Reward-Guided Search) provides a complementary inference-time method. Instead of applying a distributional shift from a tuned proxy, ARGS adjusts model predictions at each decoding step using a reward signal directly. Two components: reward-guided scoring (assigns scores to possible continuations) and token selection (selects a continuation based on scored candidates). A tunable weight controls the trade-off between semantic relevance and alignment criteria — setting it to zero recovers standard maximum-likelihood decoding. ARGS enables rapid personalized alignment without retraining: different users can have different reward functions applied at inference time. Together, proxy-tuning (distributional shift from expert delta) and ARGS (reward-guided decoding) suggest a design space where multiple axes of adaptation — domain knowledge, user preferences, task constraints — can each be applied at decoding time through complementary mechanisms. See Can user preferences be learned from just ten questions? for how per-user reward functions can be efficiently constructed.
Inquiring lines that use this note as a source 128
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can communication problems and optimization problems be addressed with the same alignment approaches?
- Why does RLHF alignment reduce the diversity of viewpoints in AI output?
- Does self-conditioning improve belief-behavior alignment better than external priors?
- Does alignment training make AI incapable of warranted urgency?
- Why does even 0.1 percent poisoned training data persist through alignment?
- Could superposed decoding algorithms maintain multi-task representation during generation?
- Why does fine-tuning for continuous space cause catastrophic forgetting?
- Can distillation methods extract directional guidance that scalar RL cannot access?
- What techniques work best for injecting domain knowledge at training time?
- What quality of curated data is minimally sufficient for alignment?
- Can instruction tuning succeed without explicit task understanding?
- Can a single AI system optimize multiple alignment dimensions simultaneously?
- Can alignment training be redesigned to permit warranted alarm?
- How does optimizing for accuracy during training degrade downstream reasoning quality?
- Does correct model behavior guarantee internal alignment of learned objectives?
- What role does terminal goal guarding play in model misalignment?
- How do early layers preserve unbiased information while late layers conform?
- Does the model learn depth-wise drift as an explicit strategy?
- How do different training objectives shift whether models over-predict or under-predict?
- How can safety-aligned parameters be protected during user-specific fine-tuning?
- How does distributional distance from pre-training relate to model difficulty?
- How does weight sharing compound the advantages of deeper model designs?
- What makes asymmetric distillation effective for converting pretrained diffusion models?
- How do training-time and inference-time knowledge injection techniques compare?
- How much alignment data does a language model actually need to specialize well?
- When should model isolation be preferred over weight-averaging approaches?
- Can bidirectional model updating between humans and AI reduce misalignment?
- Can fine-tuning ever teach semantic inference instead of amplifying training shortcuts?
- Can test-time compute on smaller models replace larger model inference?
- Can gradient approximation at equilibrium replace backpropagation through time in practice?
- Why does context information fail to override prior training associations?
- Why do small training data contaminations persist through alignment for most attack types?
- Does keyword priming explain why pre-training poisoning persists through alignment?
- How does behavioral fine-tuning differ from factual knowledge encoding in models?
- Why does the distinction between functional and causal grounding matter for AI alignment?
- Why does fine-tuning fail to remove temporal contamination from pretraining?
- When should full-parameter post-training be used instead of LoRA adaptation?
- How do layer-wise versus parameter-wise merging strategies affect information retention?
- Why does pure numeric ID indexing force models to learn from scratch?
- Why did prior multi-token prediction methods fail during fine-tuning?
- What distinguishes training-time entropy collapse from test-time variance inflation?
- What are the computational trade-offs between training-time vs inference-time consistency correction?
- What makes utility-weighted training backfire in machine learning systems?
- Can alignment methods like DPO exploit or correct these surface feature biases?
- Why does KTO skip supervised fine-tuning while DPO cannot?
- Can alignment methods model loss aversion without creating unintended sophistry?
- Can reward-guided decoding replace weight fine-tuning for personalized alignment?
- Why is offline knowledge distillation preferred when in-session signals matter?
- Why does long CoT training optimize for structural coherence over content correctness?
- Can expert vectors learned offline transfer across multiple model architectures?
- What deployment tradeoffs emerge between single-pass and multi-pass inference adaptation?
- How does pretrained knowledge constrain what adaptation strategies can achieve?
- Do all semantic steering effects follow predictable patterns based on feature alignment?
- Why does training order matter across different domain types?
- What training cost tradeoffs exist between fine-tuning and other knowledge injection methods?
- Why does weight space search reduce robustness to prompt perturbations better than prompt engineering?
- How does data quality mismatch create reasoning degradation in supervised fine-tuning?
- Why does post-training suppress alignment faking in some models but amplify it in others?
- How much task-similar finetuning data does test-time training actually need?
- Why does monological training prevent models from overriding statistical priors?
- How does RLHF alignment training reduce multi-turn conversational capability?
- How do trained weights differ from a stored library or text?
- What happens when alignment targets measure only the preferred dimension of entangled properties?
- Can steering vectors be combined with other compression techniques?
- Does gradient-based influence estimation identify which alignment examples actually matter most?
- What specific behavioral patterns should alignment examples target for maximum effect?
- Does representational density emerge from training data exposure during pretraining?
- Can alignment training create systematic blind spots in threat detection systems?
- Can model training address failures that really originate in harness gaps?
- Can prompt-based debiasing work if biases are embedded in pretraining?
- Does pretraining poisoning at scale persist through instruction alignment?
- How do retrieval and fine-tuning trade off flexibility against training cost?
- How should skill libraries coordinate with gradient-based weight optimization?
- Can goal information injected at inference time replace goal-conditioned training?
- What alignment procedures cause different models to share the same output distribution?
- How does upstream value embedding differ from downstream alignment patches?
- How do pre-training and distillation enable minimal routing signals to work?
- What mechanism transfers explicit memories into parametric model weights?
- How should training data be constructed to preserve teacher-student information gaps?
- How does KL regularization prevent both forgetting and adaptation loss?
- What makes two timescales better than one for minimizing weight movement?
- Can memory-based adaptation and gradient fine-tuning operate on complementary timescales?
- Does the pretrained prior actually constrain what internalized search can discover?
- How much does pretraining quality affect the modularity of fine-tuned models?
- Does importance sampling actually recover capabilities lost to hard sample training?
- Can population-level distributions shift usefully even when individual prediction fails?
- Do alignment benchmarks measure actual bias removal or only verbal compliance?
- Can mechanistic interpretability tools decode the biases alignment training conceals?
- Can decoder-only models become effective text encoders with training?
- What makes a learned consolidation rule lossy and where does contamination enter?
- What happens to representational structure during model pretraining phases?
- What mechanisms cause overly hard samples to degrade prior model performance?
- How can distillation preserve uncertainty expression instead of optimizing it away?
- Can categorical correctness signals stop dense optimizers from finding loopholes?
- How do models develop dense representations for familiar training data?
- What limits the capacity of context-based fast adaptation channels?
- Can models recover knowledge with completely unrelated retraining tasks?
- How does in-weights adaptation create spurious forgetting in models?
- What makes task alignment more fragile than underlying knowledge retention?
- Do long-term memory modules outperform consolidation into fast weights?
- How do sparse parameter updates enable when-not-how training to work?
- Does token-level loss aggregation help aligned models differently?
- What alignment properties emerge when the reward model disappears?
- Can weak models supervise the alignment of stronger models effectively?
- How does constitutional alignment compare to RLHF in removing human annotation costs?
- What makes principle-response mutual information sufficient for behavioral alignment?
- Why do unified models still inherit data-distribution biases from training?
- Does alignment compound cultural bias that started during pretraining?
- How does supervised fine-tuning degrade chain-of-thought faithfulness over time?
- Why does in-weight memorization fail compared to tool-based fact access?
- What causes overfitting when forcing new facts into model weights?
- Why does parameter-efficient tuning scaling fail to improve finetuning performance?
- Does pretraining data size matter less than base model scale for finetuning?
- Which finetuning method works best across different task and data regimes?
- Can AI-assisted alignment eventually solve fairness at scale?
- Can decoding-time prompting strategies fully replace diversity-focused training methods?
- How does representational density emerge from training data familiarity?
- How does Easy Consistency Tuning accelerate consistency model training from diffusion checkpoints?
- Can training order and structure shape what networks retain and learn?
- Can we unlearn memorized text by finetuning only high-gradient weights?
- How do newly learned facts become accessible after gradient updates?
- Does latent density emerge during pretraining from training data familiarity?
- Why does adaptation concentrate in low-dimensional subspaces of weights or representations?
- What makes representation interventions more efficient than weight perturbations for finetuning?
- How much performance is lost when converting pretrained checkpoints versus training from scratch?
- Can time-awareness live in model parameters instead of retrieval?
- What is the accuracy cost of enforcing temporal causality inside model parameters?
- Does finetuning facts into weights overwrite existing model capabilities?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
How do knowledge injection methods trade off flexibility and cost?
When and how should domain knowledge enter an AI system? This explores the speed, training cost, and adaptability trade-offs across four injection paradigms, and when each approach suits different deployment constraints.
proxy-tuning adds a fifth paradigm: decoding-time adaptation
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
explains why proxy-tuning preserves knowledge: it doesn't modify lower layers
-
Can imitating ChatGPT fool evaluators into thinking models improved?
Explores whether fine-tuning weaker models on ChatGPT outputs creates an illusion of capability gains. Investigates why human raters and automated judges fail to detect that imitation improves style but not underlying factuality or reasoning.
proxy-tuning exploits the same style/knowledge distinction but productively
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
proxy-tuning may avoid the SFT accuracy trap by not modifying weights
-
Can models precompute answers before users ask questions?
Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
both are inference-time adaptation methods that avoid weight modification: proxy-tuning applies a distributional shift at decoding time, sleep-time compute pre-computes inferences between interactions; together they suggest a design space where adaptation, knowledge preservation, and latency optimization all operate at inference time rather than training time
-
Can models dynamically activate expert skills at inference time?
Can language models efficiently discover and compose task-specific capabilities on the fly without modifying base weights? This explores whether test-time adaptation through expert vector composition outperforms fixed fine-tuning approaches.
complementary decoding-time adaptation: proxy-tuning applies a distributional shift from expert delta, SVF composes compact expert singular vectors via two-pass dispatch; both preserve base weights but SVF enables composable multi-skill adaptation while proxy-tuning uses a single expert signal
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Tuning Language Models by Proxy
- Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs
- An Emulator for Fine-Tuning Large Language Models using Small Language Models
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts
- Foundations of Large Language Models
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
- Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels
Original note title
proxy tuning at decoding time preserves pretrained knowledge better than direct fine-tuning by applying the tuning signal as a distributional shift