Can models dynamically activate expert skills at inference time?
Can language models efficiently discover and compose task-specific capabilities on the fly without modifying base weights? This explores whether test-time adaptation through expert vector composition outperforms fixed fine-tuning approaches.
Transformer2 introduces Singular Value Fine-tuning (SVF): instead of modifying full weight matrices or even low-rank adaptations, SVF extracts and tunes only the singular values within a model's weight matrices. This produces compact expert vectors that are inherently composable — they can be dynamically mixed at inference without interference.
The inference mechanism has two passes:
- First pass (dispatch): The model executes on the input and observes its own test-time behavior, gathering information about what skills the current problem requires.
- Second pass (adaptation): The framework combines available expert vectors based on the first-pass analysis, providing a targeted modification to the base weights specifically tailored to the task.
Three adaptation strategies provide monotonic performance benefits with increasing access to test-time conditions, enabling deployment-scenario-appropriate tradeoffs.
The key properties that make this work:
- Compositionality: SVF expert vectors combine naturally because they operate on orthogonal singular value dimensions. LoRA adapters, by contrast, modify rank-k subspaces that may interfere when composed.
- Efficiency: SVF trains far fewer parameters than LoRA while outperforming it. Expert vectors are compact enough to store many specializations.
- Continual learning: New expert modules can be developed offline and added without catastrophic forgetting, because the base model weights are never modified — only the singular value modulation changes.
The neuroscience parallel is deliberate: the brain activates specific regions depending on the task and dynamically reconfigures its functional networks in response to changing demands. Transformer2 operationalizes this for LLMs.
The deeper principle: the requisite capabilities for many downstream tasks already exist within pretrained models. The bottleneck is not knowledge but activation — knowing when to deploy which capability. This aligns with Does RL teach reasoning or just when to use it?, extending it to the architecture level: self-adaptation is about routing to existing capabilities, not creating new ones.
Inquiring lines that use this note as a source 67
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- When does knowledge activation fail across different model architectures?
- Could superposed decoding algorithms maintain multi-task representation during generation?
- Does task superposition explain how models learn from multiple in-context trajectories?
- Can prompting inject new knowledge into already-trained AI models?
- What techniques work best for injecting domain knowledge at training time?
- Can language models learn to form ad-hoc conventions through training?
- Can prompting unlock compositional skills that pretraining already learned?
- Can symbolic mechanisms improve transformer compositional abilities?
- Does narrow reallocation to remaining tasks constitute genuine adaptation?
- Why does full multi-task fine-tuning perform worse than sequential training?
- Can dynamic instance-specific prompt selection solve the generalization problem across tasks?
- Can demo placement be tuned as a task-specific hyperparameter?
- How do training-time and inference-time knowledge injection techniques compare?
- How does over-specialization create capability cliffs outside target domains?
- Can prompt optimization inject new knowledge into language models?
- How does candidate-conditional activation differ from static embedding-based feature crosses?
- What architectural changes would let language models develop genuine functional competence?
- How do ensemble methods apply within a single model?
- How do trait adapters interact with different base model architectures?
- What skills can large models identify and organize about their own abilities?
- Can a single model trained on two tasks predict untrained decision tasks?
- When should full-parameter post-training be used instead of LoRA adaptation?
- Can prompt optimization or fine-tuning inject knowledge models do not already contain?
- What performance trade-offs emerge when composing multiple independently trained model capabilities?
- What knowledge can prompt optimization actually activate in trained models?
- Why does the gap between theoretical expressiveness and learned capability matter?
- Can smaller models achieve domain expertise through focused RL training?
- Why do singular value experts compose better than low-rank adapter subspaces?
- Can expert vectors learned offline transfer across multiple model architectures?
- What deployment tradeoffs emerge between single-pass and multi-pass inference adaptation?
- How does pretrained knowledge constrain what adaptation strategies can achieve?
- What training cost tradeoffs exist between fine-tuning and other knowledge injection methods?
- How much task-similar finetuning data does test-time training actually need?
- Does sparse parameter updating improve test-time training's computational cost?
- Does training on granular tasks beat training on the full function calling problem?
- Can users adapt their competencies to match how AI actually operates?
- How can interpretability methods account for shifting representational density across task conditions?
- Can activation steering vectors compress reasoning without retraining models?
- Can individual skills improve through reuse and accumulate experience across tasks?
- What happens to base model capabilities when you apply finetuning?
- How should skill libraries coordinate with gradient-based weight optimization?
- Can single-hop knowledge automatically compose into multi-hop capability?
- Can extracted skills transfer effectively across different domains and model architectures?
- Where does skill extraction fail compared to genuine model adaptation?
- How do transformers stitch together learned behaviors when adapting to new tasks?
- Can training on diverse related tasks be more efficient than task-specific training?
- Why does scaling data and model size improve compositional generalization?
- Why does specializing to one task make future task learning harder?
- Can memory-based adaptation and gradient fine-tuning operate on complementary timescales?
- Can models adapt and combine search strategies beyond their training algorithm?
- How does mixture of experts enable flexible capacity sharing between modalities?
- Can dense models partially address modality friction without full expert specialization?
- How much does pretraining quality affect the modularity of fine-tuned models?
- Can specialized components replace single fully-trained models in deployment?
- Can models recover knowledge with completely unrelated retraining tasks?
- How do sparse mixture-of-experts models resolve modality capacity competition?
- Do text-space skills transfer learning across different frontier models?
- How do sparse parameter updates enable when-not-how training to work?
- How do task frequency and complexity interact with model capacity during training?
- Can decoding-time prompting strategies fully replace diversity-focused training methods?
- What does next-token prediction tell us about compositional linguistic competence?
- How does the inference steps dial compare to test-time compute trade-offs in language models?
- How do semantic features in representations become steerable task-specific directions?
- Why does adaptation concentrate in low-dimensional subspaces of weights or representations?
- Can modular expert decomposition extend beyond time into other causal dimensions?
- Does ternary weight quantization simplify deployment of mixture of experts?
- What makes mixture-of-experts routing learn token-level specialization effectively?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
Transformer2 operationalizes this at the architecture level via test-time expert composition
-
How do knowledge injection methods trade off flexibility and cost?
When and how should domain knowledge enter an AI system? This explores the speed, training cost, and adaptability trade-offs across four injection paradigms, and when each approach suits different deployment constraints.
SVF occupies a new position: lightweight training, dynamic inference, composable
-
Can isolating task-specific parameters prevent multi-task fine-tuning interference?
Explores whether identifying and protecting task-specific parameter regions can prevent the performance degradation that occurs when fine-tuning models on multiple tasks simultaneously. This matters because it could enable safe multi-task adaptation without sacrificing individual task performance.
SVF achieves similar goals through singular value decomposition rather than region identification
-
Can decoding-time tuning preserve knowledge better than weight fine-tuning?
Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.
complementary inference-time adaptation: proxy-tuning uses a single expert distributional shift, SVF composes multiple expert vectors; both avoid base weight modification but SVF provides finer-grained multi-skill composition via orthogonal singular value dimensions
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Transformer2: Self-adaptive LLMs
- Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
- Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance
- An Emulator for Fine-Tuning Large Language Models using Small Language Models
- QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
- Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models
- Exploring Format Consistency for Instruction Tuning
- Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Original note title
self-adaptive LLMs compose expert vectors at inference via two-pass singular value fine-tuning