Can language models discover new expertise through collaborative weight search?
Can model experts be composed through particle swarm optimization in weight space without training? This explores whether collaborative search can discover capabilities that no individual expert possesses.
Model composition has two dominant approaches: learn-to-fuse (train components to glue experts together — data-heavy, rigid) and model arithmetic (weight operations with strong assumptions like lion_indoors = lion_outdoors + dog_indoors - dog_outdoors — assumption-heavy, manual). MODEL SWARMS proposes a third way: collaborative search in weight space inspired by particle swarm optimization.
Each LLM expert is a particle with a location (model weights) and velocity (direction in weight space). Velocity is iteratively updated by three forces: inertia (tendency to keep moving), personal best (the best location this particle has found), and global best/worst (the best/worst locations found across all particles). Particles then step toward their updated velocity.
Three key properties make this distinctive:
1. Training-free. No loss function, gradient descent, or backpropagation. Composition requires only 200 examples as validation signal — barely 3 batches for training-based approaches.
2. Assumption-free. No manual specification of how experts should compose. The swarm automatically discovers better adapted experts through collaborative search.
3. Any adaptation objective. The utility function can be anything — dataset performance, reward model scores, human interests. This flexibility is structural, not parameter-tuned.
The most interesting finding is correctness emergence: new capabilities appear that no initial expert had. Questions where all experts initially answered incorrectly are answered correctly by post-swarm experts. This is not transfer — it is genuinely new capability discovered through search in weight space.
Practical results: 17.6% improvement in LLM-as-judge scores, 17.0% in factuality, 70.8% human win rate against initial experts (96% on best domains). MODEL SWARMS also drastically reduces sensitivity to minor prompt changes — improving robustness through weight-space optimization rather than prompt engineering.
Token swarms extend the approach to cross-architecture composition by operating on token probability distributions rather than weights.
Inquiring lines that use this note as a source 21
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can a single AI system optimize multiple alignment dimensions simultaneously?
- Can bilevel autoresearch discover new search mechanisms for the inner research loop?
- How does latent space diffusion enable evolutionary search in high dimensions?
- Can accelerated sampling techniques from image generation speed up evolutionary search?
- How do experts select which other experts to trust?
- How do ensemble methods apply within a single model?
- How do expert priors constrain human researchers from exploring novel concepts?
- Why do singular value experts compose better than low-rank adapter subspaces?
- Can expert vectors learned offline transfer across multiple model architectures?
- How many particles and iterations does optimal expert discovery require?
- Can token probability distributions extend swarm composition across different model architectures?
- How does joint backpropagation differ from training separate ensemble models?
- How should skill libraries coordinate with gradient-based weight optimization?
- Can models adapt and combine search strategies beyond their training algorithm?
- How does mixture of experts enable flexible capacity sharing between modalities?
- Why does gradient descent discover compositional structure without explicit pressure?
- How do sparse mixture-of-experts models resolve modality capacity competition?
- Can evolutionary search unlock problems that best-of-n selection cannot solve?
- Can the same problem be solved by multiple evolutionary search strategies?
- Does ternary weight quantization simplify deployment of mixture of experts?
- What makes mixture-of-experts routing learn token-level specialization effectively?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models dynamically activate expert skills at inference time?
Can language models efficiently discover and compose task-specific capabilities on the fly without modifying base weights? This explores whether test-time adaptation through expert vector composition outperforms fixed fine-tuning approaches.
Transformer2/SVF: inference-time expert composition via singular value decomposition; MODEL SWARMS operates in full weight space rather than decomposed space
-
Can AI systems discover better neural architectures than humans?
Can multi-agent LLM systems, when structured with genetic programming, discover novel neural network designs that outperform human-engineered architectures? This matters because it could automate a critical bottleneck in AI research.
LLM-based evolutionary search for architectures; MODEL SWARMS uses evolutionary search for model adaptation
-
Can evolutionary search beat sampling and revision at inference time?
Can LLMs evolve populations of solutions through recombination and selection to outperform simpler inference strategies? This matters because it could reveal whether biological-inspired search improves planning without formal problem definitions.
Mind Evolution: evolutionary operators applied to outputs; MODEL SWARMS applies evolutionary operators to weights
-
Can diversity optimization improve quality during language model training?
Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?
DARLING: diversity optimization in output space; MODEL SWARMS achieves diversity via multi-particle exploration in weight space
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
- Language Modeling by Language Models
- How Should We Meta-Learn Reinforcement Learning Algorithms?
- Fine-tuning Large Language Model for Automated Algorithm Design
- QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
- AlphaEvolve: A coding agent for scientific and algorithmic discovery
- Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
- Transformer2: Self-adaptive LLMs
Original note title
swarm intelligence in weight space discovers adapted language model experts without training through collaborative search guided by utility functions