Can expert vectors learned offline transfer across multiple model architectures?

This explores whether the reusable 'expert' components some methods learn ahead of time — skill-specific weight adjustments, adapters, vectors — stay portable, or whether they're locked to the one model they were trained on.

This explores whether the reusable 'expert' components some methods learn ahead of time stay portable across different models, or whether they're welded to the one architecture that produced them. The honest answer from the corpus: transfer across *different* architectures is largely unproven, but the same research is quietly revealing why portability is hard — and where the natural seams that *would* allow it actually live.

Start with what 'expert vectors' even are. Transformer² tunes only the singular values inside a model's weight matrices, producing small expert vectors that mix and match at inference without stepping on each other Can models dynamically activate expert skills at inference time?. These are learned offline and composed live — but they're defined relative to one model's specific weight matrices, so a vector tuned for model A has no obvious meaning inside model B's differently-shaped weights. That's the core obstacle: most 'experts' are coordinates in a particular model's parameter space, not free-floating skills.

The more interesting lateral finding is that skills *do* seem to live in transferable structures within a model — which is the precondition any cross-architecture transfer would need. Pruning experiments show networks naturally carve compositional tasks into isolated subnetworks, and pretraining makes this modular structure *more consistent across architectures and domains* Do neural networks naturally learn modular compositional structure?. Similarly, length generalization rides on specific attention heads that get reused across related tasks, scaffolding that pretrained models already contain Can length generalization transfer between different related tasks?. So expertise is genuinely modular and reusable — but the reuse demonstrated so far is *within* a model, not ported between two different ones.

The corpus also suggests two escape routes from the weight-coordinate problem. One is to stop encoding the expert in weights at all: proxy-tuning applies an expert's *distributional shift at decoding time*, leaving base weights untouched Can decoding-time tuning preserve knowledge better than weight fine-tuning?, and episodic-memory agents adapt continually through stored experience with zero parameter updates Can agents learn continuously from experience without updating weights?. Expertise held as an output-space signal or a memory module is far less architecture-bound than expertise baked into singular values. The other route is discovery rather than transfer — swarms of model 'particles' search weight space collaboratively and find composed experts that can answer questions none of the starting experts could, using only ~200 examples and no gradients Can language models discover new expertise through collaborative weight search?.

There's a cautionary thread too: experts that look transferable may just be converging on shared training data rather than carrying real distinct skill. Different LLMs independently produce near-identical outputs — an 'Artificial Hivemind' from overlapping pretraining and alignment Do different AI models actually produce diverse outputs?. So apparent cross-model compatibility can be an artifact of homogeneity, not evidence that a learned expert genuinely ported its capability. The takeaway worth leaving with: the bottleneck on cross-architecture transfer isn't whether skills are modular — they demonstrably are — it's that we mostly store experts as coordinates in one model's geometry, and the most promising portable forms keep the expertise *outside* the weights entirely.

Sources 7 notes

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can length generalization transfer between different related tasks?

Models trained jointly on related tasks reuse the same attention heads to handle length generalization, allowing shorter tasks to extrapolate beyond their training length. Pretrained models already contain this reusable computational scaffolding.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can language models discover new expertise through collaborative weight search?

PSO-inspired swarms of LLM particles moving through weight space discover composed experts with new capabilities—including answering questions all initial experts failed on—using only 200 validation examples and no gradient-based training.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with re-testing whether expert vectors learned offline truly transfer across model architectures. This remains an open question.

What a curated library found — and when (dated claims, not current truth): These findings span 2023–2026 and capture a regime before the latest scaling/architectural shifts:
• Expert vectors are typically defined as coordinates in one model's weight-matrix geometry (singular values, parameter space), making cross-architecture transfer unproven (Transformer², 2025-01).
• Modular task decomposition is real and consistent across architectures during pretraining, but demonstrated reuse so far is *within* a single model, not between two different ones (2023–2025).
• Length generalization transfers via shared attention-head scaffolding within related tasks, not yet shown to port across heterogeneous architectures (2025-06).
• Non-weight-encoded expertise (proxy-tuning at decode time, memory modules, swarm-discovered experts) shows lower architecture-binding than singular-value experts (~200-example discovery with zero gradients; 2024–2025).
• Apparent cross-model output similarity may reflect Artificial Hivemind convergence on shared pretraining, not true portable expertise (2025-10).

Anchor papers (verify; mind their dates):
• arXiv:2301.10884 (2023-01) — structural compositionality baseline.
• arXiv:2501.06252 (2025-01) — Transformer² singular-value expert vectors.
• arXiv:2410.11163 (2024-10) — swarm-intelligence expert discovery.
• arXiv:2510.22954 (2025-10) — Artificial Hivemind homogeneity risk.

Your task:
(1) RE-TEST EACH CONSTRAINT. For weight-coordinate binding, check whether recent models (o1, Grok-2, claude-opus) or new training methods (mixture-of-depths, dynamic routing, cross-model distillation) have relaxed the geometry-lock. For modular transfer, investigate whether parameter-sharing schemes, adapter fusion, or LoRA-composition across architectures now work reliably. Separate the durable question (can we port task-specific reasoning?) from perishable limits (weight-matrix coordinates are immobile).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. Look for papers that *do* show cross-architecture expert transfer, or that abandon weight-encoded experts entirely for language-based skill vectors, latent thought vectors, or world-model routing.
(3) Propose 2 research questions that ASSUME the regime has moved: (a) If swarm-discovered or memory-encoded experts scale to 10B+ parameter models, does cross-architecture transfer emerge as a side effect? (b) Can expert *linearity* (e.g., skill vectors in residual stream space rather than weight space) bypass architecture-specificity?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can expert vectors learned offline transfer across multiple model architectures?

Sources 7 notes

Next inquiring lines