Does task superposition explain how models learn from multiple in-context trajectories?
This explores whether the 'task superposition' phenomenon — models holding several in-context tasks at once — is the same mechanism that lets models learn from multiple worked-through trajectories in their context, or whether those are two different stories.
This explores whether task superposition is the engine behind in-context learning from trajectories. The short answer the corpus suggests is: probably not — these are two adjacent but distinct phenomena, and conflating them obscures more than it explains. Task superposition Can LLMs handle multiple tasks at once during inference? describes a model holding several complete, computationally distinct tasks in parallel during inference. But the same finding notes that autoregressive decoding collapses this superposition to a single task right after the first token. So superposition is a fleeting representational state, not a learning process — it explains what a model can momentarily represent, not how it absorbs and generalizes from examples laid out in context.
The trajectory question is answered more directly elsewhere. What actually drives in-context learning of sequential, decision-making behavior is trajectory burstiness Why do trajectories matter more than individual examples for in-context learning?: the context needs full or partial trajectories drawn from the same environment, not scattered isolated examples. That's a claim about the *structure* of what you put in the prompt — order and coherence matter — rather than about superposed task representations. The model generalizes across very different tasks without any weight update, but the lever is the shape of the demonstrations, not parallel task-holding.
There's a more compelling mechanistic candidate hiding in the corpus: shared attention machinery. Length generalization transfers across related tasks because models reuse the same attention heads, and pretraining already lays down this reusable scaffolding Can length generalization transfer between different related tasks?. Pair that with the finding that networks decompose compositional problems into modular subnetworks, again amplified by pretraining Do neural networks naturally learn modular compositional structure?. Together these sketch a picture where learning from many trajectories looks less like superposing whole tasks and more like routing context through pre-built, composable computational parts — closer to inference-time expert composition Can models dynamically activate expert skills at inference time? than to a quantum-like superposition.
The corpus also pushes back on the assumption that all trajectories teach the same way. Differential trajectory processing shows successes and failures should be handled asymmetrically — successes as concrete demonstrations, failures as abstracted lessons — and that uniform consolidation actually degrades learning Should successful and failed episodes be processed differently?. That's hard to square with a flat superposition account, where tasks coexist undifferentiated. And a sharper caution: in-context information often loses to strong parametric priors entirely, meaning trajectories in the prompt may not be 'learned from' at all when training associations dominate Why do language models ignore information in their context?.
So the honest synthesis is that task superposition and trajectory-based in-context learning answer different questions. Superposition tells you a model briefly entertains multiple tasks before committing; trajectory learning is governed by demonstration structure, reused attention heads, modular composition, and how success-versus-failure signal is processed. If you came hoping superposition was the unifying explanation, the interesting surprise is that the more durable mechanism appears to be compositional reuse of pretrained structure — not holding everything at once.
Sources 7 notes
Large language models represent multiple complete, computationally distinct tasks simultaneously during inference—a macroscopic phenomenon separate from feature-level superposition. However, autoregressive decoding forces convergence to a single task after the first token, preventing practical multi-task generation.
In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.
Models trained jointly on related tasks reuse the same attention heads to handle length generalization, allowing shorter tasks to extrapolate beyond their training length. Pretrained models already contain this reusable computational scaffolding.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.