Can length generalization transfer between different related tasks?
Can a model trained on longer sequences in one task learn to handle longer inputs in a related task without explicit training? This matters for understanding how neural networks reuse computational strategies across problems.
The "Extrapolation by Association" paper demonstrates a specific mechanism for out-of-distribution generalization: length generalization — the ability to handle longer inputs than seen during training — can transfer from one task to another.
The setup: train multiple related tasks jointly, where an "auxiliary task" uses longer inputs and a "main task" uses shorter inputs. The finding: the main task generalizes to the length of the longer auxiliary task, even though it was never trained at that length. This works across arithmetic operations, string transformations, and maze navigation — diverse algorithmic domains sharing an underlying structural similarity.
The mechanistic evidence is precise: length generalization transfer correlates with the reuse of the same attention heads between tasks. The model doesn't learn separate length-handling circuitry per task. Instead, it develops shared computational infrastructure that handles the length dimension, and this infrastructure transfers because the related tasks route through the same attention heads.
The pretrained-model finding extends this further: pretrained language models already exhibit similar transfer effects, suggesting that pretraining equips models with "reusable computational scaffolding" that facilitates extrapolation in downstream settings. The scaffolding is not task-specific — it is a general capability for processing longer sequences that was acquired during pretraining and can be activated by fine-tuning on related tasks.
This connects to Do base models already contain hidden reasoning ability? through a shared principle: pretraining installs capabilities that later training surfaces rather than creates. The base model already has the computational scaffolding for length handling; the auxiliary task merely activates it for the main task.
The connection to Do neural networks naturally learn modular compositional structure? is direct: attention head reuse across tasks is a specific instance of modular subnetwork sharing. The decomposition into reusable modules happens naturally, and pretraining encourages it — exactly the compositional generalization thesis applied to the length dimension.
Since Can neural networks learn compositional skills without symbolic mechanisms?, length generalization may follow the same scaling trajectory — more data and larger models produce more transferable attention head circuits. The practical implication: training on a diverse set of related tasks at varying lengths may be more efficient than training each task independently at the target length.
Inquiring lines that use this note as a source 11
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does extended exoskeleton use eventually produce meaningful skill transfer?
- Does task superposition explain how models learn from multiple in-context trajectories?
- Does compositional generalization emerge suddenly or improve smoothly with scale?
- Can backward transfer measurements reliably predict optimal multi-task training order?
- Can expert vectors learned offline transfer across multiple model architectures?
- Can scaling alone create compositional generalization without explicit binding mechanisms?
- Can the joint-training principle extend beyond memorization and generalization pairs?
- What makes data augmentation an implicit form of contraction learning?
- Can we predict out-of-distribution generalization without access to downstream tasks?
- How do neural networks decompose tasks into modular subnetworks that transfer?
- What makes recurrent depth enable compositional generalization across tasks?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
shared principle: pretraining installs capabilities that later training surfaces
-
Do neural networks naturally learn modular compositional structure?
Explores whether neural networks decompose compositional tasks into distinct subroutines without explicit symbolic design. This challenges the longstanding view that neural networks are fundamentally non-compositional.
attention head reuse is a concrete instance of modular subnetwork sharing
-
Can neural networks learn compositional skills without symbolic mechanisms?
Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
length generalization may share the same scaling dynamics
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Extrapolation by Association: Length Generalization Transfer in Transformers
- Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges
- Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
- Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
- In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss
- When More is Less: Understanding Chain-of-Thought Length in LLMs
- Scaling can lead to compositional generalization
- From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence
Original note title
length generalization transfers across related tasks via shared attention head reuse — pretraining provides reusable computational scaffolding