Why do decoder-only models underperform as text encoders?
Decoder-only LLMs use causal attention, which limits each token to seeing only prior context. This explores whether removing this constraint could make them competitive universal encoders without architectural redesign.
LLM2Vec (2404.05961) identifies a specific architectural reason for the slow adoption of decoder-only LLMs as text encoders: causal attention limits each token's representation to information from preceding tokens only. At any layer, the representation of token at position i is influenced solely by positions 0 through i-1. While necessary for generative capability, this is suboptimal for text embeddings that need to capture information across the entire input sequence.
The fix is surprisingly simple — a 3-step unsupervised transformation:
- Enable bidirectional attention (remove the causal mask)
- Masked next token prediction (adapt to the bidirectional regime)
- Unsupervised contrastive learning (align representations for similarity)
Applied to models from 1.3B to 8B parameters, this achieves SOTA on MTEB among models training only on publicly available data. Word-level tasks see the largest margin over encoder-only models, and sequence-level tasks reach competitive performance without any supervised training or synthetic GPT-4 data.
The finding has implications for the embedding retrieval architecture debate. Since Do embedding dimensions fundamentally limit retrievable document combinations?, the quality of embeddings matters within those geometric constraints. LLM2Vec shows that the representation quality bottleneck in decoder-only models is the causal mask, not the model size or training data. Removing this constraint accesses the full representational capacity of the pretrained model.
Since Do vector embeddings actually measure task relevance?, LLM2Vec's contrastive learning step is relevant — it aligns representations for similarity rather than association, potentially addressing the semantic-vs-relevance gap at the encoder level.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does activation masking prevent the decoder from taking interpretability shortcuts?
- Do decoder-only models have inherent architectural limits for non-sequential information?
- Does bidirectional attention improve language models as universal encoders?
- Can decoder-only models become effective text encoders with training?
- How does causal multimodal modeling differ from encoder-decoder architectures?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do embedding dimensions fundamentally limit retrievable document combinations?
Can single-vector embeddings represent any top-k document subset a user might need? Research using communication complexity theory suggests there are hard geometric limits independent of training data or model architecture.
better encoders work within these limits but cannot escape them
-
Do vector embeddings actually measure task relevance?
Vector embeddings rank semantic similarity, but RAG systems need topical relevance. When these diverge—as with king/queen versus king/ruler—does similarity-based retrieval fail in production?
causal masking may contribute to association-over-relevance by limiting contextual scope
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
- Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
- The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
- Large Language Model Programs
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
- Multi-Token Attention
- Everything Everywhere All At Once: Llms Can In-context Learn Multiple Tasks In Superposition
- System 2 Attention (is something you might need too)
Original note title
causal attention inherently limits decoder-only models as text encoders — enabling bidirectional attention transforms them into competitive universal encoders