Does bidirectional attention improve language models as universal encoders?

This explores whether removing the left-to-right "causal mask" — so the model can look both forward and backward across a sentence — makes decoder-only LLMs better at producing the embeddings used for search, retrieval, and clustering. The corpus has a direct and surprisingly clean answer: yes. The LLM2Vec work Why do decoder-only models underperform as text encoders? shows that the thing holding decoder-only models back as text encoders was never their size — it was causal attention itself. Because a decoder-only model only ever sees tokens to its left, the representation of an early word never "knows" about the words that follow it, which is exactly backwards from what you want in an encoder, where a good vector should summarize the whole passage. Switch on bidirectional attention, add a short bout of masked prediction and contrastive learning, and these models jump to state-of-the-art on the standard embedding benchmark (MTEB). The bottleneck was architectural, not a matter of scale.

What makes this interesting is *why* causal attention is such a liability — and here the corpus lets you go sideways into the mechanics of attention. One line of work argues that transformers don't store knowledge as retrievable records at all; knowledge lives as a continuous flow of activations that only exists in the act of generation Do transformer models store knowledge or generate it continuously?. A model built to *generate* the next token left-to-right is optimized for performance, not for compressing meaning into a fixed point — which is precisely the job of an encoder. Bidirectional attention is a way of repurposing a generation engine into a summarization engine.

There's a cautionary thread too. Soft attention has a structural bias: it systematically over-weights repeated and context-prominent tokens regardless of whether they're relevant Does transformer attention architecture inherently favor repeated content?. Letting attention see in both directions doesn't automatically fix that bias — it can give prominent-but-irrelevant material more chances to dominate the representation. So "bidirectional" is a genuine improvement for encoding, but it inherits attention's other quirks rather than curing them.

The broader lesson worth carrying away: the field keeps finding that decoder-only LLMs have latent capabilities locked behind the constraints of how they were trained, not how big they are. Depth turns out to matter more than width for small models Does depth matter more than width for tiny language models?; unused sequence space after the end-of-text token can be repurposed to teach self-evaluation Can models learn to evaluate their own work during training?; and here, flipping the attention mask unlocks an entirely different use. "Universal encoder" is less a new model you have to train from scratch and more a setting you can switch on in one you already have.

Sources 5 notes

Why do decoder-only models underperform as text encoders?

LLM2Vec's unsupervised 3-step process (bidirectional attention + masked prediction + contrastive learning) achieves SOTA on MTEB. The research shows causal masking, not model size, is the representation bottleneck in decoder-only encoders.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Does transformer attention architecture inherently favor repeated content?

Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher evaluating whether bidirectional attention truly unlocks decoder-only LLMs as universal text encoders—or whether the constraint has been relaxed by other means (scaling, training method, or inference technique) since early 2024.

What a curated library found—and when (dated claims, not current truth):
Findings span Nov 2023–Oct 2025. Key constraints reported:
- Causal attention fundamentally limits decoder-only models' ability to produce coherent sentence embeddings; bidirectional masking + contrastive learning recovers MTEB state-of-the-art (arXiv:2404.05961, Apr 2024).
- Transformers store knowledge as continuous activation flows, not retrievable records; generation-optimized models compress meaning poorly for fixed-point encoding (referenced ~2024).
- Soft attention systematically over-weights repeated/context-prominent tokens regardless of relevance; bidirectionality does not cure this structural bias (referenced ~2024).
- Depth > width for sub-billion parameter models, and post-EOS token space can internalize self-evaluation (arXiv:2402.14905 Feb 2024; arXiv:2507.20252 Jul 2025).

Anchor papers (verify; mind their dates):
- arXiv:2404.05961 (LLM2Vec, Apr 2024): Bidirectional attention + contrastive learning.
- arXiv:2405.00208 (Primer on Inner Workings, Apr 2024): Mechanistic explanation of attention and residual flows.
- arXiv:2507.20252 (Post-Completion Learning, Jul 2025): Unused token space for self-improvement.
- arXiv:2510.27062 (Consistency Training, Oct 2025): Instruction-following robustness.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, judge whether larger models, training methods (instruction tuning, contrastive objectives), inference tooling (memory, caching, multi-query attention), or newer evaluation protocols have since relaxed or overturned the claim. Separate the durable question ("Is causal masking a fundamental bottleneck for encoding?") from the perishable limitation ("Current off-the-shelf decoder-only models fail MTEB"). Cite what resolved it—or plainly state where the constraint still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that might argue bidirectionality is not the main lever, or that encoding capability has emerged differently.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Has instruction tuning or scale eliminated the need for explicit bidirectional retraining?" or "Do hybrid prefix-bidirectional masks outperform full bidirectionality?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does bidirectional attention improve language models as universal encoders?

Sources 5 notes

Next inquiring lines