Can decoder-only models become effective text encoders with training?

This explores whether models built to generate text left-to-right (decoder-only LLMs like the GPT family) can be retrofitted into good text encoders — systems that produce dense vector representations for search and similarity — and what kind of training it takes.

This explores whether decoder-only LLMs can be turned into strong text encoders through training. The corpus gives a clear yes, but with a sharp diagnosis of *why* they start out bad. The most direct answer comes from work on LLM2Vec Why do decoder-only models underperform as text encoders?: the thing holding these models back as encoders isn't their size or their pretraining — it's the *causal mask*. Because a decoder-only model is only allowed to look leftward at each token, no position ever 'sees' the full sentence, which is exactly what you need for a good whole-text embedding. Flip on bidirectional attention, add a short bout of masked-token prediction and contrastive learning, and the same model jumps to state-of-the-art on standard embedding benchmarks. The surprising takeaway: a capability everyone assumed lived in the weights was actually being suppressed by the attention pattern.

That reframes 'with training' as less about teaching the model new knowledge and more about *unlocking representations it already had*. There's supporting evidence that the raw material is sitting there before attention even runs: analysis of static embeddings shows they already encode rich semantic structure — valence, concreteness, even taboo — functioning like genuine lexical entries Do transformer static embeddings actually encode semantic meaning?. So the encoder conversion isn't building meaning from scratch; it's reorganizing access to signal the network already carries.

The 'with training' part also comes with a caution the corpus surfaces from a different corner. Not all fine-tuning is benign: directly tuning a model's weights can corrupt knowledge stored in its lower layers, which is why decoding-time methods like proxy-tuning preserve pretrained knowledge better Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The lesson for encoder conversion is that *how* you train matters — lightweight, targeted adaptation (as LLM2Vec uses) is more likely to preserve the model's strengths than heavy weight surgery. This connects to how knowledge lives in these models at all: transformer residual streams seem to carry knowledge as continuous *flow* rather than fixed storage Do transformer models store knowledge or generate it continuously?, which helps explain both why representations are extractable and why aggressive retraining can disturb them.

Worth knowing before you get too optimistic: a better encoder is still a *text* encoder, and text has a ceiling. One thread argues text-only models are 'Plato's cave' systems — language strips out the physics, geometry, and causality of the world it describes Are text-only language models fundamentally limited by abstraction? — and a related argument holds that form-only training can't recover grounded meaning at all Can language models learn meaning from text patterns alone?. Converting a decoder into an encoder makes the representations more useful, but it doesn't escape those limits; it just gives you cleaner access to whatever the text already contained.

So the honest synthesis is: yes, decoder-only models become effective encoders with surprisingly little training, because the barrier was architectural (causal masking) not representational — but the gains are about *unlocking and reorganizing* existing signal, the training method has to be gentle enough not to corrupt it, and the resulting encoder inherits the same grounding limits as any text-only system.

Sources 6 notes

Why do decoder-only models underperform as text encoders?

LLM2Vec's unsupervised 3-step process (bidirectional attention + masked prediction + contrastive learning) achieves SOTA on MTEB. The research shows causal masking, not model size, is the representation bottleneck in decoder-only encoders.

Do transformer static embeddings actually encode semantic meaning?

Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Can language models learn meaning from text patterns alone?

Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking whether decoder-only LLMs can become effective text encoders. A curated library of arXiv papers (2023–present) found the following — treat these as dated claims to be re-tested, not current truth:

What a curated library found — and when:
• Causal masking, not model size or pretraining, is the primary bottleneck; flipping to bidirectional attention + masked-token prediction + contrastive learning recovers state-of-the-art embedding performance (~2024, LLM2Vec).
• Static embeddings already encode rich semantic structure (valence, concreteness, taboo) before attention runs; encoder conversion unlocks rather than teaches (~2025).
• Direct weight tuning corrupts knowledge in lower layers; lightweight, decoding-time adaptation (proxy-tuning) preserves pretrained knowledge better (~2024).
• Transformer residual streams carry knowledge as continuous flow, not fixed storage, which explains both extractability and vulnerability to aggressive retraining (~2024).
• Text-only encoders inherit grounding limits of their source (Plato's cave problem); form-alone training cannot recover grounded meaning (~2023–2024).

Anchor papers (verify; mind their dates):
• arXiv:2404.05961 — LLM2Vec (Apr 2024)
• arXiv:2405.00208 — A Primer on Inner Workings (Apr 2024)
• arXiv:2508.12863 — Word Meanings in Transformers (Aug 2025)
• arXiv:2508.10003 — Semantic Structure in Embeddings (Aug 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding, judge whether newer models, training methods (e.g., continued pretraining, supervised fine-tuning on embedding tasks), tooling (SentenceTransformers, open-source benchmarks), or evaluation (MTEB, domain-specific datasets) have since relaxed or overturned it. Separate the durable question (can the capability be unlocked?) from perishable limitations (e.g., does causal masking still dominate on >1T-token models? do lightweight methods still hold at scale?).
(2) Surface the strongest contradicting or superseding work from the last ~6 months. If newer papers show that knowledge flow, grounding, or fine-tuning dynamics differ significantly, name them.
(3) Propose 2 research questions that assume the regime may have moved: e.g., whether multimodal pretraining (arXiv:2603.03276) dissolves text-only grounding limits, or whether test-time adaptation (arXiv:2410.08020) enables per-task encoder specialization without weight surgery.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can decoder-only models become effective text encoders with training?

Sources 6 notes

Next inquiring lines