Can knowledge encoded in model representations fail to influence generation?
This explores whether a model can hold something in its internal representations — a correct answer, a piece of context, a latent skill — and still not let it shape the words it actually generates.
This explores whether knowledge that demonstrably exists inside a model's hidden states can fail to reach the output — and the corpus says yes, repeatedly, through several distinct mechanisms. The most striking case: models can compute the *correct* answer in their early layers and then actively suppress it before generation. Logit-lens analysis of models trained with hidden chain-of-thought shows the right answer forming in layers 1–3, only to be overwritten in the final layers to produce format-compliant filler tokens — the reasoning stays fully recoverable from lower-ranked predictions, but never surfaces in the text Do transformers hide reasoning before producing filler tokens?. So the gap between "the model knows" and "the model says" isn't hypothetical; it's measurable.
A second, more mundane mechanism is interference. When a model's parametric training associations are strong, they can override information sitting right in the context window — the model generates outputs inconsistent with what it was just told, because prior knowledge dominates in-context knowledge. Notably, textual prompting alone can't fix this; the corpus reports that *causal intervention in the representations themselves* is required to make the context win Why do language models ignore information in their context?. That reframes your question: it's not only that encoded knowledge fails to influence generation, but that competing encoded knowledge can crowd it out.
The flip side is just as interesting — much capability is latent and simply *unelicited*. Base models already contain reasoning ability that minimal training unlocks; five independent methods (RL steering, critique fine-tuning, decoding changes, SAE feature steering, RLVR) all elicit reasoning that was already present in activations, suggesting post-training selects rather than creates Do base models already contain hidden reasoning ability?. A companion view holds that RL post-training teaches *when* to reason, not *how* — the strategies pre-exist as activation vectors before any RL touches them Does RL post-training create reasoning or just deploy it?. The bottleneck, in other words, is elicitation: knowledge can sit in the representations indefinitely without a trigger to route it into output.
This is why a growing line of work argues reasoning should be studied as hidden-state trajectory formation, not as the surface text — the visible chain-of-thought is only a *partial interface* onto a latent process, and faithfulness tests show the words don't reliably mirror the computation underneath Where does LLM reasoning actually happen during generation?. There's even an architectural escape hatch: diffusion LLMs with bidirectional attention can refine reasoning embedded directly in masked positions, decoupling it from the left-to-right generation that forces autoregressive models to either emit or bury their intermediate work Can reasoning and answers be generated separately in language models?.
The useful boundary here: prompting and post-training can *reorganize and route* what's already encoded, but neither injects what isn't there — prompt optimization activates existing knowledge and hits a hard ceiling at the edge of the training distribution Can prompt optimization teach models knowledge they lack?. So the picture has two failure directions worth holding together: encoded knowledge that exists but stays silent (suppression, interference, missing elicitation), and the absence of knowledge no representation-level trick can conjure. The interesting frontier is the first — closing the gap between what the activations know and what the tokens say.
Sources 7 notes
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.
ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.