Do transformers hide reasoning before producing filler tokens?

Explores whether language models compute correct answers in early layers but then deliberately overwrite them with filler tokens in later layers, suggesting reasoning and output formatting are separable processes.

Synthesis note · 2026-02-23 · sourced from Cognitive Models Latent

When transformers are trained to solve reasoning tasks with filler (hidden) characters replacing explicit CoT tokens, a striking pattern emerges through logit lens analysis:

Layers 1-3: Correct numerical tokens from the reasoning computation appear as top predictions. The model is performing the actual computation in these early layers.

Layer 3 transition: Filler tokens begin appearing among top-ranked predictions, competing with the computational results.

Final layer: Filler tokens dominate top predictions; correct computational tokens are relegated to rank-2 or lower. The model has overwritten the intermediate reasoning representations with format-compliant output tokens.

The hidden computations are fully recoverable by examining lower-ranked tokens during decoding. The model performs the reasoning, stores the results in its representations, then actively overwrites them to produce the expected output format. The mechanism likely involves induction heads — pattern-copying circuits that learn to overwrite based on training distribution patterns.

This finding has two important implications. First, it provides mechanistic evidence for Why does reasoning training help math but hurt medical tasks? with a twist: the computation happens in earlier layers, but the overwriting also happens in higher layers. The functional separation is computation-in-early-layers, formatting-in-late-layers, not simply knowledge-down/reasoning-up.

Second, it demonstrates a distinction between instance-adaptive and parallelizable computation. Instance-adaptive CoT requires caching subproblem solutions within token outputs — later tokens depend on earlier results. This dependency structure is incompatible with parallel filler token computation. The hidden computation in filler tokens works for tasks where the full solution can be computed in a single forward pass, but not for problems requiring sequential dependency between reasoning steps.

This connects to the CoT faithfulness literature: if models can compute correct answers without explicit reasoning tokens, the explicit CoT chain is not necessarily the mechanism producing the answer. The overwriting pattern suggests the model has two separable processes — computation and expression — that may not align. See Do language models actually use their reasoning steps?.

Inquiring lines that use this note as a source 183

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 142 in 2-hop network ·dense cluster Open in graph ↗

Do transformers hide reasoning before producing … Why does reasoning training help math but hurt med… Do language models actually use their encoded know… Do language models actually use their reasoning st… Does chain of thought reasoning actually explain m… What mechanism enables models to retrieve from lon…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why does reasoning training help math but hurt medical tasks? Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
refines: early layers compute, late layers format; the separation is functional, not just knowledge vs reasoning
Do language models actually use their encoded knowledge? Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
the overwriting mechanism explains HOW encoded information fails to influence generation: later layers actively suppress it
Do language models actually use their reasoning steps? Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
hidden computation explains why CoT can be unfaithful: the model may use a different internal computation path
Does chain of thought reasoning actually explain model decisions? When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.
the computation-expression separation extends to agentic pipelines
What mechanism enables models to retrieve from long context? Do attention heads specialize in retrieving relevant information from long context windows, and if so, what makes them universal across models and necessary for factual generation?
retrieval heads complement hidden filler reasoning: early-layer hidden computations produce intermediate results that retrieval heads access during generation — the filler overwrite pattern explains why specialized retrieval heads are necessary: if intermediate representations are overwritten, the model must retrieve from earlier positions via these sparse attention heads

Do transformers hide reasoning before producing filler tokens?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4