SYNTHESIS NOTE
Model Architecture and Internals Reasoning, Retrieval, and Evaluation

Do transformers hide reasoning before producing filler tokens?

Explores whether language models compute correct answers in early layers but then deliberately overwrite them with filler tokens in later layers, suggesting reasoning and output formatting are separable processes.

Synthesis note · 2026-02-23 · sourced from Cognitive Models Latent

When transformers are trained to solve reasoning tasks with filler (hidden) characters replacing explicit CoT tokens, a striking pattern emerges through logit lens analysis:

Layers 1-3: Correct numerical tokens from the reasoning computation appear as top predictions. The model is performing the actual computation in these early layers.

Layer 3 transition: Filler tokens begin appearing among top-ranked predictions, competing with the computational results.

Final layer: Filler tokens dominate top predictions; correct computational tokens are relegated to rank-2 or lower. The model has overwritten the intermediate reasoning representations with format-compliant output tokens.

The hidden computations are fully recoverable by examining lower-ranked tokens during decoding. The model performs the reasoning, stores the results in its representations, then actively overwrites them to produce the expected output format. The mechanism likely involves induction heads — pattern-copying circuits that learn to overwrite based on training distribution patterns.

This finding has two important implications. First, it provides mechanistic evidence for Why does reasoning training help math but hurt medical tasks? with a twist: the computation happens in earlier layers, but the overwriting also happens in higher layers. The functional separation is computation-in-early-layers, formatting-in-late-layers, not simply knowledge-down/reasoning-up.

Second, it demonstrates a distinction between instance-adaptive and parallelizable computation. Instance-adaptive CoT requires caching subproblem solutions within token outputs — later tokens depend on earlier results. This dependency structure is incompatible with parallel filler token computation. The hidden computation in filler tokens works for tasks where the full solution can be computed in a single forward pass, but not for problems requiring sequential dependency between reasoning steps.

This connects to the CoT faithfulness literature: if models can compute correct answers without explicit reasoning tokens, the explicit CoT chain is not necessarily the mechanism producing the answer. The overwriting pattern suggests the model has two separable processes — computation and expression — that may not align. See Do language models actually use their reasoning steps?.

Inquiring lines that use this note as a source 183

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 142 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

transformers perform hidden reasoning computations in earlier layers then overwrite intermediate representations with filler tokens in later layers