Do transformers hide reasoning before producing filler tokens?
Explores whether language models compute correct answers in early layers but then deliberately overwrite them with filler tokens in later layers, suggesting reasoning and output formatting are separable processes.
When transformers are trained to solve reasoning tasks with filler (hidden) characters replacing explicit CoT tokens, a striking pattern emerges through logit lens analysis:
Layers 1-3: Correct numerical tokens from the reasoning computation appear as top predictions. The model is performing the actual computation in these early layers.
Layer 3 transition: Filler tokens begin appearing among top-ranked predictions, competing with the computational results.
Final layer: Filler tokens dominate top predictions; correct computational tokens are relegated to rank-2 or lower. The model has overwritten the intermediate reasoning representations with format-compliant output tokens.
The hidden computations are fully recoverable by examining lower-ranked tokens during decoding. The model performs the reasoning, stores the results in its representations, then actively overwrites them to produce the expected output format. The mechanism likely involves induction heads — pattern-copying circuits that learn to overwrite based on training distribution patterns.
This finding has two important implications. First, it provides mechanistic evidence for Why does reasoning training help math but hurt medical tasks? with a twist: the computation happens in earlier layers, but the overwriting also happens in higher layers. The functional separation is computation-in-early-layers, formatting-in-late-layers, not simply knowledge-down/reasoning-up.
Second, it demonstrates a distinction between instance-adaptive and parallelizable computation. Instance-adaptive CoT requires caching subproblem solutions within token outputs — later tokens depend on earlier results. This dependency structure is incompatible with parallel filler token computation. The hidden computation in filler tokens works for tasks where the full solution can be computed in a single forward pass, but not for problems requiring sequential dependency between reasoning steps.
This connects to the CoT faithfulness literature: if models can compute correct answers without explicit reasoning tokens, the explicit CoT chain is not necessarily the mechanism producing the answer. The overwriting pattern suggests the model has two separable processes — computation and expression — that may not align. See Do language models actually use their reasoning steps?.
Inquiring lines that use this note as a source 183
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do models integrate conflicting signals in reasoning tasks?
- Can input augmentation and rephrasing compensate for smaller model limitations?
- How do transformers perform multi-hop reasoning across distant training documents?
- Could superposed decoding algorithms maintain multi-task representation during generation?
- Does sentence-level granularity capture enough structure for complex reasoning tasks?
- Do modern architectures in NLP and vision rely on dot products intentionally?
- What tokens do RL-trained summarizers learn to keep for ranking?
- Can this principle apply to other intermediate text generation tasks?
- Do models learn different sophistry strategies for QA versus code generation?
- How do verbose and concise reasoning occupy different regions in activation space?
- How do soft thought tokens differ from decoded assistant outputs?
- Can structured prompting reliably force models to enumerate preconditions?
- Can AI output be tokenized without decoupling from the thought processes behind it?
- How do models signal knowledge gaps through token probability?
- Does information stored in neural networks necessarily influence generation decisions?
- Do language models learn surface patterns instead of underlying linguistic principles?
- Can better prompting fix structural disruptions in artificial text generation?
- Why do language models produce plausible outputs over accurate failure reports?
- Does changing decoding procedure reveal hidden chain-of-thought paths?
- How does token-by-token generation constrain a model's ability to plan ahead?
- How does error propagation limit transformer performance on complex tasks?
- Can symbolic mechanisms improve transformer compositional abilities?
- Why do language models produce verbose reasoning when asked to think step by step?
- Does text-only evaluation hide reasoning collapse that tool use could repair?
- Can autoregressive models learn faithful translation to logical representations without semantic loss?
- Can language models reason without relying on learned semantic patterns?
- Can simple diagnostic tests predict language model performance in production complexity?
- How does the discrete token bottleneck prevent gradient flow in language model control?
- What happens when confident language masks uncertainty in AI outputs?
- Why do correct reasoning traces in language models tend to be shorter?
- Why do language models substitute parametric knowledge over retrieved context mid-reasoning?
- Why do language models imitate reasoning form without abstract inference capability?
- How much does prompt format shape what reasoning strategy a model uses?
- Do tokens beyond a critical threshold actually improve reasoning quality?
- How do planning and backtracking sentences control reasoning traces?
- What makes some tokens carry disproportionate information about answers?
- Does next-token prediction alone produce genuine functional language competence?
- What computational role do intermediate tokens actually play in transformers?
- Does the DeepSeek R1 single token insertion represent genuine reasoning?
- Do reasoning models show the same answer-maintenance pattern that diffusion models exhibit?
- Why do transformer models still miss implicit discourse relations in anxiety detection?
- Does thinking-token overuse actually degrade reasoning accuracy in practice?
- Do representations in models causally influence text generation?
- How does fluent text output trigger misleading cognitive attributions in readers?
- Do sparse arithmetic circuits explain all language model reasoning abilities?
- How does cognitive load explain linguistic patterns in both deception and incorrect reasoning?
- Why do generative and discriminative language model procedures disagree?
- What reliable traces do generative processes actually leave in finished text?
- Why do standard transformers fail on problems requiring serial algorithmic reasoning?
- Can correct model outputs prove that semantic meaning rather than surface patterns drove the response?
- Why do language models fail at grounding and inference?
- Why do temporal reasoning patterns matter more than final answers?
- Can targeted interventions on attention heads bridge the encoding-generation gap?
- How do we distinguish knowledge encoding from knowledge usage in models?
- Can correct outputs mask reliance on surface heuristics rather than deep understanding?
- Does more thinking always help large language models or sometimes hurt?
- Why do language models tend to elaborate and expand rather than compress information?
- How do lower network layers compress facts versus higher reasoning layers?
- Do latent sequence vectors outperform per-token latent iterative computation for reasoning?
- Does activation masking prevent the decoder from taking interpretability shortcuts?
- Does encoded knowledge in language models actually influence what they generate?
- Could graph neural networks fundamentally outperform transformers on structured reasoning?
- Does encoding information in LM representations guarantee it influences output?
- When does encoded knowledge fail to influence language model generation?
- Do language models actively adopt false beliefs under sustained conversational pressure?
- What hidden computations happen inside transformer layers during reasoning?
- Can we decode what individual circuits inside transformers are doing?
- Do reasoning models perform genuine logical evaluation or pattern matching?
- Why might encoded world knowledge fail to actually influence language model outputs?
- Why do models learn reasoning form instead of actual abstract inference?
- Why do language models struggle with formal logical reasoning and joins?
- Why does distillation transfer reasoning patterns with few examples?
- Why does input embedding magnitude affect perturbation sensitivity in transformers?
- Can derivational traces be distinguished from stylistic mimicry of reasoning?
- How much does input format shape what reasoning strategy a model develops?
- Can synthesized explanations be more auditable than winning-chain explanations?
- Why are truthfulness and honesty mechanistically separate in language models?
- How can prompting help models gather information before attempting reasoning?
- Can token efficiency come from stopping before reflection?
- What separates pattern matching from genuine language understanding?
- Can models hide their reasoning in continuous space rather than natural language?
- Why do reasoning models verbalize reasoning shortcuts less than necessary?
- How does training data format shape whether models reason in parallel or sequentially?
- How much of a model's reasoning tokens are unnecessary for reaching the final answer?
- Why do language models generate reasoning tokens after internally deciding the answer?
- Can language models reason without relying on surface level pattern matching?
- What makes deductive reasoning so brittle in language models overall?
- Why do explicit linguistic markers override semantic computation in models?
- Is gradient behavior in language functional or a sign of ambiguity?
- Can models compress reasoning chains without external teacher supervision?
- Why does output alignment fail to catch internally incoherent reasoning?
- Why do reasoning models confidently generate wrong answers instead of abstaining?
- What happens to reasoning accuracy when models use more thinking tokens?
- Why do reasoning models reduce effort despite having token budget remaining?
- Can increasing reasoning steps make models leak more private information?
- Does anonymizing reasoning traces harm the quality of model outputs?
- Can models learn when to think versus answer directly?
- Can deliberate corruption of reasoning traces harm out of distribution generalization?
- What separates knowledge from reasoning in neural network layers?
- Why do format and structure matter more than actual content in reasoning?
- How much does test-time compute improve reasoning without more tokens?
- Can transformers reason beyond fixed architectural depth limits?
- How does layer removal affect transformers compared to ResNets?
- Why do language models prefer certain response styles regardless of what the prompt asks?
- Why does SFT reduce reasoning quality even when improving domain accuracy?
- Why do smaller models favor code formats while larger models prefer natural language?
- Can language models accurately evaluate the quality of their own reasoning?
- Can bounded-depth transformers solve inherently sequential problems?
- Does SFT degrade reasoning quality while improving domain accuracy?
- How do recursive language models rethink where to store reasoning?
- Can language models keep secrets and control information strategically?
- How early in token generation does the reasoning mode activate?
- How do output format constraints compare to input exemplar brittleness?
- Can reasoning in free text then formatting separately recover performance?
- What sparse mechanistic structures drive reasoning traces in language models?
- How does fluent output mask the mythic function of a system?
- Can latent reasoning achieve the same substitution without tokens?
- Can language models perform genuine symbolic reasoning without semantic grounding?
- Can models be trained to explain instead of imitate answers?
- Why do AI outputs lack the stable content of written sentences?
- Can inserted errors in reasoning drafts produce predictable downstream effects?
- Why does representation recycling of MI-peak tokens improve reasoning accuracy?
- Why are receiver attention heads narrower in reasoning models than base models?
- Can attribute decomposition improve other interactive reasoning tasks beyond clinical questioning?
- What explains the contextual variability of knowledge in transformers?
- How does oral transmission of knowledge resemble transformer generation?
- Does attention bias explain grounding failure in language models?
- How do induction heads learn to overwrite computational representations?
- Why do language models produce unfaithful chain of thought explanations?
- Can knowledge encoded in model representations fail to influence generation?
- Why does concise reasoning maintain accuracy with far fewer tokens?
- Does more thinking always improve language model accuracy?
- Does supervised fine-tuning improve reasoning or just response formatting?
- How do pretrained language models represent inferential patterns versus lexical and positional cues?
- Why do language models struggle with backward reasoning compared to forward?
- Can cognitive scaffolding replace tool-based reasoning augmentation in language models?
- What role do humans play in converting language model outputs into meaningful events?
- How do reasoning-invariant tokens dilute learning signals in uniform averaging?
- Can models maintain reasoning-output coupling while improving domain accuracy?
- How do generative PRMs ensure their reasoning actually influences judgment instead of decorating outputs?
- Can models internally identify which tokens matter most for reasoning?
- Can verifier output replace ground-truth answers as the asymmetric information source?
- Does reasoning happen in hidden space or in generated tokens?
- Does next-token prediction actually explain how human thought works?
- Can interventions on model components prove mechanism without explaining encoding?
- Do models cache intentions about response topics before generating the first token?
- Can entropy signatures alone detect whether context was model-generated or externally prefilled?
- How does the enaction paradigm explain introspective anomaly detection in large language models?
- Do models verbalize their implicit knowledge when that knowledge influences their output?
- Why do higher network layers capture procedural knowledge but lower layers store facts?
- How do meta-tokens help models learn when to generate reasoning versus commit predictions?
- How does predictive accuracy on future tokens differ from correctness on labeled answers?
- Can format adaptation alone explain why reasoning enrichment improves instruction following?
- What distinguishes first-order from second-order agency in language models?
- What quality filters distinguish useful reasoning enrichment from shallow repetition?
- Why does uniform averaging across all tokens dilute the reasoning signal?
- What evidence shows that reasoning chains encode token-level functional structure?
- Does reasoning style transfer matter more than solution correctness in distillation?
- Why do thinking models execute longer tasks than standard language models?
- Do reasoning models need to verbalize doubt to correct their own mistakes?
- Can trained models encode programs more complex than their data-generating process?
- Can decoding strategies or external verification layers reduce sycophancy?
- Does CoT reasoning actually cause the outputs that follow it?
- Why do language models overthink simple questions when given extra time?
- Does the token prediction framing actually capture what human reasoning does?
- Can standard next-token prediction capture complex multi-step human reasoning directly?
- Do transformer architectures structurally bias models toward short-term optimization?
- Is premature decision-making a form of underthinking in transformer models?
- Does token-level loss aggregation help aligned models differently?
- Can models learn to optimize their own chain-of-thought generation?
- How does token-level interaction like ColBERT overcome commutativity constraints?
- What geometric structure do language models actually use during inference?
- How do semantic and symbolic reasoning capabilities differ in language models?
- What is the comprehension-generation asymmetry in language models?
- Can this whole-artifact principle apply to other generative tasks?
- Does premature confidence signal flawed reasoning in language models?
- Do language models need words to think or just latent structure?
- Why does latent-level prediction beat token-level prediction for reasoning?
- What does next-token prediction tell us about compositional linguistic competence?
- How do early-prefix tokens control the generation of entire continuations?
- Why do language models use remaining tokens to rationalize instead of reconsider?
- How does tool-based reasoning expand what language models can do?
- How does evaluation setting affect measured reasoning capabilities in language models?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
refines: early layers compute, late layers format; the separation is functional, not just knowledge vs reasoning
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
the overwriting mechanism explains HOW encoded information fails to influence generation: later layers actively suppress it
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
hidden computation explains why CoT can be unfaithful: the model may use a different internal computation path
-
Does chain of thought reasoning actually explain model decisions?
When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.
the computation-expression separation extends to agentic pipelines
-
What mechanism enables models to retrieve from long context?
Do attention heads specialize in retrieving relevant information from long context windows, and if so, what makes them universal across models and necessary for factual generation?
retrieval heads complement hidden filler reasoning: early-layer hidden computations produce intermediate results that retrieval heads access during generation — the filler overwrite pattern explains why specialized retrieval heads are necessary: if intermediate representations are overwritten, the model must retrieve from earlier positions via these sparse attention heads
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Understanding Hidden Computations in Chain-of-Thought Reasoning
- Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability
- Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity
- Chain-of-Thought Reasoning Without Prompting
- Implicit Chain of Thought Reasoning via Knowledge Distillation
- A Primer on the Inner Workings of Transformer-based Language Models
Original note title
transformers perform hidden reasoning computations in earlier layers then overwrite intermediate representations with filler tokens in later layers