Where does LLM reasoning actually happen during generation?

Does multi-step reasoning emerge from visible chain-of-thought text, hidden layer dynamics, or simply more computation? Three competing hypotheses make different predictions and can be empirically tested.

Synthesis note · 2026-04-20 · sourced from Cognitive Models Latent

The field studies "LLM reasoning" without agreeing on what the primary object of study is. Three views coexist but make incompatible predictions:

H2 (surface CoT): Multi-step reasoning is primarily mediated by explicit surface chain-of-thought. The chain IS the reasoning. This requires surface traces to provide the most stable causal leverage — but ordinary CoT is often useful without being reliably faithful, and its role varies sharply across tasks.

H0 (generic serial compute): Most apparent reasoning gains are better explained by generic serial compute than by any privileged representational object. More tokens = more FLOPs, regardless of what those tokens say. This requires matched serial compute to explain most gains — but extra budget alone cannot explain why specific internal states, features, or trajectories can predict or alter reasoning behavior.

H1 (latent-state trajectories): Multi-step reasoning is primarily mediated by latent-state trajectories, with surface CoT serving only as a partial interface. Task-relevant commitment arises in hidden-state dynamics that are only partly verbalized, or not verbalized at all.

The difficulty is that recent methods typically move several factors at once: CoT prompting changes both visible traces and compute allocation; latent reasoning methods change both hidden-state dynamics and compute budget; test-time scaling changes compute and usually changes the output path. Without designs that explicitly disentangle these three factors, experimental results cannot distinguish which hypothesis they support.

The paper argues H1 should be the default working hypothesis — not as a task-independent verdict, but because the strongest evidence currently available points toward latent-state dynamics as having the most stable causal leverage. The recommendation: treat latent-state dynamics as the default object of study and design evaluations that explicitly separate surface traces, latent states, and serial compute.

This framework organizes several existing findings. Because Do language models actually use their reasoning steps?, the H2 assumption is empirically weakened — if surface traces aren't causally faithful, they cannot be the primary reasoning medium. Because Does chain-of-thought reasoning reflect genuine thinking or performance?, H2 fails specifically on easy tasks (where the answer is determined before CoT begins) while H1 and H0 remain viable. Because Can we trigger reasoning without explicit chain-of-thought prompts?, direct latent intervention provides causal evidence for H1 that neither H2 nor H0 can explain.

Additional evidence converges from multiple angles. Because Why does reasoning training help math but hurt medical tasks?, the layer separation provides architectural grounding for H1: reasoning is a latent higher-layer process, not a surface token-generation phenomenon. Because Why do language models fail to act on their own reasoning?, even when the surface trace (rationale) is correct, the latent computation (action selection) diverges — a behavioral signature of the surface-latent disconnect that H1 predicts. And because Can we measure how deeply a model actually reasons?, there now exists an H1-native measurement tool: DTR tracks latent computational depth per token rather than surface trace properties, and it outperforms surface-level metrics as an accuracy predictor.

The sharpest implication: the field's default assumption (H2) may be distorting research priorities. If the reasoning object is latent, then benchmarks that evaluate chains, faithfulness metrics that read traces, and interpretability methods that parse CoT are all measuring a secondary phenomenon.

Inquiring lines that use this note as a source 34

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 167 in 2-hop network ·dense cluster Open in graph ↗

Where does LLM reasoning actually happen during … Do language models actually use their reasoning st… Does chain-of-thought reasoning reflect genuine th… Can models reason without generating visible think… Does chain-of-thought reasoning reveal genuine inf… Can we trigger reasoning without explicit chain-of… Why does reasoning training help math but hurt med… Can we measure how deeply a model actually reasons… Why do language models fail to act on their own re…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do language models actually use their reasoning steps? Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
empirical evidence weakening H2
Does chain-of-thought reasoning reflect genuine thinking or performance? When language models generate step-by-step reasoning, are they actually thinking through problems or just producing text that looks like reasoning? This matters for understanding whether extended reasoning tokens add real computational value.
difficulty-dependent H2 failure
Can models reason without generating visible thinking tokens? Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
H1 implementations
Does chain-of-thought reasoning reveal genuine inference or pattern matching? Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
theoretical argument against H2
Can we trigger reasoning without explicit chain-of-thought prompts? This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
causal evidence for H1
Why does reasoning training help math but hurt medical tasks? Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
provides layer-level mechanistic grounding for H1: reasoning localizes to higher layers as a latent process, not as surface token generation
Can we measure how deeply a model actually reasons? What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?
an H1-native measurement: DTR measures latent computational depth rather than surface trace properties
Why do language models fail to act on their own reasoning? LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
behavioral evidence for the latent-surface disconnect: models produce correct surface reasoning but act on latent computations that don't follow it

Where does LLM reasoning actually happen during generation?

Related concepts in this collection 8

Related papers in this collection 8

Search by related questions 4