Should LLM reasoning be studied as latent state trajectories rather than surface text?
This explores whether the real action in LLM reasoning happens in hidden internal states rather than the words the model writes out — and whether researchers should study it that way.
This explores whether LLM reasoning should be studied as the movement of internal hidden states rather than the visible chain-of-thought text — and the corpus comes down fairly hard on "yes, mostly." The central note argues that reasoning primarily operates through hidden-state trajectories, with the surface chain-of-thought serving only as a partial, sometimes-unfaithful interface to what's actually happening inside Where does LLM reasoning actually happen during generation?. The evidence comes from chain-of-thought faithfulness tests, feature steering, and layer-by-layer analysis — all of which suggest the written reasoning can diverge from the computation that produced the answer.
What makes this more than a single claim is how other corners of the collection independently point at the same gap between surface text and underlying mechanism. Mechanistic interpretability work finds that "understanding" isn't one thing but three coexisting tiers — features as directions, factual world-state connections, and compact circuits — layered as a patchwork rather than a clean hierarchy Do language models understand in fundamentally different ways?. That's a portrait you can only see by looking at internal structure, not output. Similarly, models default to surface-level shortcuts on theory-of-mind tasks rather than genuinely tracking mental states, and forcing explicit belief-tracking architecture closes the gap — strong evidence that the surface answer hides a shallower internal process than the text implies Do large language models genuinely simulate mental states?.
The latent-state lens also reframes failure. If reasoning is a trajectory through hidden space, then failure is a trajectory that wanders. One note describes reasoning LLMs as "wandering explorers" lacking validity, effectiveness, and necessity — so success probability collapses exponentially as problems get deeper Why do reasoning LLMs fail at deeper problem solving?. That's a state-dynamics story, not a text story. And entailment work shows models keying off whether a hypothesis was memorized rather than whether the premise supports it — the surface output says "entailment," but the internal process is retrieval, not inference Do LLMs predict entailment based on what they memorized?. Likewise, when semantic content is stripped from a task, performance collapses even with correct rules in hand, suggesting the underlying machinery is semantic association, not symbolic manipulation Do large language models reason symbolically or semantically?.
But the corpus doesn't let "ignore the text" off the hook entirely, and this is the part you might not expect. Several notes show that intervening on the surface *changes* the trajectory — meaning text isn't just a readout, it's a partial control surface. Structured argument prompts (forcing models to name warrants and backing) catch reasoning failures that plain chain-of-thought lets slide Can structured argument prompts make LLM reasoning more rigorous?. And diffusion LLMs blur the line entirely: they embed reasoning directly into masked positions refined alongside the answer, so "reasoning" and "answer" stop being a clean before-and-after sequence and become parallel axes of a single refinement process Can reasoning and answers be generated separately in language models?. That hints the surface-vs-latent split is itself architecture-dependent, not a universal law.
So the sharper takeaway: studying latent trajectories isn't just a better microscope — it dissolves several puzzles that look mysterious at the text level. Why do different models show distinct strategic "personalities" tied to game type rather than raw depth Do large language models use one reasoning style or many?? Why is causal reasoning reliably stronger than temporal Why do LLMs handle causal reasoning better than temporal reasoning?? These read as quirks of output until you treat them as signatures of where the internal trajectory has been well-grooved by training and where it hasn't. The text is the shadow; the trajectory is the thing casting it — but the shadow can still be poked to move the object.
Sources 10 notes
Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.
Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.
ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.
ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.
Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.
ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.