What internal mechanisms explain LLM reasoning and representation limits?
This explores what's happening inside LLMs that produces their reasoning and representation limits — not just *that* they fail, but the internal architecture and mechanisms behind those failures.
This explores the internal machinery — representations, circuits, latent dynamics — that explains why LLMs reason and represent the world the way they do, and where that machinery hits walls. The corpus converges on a striking starting point: what a model does on the surface and what it does inside are decoupled. Two models can produce identical answers through radically different internal structures, and pushing one metric (accuracy) reliably degrades others (faithfulness, calibration) — so behavior alone tells you almost nothing about mechanism What actually happens inside a language model? What actually happens inside the minds of language models?. That decoupling is the reason "it got the right answer" is a weak claim about understanding.
The most concrete limit shows up as a split between knowing and doing. Models can state a principle correctly and then fail to apply it — and even recognize their own failure — a pattern human cognition doesn't produce Can LLMs understand concepts they cannot apply?. One study measured it directly: ~87% accuracy explaining concepts versus ~64% executing them, which it frames as a structural disconnect between instruction and execution pathways rather than a knowledge gap Can language models understand without actually executing correctly?. The interesting implication is that fluency and competence run on partly separate internal circuitry.
Why does the reasoning itself break down? Several notes point inward rather than at the text. Reasoning seems to live primarily in hidden-state trajectories, with the visible chain-of-thought acting as only a partial, sometimes unfaithful interface to what's actually happening Where does LLM reasoning actually happen during generation?. And the reasoning that does happen is associative, not symbolic: strip the familiar semantics out of a task and performance collapses even when the correct rules are sitting right there in context — the model is leaning on token associations and parametric commonsense, not formal manipulation Do large language models reason symbolically or semantically?. On harder problems, models wander instead of searching systematically, so success probability drops exponentially with problem depth Why do reasoning LLMs fail at deeper problem solving?. You can even predict where this fails from first principles: treat the model as an autoregressive probability machine and low-probability targets (counting letters, reciting the alphabet backwards) get systematically harder regardless of logical simplicity Can we predict where language models will fail?.
Understanding itself turns out to be layered rather than monolithic. Mechanistic interpretability finds three tiers — concepts encoded as directions in representation space, factual world-knowledge as connections, and genuine principles as compact circuits — but the higher tiers don't replace the lower heuristics, they coexist with them, leaving a patchwork where real circuits and brittle shortcuts sit side by side Do language models understand in fundamentally different ways?. That patchwork is why a model can look principled on one input and shortcut-driven on the next. Pinning down which is which requires pairing representational analysis (where is the feature) with causal analysis (does it actually drive the output) — correlation alone identifies candidate features but can't prove they matter Can we understand LLM mechanisms with only representational analysis?.
The quiet takeaway: many of these limits aren't bugs to be patched but consequences of the substrate — token-by-token prediction over learned associations. That framing also points at the escape routes the corpus is probing. If reasoning is hostage to token-level dynamics, maybe move it up an abstraction level — reason over sentence embeddings in a language-agnostic space before decoding Can reasoning happen at the sentence level instead of tokens? — or recognize that current methods only cover conventional problem-solving and miss whole modes of creative reasoning entirely, which may explain phenomena like diversity collapse Can LLMs reason creatively beyond conventional problem-solving?. The limits and the proposed fixes are two views of the same mechanism question.
Sources 12 notes
Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.
LLMs can achieve identical accuracy while maintaining radically different internal representations, and mechanisms that appear interpretable may not causally drive outputs. This decoupling means performance metrics alone mask crucial differences in how models actually work.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.
Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.