What internal mechanisms explain LLM reasoning and representation limits?

This explores what's happening inside LLMs that produces their reasoning and representation limits — not just *that* they fail, but the internal architecture and mechanisms behind those failures.

This explores the internal machinery — representations, circuits, latent dynamics — that explains why LLMs reason and represent the world the way they do, and where that machinery hits walls. The corpus converges on a striking starting point: what a model does on the surface and what it does inside are decoupled. Two models can produce identical answers through radically different internal structures, and pushing one metric (accuracy) reliably degrades others (faithfulness, calibration) — so behavior alone tells you almost nothing about mechanism What actually happens inside a language model? What actually happens inside the minds of language models?. That decoupling is the reason "it got the right answer" is a weak claim about understanding.

The most concrete limit shows up as a split between knowing and doing. Models can state a principle correctly and then fail to apply it — and even recognize their own failure — a pattern human cognition doesn't produce Can LLMs understand concepts they cannot apply?. One study measured it directly: ~87% accuracy explaining concepts versus ~64% executing them, which it frames as a structural disconnect between instruction and execution pathways rather than a knowledge gap Can language models understand without actually executing correctly?. The interesting implication is that fluency and competence run on partly separate internal circuitry.

Why does the reasoning itself break down? Several notes point inward rather than at the text. Reasoning seems to live primarily in hidden-state trajectories, with the visible chain-of-thought acting as only a partial, sometimes unfaithful interface to what's actually happening Where does LLM reasoning actually happen during generation?. And the reasoning that does happen is associative, not symbolic: strip the familiar semantics out of a task and performance collapses even when the correct rules are sitting right there in context — the model is leaning on token associations and parametric commonsense, not formal manipulation Do large language models reason symbolically or semantically?. On harder problems, models wander instead of searching systematically, so success probability drops exponentially with problem depth Why do reasoning LLMs fail at deeper problem solving?. You can even predict where this fails from first principles: treat the model as an autoregressive probability machine and low-probability targets (counting letters, reciting the alphabet backwards) get systematically harder regardless of logical simplicity Can we predict where language models will fail?.

Understanding itself turns out to be layered rather than monolithic. Mechanistic interpretability finds three tiers — concepts encoded as directions in representation space, factual world-knowledge as connections, and genuine principles as compact circuits — but the higher tiers don't replace the lower heuristics, they coexist with them, leaving a patchwork where real circuits and brittle shortcuts sit side by side Do language models understand in fundamentally different ways?. That patchwork is why a model can look principled on one input and shortcut-driven on the next. Pinning down which is which requires pairing representational analysis (where is the feature) with causal analysis (does it actually drive the output) — correlation alone identifies candidate features but can't prove they matter Can we understand LLM mechanisms with only representational analysis?.

The quiet takeaway: many of these limits aren't bugs to be patched but consequences of the substrate — token-by-token prediction over learned associations. That framing also points at the escape routes the corpus is probing. If reasoning is hostage to token-level dynamics, maybe move it up an abstraction level — reason over sentence embeddings in a language-agnostic space before decoding Can reasoning happen at the sentence level instead of tokens? — or recognize that current methods only cover conventional problem-solving and miss whole modes of creative reasoning entirely, which may explain phenomena like diversity collapse Can LLMs reason creatively beyond conventional problem-solving?. The limits and the proposed fixes are two views of the same mechanism question.

Sources 12 notes

What actually happens inside a language model?

Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.

What actually happens inside the minds of language models?

LLMs can achieve identical accuracy while maintaining radically different internal representations, and mechanisms that appear interpretable may not causally drive outputs. This decoupling means performance metrics alone mask crucial differences in how models actually work.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Where does LLM reasoning actually happen during generation?

Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Do language models understand in fundamentally different ways?

Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Can LLMs reason creatively beyond conventional problem-solving?

Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.

What internal mechanisms explain LLM reasoning and representation limits?

Sources 12 notes

Next inquiring lines