How do lower network layers compress facts versus higher reasoning layers?

This explores what the corpus reveals about how computation is distributed across a transformer's depth — whether early layers do something fact-like and compressive while later layers do something reasoning-like — and where the literal premise of the question holds up.

This explores whether transformers split labor by depth — lower layers compressing facts, upper layers reasoning — and the corpus actually complicates the tidy version of that picture. The sharpest evidence comes from logit-lens work showing models trained with hidden chain-of-thought compute the correct answer in layers 1–3, then actively suppress that representation in the final layers to emit format-compliant filler Do transformers hide reasoning before producing filler tokens?. So early layers aren't just storing facts to be reasoned over later — they can carry finished reasoning that upper layers overwrite. The interesting twist is that depth isn't a clean pipeline from 'retrieve' to 'reason'; the answer can already be present early and get buried, recoverable only from lower-ranked token predictions.

If you zoom out from layers to the broader question of how models compress, the corpus suggests compression is the default mode everywhere, not a lower-layer specialty. LLMs aggressively maximize statistical compression — capturing broad category structure but discarding the fine-grained, context-sensitive distinctions humans preserve Do LLMs compress concepts more aggressively than humans do?. That's a useful reframe: the 'fact compression' the question imagines isn't a neutral storage step, it's a lossy bet on what to keep. And the model's own internals seem to know which parts matter — reasoning chains encode token-level functional importance, with symbolic-computation tokens preferentially preserved and grammar or meta-discourse pruned first Which tokens in reasoning chains actually matter most?. A related finding shows only about 20% of tokens are high-entropy 'forking points' that actually drive learning Do high-entropy tokens drive reasoning model improvements?. Compression and reasoning, in other words, are entangled all the way down.

The entanglement runs the other direction too: the machinery that does reasoning turns out to be good at compression. A reasoning model's raw thinking trace, used directly as shortened context, beats most purpose-built compression methods — the same mechanism that produces reasoning also produces usable input compression Can a reasoning model's thinking trace compress context effectively?. That undercuts the premise of a strict division of labor. Rather than 'low = compress facts, high = reason,' the picture is more recursive: reasoning is itself a compression operation over evidence.

For a different cut at where abstraction lives, Meta's Large Concept Model abandons token-level processing entirely and reasons over whole-sentence embeddings in a language-agnostic space before decoding Can reasoning happen at the sentence level instead of tokens?. That's a structural argument that the 'reasoning layer' might be better placed above tokens altogether — a hint that the fact-vs-reasoning split people intuit by depth might be better engineered as a split by representational grain. If you want to chase the deeper thread, the surprise here is that 'where facts live' and 'where reasoning happens' may not be separable coordinates in a transformer at all — the same layers, and the same compression instinct, do both.

Sources 6 notes

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Do LLMs compress concepts more aggressively than humans do?

Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can a reasoning model's thinking trace compress context effectively?

A reasoning model's raw thinking trace, used directly as shortened context, outperforms most dedicated compression methods without requiring specialized modules or compression-specific training. The mechanism that enables reasoning also produces usable input compression.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

How do lower network layers compress facts versus higher reasoning layers?

Sources 6 notes

Next inquiring lines