How do knowledge and reasoning circuits interfere in the same neural network?
This explores what happens inside a model when the part that stores facts and the part that does step-by-step reasoning have to share the same network — and how each one can corrupt the other.
This explores what happens inside a model when the part that stores facts and the part that does step-by-step reasoning have to share the same network. The corpus suggests they aren't cleanly separated rivals so much as overlapping tenants — and the overlap cuts both ways. One line of work finds a rough division of labor by depth: factual knowledge is retrieved in the lower layers while reasoning adjustments happen in the higher ones Why does reasoning training help math but hurt medical tasks?. That split is exactly why training a model harder on reasoning can sharpen its math while quietly degrading knowledge-heavy domains like medicine — you're tuning the upper machinery in ways that disturb the lower retrieval.
The more vivid interference shows up when you trace an actual reasoning circuit. Models implement syllogistic logic through a content-independent three-stage mechanism (recite the premises, suppress the middle term, mediate to a conclusion), and that mechanism works across architectures. But additional attention heads carrying world knowledge lean on the process, nudging conclusions toward what *sounds* plausible rather than what logically follows — and this contamination gets *worse* at larger scale How do language models perform syllogistic reasoning internally?. So the same stored knowledge that makes a model useful is also what makes it commit logical fallacies: it can't fully quarantine 'what I know is usually true' from 'what follows here.'
Why do these circuits tangle rather than stay tidy? Part of the answer is that networks do tend toward modularity — pruning studies show they spontaneously isolate compositional subroutines into separate subnetworks, and pretraining makes that separation more reliable Do neural networks naturally learn modular compositional structure?. The catch is that modularity is partial and learned, not guaranteed. Where it's clean, knowledge and reasoning coexist; where it isn't, they bleed.
The unsettling twist is that you usually can't see any of this from the outside. Two models can hit identical accuracy while running radically different internal machinery, and gains on one axis (accuracy) routinely cost you another (faithfulness, calibration) What actually happens inside a language model? What actually happens inside the minds of language models?. A network can even ace every benchmark while its internal representation is incoherent — the 'fractured entangled representation' problem — so a model that looks like it's reasoning may just be retrieving a memorized pattern that resembles reasoning Can AI pass every test while understanding nothing?. That matters here because if you can't tell knowledge-lookup apart from genuine inference behaviorally, you can't tell when one is masquerading as the other.
The thing you might not have expected to want to know: this interference isn't only a bug to suppress — it's also where capability comes from. Base models already hold latent reasoning ability that minimal training merely *elicits* rather than installs Do base models already contain hidden reasoning ability?, which means reasoning is woven through the same weights that store knowledge in the first place. And one escape route from the contamination is architectural: interleaving reasoning steps with external lookups (querying a real source mid-chain) grounds the inference so stored priors can't silently steer it off course Can interleaving reasoning with real-world feedback prevent hallucination?. The interference, in other words, may be the price of having reasoning and knowledge in one network at all — and the open design question is how much to separate them versus how to keep them honest while entangled.
Sources 8 notes
Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.
LLMs implement a content-independent three-stage reasoning mechanism—recitation, middle-term suppression, mediation—that works across architectures. However, additional attention heads encoding world knowledge systematically bias conclusions toward semantically plausible rather than logically valid answers, with contamination increasing at larger scales.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.
LLMs can achieve identical accuracy while maintaining radically different internal representations, and mechanisms that appear interpretable may not causally drive outputs. This decoupling means performance metrics alone mask crucial differences in how models actually work.
The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.