What circuit mechanisms produce belief bias in syllogistic reasoning?

This explores what's actually happening inside the model when it gets fooled by a believable-but-illogical conclusion — the specific circuitry that lets world knowledge override logical form. The clearest answer in the corpus is mechanistic: syllogistic reasoning runs on a content-independent three-stage circuit — recitation (restating the premises), middle-term suppression (dropping the shared term that links them), and mediation (drawing the conclusion). This same machinery shows up across architectures, so it looks like a genuine reasoning algorithm rather than a memorized trick. Belief bias enters through a *separate* set of attention heads that encode world knowledge and quietly tilt the conclusion toward what's semantically plausible instead of what's logically valid. The unsettling part: this contamination gets *worse* at larger scales, so scaling up doesn't wash out the bias — it amplifies it How do language models perform syllogistic reasoning internally?.

Zoom out from the circuit and the behavior matches humans almost eerily well. Models reproduce the human belief-bias signature item-by-item — the same syllogisms that trip people up trip up the model, at comparable error rates, and the same pattern recurs on natural-language inference and the Wason selection task. That behavioral isomorphism across three independent tasks is the behavioral shadow of the circuit story: content and logical form aren't cleanly separable inside a transformer, they're entangled by architecture Do language models show the same content effects humans do?. A complementary probe makes the point even sharper — strip the familiar semantics out of a reasoning problem while leaving the logical rules intact, and performance collapses. The model was never manipulating symbols; it was riding token associations and parametric commonsense Do large language models reason symbolically or semantically?.

Where does the bias come from in the first place? Not from instruction tuning. A causal experiment varying random seeds and cross-tuning found that models sharing a pretrained backbone carry the same bias fingerprint regardless of what finetuning data they saw — biases are planted during pretraining and only nudged afterward Where do cognitive biases in language models come from?. And it's not unique to syllogisms: the same flavor of training-data-statistics-driven error shows up in causal reasoning, where models reproduce human failures like weak explaining-away and Markov violations on collider networks Do large language models make the same causal reasoning mistakes as humans?. The common thread is that these aren't bugs in a logic engine — they're the predictable output of a system whose 'reasoning' is statistical pattern completion.

That reframing is where it gets interesting for anyone hoping to fix it. If chain-of-thought were genuine inference, you could reason your way past belief bias; but the corpus argues CoT is constrained imitation of reasoning *form*, reproducing familiar schemata rather than performing novel symbolic manipulation — which is why it fails predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching? Why does chain-of-thought reasoning fail in predictable ways?. There's a layered geometry to it too: knowledge tends to live in lower network layers and reasoning adjustments in higher ones, which is why pushing harder on reasoning can actively degrade knowledge-heavy behavior Why does reasoning training help math but hurt medical tasks?. One concrete lever does exist — training judges with RL to actually deliberate during evaluation measurably reduces their susceptibility to surface-feature biases Can reasoning during evaluation reduce judgment bias in LLM judges? — suggesting belief bias is suppressible through process, even if it's baked in at the source. The thing worth walking away with: the model has a real, transferable logic circuit *and* a world-knowledge circuit running in parallel, and belief bias is what you see when the second one wins.

Sources 9 notes

How do language models perform syllogistic reasoning internally?

LLMs implement a content-independent three-stage reasoning mechanism—recitation, middle-term suppression, mediation—that works across architectures. However, additional attention heads encoding world knowledge systematically bias conclusions toward semantically plausible rather than logically valid answers, with contamination increasing at larger scales.

Do language models show the same content effects humans do?

LLMs show identical content-sensitivity patterns to humans on NLI, syllogisms, and Wason tasks, with belief-bias signatures matching human error rates item-by-item. This behavioral isomorphism across three independent tasks suggests content and logical form are inseparable in transformer reasoning architecturally.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Do large language models make the same causal reasoning mistakes as humans?

LLMs show weak explaining away and Markov violations in collider networks, matching human error patterns exactly. This suggests shared mechanisms rooted in training data statistics rather than categorical reasoning inferiority.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Can reasoning during evaluation reduce judgment bias in LLM judges?

Training judges with reinforcement learning to reason about evaluations—by converting judgment tasks into verifiable problems with synthetic data pairs—produces judges that think through their decisions rather than relying on exploitable surface features, directly mitigating authority, verbosity, position, and beauty bias.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher re-testing whether belief bias in LLM reasoning is still constrained by the circuit bottlenecks a curated library documented between 2022–2025.

What a curated library found — and when (findings span 2022–2025; treat as dated claims):
• Syllogistic reasoning uses a three-stage content-independent circuit (recitation → middle-term suppression → mediation), but belief bias enters via separate attention heads encoding world knowledge that override logical form (2024, arXiv:2408.08590).
• Belief bias *worsens* at larger scales, not improves — scaling doesn't dissolve the contamination (2024, arXiv:2408.08590).
• Models reproduce human belief-bias signatures item-by-item across syllogisms, natural-language inference, and Wason tasks; semantic and logical form are entangled by architecture, not cleanly separable (2022–2023, arXiv:2207.07051, arXiv:2305.14825).
• Cognitive biases are planted during pretraining, not shaped by finetuning; models sharing a pretrained backbone carry identical bias fingerprints regardless of downstream tuning (2025, arXiv:2507.07186).
• Chain-of-thought is constrained imitation of reasoning form, not genuine inference; it fails under distribution shift and cannot reason past belief bias (2025, arXiv:2506.02878).
• RL-trained judges that deliberate during evaluation measurably reduce susceptibility to surface-feature biases, suggesting belief bias *is* suppressible through process (2025, arXiv:2505.10320).

Anchor papers (verify; mind their dates):
– arXiv:2408.08590 (2024): Reasoning Circuits in Language Models — the mechanistic core.
– arXiv:2506.02878 (2025): CoT Is Not True Reasoning — strongest challenge to CoT-as-fix narrative.
– arXiv:2507.07186 (2025): Planted in Pretraining — pretraining as bias source.
– arXiv:2505.10320 (2025): J1: Incentivizing Thinking via RL — the only concrete suppressant.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, probe whether recent model scaling (o1, o3 family), improved reasoning architectures (test-time compute, search, learned verifiers), or new evaluation regimes have since relaxed the claim that bias worsens at scale, that CoT cannot transcend it, or that pretraining-planted biases resist finetuning. Distinguish the durable question (are logic and content separable?) from perishable limits (does scaling help?). Cite what moved it.
(2) Surface the strongest CONTRADICTING work from the last 3–6 months — especially any paper showing that stronger CoT, reasoning-optimized finetuning, or novel architectures *do* decouple content from logical form, or that belief bias plateaus or reverses at sufficient scale.
(3) Propose 2 research questions that assume the circuit regime may have shifted: e.g., "Does process-heavy test-time reasoning (search, critique, verification) hollow out the advantage of world-knowledge heads over logic heads?"; "Can targeted causal intervention in lower layers suppress pretraining-planted biases without finetuning?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What circuit mechanisms produce belief bias in syllogistic reasoning?

Sources 9 notes

Next inquiring lines