SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Language, Text, and Discourse

Why do LLMs fail at simple deductive reasoning?

LLMs excel at complex multi-hop reasoning across sentences but struggle with trivial deductions humans find obvious. What explains this counterintuitive reversal in capability?

Synthesis note · 2026-02-21 · sourced from Natural Language Inference
What kind of thing is an LLM really? Where exactly do LLMs break down with language structure? How should researchers navigate LLM reasoning research?

The "Minds vs. Machines" entailment benchmark reveals a non-obvious asymmetry: LLMs outperform humans on multi-hop reasoning tasks that require integrating information across multiple sentences and knowledge types, while humans outperform LLMs on tasks requiring simple deductive inference.

This reverses the common intuition. We expect LLMs to handle simple cases well and fail on complex ones. Instead: the more complex multi-hop reasoning, the more advantaged LLMs become relative to humans. Simple deductive steps — the kind of inference humans find trivially obvious — are precisely where LLMs are weakest.

The knowledge type taxonomy matters: entity-grounded knowledge (facts about entities, verifiable externally), commonsense knowledge (implicit everyday reasoning, hard to articulate), and localized knowledge (context-specific, impossible to infer unless stated). LLMs handle entity-grounded reasoning better; humans handle commonsense inferences better.

This connects to the inversion captured in Does LLM grammatical performance decline with structural complexity? — but the failure mode is different. Grammatical complexity degrades LLM performance. Inferential complexity does not necessarily degrade it, and may improve it relative to humans who tire or miss multi-step chains. Structural complexity and inferential complexity have different profiles.

The practical implication: the right use case for LLM-assisted reasoning is complex multi-step inference that humans find cognitively taxing, not simple first-order deductions that humans find trivial. And Why do embedding contexts confuse LLM entailment predictions? shows the specific class of simple inference where LLMs fail worst: trivial entailments that humans find effortless.

Inquiring lines that use this note as a source 6

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 153 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

llms outperform humans at multi-hop reasoning in extended contexts but fail at simple deductive inference