Why do LLMs fail at simple deductive reasoning?
LLMs excel at complex multi-hop reasoning across sentences but struggle with trivial deductions humans find obvious. What explains this counterintuitive reversal in capability?
The "Minds vs. Machines" entailment benchmark reveals a non-obvious asymmetry: LLMs outperform humans on multi-hop reasoning tasks that require integrating information across multiple sentences and knowledge types, while humans outperform LLMs on tasks requiring simple deductive inference.
This reverses the common intuition. We expect LLMs to handle simple cases well and fail on complex ones. Instead: the more complex multi-hop reasoning, the more advantaged LLMs become relative to humans. Simple deductive steps — the kind of inference humans find trivially obvious — are precisely where LLMs are weakest.
The knowledge type taxonomy matters: entity-grounded knowledge (facts about entities, verifiable externally), commonsense knowledge (implicit everyday reasoning, hard to articulate), and localized knowledge (context-specific, impossible to infer unless stated). LLMs handle entity-grounded reasoning better; humans handle commonsense inferences better.
This connects to the inversion captured in Does LLM grammatical performance decline with structural complexity? — but the failure mode is different. Grammatical complexity degrades LLM performance. Inferential complexity does not necessarily degrade it, and may improve it relative to humans who tire or miss multi-step chains. Structural complexity and inferential complexity have different profiles.
The practical implication: the right use case for LLM-assisted reasoning is complex multi-step inference that humans find cognitively taxing, not simple first-order deductions that humans find trivial. And Why do embedding contexts confuse LLM entailment predictions? shows the specific class of simple inference where LLMs fail worst: trivial entailments that humans find effortless.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do humans and LMs differ on multi-hop reasoning?
- What cognitive capacities do LLMs actually lack that commentary assumes they have?
- Why do LLMs struggle with negation and exception handling?
- What specific linguistic features cause LLMs to fail at trivial entailment?
- Can LLMs improve at simple deduction through different training approaches?
- Why do LLMs fail at counterfactual reasoning despite factual knowledge?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does LLM grammatical performance decline with structural complexity?
This explores whether LLMs fail uniformly at grammar or whether their failures follow a predictable pattern tied to input complexity. Understanding the relationship matters for deciding when LLM annotations are reliable.
structural complexity degrades LLMs; inferential multi-hop complexity does not follow the same pattern
-
Can non-reasoning models catch up with more compute?
Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
complementary: reasoning capability is training-regime specific; this notes that the regime shapes which *type* of task LLMs are better at than humans
-
Why do embedding contexts confuse LLM entailment predictions?
Can language models distinguish between contexts that preserve versus cancel entailments? The study explores whether LLMs systematically fail to apply the semantic rules governing presupposition triggers and non-factive verbs.
the specific trivial inferences where LLMs fail
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Logical Reasoning in Large Language Models: A Survey
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Large Language Model Reasoning Failures
- Do Large Language Models Latently Perform Multi-Hop Reasoning?
- Minds versus Machines: Rethinking Entailment Verification with Language Models
- Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs
- Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners
Original note title
llms outperform humans at multi-hop reasoning in extended contexts but fail at simple deductive inference