How much does the order of premises actually matter for reasoning?
When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.
LLMs are surprisingly brittle to the ordering of premises in deductive reasoning tasks, despite the fact that premise order does not alter the underlying logical task. The "Premise Order Matters" paper shows that permuting premise order can cause a performance drop of over 30%.
The key finding is directional: LLMs achieve best performance when premises are presented in the same order as the context required in intermediate reasoning steps — essentially, when the prompt mirrors the ground truth proof sequence. When premises must be mentally reordered to construct the proof, accuracy drops sharply.
This brittleness reveals that LLM deductive reasoning is not operating on abstract logical relations but on sequential pattern matching through the input. The model processes premises in order and constructs intermediate representations that are order-dependent. When the order does not match the proof structure, the model must implicitly reorder — a capability it lacks or executes poorly.
The finding connects to multiple existing insights about surface-level processing:
Since Why do chain-of-thought examples fail across different conditions?, order sensitivity is not unique to premises — it extends across the entire prompt structure. Both findings suggest that LLMs process prompts as sequential narratives, not as unordered logical structures.
Since Does training data format shape reasoning strategy more than domain?, premise ordering is another format effect: the same logical content produces dramatically different performance depending on presentation format. The 30% gap is comparable to the 7.5x format effect documented in training data.
The practical implication is that anyone constructing prompts for deductive reasoning tasks should order premises to match the expected proof sequence. This is trivial for the prompt designer who knows the answer but impossible in production settings where the answer is unknown — creating a fundamental deployment challenge for LLM deductive reasoning.
Inquiring lines that use this note as a source 1
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do chain-of-thought examples fail across different conditions?
Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.
order sensitivity extends from exemplars to premises; shared mechanism of sequential processing
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
premise ordering is a format effect comparable in magnitude
-
Do large language models reason symbolically or semantically?
Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.
sequential processing rather than abstract logical manipulation
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Premise Order Matters in Reasoning with Large Language Models
- Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs
- Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- First Try Matters: Revisiting the Role of Reflection in Reasoning Models
- Logical Reasoning in Large Language Models: A Survey
- Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning
- Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners
Original note title
premise ordering affects deductive reasoning performance by over 30 percent