Why does premise ordering shift syllogistic reasoning performance by over 30 percent?
This explores why simply reordering the premises in a logic problem — without changing the logic itself — can swing an LLM's accuracy by more than 30 percent, and what that reveals about how these models actually 'reason.'
This explores why simply reordering the premises in a logic problem — without changing the logic itself — can swing an LLM's accuracy by more than 30 percent. The short answer the corpus points to: models aren't manipulating logic abstractly, they're pattern-matching against the sequence they saw during training. Accuracy peaks when the premises happen to arrive in the same order as the steps of the ground-truth proof, and collapses when they don't How much does the order of premises actually matter for reasoning?. The logic is identical either way; what changes is whether the surface form matches a familiar template.
That reframes the 30 percent drop as evidence of *imitation rather than inference*. Several notes converge on this from different angles. One shows that chain-of-thought works by constraining models to reproduce familiar reasoning shapes from training, and degrades predictably the moment you shift the distribution — the signature of mimicry, not capability Does chain-of-thought reasoning reveal genuine inference or pattern matching?. Another lands the point even harder: logically *invalid* chain-of-thought exemplars perform nearly as well as valid ones, meaning it's the structural form of the reasoning, not its logical correctness, that drives the gains Does logical validity actually drive chain-of-thought gains?. If form is what matters, then rearranging the form — premise order — should matter a lot. It does.
There's a complementary mechanistic story under the hood. When researchers traced how models actually execute a syllogism, they found a content-independent three-stage circuit (recitation, middle-term suppression, mediation) — but that circuit is contaminated by separate attention heads encoding world knowledge, which bias conclusions toward what's *plausible* rather than what *follows* How do language models perform syllogistic reasoning internally?. So the reasoning machinery is real but fragile and entangled with surface cues. Premise order is exactly the kind of surface cue that nudges such a system off the rails: it doesn't break the logic, it disrupts the sequential scaffolding the circuit leans on.
The deeper pattern is that LLM reasoning failures track *familiarity*, not difficulty. Models break at instance-novelty boundaries rather than complexity thresholds — any reasoning chain succeeds if the model has seen similar instances, regardless of how hard the logic is Do language models fail at reasoning due to complexity or novelty?. A reordered premise set is, in effect, a less-familiar instance of the same problem, so performance sags. It's the same sensitivity that shows up elsewhere as accuracy dropping just from padding the input with irrelevant tokens Does reasoning ability actually degrade with longer inputs? — both are cases where something that *shouldn't* matter to the logic does matter to the model.
The thing worth walking away with: premise ordering isn't a quirky prompt-engineering footnote, it's a diagnostic. The 30 percent swing is a measurement of how much these models depend on the *shape and sequence* of a problem versus its actual logical content. If a model were genuinely doing abstract deduction, reordering the givens would be invisible to it — the way it's invisible to you. That it isn't tells you where the reasoning is really coming from.
Sources 6 notes
Reordering premises in logical tasks drops LLM accuracy by more than 30 percent, even though the logic remains identical. Performance peaks when premises match the ground truth proof sequence, suggesting LLMs rely on sequential pattern matching rather than abstract logical manipulation.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
LLMs implement a content-independent three-stage reasoning mechanism—recitation, middle-term suppression, mediation—that works across architectures. However, additional attention heads encoding world knowledge systematically bias conclusions toward semantically plausible rather than logically valid answers, with contamination increasing at larger scales.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.