SYNTHESIS NOTE

Does LLM math reasoning truly generalize or just pattern match?

This explores whether high scores on math benchmarks reflect genuine reasoning ability or merely template familiarity. The question matters because it determines how much we should trust LLMs on novel numerical problems.

Synthesis note · 2026-06-03 · sourced from Reasoning Critiques

GSM8K's near-saturated scores suggest LLM math reasoning has genuinely advanced — GSM-Symbolic tests whether that is real by regenerating the same questions from symbolic templates. The findings deflate the headline. Models show notable variance across different instantiations of the same question (single-point accuracy is unreliable), performance declines when only the numerical values change (proper-name changes hurt less), and degrades as question complexity rises. Most damning, GSM-NoOp — adding a clause that is related but irrelevant to the answer — causes large drops, exposing that models cannot reliably discern relevant from irrelevant information. The conclusion: reasoning here is probabilistic pattern-matching, not formal reasoning.

The keeper is the diagnostic method (controlled symbolic perturbation) and the verdict: benchmark gains can reflect template familiarity rather than reasoning, and the fragility is structural, not a tuning gap.

This is a landmark anchor for the vault's reasoning-fragility cluster. It converges with Do language models fail at reasoning due to complexity or novelty? and Do large language models reason symbolically or semantically?, and the number-sensitivity echoes Do large language models actually perform iterative optimization?.

Inquiring lines that use this note as a source 2

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 134 in 2-hop network ·dense cluster Open in graph ↗

Does LLM math reasoning truly generalize or just… Do language models fail at reasoning due to comple… Do large language models reason symbolically or se… Do large language models actually perform iterativ…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do language models fail at reasoning due to complexity or novelty? Explores whether reasoning-model failures stem from task complexity thresholds or from encountering unfamiliar instances. Tests whether scaling chain length actually addresses the root cause of reasoning breakdown.
both pin reasoning failure on training-distribution familiarity rather than intrinsic complexity
Do large language models reason symbolically or semantically? Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.
GSM-NoOp is a semantics-decoupling stress test confirming this
Do large language models actually perform iterative optimization? Explores whether LLMs execute genuine numerical procedures like Newton-Raphson or instead pattern-match to memorized solution templates when solving constrained optimization problems.
number-sensitivity is the same memorized-template fallback

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LLM math reasoning is fragile pattern-matching — accuracy drops when only numbers change and irrelevant clauses derail it

Does LLM math reasoning truly generalize or just pattern match?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4