Does LLM math reasoning truly generalize or just pattern match?
This explores whether high scores on math benchmarks reflect genuine reasoning ability or merely template familiarity. The question matters because it determines how much we should trust LLMs on novel numerical problems.
GSM8K's near-saturated scores suggest LLM math reasoning has genuinely advanced — GSM-Symbolic tests whether that is real by regenerating the same questions from symbolic templates. The findings deflate the headline. Models show notable variance across different instantiations of the same question (single-point accuracy is unreliable), performance declines when only the numerical values change (proper-name changes hurt less), and degrades as question complexity rises. Most damning, GSM-NoOp — adding a clause that is related but irrelevant to the answer — causes large drops, exposing that models cannot reliably discern relevant from irrelevant information. The conclusion: reasoning here is probabilistic pattern-matching, not formal reasoning.
The keeper is the diagnostic method (controlled symbolic perturbation) and the verdict: benchmark gains can reflect template familiarity rather than reasoning, and the fragility is structural, not a tuning gap.
This is a landmark anchor for the vault's reasoning-fragility cluster. It converges with Do language models fail at reasoning due to complexity or novelty? and Do large language models reason symbolically or semantically?, and the number-sensitivity echoes Do large language models actually perform iterative optimization?.
Inquiring lines that use this note as a source 2
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do language models fail at reasoning due to complexity or novelty?
Explores whether reasoning-model failures stem from task complexity thresholds or from encountering unfamiliar instances. Tests whether scaling chain length actually addresses the root cause of reasoning breakdown.
both pin reasoning failure on training-distribution familiarity rather than intrinsic complexity
-
Do large language models reason symbolically or semantically?
Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.
GSM-NoOp is a semantics-decoupling stress test confirming this
-
Do large language models actually perform iterative optimization?
Explores whether LLMs execute genuine numerical procedures like Newton-Raphson or instead pattern-match to memorized solution templates when solving constrained optimization problems.
number-sensitivity is the same memorized-template fallback
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
- LLMs can implicitly learn from mistakes in-context
- An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- DecepChain: Inducing Deceptive Reasoning in Large Language Models
- A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis
- Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Original note title
LLM math reasoning is fragile pattern-matching — accuracy drops when only numbers change and irrelevant clauses derail it