An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems
In this paper, we introduce a systematic framework beyond conventional method to assess LLMs’ mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI’s flagship reasoning model, O3, scores 49 % on the originals but drops by 4 percentage points on surface variants, and by 10.5 percentage points on core-step-based variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.
Introduction. Motivation Modern AI systems are increasingly entrusted with tasks that hinge on reasoning rather than pattern matching. Reliable progress therefore depends on precisely measuring an LLM’s reasoning capacity and its ability to generalize beyond memorized textual surface forms. Existing math-reasoning benchmarks, however, exhibit two critical weaknesses: (i) leakage-induced score inflation, since benchmark items rapidly seep into pre-training corpora, and (ii) limited robustness coverage, because today’s datasets are too small or lack controlled transformations that probe true generalization. Addressing these weaknesses is urgent if we aim to benchmark reasoning with the same rigor demanded in safety-critical domains such as healthcare or cybersecurity. Benchmark inflation through training leakage. Recent studies show that public datasets, including GSM8K(Cobbe et al. 2021) and MATH (Hendrycks et al. 2021), have leaked into the web-scale corpora used to pre-train large language Competition mathematics reveals the next robustness bottleneck.
Discussion / Conclusion. Key Findings Symbol-level perturbations cause substantial drops. Across the four surface variants—DL, DLC, DLM, and GS—merely renaming variables lowers accuracy by 3–5 pp on average; for example, GEMINI-2.5-PRO falls from 78.3% to 72.9% (–5.4 pp; see Table 1). This indicates that today’s SOTA models still rely on lexical “semantic anchors” rather than fully abstract proof structures. Maintaining structure but resampling parameters is even harsher. The KERNEL VARIANT (KV) simultaneously resamples all mutable constants while preserving the original reasoning skeleton. Accuracy losses reach ≈10 pp; OPENAI O3 declines from 48.8% to 38.5% (–10.3 pp), showing that grasping a solution pattern does not automatically translate to parameter-invariant reasoning ability. Conclusion & Future Work In this paper, we introduced the Generalization–and– Perturbation (GAP) framework.