Can a single training example unlock mathematical reasoning?
Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.
A single training example in RLVR is sufficient to produce dramatic mathematical reasoning improvement — MATH500 performance jumps from 36.0% to 73.6% for Qwen2.5-Math-1.5B. This matches the performance of training on the 1.2k DeepScaleR subset. Two examples slightly exceed both (74.8%). The pattern replicates across model families (Qwen, Llama, DeepSeek), RL algorithms (GRPO, PPO), and different math examples.
The most striking phenomenon is post-saturation generalization: training accuracy on the single example rapidly reaches 100%, yet test accuracy continues to improve for approximately 1,400 more training steps. The model has perfectly memorized its one example but keeps getting better at unseen problems. Even after eventual overfitting — when training outputs become "incomprehensible multilingual gibberish mixed with correct solutions" — test performance and output interpretability remain strong.
This finding is the extreme case of Do base models already contain hidden reasoning ability?. One example is not teaching reasoning — it is providing the minimal activation signal for the RL optimization process to reshape the sampling distribution. The entropy loss component encourages diverse output exploration, while the single training example acts as "implicit regularization" — punishing explorations that fail on the learned data, thereby providing verification for exploration.
Cross-domain generalization also emerges: a single math example improves performance on problems from different mathematical subdomains. Self-reflection frequency increases spontaneously during training, with words like "rethink," "recheck," and "recalculate" appearing more frequently — the model develops metacognitive behaviors from a single data point.
Since Can models improve themselves on tasks without verifiable answers?, the 1-shot result pushes the minimum viable dataset even further: not 1,000 demonstrations, but one.
Inquiring lines that use this note as a source 37
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do single examples trigger large reasoning improvements in models?
- Can explicit numerical signals override learned linguistic defaults in fine-tuned models?
- Can prompting unlock compositional skills that pretraining already learned?
- Does selecting examples from multiple complexity levels outperform selecting only high-quality examples?
- Can models learn to select exemplars based on reasoning skills rather than complexity?
- Can diverse critiques on a single problem unlock reasoning without diverse problem sets?
- Does partial trace guidance work better than curriculum learning for hard problems?
- Can solution traces substitute for process-level reward signals in math reasoning?
- Why do medical and mathematical tasks require fundamentally different model capabilities?
- Why does curriculum learning with tight budgets beat fixed-budget approaches?
- Why does keyword priming require only three training exposures to establish?
- How do the three grokking phases connect to memorization capacity limits?
- Can data pruning strategies exploit the finite nature of memorization capacity?
- Do models with unfilled memorization capacity appear to generalize falsely?
- Does logical trace coherence guarantee valid mathematical reasoning?
- What information do numerical rewards fail to provide for reasoning tasks?
- Do current math benchmarks measure outcomes or rhetorical plausibility?
- How does a single training example trigger phase transitions in reasoning output?
- Can a single correct example seed exponential improvement in mathematical reasoning?
- How can one training example improve reasoning across thousands of unseen problems?
- Why does reasoning training improve math but hurt knowledge tasks?
- How do single training examples activate reasoning capabilities in language models?
- Can one training example activate mathematical reasoning in RL-trained models?
- How tight should a textual learning rate be before it prevents skill escape?
- Can one training example activate mathematical reasoning without reinforcement learning?
- Why do medium-difficulty problems produce more stable learning gains?
- Why does the order of training examples matter for what models learn?
- Does grokking in modular arithmetic follow the same three-phase learning trajectory?
- Can mathematical reasoning improvements transfer across problem subdomains?
- Why does reasoning transfer across different numbers but factual recall does not?
- How much of MATH-500 improvement comes from data contamination versus real reasoning gains?
- Do text-space skills transfer learning across different frontier models?
- Do few-shot examples improve in-context learning or add noise?
- What makes a good in-context learning example for a given task?
- What is the theoretical capacity limit before memorization saturates?
- How does the Learning Law explain why all examples should contribute equally?
- Can small demonstration sets unlock general reasoning without large question data?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
1-shot RLVR is the most extreme confirmation
-
Can models improve themselves on tasks without verifiable answers?
Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?
1-shot pushes the frontier far beyond 1000
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
post-saturation generalization shows the learning continues beyond the data
-
Does reflection in reasoning models actually correct errors?
When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
1-shot RLVR spontaneously increases self-reflection frequency
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reinforcement Learning for Reasoning in Large Language Models with One Training Example
- LLMs can implicitly learn from mistakes in-context
- A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis
- rStar2-Agent: Agentic Reasoning Technical Report
- Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- Spurious Rewards: Rethinking Training Signals in RLVR
Original note title
one training example is sufficient to activate mathematical reasoning in rlvr — post-saturation generalization continues after training accuracy reaches 100 percent