Does RLVR actually improve mathematical reasoning or just coherence?
RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.
RLVR verifies only the final answer and distributes rewards uniformly across all tokens. Its impact on intermediate reasoning tokens — which are not directly incentivized — has not been formally studied. Using a First-Order Logic (FOL)-based error taxonomy to classify errors in intermediate steps, the investigation reveals a nuanced picture.
RLVR post-training does improve trace coherence — the local consistency of reasoning steps as measured by error patterns. The improvement is strongest on problems where the base model fails but the RL-trained model succeeds. Reasoning traces become more internally consistent, with fewer identifiable logical errors between adjacent steps.
However, trace coherence is not trace validity. Coherence measures local consistency — each step follows plausibly from the previous one. Validity implies global logical soundness — the entire chain constitutes a correct mathematical proof. Coherent traces can be globally invalid: a chain of locally plausible steps can still reach a wrong conclusion or contain a valid-seeming path that skips essential justification.
Since What do models actually learn from chain-of-thought training?, this finding extends the pattern: RLVR, like long CoT training, optimizes for structural properties (local coherence) rather than semantic properties (global validity). The reward signal from final-answer verification creates pressure toward "traces that look right" rather than "traces that are right." The uniform distribution of advantages across tokens means the model has no mechanism to specifically improve at the critical reasoning junctures.
Since Does chain-of-thought reasoning reveal genuine inference or pattern matching?, the coherence-validity gap is the RLVR-specific manifestation of the broader CoT-as-imitation pattern. The model learns the form of coherent reasoning (adjacent steps that fit together) without necessarily learning the substance (valid logical derivation).
The coherence-validity distinction maps directly onto the faithfulness framework: since Do language models actually use their reasoning steps?, RLVR's coherence improvement addresses neither criterion. Improved local coherence means adjacent steps follow plausibly from each other (a structural property), but does not establish that those steps are causally sufficient (removing them would degrade the answer) or causally necessary (no spurious steps are present). RLVR-improved traces may look more faithful while being no more causally grounded — the structural surface improves while the causal substance remains unverified.
Claims that RLVR "improves reasoning" should be examined carefully: what improves is trace coherence (perceived quality), not necessarily trace validity (actual mathematical correctness).
Inquiring lines that use this note as a source 32
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do one-shot transparency studies miss the temporal reversal entirely?
- How much RLVR improvement comes from benchmark data memorization?
- Can clean benchmarks reveal true RLVR reasoning gains?
- Can RLVR expand a model's reasoning capabilities beyond its training ceiling?
- Why do current RLVR methods fail to expand reasoning capability beyond base model boundaries?
- How does Peircean Secondness differ from what RLHF actually provides?
- Does layer-wise prediction stabilization provide a stronger trace quality signal than confidence alone?
- Does logical trace coherence guarantee valid mathematical reasoning?
- Does reasoning trace style explain why RL post-training improves model reasoning?
- Why do corrupted traces maintain performance as well as correct traces?
- How does post-training on traces improve performance without semantic reasoning?
- Why does combining reasoning distillation with RLVR outperform either training stage alone?
- How does trace coherence differ from valid mathematical proof in practice?
- Does RLVR reward structure create pressure toward traces that look right?
- What role do high-entropy minority tokens play in RLVR?
- How does trace coherence differ from trace validity in reasoning?
- What limits RLVR effectiveness beyond mathematical and coding domains?
- Can a single correct example seed exponential improvement in mathematical reasoning?
- Does RLVR expand model capability or reorganize existing capability?
- Do high-entropy RLVR tokens correspond to MI-peak tokens during inference?
- What makes mathematically confident but incorrect answers resemble valid solution shapes?
- Can one training example activate mathematical reasoning in RL-trained models?
- How does vehicle causality differ from content causality in physical systems?
- Does trace length actually reflect problem difficulty or training proximity?
- What's the difference between RLHF, RLVR, and RLCF as training paradigms?
- Why do six different RLVR algorithms converge on similar performance levels?
- How does prolonged RL training differ from standard RLVR approaches?
- Why do certain tokens at certain difficulties drive most of RLVR's learning signal?
- How much of MATH-500 improvement comes from data contamination versus real reasoning gains?
- Does RLVR teach new reasoning or activate existing pretraining capabilities?
- Can combining SRL with RLVR outperform either method used alone?
- What types of math proofs benefit most from proof-by-contradiction framing?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
What do models actually learn from chain-of-thought training?
When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
RLVR coherence improvement is the same dynamic: structural over semantic
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
coherence-validity gap is RLVR-specific CoT-as-imitation
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
coherent traces invite anthropomorphic trust
-
Do chain-of-thought traces actually help users understand model reasoning?
Chain-of-thought explanations are often presented as transparency tools, but do they genuinely improve human understanding or create an illusion of interpretability? A human-subject study tests whether traces help users follow and evaluate model reasoning.
coherence optimizes perceived quality, not actual validity
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
RLVR coherence addresses neither faithfulness criterion: local plausibility does not establish causal sufficiency or necessity of reasoning steps
-
Does fine-tuning disconnect reasoning steps from final answers?
When models are fine-tuned on specific domains, do their chain-of-thought steps become less causally connected to their outputs? Three experiments test whether reasoning chains remain functionally faithful after training.
parallel phenomenon: SFT degrades faithfulness (reasoning steps causally disconnected from answers) while RLVR improves coherence without validity; both show that training can improve surface reasoning quality while leaving or worsening the causal grounding problem
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- Spurious Rewards: Rethinking Training Signals in RLVR
- Reinforcement Learning for Reasoning in Large Language Models with One Training Example
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces
Original note title
rlvr improves trace coherence without guaranteeing trace validity — local consistency gains should not be mistaken for improved mathematical reasoning