Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.

Synthesis note · 2026-02-22 · sourced from RLVR

RLVR verifies only the final answer and distributes rewards uniformly across all tokens. Its impact on intermediate reasoning tokens — which are not directly incentivized — has not been formally studied. Using a First-Order Logic (FOL)-based error taxonomy to classify errors in intermediate steps, the investigation reveals a nuanced picture.

RLVR post-training does improve trace coherence — the local consistency of reasoning steps as measured by error patterns. The improvement is strongest on problems where the base model fails but the RL-trained model succeeds. Reasoning traces become more internally consistent, with fewer identifiable logical errors between adjacent steps.

However, trace coherence is not trace validity. Coherence measures local consistency — each step follows plausibly from the previous one. Validity implies global logical soundness — the entire chain constitutes a correct mathematical proof. Coherent traces can be globally invalid: a chain of locally plausible steps can still reach a wrong conclusion or contain a valid-seeming path that skips essential justification.

Since What do models actually learn from chain-of-thought training?, this finding extends the pattern: RLVR, like long CoT training, optimizes for structural properties (local coherence) rather than semantic properties (global validity). The reward signal from final-answer verification creates pressure toward "traces that look right" rather than "traces that are right." The uniform distribution of advantages across tokens means the model has no mechanism to specifically improve at the critical reasoning junctures.

Since Does chain-of-thought reasoning reveal genuine inference or pattern matching?, the coherence-validity gap is the RLVR-specific manifestation of the broader CoT-as-imitation pattern. The model learns the form of coherent reasoning (adjacent steps that fit together) without necessarily learning the substance (valid logical derivation).

The coherence-validity distinction maps directly onto the faithfulness framework: since Do language models actually use their reasoning steps?, RLVR's coherence improvement addresses neither criterion. Improved local coherence means adjacent steps follow plausibly from each other (a structural property), but does not establish that those steps are causally sufficient (removing them would degrade the answer) or causally necessary (no spurious steps are present). RLVR-improved traces may look more faithful while being no more causally grounded — the structural surface improves while the causal substance remains unverified.

Claims that RLVR "improves reasoning" should be examined carefully: what improves is trace coherence (perceived quality), not necessarily trace validity (actual mathematical correctness).

Inquiring lines that use this note as a source 32

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 123 in 2-hop network ·medium cluster Open in graph ↗

Does RLVR actually improve mathematical reasonin… What do models actually learn from chain-of-though… Does chain-of-thought reasoning reveal genuine inf… Do reasoning traces actually cause correct answers… Do chain-of-thought traces actually help users und… Do language models actually use their reasoning st… Does fine-tuning disconnect reasoning steps from f…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

What do models actually learn from chain-of-thought training? When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
RLVR coherence improvement is the same dynamic: structural over semantic
Does chain-of-thought reasoning reveal genuine inference or pattern matching? Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
coherence-validity gap is RLVR-specific CoT-as-imitation
Do reasoning traces actually cause correct answers? Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
coherent traces invite anthropomorphic trust
Do chain-of-thought traces actually help users understand model reasoning? Chain-of-thought explanations are often presented as transparency tools, but do they genuinely improve human understanding or create an illusion of interpretability? A human-subject study tests whether traces help users follow and evaluate model reasoning.
coherence optimizes perceived quality, not actual validity
Do language models actually use their reasoning steps? Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
RLVR coherence addresses neither faithfulness criterion: local plausibility does not establish causal sufficiency or necessity of reasoning steps
Does fine-tuning disconnect reasoning steps from final answers? When models are fine-tuned on specific domains, do their chain-of-thought steps become less causally connected to their outputs? Three experiments test whether reasoning chains remain functionally faithful after training.
parallel phenomenon: SFT degrades faithfulness (reasoning steps causally disconnected from answers) while RLVR improves coherence without validity; both show that training can improve surface reasoning quality while leaving or worsening the causal grounding problem

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rlvr improves trace coherence without guaranteeing trace validity — local consistency gains should not be mistaken for improved mathematical reasoning

Does RLVR actually improve mathematical reasoning or just coherence?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4