Should reasoning benchmarks score final answers or reasoning traces?
Current reasoning benchmarks often credit plausible-looking reasoning steps even when final answers are wrong. Does measuring outcomes instead of traces reveal whether models actually solve problems, or does it miss important reasoning capability?
LR²Bench scores Exact Match on the final solution against deterministic CSP ground truth. It does not score the trace. This is the methodological choice that produces the dramatic 20-23.6% number, and it is the choice most other reasoning benchmarks have been quietly avoiding. Trace-based evaluation — does the reasoning look right, are the reflective phrases present, does the chain have the expected structure — would have inflated the result by counting plausible-looking reflection as evidence of reflection. CSPs do not allow that inflation because the constraint either holds or it doesn't.
The lesson generalizes. Do reasoning traces actually cause correct answers? argues the principle: derivational traces are stylistic mimicry of reasoning, not verified reasoning. Does RLVR actually improve mathematical reasoning or just coherence? argues the empirical version: training improves trace coherence without improving trace validity. LR²Bench operationalizes the methodological response — measure the outcome, not the trace, on tasks where the outcome is independently verifiable.
The harder corollary: many existing reasoning benchmarks are partly trace-evaluation in disguise. Math benchmarks where partial-credit grading is permissive, multi-step reasoning where intermediate steps can be "interpretation-credited" by graders, dialogue tasks where helpfulness is judged on tone — these all give credit for reflective appearance even when outcomes are wrong or absent. CSPs are valuable not because they are common in real applications but because they are epistemically clean: they isolate whether the model can do the thing, free from rhetorical credit.
For benchmark design more broadly, the LR²Bench template is: pick tasks with deterministic verifiers; measure final outcome; do not score the trace. Apply that template to a domain and the reasoning theater collapses into whatever reasoning is actually happening. Twenty percent on CSPs is the floor after the theater is removed. Benchmarks that produce higher numbers should explain how their design avoids re-introducing trace credit — and most cannot.
Inquiring lines that use this note as a source 20
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can corrupted reasoning traces be reliably distinguished from correct ones?
- Why do correct reasoning traces appear shorter than incorrect ones?
- How do surface correlations between narratives and answers mislead benchmark validity?
- Can reasoning traces serve purposes beyond producing the final answer itself?
- Why do corrupted traces maintain performance as well as correct traces?
- Which sentences in reasoning traces actually influence the final answer?
- Do current math benchmarks measure outcomes or rhetorical plausibility?
- Should benchmarks measure trace length or whether constraints were actually satisfied?
- How does trace coherence differ from trace validity in reasoning?
- Why do benchmark scores rise while reasoning quality declines?
- How does tool access change what we measure in reasoning tests?
- Why do final answers contradict what the thinking draft explicitly concluded?
- Do corrupted reasoning traces teach something different than pure success traces?
- What is the gap between benchmark performance and real workplace task completion?
- Why do reasoning traces mislead users into trusting wrong model answers?
- What makes a trajectory score interpretable across different interactive benchmarks?
- What evaluation methods actually measure reasoning versus execution capability?
- What reasoning tasks are actually checkable through process verification?
- What makes reasoning traces effective or ineffective for solving problems?
- Why do reasoning traces fail to accurately reflect model decision-making?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
principle: traces are mimicry, not verification
-
Does RLVR actually improve mathematical reasoning or just coherence?
RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.
empirical: training improves coherence not validity
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT
- Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
- Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces
- Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
- Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Complex Logical Instruction Generation
Original note title
reflection benchmarks should be solution-verifiable not trace-verifiable — Exact Match on the answer cuts through reasoning theater