Why does outcome supervision fail for long reasoning chains?
This explores why grading only the final answer (outcome supervision) breaks down once a model's reasoning runs many steps — and what checking the steps themselves reveals instead.
This explores why scoring only a model's final answer falls apart over long reasoning chains. The corpus points to a single underlying mismatch: in long traces, most failures live in the *process*, not the *answer* — so a signal that only sees the answer is blind to where things actually went wrong. The sharpest evidence is direct. When you check intermediate states and policy compliance during generation rather than just the endpoint, task success jumps from 32% to 87%, because the dominant failure mode turns out to be process violations, not wrong final outputs Where do reasoning agents actually fail during long traces?. Outcome supervision can't catch a mistake that happens at step 12 of 40 if the model still stumbles into a plausible-looking answer — and it equally can't reward a sound process that happened to miss.
The failures it misses aren't random; they're structural. Reasoning models tend to *wander* (explore invalid paths) and *underthink* (abandon promising paths too early), and the striking part is that good solutions were often reachable — they were dropped prematurely Why do reasoning models abandon promising solution paths?. A final-answer reward gives the model no gradient on "you were on the right track and quit," so the very behavior that sinks long chains is exactly the behavior outcome supervision is silent about. This compounds with how thin the reward signal becomes over a long trace: more steps means more places to go wrong, but still only one bit of feedback at the end.
There's a deeper reason the outcome signal is weak, which several notes converge on: much of what looks like reasoning is pattern imitation, not inference. Chains succeed when the instance resembles training data and degrade predictably under distribution shift Does chain-of-thought reasoning actually generalize beyond training data? Does chain-of-thought reasoning reveal genuine inference or pattern matching?, and breakdowns track instance *novelty* rather than complexity or chain length Do language models fail at reasoning due to complexity or novelty?. Frontier models manage only ~20-23% on constraint-satisfaction problems that demand genuine backtracking Can reasoning models actually sustain long-chain reflection?. If the chain is scaffolding rather than load-bearing logic — and corrupted traces teaching as well as correct ones suggests it often is Do reasoning traces need to be semantically correct? — then a correct final answer doesn't certify the path, and outcome supervision is rewarding the wrong thing without knowing it.
The fixes in the corpus all share a shape: get step-level signal without paying for human step annotations. Reverse-curriculum RL slides the start state backward from near-completion so outcome feedback effectively exposes step-level failures Can curriculum learning approximate expensive process supervision?. Curriculum sequencing — imitation first, then verifiable-reward RL — works because the imitation phase produces reasonable rollouts that *make the outcome reward informative*, which is a tacit admission that raw outcome reward on a cold model is too sparse to learn from Does sequencing imitation then exploration training improve reasoning?.
The twist worth leaving with: length itself is a red herring. Longer chains don't mean harder problems — trace length mostly reflects how close the instance sits to training schemas, and decouples from difficulty out of distribution Does longer reasoning actually mean harder problems?. Optimal chain length even follows an inverted-U, with stronger models preferring shorter chains Why does chain of thought accuracy eventually decline with length?. So outcome supervision doesn't fail *because* chains are long — it fails because length multiplies the hidden process errors it was never able to see in the first place Why does chain-of-thought reasoning fail in predictable ways?.
Sources 12 notes
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.