Does training for better reasoning reduce an AI system's ability to abstain?

This explores whether optimizing a model to reason well comes with a hidden tax on its willingness to say 'I don't know' — and the corpus says yes, with a fairly clear mechanism for why.

This explores whether training a model to reason better quietly erodes its ability to abstain — to decline a question it can't actually answer. The most direct evidence says yes: reasoning fine-tuning degrades abstention capacity by roughly 24 percent Does reasoning fine-tuning make models worse at declining to answer?. Models tuned for reasoning performance answer *more* questions, but they do so with unwarranted confidence — the training signal rewards producing a complete answer and systematically punishes 'I don't know.' Abstention isn't lost by accident; it's trained away, because the optimization target never valued it.

What makes this more than a one-paper finding is that the same reward structure shows up as a recurring failure pattern across the collection. Supervised fine-tuning, for instance, raises benchmark accuracy while cutting the quality of the actual reasoning steps by nearly 39 percent — models learn to produce correct-looking final answers through post-hoc rationalization rather than genuine inference Does supervised fine-tuning improve reasoning or just answers?. The common thread: when you reward the final answer, you get a model that always produces a final answer — confidently, whether or not it should. Abstention and reasoning honesty are casualties of the same blind spot in how we score success.

There's a cognitive-overreach version of this too. Push a model to 'think harder' and accuracy doesn't keep climbing — it peaks and then declines, because models overthink easy problems and underthink hard ones once you flood them with thinking tokens Does more thinking time always improve reasoning accuracy?. More reasoning effort doesn't translate into better calibration about *when* to stop or stay silent. Effort and good judgment about one's own limits turn out to be different things.

The interesting counterpoint is that this damage isn't inevitable — it's a property of *how* you train, not of reasoning itself. Some methods explicitly teach a model when to engage extended thinking versus answer quickly, routing between modes without collapsing into always-on reasoning Can models learn when to think versus respond quickly?. And RL training can redirect a model's extended thinking away from counterproductive self-doubt into productive gap analysis, suggesting the training signal mediates reasoning *quality*, not just quantity Does extended thinking help or hurt model reasoning?. If a reward can teach a model to second-guess itself usefully, it can in principle teach it to abstain — the 24 percent drop reflects a reward that simply never asked for that.

The takeaway you might not have expected: abstention is a casualty of optimization targets, not of intelligence. A model that reasons more isn't a model that knows its limits better — those are separate capabilities, and current reasoning training buys the first while quietly selling off the second.

Sources 5 notes

Does reasoning fine-tuning make models worse at declining to answer?

Models optimized for reasoning performance answer questions more often but express unwarranted confidence and fail to abstain appropriately. The training signal rewards complete answers, systematically punishing 'I don't know' responses.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Does training for better reasoning reduce an AI system's ability to abstain?

Sources 5 notes

Next inquiring lines