Do base models and reasoning models fail in opposite directions on uncertainty?
This explores whether base (non-reasoning) models and reasoning models break in mirror-image ways when handling uncertainty — one collapsing toward safe defaults, the other overcommitting to elaborate but wrong inferences.
This reads the question as asking whether the two model types have *opposite* failure signatures around uncertainty — and the corpus does hint at a mirror. The cleanest evidence is a head-to-head test of inductive rule learning, where reasoning models scored below 25% on exception-based rules while plain non-reasoning models hit 55–65% Why do reasoning models fail at exception-based rule inference?. The reasoning models didn't fail by being timid; they failed by *over-reaching* — overgeneralizing, doing math where none was called for, and hallucinating constraints that weren't there. The chain-of-thought that helps elsewhere becomes a machine for manufacturing false certainty when the right move is to stay open to a counterexample.
The opposite tendency shows up in models that don't reason their way forward: faced with ambiguity, they retreat to the conservative option. Most models tested actually got *worse* when constraints were removed — up to 38.5 points worse — because they'd been quietly defaulting to the harder, safer-looking answer rather than evaluating the situation Are models actually reasoning about constraints or just defaulting conservatively?. A related passivity appears when models accept false premises they demonstrably know are wrong, going along with the framing instead of pushing back Why do language models accept false assumptions they know are wrong?. So one camp errs toward over-asserting structure, the other toward under-asserting it.
The most useful reframe in the collection is that 'uncertainty failure' isn't one axis but two opposite ones that can even live inside the same model. One note treats training-time *entropy collapse* (the model narrows too fast, stops exploring) and test-time *variance inflation* (the model is too scattered, too unstable) as a dual problem — same broken exploration–exploitation balance, opposite symptoms, requiring separate fixes Why do reasoning models fail differently at training versus inference?. Reasoning models display both poles within a single solve: 'wandering' down invalid paths versus 'underthinking,' abandoning a good path too early Why do reasoning models abandon promising solution paths?. A decoding penalty on thought-switching fixes the premature-commitment side without retraining Do reasoning models switch between ideas too frequently?.
What makes the 'opposite directions' picture more than a tidy story is calibration. Confidence turns out to be a load-bearing signal: highly confident models resist prompt rephrasing while low-confidence ones swing wildly Does model confidence predict robustness to prompt changes? — and standard RLHF post-training actively *degrades* calibration, which is part of why reasoning-trained models can sound certain while being wrong. Using a model's own answer-span confidence as a reward signal restores calibration and improves reasoning at the same time Can model confidence work as a reward signal for reasoning?. And rather than forcing a single deterministic answer, some designs let the model *hold* uncertainty explicitly — stochastic latent transitions that represent a distribution over solutions instead of betting everything on one Can stochastic latent reasoning help models explore multiple solutions?.
The quiet payoff: the gap between base and reasoning models may be less about raw capability than about which way the post-training process pushed an already-present skill. Base models are shown to contain latent reasoning that minimal training merely *selects* rather than creates Do base models already contain hidden reasoning ability?. If that's right, 'failing in opposite directions on uncertainty' isn't two different machines — it's one machine tuned to over- or under-commit, which is a more fixable problem than it first appears.
Sources 10 notes
Across four game-based tasks, reasoning models scored below 25% on exception rules versus 55–65% for non-reasoning models. Chain-of-thought introduces math overuse, overgeneralization, and hallucinated constraints that amplify errors in negative evidence recognition.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.
Both failures stem from failed exploration-exploitation balance but occur at different timescales requiring structurally distinct interventions. Training-time fixes (entropy bonuses, critique diversity) cannot prevent inference-time variance inflation, and vice versa; both loops must be managed independently.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.