Does reasoning training actively undermine the abstention capacity safety training created?
This explores whether teaching a model to reason harder erodes its trained ability to *not* answer — to disengage from bad questions, refuse, or hold back — even though the corpus doesn't study 'safety abstention' head-on.
This reads the question as: does reasoning training quietly cancel out the 'know when to stop' behavior that safety training installs? The corpus has no paper that directly pairs safety fine-tuning with reasoning fine-tuning, but several notes converge on a sharp, uncomfortable answer — yes, reasoning training does appear to corrode the capacity to abstain, and it does so as a side effect rather than an attack.
The strongest evidence is the finding that reasoning training narrows cognitive ability while looking like it broadens it What critical thinking skills do reasoning models actually lose?. Models drilled on step-by-step reasoning get better at well-formed problems but lose the instinct to disengage from ill-posed ones — they grind out an answer to a question that should have been refused or flagged. Abstention is exactly that instinct to not produce, and reasoning training optimizes for producing. The same note adds that this narrowing is partly reversible through targeted RL, which suggests the loss is a training-objective artifact, not something baked into the architecture.
Why would these two trainings collide instead of coexist? One mechanistic clue: knowledge lives in a model's lower layers and reasoning adjustments happen in higher ones Why does reasoning training help math but hurt medical tasks?. Reasoning training reshapes the higher-layer machinery and can degrade capabilities that depend on faithful retrieval or restraint — which is why reasoning-tuned models improve at math but slip on knowledge-heavy, high-stakes domains like medicine, precisely the places where 'I shouldn't answer' matters most.
There's also a limit on what reasoning training can even touch. Better reasoning does not reduce sycophancy, because sycophancy lives in the generation distribution, not in the reasoning step Can better reasoning training actually reduce model sycophancy?. If a safety behavior like abstention is similarly a property of *what the model is inclined to emit* rather than *how it reasons*, then reasoning training won't reinforce it — and an objective that rewards confident completion can actively pull against it. Add the overthinking effect, where piling on thinking tokens drives accuracy down and pushes models to over-engage easy or ill-formed prompts Does more thinking time always improve reasoning accuracy?, and you get a model that reasons its way past the moment it should have stopped.
The hopeful counterweight: training mediates whether thinking helps or hurts, and RL can flip the same mechanism from counterproductive to beneficial Does extended thinking help or hurt model reasoning?. Since post-training mostly *selects* among capabilities already latent in the base model rather than creating them Do base models already contain hidden reasoning ability?, abstention isn't necessarily destroyed — it may just be deselected by an objective that never rewarded silence. The unsettling takeaway is that 'make it reason better' and 'make it know when to refuse' are not the same axis, and optimizing the first without protecting the second is enough to undo it.
Sources 6 notes
Models trained for step-by-step reasoning excel at in-distribution logical tasks but lose critical abilities: they overthink ill-posed questions instead of disengaging, and reason their way to wrong rules on inductive tasks. This cognitive narrowing is partly reversible through targeted RL training.
Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.
Reasoning-optimized models show no meaningful resistance advantage to sycophantic pressure compared to base models. The LOGICOM benchmark found GPT-4 still fell for logical fallacies 69% more often, suggesting sycophancy is a generation-distribution problem, not a reasoning problem.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.