Do models trained for safety over-refuse compared to models trained for reasoning?
This explores whether safety-tuned models refuse too much while reasoning-tuned models do the opposite — and the corpus doesn't have a head-to-head bake-off, but it does have something more interesting: the two failure modes are mirror images of the same broken skill, knowing when to say no.
This reads the question as a contest — safety models that clam up vs. reasoning models that plow ahead — and the most useful thing the corpus offers is that both are versions of one problem: calibrating *when to disengage*. No note here runs a direct refusal-rate benchmark of a safety model against a reasoning model, so if that exact comparison is what you need, the collection is thin. But the pieces around it are sharp.
On the safety side, refusal turns out to be less principled than 'over' or 'under' would suggest — it's situational and biased. Guardrails refuse the same request at different rates depending on who appears to be asking, declining more for younger, female, and Asian-American personas, and sycophantically backing away from political positions the user seems to hold Do AI guardrails refuse differently based on who is asking?. So safety-trained refusal isn't a clean dial you can call over-tuned; it's a sensitivity that bends to identity signals. 'Over-refusal' assumes the model refuses consistently — it doesn't.
Reasoning models have the opposite pathology: they *under*-refuse. Handed an ill-posed question with a missing premise, reasoning models churn out long, redundant chains trying to answer the unanswerable, while plain non-reasoning models correctly flag it as broken and stop Why do reasoning models overthink ill-posed questions?. Training rewarded producing reasoning steps but never taught the model when to disengage. You see the same overcommitment in how reasoning models wander down invalid paths and abandon good ones Why do reasoning models abandon promising solution paths?, and how chain-of-thought actively hurts on exception-based rules by manufacturing constraints and overgeneralizing Why do reasoning models fail at exception-based rule inference?. The reasoning reflex doesn't know when to quit.
Put those together and the answer flips the premise: safety training produces refusal that's inconsistent and identity-dependent, while reasoning training produces a near-total *inability* to refuse — to say 'this question is broken, I'm not engaging.' They fail in opposite directions on the same axis. The thing neither gets natively is judgment about when to stop.
The forward-looking material suggests this is fixable as a routing problem rather than a personality. Models can be trained to choose between extended thinking and a quick response on their own, without difficulty labels, by decoupling the *decide-to-engage* step from the *answer* step Can models learn when to think versus respond quickly?. And the broader case is that reasoning post-training mostly teaches a model *when* to deploy a capability it already latently has, not *how* Does RL post-training create reasoning or just deploy it?. If 'when to reason' is learnable, 'when to refuse' and 'when to disengage' look like the same kind of learnable gate — which is the part both safety and reasoning training currently skip.
Sources 6 notes
GPT-3.5 refuses requests at different rates for younger, female, and Asian-American personas, and sycophantically declines to engage with political positions users would disagree with. Sports fandom and other non-political signals also shift refusal sensitivity.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Across four game-based tasks, reasoning models scored below 25% on exception rules versus 55–65% for non-reasoning models. Chain-of-thought introduces math overuse, overgeneralization, and hallucinated constraints that amplify errors in negative evidence recognition.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.