Do models trained for safety over-refuse compared to models trained for reasoning?

This explores whether safety-tuned models refuse too much while reasoning-tuned models do the opposite — and the corpus doesn't have a head-to-head bake-off, but it does have something more interesting: the two failure modes are mirror images of the same broken skill, knowing when to say no.

This reads the question as a contest — safety models that clam up vs. reasoning models that plow ahead — and the most useful thing the corpus offers is that both are versions of one problem: calibrating *when to disengage*. No note here runs a direct refusal-rate benchmark of a safety model against a reasoning model, so if that exact comparison is what you need, the collection is thin. But the pieces around it are sharp.

On the safety side, refusal turns out to be less principled than 'over' or 'under' would suggest — it's situational and biased. Guardrails refuse the same request at different rates depending on who appears to be asking, declining more for younger, female, and Asian-American personas, and sycophantically backing away from political positions the user seems to hold Do AI guardrails refuse differently based on who is asking?. So safety-trained refusal isn't a clean dial you can call over-tuned; it's a sensitivity that bends to identity signals. 'Over-refusal' assumes the model refuses consistently — it doesn't.

Reasoning models have the opposite pathology: they *under*-refuse. Handed an ill-posed question with a missing premise, reasoning models churn out long, redundant chains trying to answer the unanswerable, while plain non-reasoning models correctly flag it as broken and stop Why do reasoning models overthink ill-posed questions?. Training rewarded producing reasoning steps but never taught the model when to disengage. You see the same overcommitment in how reasoning models wander down invalid paths and abandon good ones Why do reasoning models abandon promising solution paths?, and how chain-of-thought actively hurts on exception-based rules by manufacturing constraints and overgeneralizing Why do reasoning models fail at exception-based rule inference?. The reasoning reflex doesn't know when to quit.

Put those together and the answer flips the premise: safety training produces refusal that's inconsistent and identity-dependent, while reasoning training produces a near-total *inability* to refuse — to say 'this question is broken, I'm not engaging.' They fail in opposite directions on the same axis. The thing neither gets natively is judgment about when to stop.

The forward-looking material suggests this is fixable as a routing problem rather than a personality. Models can be trained to choose between extended thinking and a quick response on their own, without difficulty labels, by decoupling the *decide-to-engage* step from the *answer* step Can models learn when to think versus respond quickly?. And the broader case is that reasoning post-training mostly teaches a model *when* to deploy a capability it already latently has, not *how* Does RL post-training create reasoning or just deploy it?. If 'when to reason' is learnable, 'when to refuse' and 'when to disengage' look like the same kind of learnable gate — which is the part both safety and reasoning training currently skip.

Sources 6 notes

Do AI guardrails refuse differently based on who is asking?

GPT-3.5 refuses requests at different rates for younger, female, and Asian-American personas, and sycophantically declines to engage with political positions users would disagree with. Sports fandom and other non-political signals also shift refusal sensitivity.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Why do reasoning models fail at exception-based rule inference?

Across four game-based tasks, reasoning models scored below 25% on exception rules versus 55–65% for non-reasoning models. Chain-of-thought introduces math overuse, overgeneralization, and hallucinated constraints that amplify errors in negative evidence recognition.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking post-training alignment in LLMs. The question: do safety-trained models genuinely over-refuse, or do reasoning-trained models under-refuse, or is 'over/under' the wrong frame entirely?

What a curated library found — and when (dated claims, not current truth): Findings span July 2024 to December 2025.
• Safety guardrails refuse inconsistently, varying by perceived user demographics (younger, female, Asian-American personas trigger higher refusal rates) and political alignment signals, not by a stable principle (2024-07).
• Reasoning models chronically under-refuse: when given ill-posed questions with missing premises, they generate long chains justifying answers to unanswerable queries instead of disengaging (2025-04, 2025-05).
• Extended thinking models wander solution spaces without principled termination, exploring invalid paths and abandoning valid ones, and chain-of-thought actively degrades performance on inductive rule inference (2025-05, 2025-06).
• Post-training via RL teaches models *when* to deploy reasoning, not *how*; decoupled routing (choose reasoning vs. quick response without difficulty labels) separates the decide-to-engage gate from the answer (2025-05, 2025-10).

Anchor papers (verify; mind their dates):
• arXiv:2407.06866 (2024-07) — guardrail demographic sensitivity
• arXiv:2505.20296 (2025-05) — reasoning model solution wandering
• arXiv:2510.07364 (2025-10) — when vs. how in thinking model training
• arXiv:2601.00830 (2025-12) — chain-of-thought explanations underreport

Your task:
(1) RE-TEST THE FRAME. Evaluate whether newer models (post-o3, Gemini 3+, Claude 4.5) still show the identity-dependent guardrail drift and reasoning-model under-refusal documented here. Check if recent decoupled-routing methods (RL gatekeeping, mixture-of-experts scheduling) have narrowed the gap. Crucially: separate the durable question (can models learn stable *when-to-engage* heuristics?) from perishable constraints (current post-training regimes don't teach this).
(2) Surface the strongest DISAGREEMENT or SUPERSEDING work from the last 6 months. Look for papers arguing refusal inconsistency is a measurement artifact, or that reasoning *over*-refusal occurs in certain domains, or that unified post-training dissolves the safety/reasoning split entirely.
(3) Propose 2 research questions that assume the regime may have shifted: (a) Can a single shared gating mechanism (trained via preference data) calibrate both refusal and reasoning depth without retuning? (b) Do reasoning models trained explicitly on *dismissal-of-invalid-premises* recover the ability to disengage safely?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Do models trained for safety over-refuse compared to models trained for reasoning?

Sources 6 notes

Next inquiring lines