Why do better reasoning models ignore instructions?
As models develop stronger reasoning abilities through training, they appear to become worse at following specified constraints. Is this an unavoidable trade-off, and what causes it?
As models get better at reasoning, they get worse at doing what you ask. This is not a tuning oversight — it appears to be a structural consequence of how reasoning capability is developed.
The MathIF benchmark evaluates instruction-following in mathematical reasoning tasks — not just whether the answer is correct, but whether the model complied with specified constraints while solving. The findings:
- Most models fail to reliably follow instructions; even the best model achieves only 50.71% on strict instruction-following
- Performance degrades with task difficulty and constraint complexity
- Common reasoning-oriented training strategies (SFT and RL) enhance reasoning ability but degrade instruction adherence
- This degradation is more pronounced as CoT length increases
The mechanism: longer reasoning chains create a contextual gap between the original instruction and the final answer. The instruction is specified at the beginning of the context. The answer emerges at the end of a long chain. As the chain grows, the model's attention to the original instruction diminishes — the reasoning process itself dilutes the directive.
The trade-off is explicit: enforcing brevity by limiting CoT length recovers instruction-following performance, but at the cost of reasoning depth and accuracy.
This creates an alignment problem that is distinct from the sycophancy or values-misalignment literature. The model is not disagreeing with instructions — it is forgetting them while reasoning. The longer it thinks, the more thoroughly it has moved on.
The practical implication: for instruction-critical applications (task management, decision support, customer-facing agents), high-capability reasoning models may perform worse on the compliance dimension that matters most. The "upgrade to a stronger model" instinct may backfire.
Constraint attention as measurable mechanism (When Thinking Fails): A 15-model evaluation across IFEval (simple rule-verifiable constraints) and ComplexBench (compositional constraints) confirms the degradation across model families and sizes. The proposed mechanism is "constraint attention" — attention scores directed toward constraint tokens in the instruction. When CoT prompting is applied, constraint attention measurably decreases: the reasoning process diverts the model's attention from instruction-relevant tokens toward content planning.
CoT helps in two cases: (1) satisfying formatting/structural requirements, and (2) enforcing lexical constraints that override default tendencies. It hurts in two cases: (1) over-focusing on high-level content while neglecting simple constraints (word counts, case requirements), and (2) introducing well-intentioned content that unintentionally violates constraints.
Four mitigation strategies were evaluated: in-context learning (corrective examples), self-reflection (model evaluates its own response), self-selective reasoning (model decides when to reason), and classifier-selective reasoning (trained classifier decides). Classifier-selective reasoning consistently delivered the best performance across both benchmarks — the model needs an external signal for when to reason, because its own judgment about when reasoning helps is unreliable.
LogicIFEval (2025) extends the deficit to logic-rich instructions specifically. When instructions contain rich logic structures — conditionals, nesting, recursion, function calls — most popular LLMs correctly follow fewer than 60% of instructions. Open-source models lag significantly behind frontier models. As logical complexity increases, accuracy drops further. An important asymmetry: incorporating explicit thinking (extended reasoning) before responding enhances instruction-following for large LLMs but NOT for smaller LLMs, suggesting the thinking-instruction interaction depends on model scale. This adds a second dimension to the instruction-following deficit: not just that reasoning training degrades compliance, but that logic-rich instructions are inherently harder to follow even without reasoning training. Source: Arxiv/Evaluations.
Inquiring lines that use this note as a source 21
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How can a model explain something correctly yet fail to apply it?
- What training signals would teach models when not to reason?
- Do models trained for reasoning lose their ability to decline questions?
- Why does instruction tuning hurt knowledge-intensive tasks more than reasoning tasks?
- Does scaling reasoning capability create tradeoffs with instruction following?
- How does scaling reasoning capability actually reduce instruction-following ability?
- Which constraint types do reasoning models handle best?
- How do reasoning improvements suppress a model's ability to abstain?
- Why does latent reasoning override no-think instructions in models?
- Why do instruction following and reasoning capability trade off in training?
- Why do some reasoning steps receive negligible attention from later steps?
- Can reasoning fine-tuning improve both capability and instruction compliance together?
- Why does reasoning fine-tuning reduce a model's ability to abstain?
- Do negative constraints require fundamentally different training signals than positive instructions?
- Why do reasoning models fail to improve constrained optimization performance?
- Why do expert reasoners skip steps that novices must state explicitly?
- What is the distinction between teaching reasoning how versus when to activate?
- Why do smaller models lose reasoning faithfulness more than larger models?
- Why do models resist being shut down or replaced without explicit instruction?
- Why does reasoning fine-tuning reduce models' ability to abstain?
- Why do strong models struggle more with instruction following than mid-tier ones?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
different mechanism (RLHF erodes grounding) but the same pattern: capability training degrades compliance/grounding
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
abstention capacity is another compliance capability that reasoning training degrades
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
similar trade-off structure: SFT improves one dimension while degrading another
-
When does explicit reasoning actually help model performance?
Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
instruction-following is another task type that CoT makes worse
-
Can models learn when to think versus respond quickly?
Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
Thinkless/classifier-selective reasoning implements the solution: route to reasoning only when it helps
-
Why do reasoning models fail at theory of mind tasks?
Recent LLMs optimized for formal reasoning dramatically underperform at social reasoning tasks like false belief and recursive belief modeling. This explores whether reasoning optimization actively degrades the ability to track other agents' mental states.
social reasoning is another capability that reasoning training degrades; the instruction-following deficit and the ToM deficit share the same structural pattern where optimizing formal reasoning trades off against other capabilities
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
- Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following
- When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs
- Are Emergent Abilities in Large Language Models just In-Context Learning?
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Complex Logical Instruction Generation
- Instruction Induction: From Few Examples to Natural Language Task Descriptions
- LLMs can implicitly learn from mistakes in-context
Original note title
scaling reasoning capability creates an instruction-following deficit — a fundamental training trade-off