Does scaling reasoning capability create tradeoffs with instruction following?

This explores whether making models reason harder — longer chains, more thinking — comes at the cost of actually doing what they're told, and the corpus says yes, with a fairly specific mechanism.

This explores whether pushing a model to reason more — longer chains-of-thought, more training for problem-solving — quietly erodes its ability to follow the instructions it was given. The corpus answers directly: it does. The MathIF benchmark finds that both supervised fine-tuning and reinforcement learning improve reasoning while *reducing* instruction adherence, and the effect gets worse as chain-of-thought length grows Why do better reasoning models ignore instructions?. The proposed mechanism is intuitive once named: the longer a model thinks, the more contextual distance opens up between the original instruction and the place it finally answers, diluting its attention to what was actually asked. Reasoning and obedience end up competing for the same limited focus.

What makes this more than a single benchmark quirk is how it rhymes with a separate finding about what instruction tuning even teaches. One note argues that instruction tuning mostly teaches a model the *shape* of valid outputs, not deeper task understanding — models trained on semantically empty or even wrong instructions perform about as well as those trained on correct ones Does instruction tuning teach task understanding or output format?. If instruction-following is largely a thin formatting layer, then it's exactly the kind of behavior heavy reasoning training could overwrite without the model losing its core capability. The deficit isn't reasoning destroying knowledge — it's reasoning crowding out a surface habit.

The corpus also hints the tradeoff isn't inevitable, but architectural. SoftCoT keeps the main model frozen and offloads the 'thinking' to a small auxiliary module, specifically to avoid the catastrophic forgetting that comes from retraining the backbone Can continuous reasoning avoid forgetting in instruction-tuned models?. LLM Programs go further: instead of letting a model reason in one long, instruction-diluting stream, they wrap it in explicit algorithms that hand each step only the context it needs Can algorithms control LLM reasoning better than LLMs alone?. Both treat the long single chain — the very thing that creates contextual distance — as the problem to engineer around, rather than the goal to maximize.

There's a deeper irony lurking in the adjacent material: the long reasoning chains that cost you instruction-following may not even be buying genuine reasoning. Several notes argue chain-of-thought is constrained imitation of reasoning *form*, degrading predictably the moment you leave the training distribution Does chain-of-thought reasoning reveal genuine inference or pattern matching? Does chain-of-thought reasoning actually generalize beyond training data?, and that frontier reasoning models hit a ceiling around 20–23% on constraint-satisfaction problems requiring real backtracking Can reasoning models actually sustain long-chain reflection?. Constraint satisfaction is, in a sense, instruction-following under pressure — and reasoning models are bad at it. So the tradeoff may be sharper than a clean exchange: you can spend training and tokens lengthening chains, lose instruction adherence as a side effect, and still not gain robust reasoning where it counts.

The useful takeaway is that 'more reasoning' is not a free upgrade you bolt onto a model. It reshapes the model's attention budget, and instruction-following is one of the first things to pay. The most promising responses in the corpus all separate the reasoning machinery from the instruction-honoring core — freezing backbones, delegating thought, or letting an external algorithm hold the instructions the model would otherwise forget mid-chain.

Sources 7 notes

Why do better reasoning models ignore instructions?

The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Does scaling reasoning capability create tradeoffs with instruction following?

Sources 7 notes

Next inquiring lines