Can format adaptation alone explain why reasoning enrichment improves instruction following?
This explores whether the boost reasoning gives to instruction-following is just the model learning the *shape* of good answers (format), or whether something deeper — actual computation — is also doing work.
This explores whether the boost reasoning gives to instruction-following is just the model learning the *shape* of good answers (format), or whether something deeper is also at play. The corpus splits sharply on this, and the most interesting reading is that 'format adaptation alone' explains a surprising amount — but not everything. The strongest evidence for the format-only view comes from instruction tuning research showing models trained on semantically empty or even deliberately *wrong* instructions perform about as well as those given correct ones Does instruction tuning teach task understanding or output format?. What transfers isn't comprehension of the task — it's knowledge of what the output is supposed to look like. If the content of an instruction barely matters, then a lot of what we call 'better instruction following' is really the model getting calibrated to an output space.
The reasoning literature echoes this from a different angle. Several notes argue chain-of-thought works by reproducing the *form* of reasoning rather than performing genuine inference: illogical CoT exemplars score nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?, and CoT degrades predictably once you push it outside its training distribution — the fingerprint of imitation, not real abstraction Does chain-of-thought reasoning reveal genuine inference or pattern matching?, Does chain-of-thought reasoning actually generalize beyond training data?. A related finding on imitation models shows the same pattern at the macro scale: copying a stronger model's style fools human raters while closing no actual capability gap Can imitating ChatGPT fool evaluators into thinking models improved?. So 'reasoning enrichment improving instruction following' could plausibly be format dressing the whole way down.
But the corpus also contains the counterevidence that keeps format-alone from being the full story. One striking result: transformers trained with hidden reasoning tokens actually *compute the correct answer in early layers*, then suppress it to emit format-compliant filler Do transformers hide reasoning before producing filler tokens?. Here format and computation are decoupled — the real work happens, and the format is a separate, almost cosmetic layer on top. That's a direct demonstration that 'looks like reasoning' and 'is doing reasoning' are two different things living in the same output.
The cleanest rebuttal comes from training dynamics. The same 'thinking mode' machinery can be counterproductive in a vanilla model — inducing self-doubt that *degrades* answers — and then be flipped into genuinely useful gap-analysis by RL training Does extended thinking help or hurt model reasoning?. If reasoning were pure format, its sign wouldn't depend on training; the format is identical in both cases, yet the effect reverses. Add that procedural knowledge in pretraining drives reasoning generalization in a way factual lookup doesn't Does procedural knowledge drive reasoning more than factual retrieval?, and that reasoning quality is non-monotonic in length — more tokens eventually *hurt* Does more thinking time always improve reasoning accuracy? — and you get effects that 'more format' can't account for.
So the honest synthesis: format adaptation is a much larger share of the explanation than most people assume, and you should be suspicious whenever a reasoning trace 'improves' a model on familiar, in-distribution tasks — that's exactly where imitation of form is indistinguishable from the real thing. But it can't be the *whole* story, because the corpus shows cases where computation and format are physically separable, and where the identical format produces opposite results depending on training. The unexpected takeaway: the question isn't 'format or substance' — it's that current evaluations mostly can't tell them apart, which is why a model that has only learned the format can look exactly like one that learned to reason.
Sources 9 notes
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.