Does instruction tuning teach task understanding or output format?
Exploring whether models trained on instructions actually learn the task semantics or merely learn to match output distributions. This matters because it challenges assumptions about how fine-tuning improves model behavior.
"Do Models Really Learn to Follow Instructions?" creates two devastating controls. First, simplified task definitions that strip all semantic content, leaving only output space information (e.g., "output one of: A, B, C"). Second, delusive examples containing incorrect input-output mappings. Models trained on either achieve comparable performance to models trained on full, correct instructions. A random baseline achieves 42.6% exact-match versus instruction tuning's 43%.
The implication: instruction tuning primarily teaches the model to map its existing capabilities to the expected output format, not to understand or execute the task as described in the instruction. The semantic content of the instruction — what the task is, how to approach it, what constitutes a correct answer — appears largely irrelevant. What matters is the output distribution: how many classes, what format, what vocabulary.
This connects to a broader pattern. Does training data format shape reasoning strategy more than domain? showed a 7.5x stronger effect of format over domain. Can models pass tests while missing the actual grammar? showed that correct outputs can mask reliance on surface heuristics. The instruction tuning finding adds: even explicit instructions about the task are largely ignored in favor of format signals.
A complementary theory from "Are Emergent Abilities just ICL?" (2309.01809) provides the mechanistic explanation: instruction tuning enables "implicit in-context learning" — mapping instructions to the form required for ICL rather than creating new functional abilities. The evidence: purported emergent abilities are explained by a combination of in-context learning, model memory, and linguistic knowledge. The model's sensitivity to minor prompt variations and tendency to hallucinate are inconsistent with genuine emergent functional abilities but consistent with a model that maps prompts to ICL patterns. This reframes safety concerns: if prompts function as "training mechanisms" rather than interfaces to inherent abilities, the safety landscape changes — the risk is in what ICL patterns exist, not in what abilities have "emerged."
The IT Survey (same source) documents the concern from the other direction: "there has been an intense criticism that IT only captures surface-level patterns and styles rather than comprehending and learning the task." Combined with the False Promise finding that model imitation captures style not factuality, a clear pattern emerges: fine-tuning-based adaptation — whether through imitation, instruction tuning, or domain SFT — preferentially captures distributional and formatting information while leaving underlying capabilities largely unchanged. The capability bottleneck is in the base model, not the adaptation method.
Webson & Pavlick (2021) provide the prompting-level parallel. Evaluating 30+ manually written templates and 13 sets of target words across 390+ prompts, they find models learn identically fast from irrelevant or misleading templates as from instructive ones. Models are "much more sensitive to the choice of LM target words as opposed to the meaning of the instruction templates." Instruction-tuned models can be "too robust" — less sensitive to prompt semantics than non-IT equivalents, suggesting IT trains a form of prompt-blindness. This holds from 235M to 175B parameters. The convergence is striking: both the fine-tuning and the prompting literature arrive at the same conclusion from opposite directions — the semantic content of instructions is largely inert, and what transfers is format and output space information.
Inquiring lines that use this note as a source 122
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do workers who understand AI generations learn more than those who only use output?
- Why does AI-improved task performance fail to transfer to independent work?
- Does extended exoskeleton use eventually produce meaningful skill transfer?
- Can granular sub-task training for function calling improve both open and proprietary models?
- Can benchmarks designed for shortcut learning detect heuristic override failures?
- Can explicit goal state scaffolding at inference time transfer to autonomous tracking through training?
- How does the knowing-doing gap widen as tasks become more complex?
- Do models learn different sophistry strategies for QA versus code generation?
- Can curated demonstrations compensate for smaller or simpler training environments?
- Does alignment training create bidirectional instruction and response mappings?
- Can instruction tuning succeed without explicit task understanding?
- Does correct model behavior guarantee internal alignment of learned objectives?
- How do training objectives shape what a world model actually learns?
- Can prompting unlock compositional skills that pretraining already learned?
- How much of the combinatorial task space must training data cover?
- Does narrow reallocation to remaining tasks constitute genuine adaptation?
- What execution feedback signals drive context updates without supervision labels?
- Can dynamic instance-specific prompt selection solve the generalization problem across tasks?
- How much does pretraining contribute to ToM performance versus task-specific training?
- Can demo placement be tuned as a task-specific hyperparameter?
- Does partial trace guidance work better than curriculum learning for hard problems?
- Do task-specific heuristics improve gradually or appear suddenly at scale?
- How do task difficulty and skill type interact in model performance?
- Does AI-assisted performance transfer to independent task completion?
- Can in-context learning replicate the timing effects that RL teaches models?
- Can fine-tuning ever teach semantic inference instead of amplifying training shortcuts?
- How does preference-based training compare to supervised fine-tuning for function calling?
- Does fine-tuning models for specific tasks destroy their ability to reason?
- Why does mixed instruction data sometimes hurt specific model capabilities?
- Does approaching human performance mean learning the same grammatical rules?
- What distinguishes instance seeds from full input-output exemplar requirements?
- Why do primacy effects peak at specific instruction densities?
- Does input length alone explain instruction density performance loss?
- Can structured output formats reduce instruction following degradation?
- Can correct outputs mask reliance on surface heuristics rather than deep understanding?
- Are instruction-tuned models more or less sensitive to prompt semantics than others?
- Do task-specific heuristics emerge because they compress well enough?
- Does highlighting input features reduce human over-reliance on machine outputs?
- How does behavioral fine-tuning differ from factual knowledge encoding in models?
- How should researchers evaluate whether correct model outputs reflect real structural learning?
- Can curriculum degradation of document quality accelerate policy learning?
- How does explicit exploratory prompting compare to fine-tuned reinforcement learning for in-context adaptation?
- Can a single model trained on two tasks predict untrained decision tasks?
- Do instruction-tuned models learn tasks or just output format distributions?
- What happens when you train user simulators instead of task agents?
- Does foundational model training or user priors more strongly shape final outputs?
- Why does the gap between theoretical expressiveness and learned capability matter?
- Why does KTO skip supervised fine-tuning while DPO cannot?
- Does fine-tuning actually change model capabilities or only output distribution?
- Why does instruction tuning hurt knowledge-intensive tasks more than reasoning tasks?
- Why do vector embeddings fail to measure task relevance in production RAG?
- Does scaling reasoning capability create tradeoffs with instruction following?
- How does scaling reasoning capability actually reduce instruction-following ability?
- Why does critique training produce deeper understanding than imitation training?
- How do instruction backtranslation and MAGPIE demonstrate self-generation principles?
- Does self-supervised process supervision work for domains with ambiguous correctness?
- How does task-oriented fine-tuning compare to preference tuning methods?
- Why do instruction following and reasoning capability trade off in training?
- How does task contamination differ from test set data leakage?
- How does data quality mismatch create reasoning degradation in supervised fine-tuning?
- How much task-similar finetuning data does test-time training actually need?
- What makes high-quality GUI instruction data different from general vision data?
- Does the Assistant Axis exist in pre-trained models before instruction tuning?
- Can reasoning fine-tuning improve both capability and instruction compliance together?
- Can a separate mediator layer improve intent understanding before task execution?
- Do negative constraints require fundamentally different training signals than positive instructions?
- Does training on granular tasks beat training on the full function calling problem?
- Can preference learning fix the rigid output format problem better than supervised training?
- Does format-based pretraining determine how models respond to reinforcement learning?
- How does training on correct answer form differ mechanistically from training on failure analysis?
- Can models maintain multiple task interpretations simultaneously before committing to a single policy?
- How do prompting and activation steering relate as compression strategies?
- Does minimal code engagement during vibe coding harm students' long-term programming comprehension?
- What distinguishes task-specific heuristics from genuine world models?
- What specific behavioral patterns should alignment examples target for maximum effect?
- Can activation sparsity patterns guide the selection of in-context learning demonstrations?
- Do instruction-tuned models prefer conversational over formal source language?
- How does reinforcement learning on outcomes reinforce template-matching rather than computation?
- Why does imitation learning alone plateau without outcome-based refinement?
- How do out-of-distribution tests reveal that optimization learning is memorization?
- How do complete multi-turn trajectories differ from isolated task examples?
- What is the gap between benchmark performance and real workplace task completion?
- Does pretraining poisoning at scale persist through instruction alignment?
- How do input-side defenses separate task methodological and framing intents?
- How does post-training shift models from passive prediction to on-policy action?
- What distinguishes data that generalizes broadly from task-specific memorization?
- Can we predict out-of-distribution generalization without access to downstream tasks?
- Can explicit reflection during AI-assisted work improve transfer of learning?
- Can extracted skills transfer effectively across different domains and model architectures?
- Where does skill extraction fail compared to genuine model adaptation?
- How do transformers stitch together learned behaviors when adapting to new tasks?
- Can training on diverse related tasks be more efficient than task-specific training?
- Why does specializing to one task make future task learning harder?
- Can we predict which tasks will decompose into modular subnetworks?
- What is the difference between changing model outputs versus changing internal representations?
- Can mechanistic interpretability tools decode the biases alignment training conceals?
- Can vector embeddings measure task relevance instead of semantic similarity?
- Can format adaptation alone explain why reasoning enrichment improves instruction following?
- What capacity threshold determines whether RL teaches activation versus shortcut learning?
- Why does supervised fine-tuning on diverse demonstrations expand exploration diversity compared to RL?
- How does stage-wise training scheduling resolve conflicts between constraint-following and creative tasks?
- Can trained models encode programs more complex than their data-generating process?
- What emergent behaviors do models develop when trained on underspecified pedagogical tasks?
- What makes task alignment more fragile than underlying knowledge retention?
- How does action-level decomposition differ from token-level imitation in supervision?
- Why does target probability matter more than task logical complexity?
- Do text-space skills transfer learning across different frontier models?
- What training regimes confound surface mechanisms with their actual causes?
- How do agents distinguish between evidence framing and instruction framing in practice?
- Do legitimate task signals exploit the same position and framing vulnerabilities as attacks?
- What makes principle-response mutual information sufficient for behavioral alignment?
- Why do estimates for task-level performance differ so much from full job automation timelines?
- How does annotation-based pretraining compare to self-supervised video masking for screen understanding?
- Why does identifying UI element types and locations enable downstream task learning?
- What makes a good in-context learning example for a given task?
- Which finetuning method works best across different task and data regimes?
- How do finetuning and pretraining improvements differ in their effects on model capabilities?
- How do task frequency and complexity interact with model capacity during training?
- Can intentional data-mixture design replace model scaling for rare task learning?
- Do sample-level similarities between pretraining and downstream tasks explain the frequency effect?
- How much performance is lost when converting pretrained checkpoints versus training from scratch?
- Why do strong models struggle more with instruction following than mid-tier ones?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
format > domain at 7.5x; this adds format > instruction semantics
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
same mechanism in linguistic domain
-
Can small models reason well by just learning output format?
Does reasoning performance depend primarily on adapting how models express outputs rather than acquiring new knowledge? The Tina research tests this by applying LoRA to a 1.5B model during reasoning training.
LoRA as format adapter aligns with IT as format teacher
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
SFT raises accuracy because it teaches the output format, not because it improves reasoning
-
Why do chain-of-thought examples fail across different conditions?
Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.
complementary evidence of format-over-substance: IT achieves accuracy through format matching alone, while CoT exemplar brittleness shows reasoning performance depends on surface exemplar properties (order, style, complexity) rather than semantic content
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning
- A Survey on Post-training of Large Language Models
- Exploring Format Consistency for Instruction Tuning
- Are Emergent Abilities in Large Language Models just In-Context Learning?
- LESS: Selecting Influential Data for Targeted Instruction Tuning
- Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
- Instruction Induction: From Few Examples to Natural Language Task Descriptions
- Foundations of Large Language Models
Original note title
instruction tuning teaches output format distribution not task understanding — simplified and delusive instructions achieve comparable performance