Do instruction-tuned models learn tasks or just output format distributions?

This explores a provocative claim — that 'instruction tuning' may teach models *where* to put their answer (the output space) rather than *how* to actually do the task — and what the corpus says about how deep that goes.

This explores whether instruction-tuned models genuinely understand the tasks they're given, or mostly learn the *shape* of the expected output. The corpus has a startlingly direct answer to the literal version of the question, and then a set of adjacent findings that suggest the same pattern shows up at every stage of post-training.

The sharpest evidence is a study where models trained on semantically empty or even deliberately *wrong* instructions performed about as well as models trained on full correct ones — 43% versus a 42.6% baseline Does instruction tuning teach task understanding or output format?. In other words, the content of the instruction was nearly irrelevant; what transferred was knowledge of the output space. A related finding makes the same point from the other direction: aligned models like Llama-3-Instruct will auto-regressively generate high-quality instructions when fed *only* the pre-query formatting tokens, no prompt needed Can aligned LLMs generate their own training data?. The format scaffolding alone is doing remarkable work — which is exactly what you'd expect if format, not task semantics, is what tuning installs.

What's striking is how this 'format over substance' pattern recurs beyond supervised instruction tuning. In reinforcement learning, RL doesn't teach new behavior so much as amplify one dominant format already latent in pretraining while suppressing the alternatives — and which format wins depends on model *scale*, not on which performs best Does RL training collapse format diversity in pretrained models?. And when researchers probe whether RL installs genuine reasoning procedures, models that look strong in-distribution collapse on slight out-of-distribution variants, revealing they've sharpened template-matching rather than learned a procedure Do fine-tuned language models actually learn optimization procedures?. The same story appears in latent computation: asked to run iterative numerical methods, models recognize a problem as template-similar and emit plausible-but-wrong values instead of actually executing the steps Do large language models actually perform iterative optimization?.

But the corpus doesn't let 'it's all just format' stand as the final word — it shows where the format/task boundary gets attacked. DPO works for small models precisely *because* it supplies explicit negative examples that target rigid output-format failures SFT can't fix Can small models match large models on function calling?. Checklist-based rewards decompose instruction quality into verifiable sub-criteria, which reduces overfitting to the superficial artifacts that fool holistic reward models — an attempt to make the signal reward substance over surface Can breaking down instructions into checklists improve AI reward signals?. And the fragility of pure instruction-following shows up directly: as you stack instructions, compliance degrades predictably, with even the best models hitting only 68% at high density How does instruction density affect model performance?.

The thing you didn't know you wanted to know: the question isn't really 'tasks *or* format' — it's that format-learning is the cheap, default thing post-training installs at every stage, and the interesting research is about which training signals (explicit negatives, decomposed verifiable rewards) can force a model past surface mimicry into something that survives an out-of-distribution test.

Sources 8 notes

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can aligned LLMs generate their own training data?

MAGPIE shows that aligned models like Llama-3-Instruct auto-regressively generate diverse, high-quality instructions when given only pre-query formatting tokens, without prompt engineering. 4M generated pairs matched human-curated datasets in quality and outperformed external sources in downstream fine-tuning.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

How does instruction density affect model performance?

IFScale benchmark shows three degradation patterns: linear (small models), exponential (mid-range), and threshold decay (reasoning models maintain ~150 instructions then fail steeply). Even best models reach only 68% accuracy at maximum density.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher auditing whether instruction-tuned LLMs learn tasks or output-format distributions. This remains an open question despite recent work.

What a curated library found — and when (dated claims, not current truth): Findings span 2023–2025.

• Instruction content is nearly irrelevant to post-training performance: semantically empty or deliberately wrong instructions achieved 43% versus 42.6% baseline performance, suggesting format, not task semantics, transfers (2023).
• RL post-training amplifies a single dominant pretraining format rather than teaching new task behavior; which format wins depends on model scale, not performance (2025).
• Models collapse on slight out-of-distribution variants, revealing template-matching rather than learned procedures — even high-performing in-distribution models fail n−1 tests (2024).
• Instruction-following degrades predictably with instruction density; even best models hit only 68% compliance at high stacking (2025).
• DPO and checklist-based rewards (explicit negatives, decomposed verifiable sub-criteria) can force models past surface mimicry, suggesting the format/task boundary is attackable (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023) — foundational evidence that instruction content barely matters
• arXiv:2504.07912 (2025) — RL amplifies pretraining distributions, not novel behavior
• arXiv:2507.18624 (2025) — checklists outperform reward models at aligning substance
• arXiv:2507.11538 (2025) — instruction density collapse

Your task:
(1) RE-TEST each constraint. For every finding above, probe whether newer models (o1, claude-opus-4), training methods (test-time scaling, multi-step reasoning curricula, longer context), or evaluation harnesses (stronger OOD benches, mechanistic probes) have since RELAXED or OVERTURNED it. Separate the durable question (do models learn task structure at all?) from the perishable limitation (small models can't resist format-fitting). What actually resolved each constraint?
(2) Surface the strongest contradicting or superseding work from the last 6 months. Does any paper show instruction content *does* matter, or that format-fitting is actually task learning in disguise?
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "If larger models *do* learn task structure, at what scale does it emerge?" or "Can we distinguish memorized output distributions from learned procedures mechanistically?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Do instruction-tuned models learn tasks or just output format distributions?

Sources 8 notes

Next inquiring lines