Why does training data format matter more than its domain content?
This explores a striking finding in the corpus: how you present training data — its structure, shape, and form — can steer a model's behavior more powerfully than what subject the data is actually about.
This explores a striking finding in the corpus: how you present training data — its structure, shape, and form — can steer a model's behavior more powerfully than what the data is actually about. The headline result is concrete and measurable: models trained on multiple-choice data learn to explore broadly (breadth-first), while models trained on free-form answers learn to dig deep (depth-first), and this format effect outweighs the domain effect by about 7.5 times Does training data format shape reasoning strategy more than domain?. In other words, whether your data is about medicine or math matters far less than whether it's shaped as a checklist or an essay.
Why would shape dominate content? A clue comes from thinking of language models as compression engines rather than fact-memorizers. A model trained only on text can out-compress images and audio better than specialized tools like PNG and FLAC, because it generalizes by learning the underlying *structure* of information, not the surface domain Can text-trained models compress images better than specialized tools?. If learning is fundamentally about absorbing structural patterns, then the format you feed in — the pattern itself — becomes the primary teaching signal, and the domain is almost incidental.
The corpus shows this format-sensitivity isn't just a pretraining curiosity; it persists and even sharpens during later training. Reinforcement learning, surprisingly, doesn't blend formats — it picks one dominant format from pretraining and amplifies it within the first epoch while suppressing the alternatives, and which format wins depends on model scale rather than on which format performs best Does RL training collapse format diversity in pretrained models?. So format isn't just an input choice; it's a property the training process actively selects on and collapses toward. Related work shows the same theme from the structure angle: organizing knowledge into a taxonomy lets a model reach 50% of full performance using just 0.3% of the data, because the model learns *where* a fact sits in a conceptual structure rather than memorizing raw text Can organizing knowledge structures beat raw training data volume?.
The flip side is that format is also where the hidden costs live. Domain adaptation methods deliver visible gains while quietly degrading reasoning faithfulness and *format flexibility* — the very adaptability that lets a model switch presentation styles How do domain training techniques actually reshape model behavior?. And when training shape goes wrong, the damage is structural too: overly hard RLVR samples teach degenerate shortcuts like answer-repetition that then contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?, while the *order* in which you train task types mechanically reshapes a model's entropy — train structured tasks first and you avoid collapsing the open-ended creative capacity Does training order reshape how models handle different task types?.
The thread that ties these together — and the thing you might not have known you wanted to know — is that a model is less a database of facts and more a learner of patterns. It absorbs the *form* of its lessons as deeply as the content, sometimes more so. This even shows up in alignment: 1,000 carefully shaped examples can match datasets thousands of times larger, because post-training activates existing capability through good form rather than pouring in new content Can careful curation replace massive alignment datasets?. The practical upshot is that how you arrange a lesson can matter more than what the lesson is about.
Sources 8 notes
Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.
Chinchilla models trained exclusively on text achieve better compression rates on images and audio than FLAC and PNG by using their context window to adapt as task-specific compressors. This demonstrates that generalization operates through compression, not specialization.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
StructTuning achieves 50% of full-corpus performance using only 0.3% of training data by organizing chunks into auto-generated domain taxonomies. The model learns knowledge position within conceptual structures rather than raw text patterns, matching how students learn from textbooks.
Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.