Why does training data format matter more than domain content?
This explores why how training data is shaped — multiple-choice vs. free-form, structured vs. raw, the format a model latches onto — appears to steer a model's behavior more powerfully than which subject the data is about.
This explores why how training data is shaped — multiple-choice vs. free-form, structured vs. raw, the format a model latches onto — appears to steer a model's behavior more powerfully than which subject the data is about. The sharpest result in the corpus puts a number on it: a model's reasoning strategy is shaped roughly 7.5 times more by the format it was trained on than by the domain. Multiple-choice data pushes models toward breadth-first exploration; free-form data produces depth-first reasoning Does training data format shape reasoning strategy more than domain?. The content is almost incidental — it's the shape of the examples that installs the habit.
One reason format dominates is that training doesn't so much teach new knowledge as activate and amplify patterns already latent in the model. RL post-training, for instance, doesn't blend formats — it converges on a single dominant format inherited from pretraining and suppresses the alternatives within the first epoch, and which format wins depends on model scale rather than on which one performs best Does RL training collapse format diversity in pretrained models?. The same activation-not-construction story shows up in alignment: 1,000 carefully curated examples on a strong base model rival datasets orders of magnitude larger, because post-training surfaces existing capability rather than building it Can careful curation replace massive alignment datasets?. If training is mostly selecting among pre-existing behaviors, then the presentation of the data — the cue the model keys on — naturally outweighs the topic.
The deeper lesson is that models learn structure, not just text. StructTuning reaches 50% of full-corpus performance using 0.3% of the data by organizing chunks into a domain taxonomy, so the model learns where a fact sits in a conceptual map rather than memorizing raw strings — much like a student learning from a textbook's organization rather than its word count Can organizing knowledge structures beat raw training data volume?. Relatedly, mapping items to discrete codes before embedding transfers across domains better than encoding text directly, because the discrete intermediate strips away surface text bias decoupling-text-from-item-representations-via-discrete-codes-is-more-transferable. In both cases, the organizing format carries the generalization, not the domain vocabulary.
Format also has a mechanical, almost physical effect on training dynamics. Structured tasks drive output entropy down while creative tasks push it up, and simply changing the training order — structured tasks first — yields measurable gains by preventing entropy collapse from wrecking open-ended ability Does training order reshape how models handle different task types?. Push format too hard in the wrong direction and capabilities actively degrade: nearly-impossible RLVR samples teach degenerate shortcuts that contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. And every domain-adaptation technique carries hidden costs — performance gains paired with quiet losses in reasoning faithfulness and format flexibility How do domain training techniques actually reshape model behavior?.
The useful takeaway for anyone building with these models: if you want to change how a model thinks, redesign the shape of your examples, not just their subject matter. The flip side is a caution — format isn't free to copy across models. Teacher-refined data that's objectively higher quality can still degrade a student if it exceeds the student's learning frontier, so the right format is the one compatible with the model you're actually training Does teacher-refined data always improve student model performance?.
Sources 9 notes
Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.
StructTuning achieves 50% of full-corpus performance using only 0.3% of training data by organizing chunks into auto-generated domain taxonomies. The model learns knowledge position within conceptual structures rather than raw text patterns, matching how students learn from textbooks.
VQ-Rec demonstrates that mapping item text to discrete codes via product quantization, then to embeddings, improves cross-domain transfer compared to direct text encoding. The discrete intermediate reduces text bias and enables efficient per-domain fine-tuning.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.