Does knowledge structure matter more than knowledge volume for model training?
This explores whether how knowledge is organized during training (taxonomies, graphs, reasoning paths) does more work than simply how much data you pour in — and the corpus comes down firmly on the side of structure.
This explores whether how knowledge is organized during training matters more than sheer data volume — and across the collection, structure keeps winning, often by startling margins. The clearest case: StructTuning reaches 50% of full-corpus performance using just 0.3% of the training data, by sorting chunks into auto-generated domain taxonomies so the model learns where a fact sits in a conceptual map rather than memorizing raw text Can organizing knowledge structures beat raw training data volume?. The framing is deliberately textbook-like: students don't read more pages, they learn the scaffolding the pages hang on.
The same pattern shows up when you push structure further into graphs. A 32B model fine-tuned on reasoning tasks derived from medical knowledge-graph paths beats the field across 15 medical domains — the authors argue compositional structure matters more than scale Can knowledge graphs teach models deep domain expertise?. And structure helps at inference too, not just training: externalizing reasoning into knowledge-graph triples lets a small model (GPT-4o mini) jump 29% on hard GAIA tasks Can structuring reasoning as knowledge graphs help smaller models solve complex tasks?. Structure is doing the heavy lifting whether it's baked into the weights or scaffolded around them.
There's a deeper reason volume alone hits a ceiling. An analysis of 5 million pretraining documents found that reasoning generalizes from broad, transferable *procedural* knowledge — the how-to patterns — while factual recall depends on narrow, document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. More data mostly buys you more memorized facts; structured procedural exposure buys you transferable reasoning. That's also why prompting can't rescue a model that lacks foundational knowledge — prompts only reorganize what's already in the distribution, they can't inject what was never structured in Can prompt optimization teach models knowledge they lack?. And it's why base models can be 'unlocked' by minimal training: the capability is latent, waiting for the right elicitation, not for more tokens Do base models already contain hidden reasoning ability?.
If structure beats volume, then *how* you structure becomes the real lever — and here the corpus gets interestingly granular. RLAG embeds domain knowledge better than supervised fine-tuning because it rewards coherent explanation, not token-level correctness, internalizing knowledge structures rather than surface strings Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. DPO outperforms SFT for small models by giving explicit negative examples Can small models match large models on function calling?. Even training *order* is structural: scheduling structured tasks before creative ones yields 6.2% gains by preventing entropy collapse from wrecking open-ended skills Does training order reshape how models handle different task types?.
The honest caveat the collection insists on: structured methods aren't free. Every domain-adaptation technique has a conditional sweet spot, and visible gains often hide degradation in reasoning faithfulness, capability transfer, and format flexibility How do domain training techniques actually reshape model behavior?. Push structure the wrong way — like training on near-impossible RLVR samples — and the model learns degenerate shortcuts that contaminate skills it already had Do overly hard RLVR samples actually harm model capabilities?. So the sharper claim isn't 'structure good, volume bad.' It's that the binding constraint on a trained model is the *shape* of what it learns, and volume is a blunt, expensive way to buy a shape you could engineer directly — when you get the engineering right.
Sources 11 notes
StructTuning achieves 50% of full-corpus performance using only 0.3% of training data by organizing chunks into auto-generated domain taxonomies. The model learns knowledge position within conceptual structures rather than raw text patterns, matching how students learn from textbooks.
Fine-tuning a 32B model on 24,000 reasoning tasks derived from medical knowledge graph paths produces state-of-the-art performance across 15 medical domains, demonstrating that structured knowledge composition matters more than scale.
Knowledge Graph of Thoughts (KGoT) achieves 29% improvement on GAIA Level 3 tasks using GPT-4o mini by externalizing reasoning into iteratively constructed KG triples. The approach improves transparency, reduces bias, and enables quality control over reasoning steps.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.