How does training format shape reasoning strategy more than content?
This explores why *how* training data is presented — multiple-choice vs. free-form, the shape of the examples — seems to steer how a model reasons more than *what* the data is about.
This explores why *how* training data is presented shapes a model's reasoning strategy more than the actual subject matter does. The cleanest evidence comes from a study showing that models trained on multiple-choice data adopt breadth-first exploration, while free-form training produces depth-first reasoning — and that this format effect is about 7.5 times stronger than the effect of domain content Does training data format shape reasoning strategy more than domain?. Presentation, in other words, leaves a deeper imprint on cognitive style than topic.
Why would form dominate substance? A cluster of work suggests that what models learn from chain-of-thought is the *shape* of reasoning, not its logical content. Illogical or structurally invalid CoT exemplars perform nearly as well as valid ones, which means the gains ride on structural pattern-matching rather than genuine inference Does logical validity actually drive chain-of-thought gains?. The synthesizing view is that CoT is constrained imitation — the model reproduces a reasoning *format* it has seen, which is exactly why format effects dominate content and why structurally invalid prompts still succeed What makes chain-of-thought reasoning actually work?. If reasoning is learned as a template, then the template you train on is the lever.
This fits a larger picture in which training doesn't install reasoning so much as select and route it. Base models already carry latent reasoning capability, and many different interventions — RL, decoding tweaks, feature steering — just elicit what's already there Do base models already contain hidden reasoning ability?. One framing puts it bluntly: RL post-training teaches a model *when* to reason, not *how* Does RL post-training create reasoning or just deploy it?. Seen this way, training format isn't writing new reasoning skills — it's choosing which pre-existing strategy gets deployed, which is why a surface feature like answer format can swing behavior so hard.
The catch is that imitated form is brittle. Reasoning learned as format degrades predictably when you shift the task, length, or presentation away from the training distribution — models keep producing fluent traces while the underlying logic quietly fails Does chain-of-thought reasoning actually generalize beyond training data?. And format isn't the only thing that matters: a complementary line of work finds that *procedural* knowledge in pretraining — worked examples and methods, not isolated facts — is what actually transfers to new reasoning, suggesting the durable signal is closer to 'how to do it' than to either topic or surface shape Does procedural knowledge drive reasoning more than factual retrieval?.
The unexpected takeaway: if you want to change *how* a model reasons, you may get more leverage from redesigning the format of its examples than from curating better content — but that same sensitivity means a model trained to a format is also trapped by it the moment the world stops looking like the training set.
Sources 7 notes
Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.