Can training format itself shape what reasoning strategy a model learns?

This explores whether the *shape* of training data — multiple-choice vs. free-form, the structural layout of examples — determines the reasoning strategy a model adopts, independent of what the data is actually about.

This question reads as: does *how* training examples are presented matter more than *what* they contain? The corpus gives an unusually clean answer — yes, and the effect is large. One study found that training format shapes reasoning strategy roughly 7.5 times more strongly than domain content: models trained on multiple-choice data drift toward breadth-first exploration (surveying options), while free-form training produces depth-first reasoning (committing to a line and following it down) Does training data format shape reasoning strategy more than domain?. The strategy a model reaches for isn't really about the subject matter; it's a fingerprint of the format it was steeped in.

Why would presentation outweigh content? Because a lot of what reasoning training does is teach *structure*, not facts. Models trained on chain-of-thought tolerate having half their numbers corrupted (3.2% accuracy loss) but fall apart when you shuffle the *order* of steps (13.3% loss) — what distills across demonstrations is the logical architecture, how steps sequence and connect, not their content correctness What do models actually learn from chain-of-thought training?. The same theme shows up even more starkly: deliberately corrupted, semantically irrelevant reasoning traces train models about as well as correct ones, suggesting traces act as computational scaffolding rather than meaningful argument Do reasoning traces need to be semantically correct?. If the content can be wrong and training still works, then format is doing the heavy lifting.

This connects to a broader claim running through the collection: post-training mostly *selects* reasoning that base models already latently possess rather than creating it from scratch Do base models already contain hidden reasoning ability?, with RL teaching a model *when* to reason rather than *how* Does RL post-training create reasoning or just deploy it?. If reasoning is being selected rather than built, then format is the selection pressure — the lens that decides which pre-existing strategy gets amplified. A striking version of this: RL training consistently converges on a *single* dominant format already present in pretraining within the first epoch, suppressing the alternatives, and which format wins depends on model scale rather than on which one performs best Does RL training collapse format diversity in pretrained models?. Format isn't just an input — it's what training collapses toward.

The practical edge of this is that you can get reasoning behavior cheaply by targeting format directly. A 1.5B model with LoRA-only tuning matched far larger full-RL models, because what RL was teaching turned out to be output-format organization, not new knowledge — reasoning and knowledge storage appear to be separable lora-based-reasoning-format-adaptation-achieves-competitive-reasonin. There's a cautionary flip side, though: when models learn the *form* of reasoning without the underlying logic, the form generalizes poorly. Chain-of-thought degrades predictably the moment you shift task, length, or format away from training, producing fluent-but-invalid reasoning — imitation of a shape rather than the thing the shape was supposed to encode Does chain-of-thought reasoning actually generalize beyond training data?.

The thing worth carrying away: the strategy a model uses to think may be less a property of its intelligence than an echo of the worksheet format it was trained on. If you want a model to explore broadly versus dig deeply, the data's *layout* may be a more powerful lever than its subject — and possibly more powerful than the training algorithm you reach for.

Sources 8 notes

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

What do models actually learn from chain-of-thought training?

Controlled ablations show models tolerate 50% corrupted numbers (3.2% accuracy loss) but fail under step shuffling (13.3% loss). What distills across reasoning demonstrations is logical architecture—how steps sequence and connect—not factual accuracy.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Can training format itself shape what reasoning strategy a model learns?

Sources 8 notes

Next inquiring lines