How much does training data format shape what reasoning strategy emerges?

This explores whether the *format* of training data — multiple-choice vs. free-form, correct vs. corrupted traces — steers what kind of reasoning a model develops, more than the subject matter or the logic inside it.

This explores whether the shape of training data — how problems and answers are presented — does more to determine a model's reasoning style than what the data is actually about. The corpus answers with a surprisingly emphatic yes, and the cleanest evidence is direct: models trained on multiple-choice data adopt a breadth-first "scan the options" strategy, while free-form training produces depth-first chains — and the format effect outweighs the domain effect by roughly 7.5 to 1 Does training data format shape reasoning strategy more than domain?. Presentation, not topic, sets the cognitive habit.

The reason format dominates becomes clearer once you see how little of "reasoning" the training is actually creating. Several notes converge on the idea that base models already carry latent reasoning ability, and post-training mostly *selects* and *packages* it rather than installing it. Minimal interventions — RL steering, decoding tweaks, feature steering — all surface reasoning that pre-exists in activations Do base models already contain hidden reasoning ability?, and RL post-training appears to teach *when* to deploy reasoning rather than *how* Does RL post-training create reasoning or just deploy it?. If the capability is already there, then what training data does is largely formatting work — which is exactly why a 1.5B model with LoRA-only tuning can match much larger RL models by learning output *organization* instead of new knowledge Can small models reason well by just learning output format?.

The unsettling corollary is that the *content* of reasoning traces matters far less than their *form*. Chain-of-thought exemplars that are logically invalid perform nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?, and traces that have been deliberately corrupted teach about as well as correct ones — sometimes generalizing *better* out of distribution Do reasoning traces need to be semantically correct?. The model is learning the scaffolding and rhythm of step-by-step output, not the inferential substance. This is why format imprints so hard: it's the part the model can actually imitate.

But format-shaped reasoning is also brittle reasoning. Because the model absorbs the *form* without the underlying logic, it breaks predictably when the presentation shifts: DataAlchemy experiments show chain-of-thought degrading systematically under changes in task, length, and format — producing fluent but logically inconsistent output Does chain-of-thought reasoning actually generalize beyond training data?. The same property that makes format a powerful lever for *shaping* reasoning makes it a fault line for *generalizing* it. Worth contrasting with the one ingredient that does seem to travel: broad procedural knowledge drawn from diverse pretraining documents, which transfers across problems in a way that format-mimicry and fact-memorization do not Does procedural knowledge drive reasoning more than factual retrieval?.

If you want to go deeper into the mechanism, two notes zoom into where the format signal actually lives: only ~20% of tokens — the high-entropy "forking points" — carry the reasoning learning signal Do high-entropy tokens drive reasoning model improvements?, and reasoning verbosity turns out to be a single steerable direction in activation space Can we steer reasoning toward brevity without retraining?. And for the limiting case — what happens when you strip task-specific format away entirely — Quiet-STaR shows reasoning can emerge as a side effect of predicting *any* text, format-free Can models learn reasoning from predicting any text?. The takeaway the corpus leaves you with: training format isn't a cosmetic choice about how answers look — it's one of the strongest levers you have over how a model thinks, precisely because the model is imitating form more than it's reasoning from content.

Sources 11 notes

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can small models reason well by just learning output format?

A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can models learn reasoning from predicting any text?

Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI reasoning researcher. The question remains open: **Does training data format shape emerging reasoning strategy more than content does — and does this constraint still hold under 2025–2026 model scaling, inference-time scaling, and RL regimes?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–12, anchored in a 7.5-to-1 format-over-domain effect. Key constraints cited:
• Format (multiple-choice vs. free-form) dominates domain content by ~7.5× in determining reasoning style; logically invalid chain-of-thought traces perform nearly as well as valid ones (~2023).
• Base models already possess latent reasoning; post-training (RL, LoRA) packages and selects it rather than installing it; RL teaches *when* to reason, not *how* (~2024–2025).
• Chain-of-thought degrades systematically under task/length/format shifts, suggesting format-learned reasoning is distribution-bounded and brittle (~2025).
• Only ~20% of tokens (high-entropy forking points) carry reasoning signal; verbosity occupies a single steerable activation direction (~2025).
• Quiet-STaR shows reasoning can emerge format-free, as a side effect of predicting any text (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2307.10573 (2023-07): Invalid Logic, Equivalent Gains
• arXiv:2403.09629 (2024-03): Quiet-STaR
• arXiv:2504.15777 (2025-04): Tina (LoRA reasoning)
• arXiv:2508.01191 (2025-08): Is CoT a Mirage? Data Distribution Lens

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every claim above, ask: Have newer models (o1-class, test-time scaling, post-training breakthroughs), new training methods (multi-turn RL, synthetic data generation, curriculum), tooling (advanced sampling harnesses), or evals overturned the 7.5× format dominance or the claim that content *barely* matters? Separate the durable question ("does format imprint reasoning style?") from perishable limitation ("format outweighs content 7.5×"). Cite what relaxed or resolved each constraint, or state plainly where it still stands.
(2) **Surface contradicting/superseding work from the last ~6 months** that challenges the "format > content" synthesis — especially work showing content, domain diversity, or semantic structure *does* shape reasoning emergence in ways the library may have underweighted.
(3) **Propose 2 research questions assuming the regime has shifted:** e.g., under test-time scaling or synthetic-data-driven RL, does the format effect persist? Can adversarial or semantically rich data recover content's role?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How much does training data format shape what reasoning strategy emerges?

Sources 11 notes

Next inquiring lines