INQUIRING LINE

How much does training data format influence reasoning strategy versus domain content?

This explores whether the *shape* of training data (multiple-choice vs. free-form, format and presentation) does more to determine how a model reasons than the actual subject matter it's trained on.


This explores whether the *shape* of training data shapes a model's reasoning strategy more than its subject matter — and the corpus has a surprisingly crisp answer. The headline result is that format dominates: models trained on multiple-choice data adopt breadth-first exploration, while free-form training produces depth-first reasoning, and the format effect outweighs the domain effect by roughly 7.5 to 1 Does training data format shape reasoning strategy more than domain?. In other words, *how* a problem is presented teaches the model a habit of thinking that travels across whatever topic you throw at it.

The reason this isn't just a quirk becomes clearer when you look at where reasoning actually comes from. Reasoning ability seems to be drawn from broad, transferable procedural patterns picked up across many pretraining documents — unlike factual recall, which depends on narrowly memorizing specific source documents Does procedural knowledge drive reasoning more than factual retrieval?. If reasoning is a procedure rather than a fact, then it makes sense that the *form* of the data — the structural template it presents — would imprint the procedure more strongly than the content does. Several lines of work go further and argue the reasoning is already latent in the base model: post-training selects and elicits it rather than creating it Do base models already contain hidden reasoning ability?, with RL teaching a model *when* to deploy reasoning rather than *how* to do it Does RL post-training create reasoning or just deploy it?. Under that view, training format is essentially a steering signal that picks which pre-existing strategy gets activated.

There's a sharp downside to format being this powerful, though. If models are imitating the *form* of reasoning rather than its underlying logic, they should break when the form shifts — and they do. Chain-of-thought degrades predictably under distributional shifts in task, length, and format, producing fluent but logically inconsistent reasoning Does chain-of-thought reasoning actually generalize beyond training data?. Reasoning accuracy also collapses just from longer inputs, well below the context limit, in a way that's task-agnostic Does reasoning ability actually degrade with longer inputs?. These are the symptoms you'd expect from a system that learned a presentational pattern rather than a content-general competence.

The lateral surprise here is how *local* the format signal turns out to be. Reasoning improvements during RLVR are concentrated in a small minority — only about 20% of tokens are high-entropy "forking points," and training on those alone matches full updates Do high-entropy tokens drive reasoning model improvements?. Reasoning verbosity is even a single linear direction in activation space you can steer without retraining Can we steer reasoning toward brevity without retraining?. So "format shapes strategy" cashes out concretely: format is nudging a handful of decision tokens and a few activation directions, not rewriting the model's knowledge. The flip side is that domain-specific adaptation methods each have narrow sweet spots and tend to carry hidden costs — gains in one place quietly degrading reasoning faithfulness or format flexibility elsewhere How do domain training techniques actually reshape model behavior? — which is exactly why content-tuning struggles to compete with format for control over reasoning style.

The takeaway you didn't know you wanted: if you care about *how* a model reasons, you may get more leverage from changing the presentation of your training examples — or steering an activation vector — than from feeding it more domain content. Approaches like learning rationales at the token level on arbitrary text Can models learn reasoning from predicting any text? lean into exactly this, treating reasoning as a format-level skill that emerges independent of any particular subject.


Sources 10 notes

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

How do domain training techniques actually reshape model behavior?

Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.

Can models learn reasoning from predicting any text?

Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability analyst. The question remains open: does training data *format* (multiple-choice vs. free-form vs. other structures) shape a model's reasoning *strategy* more powerfully than domain content does?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2024, with major updates in 2025–2026:
• Format effect outweighs domain effect by ~7.5:1; multiple-choice trains breadth-first, free-form trains depth-first (2025).
• Reasoning is procedural, not factual; it emerges from broad pretraining patterns, not narrow memorization (2024–2025).
• Base models possess latent reasoning; post-training and RL select *when* to deploy it, not *how* to reason (2025).
• Chain-of-thought degrades predictably under distributional shift (task, length, format); longer inputs degrade reasoning even far below context limits (2024–2025).
• Only ~20% of tokens are high-entropy "forking points" driving reasoning; steering activation directions can control verbosity without retraining (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2411.12580 (Nov 2024) – procedural knowledge in pretraining
• arXiv:2505.10185 (May 2025) – CoT encyclopedia; predicting reasoning paths
• arXiv:2506.01939 (Jun 2025) – high-entropy minority tokens
• arXiv:2508.01191 (Aug 2025) – CoT as distribution-bounded mirage

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above—especially the 7.5:1 ratio, the format-as-steering-signal thesis, and degradation under distributional shift—judge whether models released or trained in the last 6 months (e.g., reasoning-optimized variants, larger context windows, new RL schedules, or better data mixtures) have relaxed or overturned it. Separate the durable question (likely: does format still dominate content?) from the perishable limitation (possibly: does longer context or refined post-training eliminate the length-degradation cliff?). Cite what resolved or held.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially anything showing domain content *does* compete with format under certain conditions, or that reasoning is more content-entangled than a pure "latent + steering" view allows.
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., "Can multi-modal or interleaved format training weaken the format-strategy coupling?" or "Does scaling context or model size invert the format/content ratio?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines