How much does training composition affect syntactic versus reasoning performance?

This explores whether the mix of data and training signals you feed a model shapes its grip on output *form* (syntax, format, the shape of an answer) differently than its grip on actual *reasoning* — and the corpus suggests these two are governed by surprisingly separate levers.

This reads the question as: does training composition pull form and reasoning in different directions? The collection's recurring answer is yes — and the gap is wider than you'd expect. A striking thread is that much of what looks like 'reasoning' gains is really form being learned. Logically *invalid* chain-of-thought examples produce nearly the same accuracy boost as valid ones Does logical validity actually drive chain-of-thought gains?, and a related line argues chain-of-thought is constrained imitation of a reasoning *shape* rather than genuine inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?. So composition that supplies the surface pattern of reasoning reliably teaches the syntax of it — while the underlying logic stays distribution-bound and degrades the moment you shift task, length, or format Does chain-of-thought reasoning actually generalize beyond training data?.

What *does* move real reasoning turns out to be a specific composition ingredient: procedural knowledge. An analysis of five million pretraining documents found reasoning generalization rides on broad, transferable procedural material spread across many sources, whereas factual recall depends on narrow, document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. That's the cleanest 'composition matters' result here — the *kind* of knowledge in the mix, not just its volume, determines whether you get transfer or rote lookup.

The trade-off cuts deeper when you look at where these capabilities live in the network. Knowledge sits in lower layers and reasoning in higher ones, which is why training that sharpens reasoning improves math but can actively *degrade* knowledge-heavy domains like medicine Why does reasoning training help math but hurt medical tasks?. So composition isn't a free lunch: shifting the mix toward reasoning isn't neutral with respect to other competencies — you can buy reasoning by spending knowledge.

The form-versus-reasoning split also shows up in *how* you train, not just what's in the data. For small models on function calling, DPO on correct-and-incorrect preference pairs beats plain supervised fine-tuning precisely because the negative examples target rigid *format* failures that SFT leaves unfixed Can small models match large models on function calling?. And at the token level, only ~20% of tokens — the high-entropy forking points — carry the reasoning learning signal, so RLVR is really reshaping a small reasoning-critical minority while the rest is essentially form Do high-entropy tokens drive reasoning model improvements?.

The thing you didn't know you wanted to know: syntactic competence is cheap and composition-robust — models pick up the *form* of correct output from almost any reasonable mix, even illogical exemplars. Reasoning is expensive, composition-sensitive, and zero-sum against other domains. Tellingly, even when models can't reason symbolically they lean on semantic associations from their training distribution Do large language models reason symbolically or semantically?, and 'compositional reasoning' often collapses into memorized subgraph matching that shatters on novel combinations Do transformers actually learn systematic compositional reasoning?. The form is in the data; the reasoning, mostly, is not.

Sources 9 notes

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with re-testing whether training composition truly pulls syntactic and reasoning performance in opposite directions, or whether that constraint has shifted. The question remains: does the KIND and MIX of pretraining data systematically decouple form-learning from reasoning-generalization?

What a curated library found — and when (findings span May 2023 to August 2025, treat as dated claims):
• Logically invalid chain-of-thought examples boost accuracy nearly as much as valid ones; CoT is constrained imitation of reasoning shape, not genuine inference (2023–2025).
• Procedural knowledge — broad, transferable procedural material across many sources — drives reasoning generalization; factual recall depends on narrow, document-specific memorization (~2024–2025).
• Knowledge resides in lower network layers; reasoning in higher ones. Composition shifts toward reasoning can actively degrade knowledge-heavy domains like medicine (2025).
• Only ~20% of tokens (high-entropy forking points) carry the reasoning learning signal; RLVR reshapes this critical minority while the rest is form (2025).
• CoT reasoning collapses into memorized subgraph matching on novel combinations; compositional reasoning is largely distribution-bound (2023–2025).

Anchor papers (verify; mind their dates):
• arXiv:2411.12580 (2024-11) Procedural Knowledge in Pretraining Drives Reasoning
• arXiv:2508.01191 (2025-08) Is Chain-of-Thought Reasoning a Mirage? A Data Distribution Lens
• arXiv:2506.01939 (2025-06) High-Entropy Minority Tokens Drive Effective Reinforcement Learning
• arXiv:2507.18178 (2025-07) Decoupling Knowledge and Reasoning in LLMs

Your task:
(1) RE-TEST THE FORM–REASONING TRADE-OFF. For each claim above, assess whether newer models (GPT-4o, Claude 3.5, o1-preview reasoning models), composition techniques (curriculum, mixture-of-experts, multi-task balancing), or new tooling (mechanistic interpretability probes, layer-wise LoRA steering) have since RELAXED or OVERTURNED the decoupling. Distinguish durable findings (e.g., knowledge–reasoning layer split) from perishable constraints (e.g., CoT as pure imitation). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—any result showing syntactic and reasoning CAN scale together, or that composition doesn't trade off, or that reasoning emerges without explicit procedural data.
(3) Propose 2 new research questions that ASSUME the regime may have shifted: e.g., does mixture-of-experts composition now allow decoupled scaling? Can targeted procedural injection now fix distribution-bounded reasoning without sacrificing knowledge?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How much does training composition affect syntactic versus reasoning performance?

Sources 9 notes

Next inquiring lines