Why do SFT models memorize patterns instead of learning generalizable reasoning?

This explores why models trained by supervised fine-tuning (SFT) tend to copy the surface form of reasoning rather than acquire reasoning that transfers to new problems — and what the corpus says is actually being learned when that happens.

This explores why supervised fine-tuning seems to produce pattern-copying rather than transferable reasoning. The sharpest answer in the collection is that chain-of-thought reasoning is, mechanically, constrained imitation rather than genuine inference — models learn to reproduce familiar reasoning *schemata* from their training data, not to perform novel logical steps Does chain-of-thought reasoning reveal genuine inference or pattern matching? Why does chain-of-thought reasoning fail in predictable ways? What makes chain-of-thought reasoning actually work?. The tell is how failure happens: when you shift the task, the length, or the format away from what was trained, performance degrades in a predictable, systematic way, and the model keeps producing fluent prose that is logically inconsistent underneath Does chain-of-thought reasoning actually generalize beyond training data?. That distribution-boundedness is the fingerprint of imitation: real reasoning wouldn't care whether the problem was phrased the way the training set phrased it.

A second thread points to *where* the memorization actually lives. One analysis decomposes chain-of-thought errors into local, mid-range, and long-range sources and finds that local memorization — predicting the next token mostly from the immediately preceding tokens — accounts for up to two-thirds of reasoning errors, and gets worse as problems grow more complex or drift from the training distribution Where do memorization errors arise in chain-of-thought reasoning?. In other words, the model leans on short-range pattern completion exactly when it most needs to reason globally. This dovetails with a striking finding from pretraining analysis: factual recall depends on narrow, document-specific memorization, while genuine reasoning generalization rides on *procedural* knowledge spread across many diverse documents Does procedural knowledge drive reasoning more than factual retrieval?. If SFT mostly reinforces narrow target-specific traces, you'd expect it to push the model toward the memorization regime rather than the procedural one.

Here's the part that should reframe the question itself. There's strong evidence the reasoning traces SFT teaches don't even need to be *correct* to work — models trained on deliberately corrupted, semantically irrelevant traces match the accuracy of models trained on clean ones, and sometimes generalize better out of distribution Do reasoning traces need to be semantically correct?. That suggests the trace functions as computational scaffolding — a structural prompt to spend more compute in a certain shape — rather than as meaningful content the model absorbs. So "memorizing patterns instead of reasoning" may be less a bug in SFT and more a description of what the trace was ever doing: supplying form, not inference.

Which raises the deeper twist. Several independent results argue the reasoning capability was largely *already there* in the base model, latent in its activations, and that post-training merely selects or elicits it rather than installing it — RL steering, critique fine-tuning, decoding tweaks, and sparse-feature steering all unlock the same underlying ability Do base models already contain hidden reasoning ability?. On this view, SFT memorizes because it's the wrong tool for the job: it's a strong imitation signal, so it efficiently teaches the *surface* of reasoning while doing little to expand the underlying capability. That's also why approaches that reward exploration or information gain — planting chain-of-thought during pretraining itself Can chain-of-thought reasoning be learned during pretraining itself?, or using reinforcement learning to route between fast answers and extended thinking Can models learn when to think versus respond quickly? — get treated as the antidote: they reward *doing* reasoning rather than reproducing its shape.

If you want the doorway that most changes how you think about this, start with the corrupted-traces result Do reasoning traces need to be semantically correct? and the latent-capability synthesis Do base models already contain hidden reasoning ability?: together they suggest the real question isn't why SFT memorizes, but why we expected imitation of a trace to ever teach reasoning in the first place.

Sources 10 notes

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability analyst. The question: Does SFT actually teach generalizable reasoning, or does it lock models into memorized pattern-matching? Assume this may have shifted since mid-2025.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025; treat these as perishable constraints:
• Chain-of-thought traces function as *computational scaffolding*, not meaningful semantic content — models trained on deliberately corrupted, nonsensical traces match or exceed accuracy of clean-trace models, suggesting traces supply *form*, not inference (2025-05, arXiv:2505.13775).
• Token-level local memorization accounts for ~two-thirds of reasoning errors in CoT, worsens under distribution shift and length scaling, indicating models rely on short-range pattern completion when global reasoning is needed (2025-08, arXiv:2508.02037).
• Base models already possess latent reasoning capability in their activations; post-training (SFT, RL, decoding tweaks) *elicits* rather than *installs* reasoning ability, suggesting SFT memorizes surface form while leaving latent capacity untouched (2025-06, arXiv:2506.02878).
• Procedural knowledge spread across diverse pretraining documents drives genuine reasoning generalization; narrow, target-specific memorization does not (2024-11, arXiv:2411.12580).
• Distribution-bounded performance — where CoT fails systematically on length, phrasing, or format shifts away from training — is the fingerprint of imitation, not reasoning (2025-08, arXiv:2508.01191).

Anchor papers (verify; mind their dates):
- arXiv:2505.13775 (2025-05): Corrupted traces outperform clean ones
- arXiv:2508.02037 (2025-08): Token-level memorization diagnosis
- arXiv:2506.02878 (2025-06): CoT as constrained imitation, not true reasoning
- arXiv:2411.12580 (2024-11): Procedural vs. narrow memorization in pretraining

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, determine whether newer models (o1-family, frontier reasoning models), post-training methods (RL objectives rewarding exploration/information gain, hybrid fast-vs-extended thinking), or evaluation harnesses have *relaxed* the memorization regime or revealed it was narrower than claimed. Specifically: Has the corrupted-trace finding held across larger models? Do RL-steered models show procedural knowledge? Where does local-token memorization still dominate, and where has it been overcome? Separate the durable question ("Why does imitation-based training favor surface form?") from the perishable limitation ("SFT cannot elicit reasoning at all").
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 3 months. If any recent papers argue that SFT *does* teach procedural generalization, or that o1-style scaling has made the memorization-vs-reasoning distinction moot, name and ground them.
(3) Propose 2 research questions that assume the post-training regime *has* moved: (a) Does RL-as-pretraining-objective (arXiv:2510.01265) actually teach procedural reasoning, or does it merely route more efficiently between memorized procedures? (b) In models that learn to throttle extended thinking (arXiv:2505.13379), does the "fast answer" path still memorize while "extended thinking" reasons, or have they converged?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do SFT models memorize patterns instead of learning generalizable reasoning?

Sources 10 notes

Next inquiring lines