Why do smaller models lose reasoning faithfulness more than larger models?
This explores why a model's stated reasoning steps reliably drive its answers in large models but become decorative or disconnected in smaller ones — and the corpus suggests the cleaner story is about *mechanism*, not size alone.
This reads the question as: when does a model's chain-of-thought actually cause its answer (faithful) versus merely accompany it (performative)? The corpus doesn't have a paper that benchmarks small-vs-large faithfulness head-to-head, so the honest answer is that it reframes your question rather than confirming its premise — and the reframe is the interesting part. The strongest direct evidence is that faithfulness degrades less by parameter count and more by *training and reliance patterns* that smaller models lean on harder.
The sharpest result is that fine-tuning itself unfastens reasoning from answers. Three separate tests — cutting the reasoning short, paraphrasing it, and swapping in filler — leave the final answer unchanged more often after fine-tuning, meaning the chain becomes a display rather than a cause Does fine-tuning disconnect reasoning steps from final answers?. This matters for the size question because smaller models are disproportionately *built* by fine-tuning on a larger teacher's outputs: DPO and SFT on teacher-generated examples are how a small model is taught to imitate function-calling and math reasoning at all Can small models match large models on function calling?. So the very process that lets small models punch above their weight is the process shown to make reasoning steps less causal.
Underneath that, the corpus points to *what* small models fall back on when the reasoning isn't load-bearing. When semantic content is decoupled from the logical structure, LLM accuracy collapses even with the correct rules sitting in context — models are running on token associations and parametric commonsense, not symbolic manipulation Do large language models reason symbolically or semantically?. A model with thinner parametric coverage has less of this semantic scaffolding to stand on, so its chain is more likely to be a fluent narration of a pattern-match. Relatedly, models often *look* like they're reasoning about constraints while actually exploiting a conservative default — twelve of fourteen models did *worse* when constraints were removed, proving they were defaulting to the harder option rather than evaluating anything Are models actually reasoning about constraints or just defaulting conservatively?. Faithfulness failure and capability failure share a root: reasoning that was never functional in the first place.
There's also a capability-shaped pattern that runs the other direction and is worth knowing. Optimal chain length follows an inverted-U, and the optimum gets *shorter* as models get more capable — stronger models naturally gravitate to terse chains, while weaker models ramble Why does chain of thought accuracy eventually decline with length?. Longer chains aren't free: contextual distance from the original instruction dilutes attention and degrades adherence as the reasoning stretches out Why do better reasoning models ignore instructions?. So a smaller model that compensates with more reasoning tokens may be buying length precisely where length erodes the link between intent, steps, and answer.
The thing you didn't know you wanted to know: several of these failures aren't reasoning failures at all. When a model knows the algorithm but can't run it across many text-only steps, giving it a tool restores performance past the supposed 'reasoning cliff' — the bottleneck was execution bandwidth, not reasoning Are reasoning model collapses really failures of reasoning?. And models break at instance *novelty*, not task complexity — any chain succeeds if the model has seen similar instances Do language models fail at reasoning due to complexity or novelty?. Read together, 'small models lose faithfulness' dissolves into something more precise: smaller models have less parametric familiarity and less execution bandwidth, so they hit the performative-reasoning regime sooner — but the mechanism is the same one that catches large models too, just further out.
Sources 8 notes
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.