INQUIRING LINE

Why do smaller models lose reasoning faithfulness more than larger models?

This explores why a model's stated reasoning steps reliably drive its answers in large models but become decorative or disconnected in smaller ones — and the corpus suggests the cleaner story is about *mechanism*, not size alone.


This reads the question as: when does a model's chain-of-thought actually cause its answer (faithful) versus merely accompany it (performative)? The corpus doesn't have a paper that benchmarks small-vs-large faithfulness head-to-head, so the honest answer is that it reframes your question rather than confirming its premise — and the reframe is the interesting part. The strongest direct evidence is that faithfulness degrades less by parameter count and more by *training and reliance patterns* that smaller models lean on harder.

The sharpest result is that fine-tuning itself unfastens reasoning from answers. Three separate tests — cutting the reasoning short, paraphrasing it, and swapping in filler — leave the final answer unchanged more often after fine-tuning, meaning the chain becomes a display rather than a cause Does fine-tuning disconnect reasoning steps from final answers?. This matters for the size question because smaller models are disproportionately *built* by fine-tuning on a larger teacher's outputs: DPO and SFT on teacher-generated examples are how a small model is taught to imitate function-calling and math reasoning at all Can small models match large models on function calling?. So the very process that lets small models punch above their weight is the process shown to make reasoning steps less causal.

Underneath that, the corpus points to *what* small models fall back on when the reasoning isn't load-bearing. When semantic content is decoupled from the logical structure, LLM accuracy collapses even with the correct rules sitting in context — models are running on token associations and parametric commonsense, not symbolic manipulation Do large language models reason symbolically or semantically?. A model with thinner parametric coverage has less of this semantic scaffolding to stand on, so its chain is more likely to be a fluent narration of a pattern-match. Relatedly, models often *look* like they're reasoning about constraints while actually exploiting a conservative default — twelve of fourteen models did *worse* when constraints were removed, proving they were defaulting to the harder option rather than evaluating anything Are models actually reasoning about constraints or just defaulting conservatively?. Faithfulness failure and capability failure share a root: reasoning that was never functional in the first place.

There's also a capability-shaped pattern that runs the other direction and is worth knowing. Optimal chain length follows an inverted-U, and the optimum gets *shorter* as models get more capable — stronger models naturally gravitate to terse chains, while weaker models ramble Why does chain of thought accuracy eventually decline with length?. Longer chains aren't free: contextual distance from the original instruction dilutes attention and degrades adherence as the reasoning stretches out Why do better reasoning models ignore instructions?. So a smaller model that compensates with more reasoning tokens may be buying length precisely where length erodes the link between intent, steps, and answer.

The thing you didn't know you wanted to know: several of these failures aren't reasoning failures at all. When a model knows the algorithm but can't run it across many text-only steps, giving it a tool restores performance past the supposed 'reasoning cliff' — the bottleneck was execution bandwidth, not reasoning Are reasoning model collapses really failures of reasoning?. And models break at instance *novelty*, not task complexity — any chain succeeds if the model has seen similar instances Do language models fail at reasoning due to complexity or novelty?. Read together, 'small models lose faithfulness' dissolves into something more precise: smaller models have less parametric familiarity and less execution bandwidth, so they hit the performative-reasoning regime sooner — but the mechanism is the same one that catches large models too, just further out.


Sources 8 notes

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Why do better reasoning models ignore instructions?

The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-faithfulness analyst. The question: **Do smaller models genuinely lose chain-of-thought faithfulness faster than larger ones, or is this a confound of training regime and execution bandwidth?** Still open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. Key constraints the corpus identified:
• Fine-tuning (SFT, DPO) itself decouples reasoning from answers independently of model size; small models are *built* by fine-tuning on teacher outputs, so inherit this unfaithfulness (2024-11, arXiv:2411.15382).
• Models run on token association and parametric commonsense, not symbolic manipulation; thin parametric coverage in small models means reasoning is more likely performative pattern-matching (2023-05, arXiv:2305.14825).
• Optimal chain-of-thought length follows an inverted-U; stronger models use *shorter* chains, while weaker models ramble—and longer chains dilute attention and instruction adherence (2025-02, arXiv:2502.07266; 2025-05, arXiv:2505.14810).
• Reasoning collapses are often execution failures (lack of bandwidth for multi-step symbolic work), not reasoning failures; tools can bypass this cliff (2026-02, arXiv:2602.06176).
• Models fail on instance *novelty*, not task complexity; unfamiliarity drives breakdown more than inherent reasoning incapacity.

Anchor papers (verify; mind their dates):
• arXiv:2305.14825 (2023-05): semantic vs. symbolic reasoning
• arXiv:2411.15382 (2024-11): fine-tuning's cost to CoT faithfulness
• arXiv:2602.06176 (2026-02): execution vs. reasoning failure
• arXiv:2502.07266 (2025-02): chain length and capability curves

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, ask: Have newer training methods (preference optimization, synthetic reasoning data, tool-use at training time), model architectures (sparse models, retrieval-augmented reasoning), or evaluations (causal tracing, mechanistic interpretation of faithfulness) since *relaxed* the training-regime penalty or execution-bandwidth bottleneck? Separate the durable question (does parametric familiarity matter?) from perishable limitations (does fine-tuning still degrade faithfulness if we use RLHF-with-process-reward?).
(2) **Surface disagreement.** Hunt for papers in the last 6 months that claim small models *can* maintain faithful reasoning—e.g., via scaffolding, verifiers, or data-efficient distillation—and reconcile those claims against the library's findings on performative vs. causal reasoning.
(3) **Propose two regime-shift questions:** (a) If reasoning-specific pre-training (e.g., synthetic step-by-step construction) replaces fine-tuning, does size still predict faithfulness loss? (b) Do multi-agent or ensemble setups (where small models verify each other's steps) restore faithfulness better than scaling a single model?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines