Does fine-tuning disconnect reasoning steps from final answers?
When models are fine-tuned on specific domains, do their chain-of-thought steps become less causally connected to their outputs? Three experiments test whether reasoning chains remain functionally faithful after training.
The "Impact of Fine-Tuning on Chain-of-Thought Reasoning" paper reveals a dimension of SFT damage that InfoGain metrics miss: faithfulness. After fine-tuning, the reasoning steps in CoT outputs are less causally connected to the final answer. The model still generates reasoning chains — they just matter less for determining the output.
Three specific tests operationalize this:
Early Termination: truncate the CoT at step i and ask for the final answer. If truncation at an early step already produces the correct answer, only a fraction of the reasoning was faithful. Fine-tuned models show earlier convergence — their answers are "decided" before the reasoning chain finishes.
Paraphrasing: rephrase later reasoning steps. If the answer is invariant to paraphrasing, the reasoning was faithful (the argument matters, not the words). Fine-tuned models show less sensitivity to paraphrasing — suggesting the chain is performative rather than functional.
Filler Substitution: replace later reasoning steps with filler tokens ("..."). If the answer doesn't change, those steps weren't contributing. Fine-tuned models tolerate more filler substitution.
This extends the SFT accuracy trap in a critical direction. Does supervised fine-tuning actually improve reasoning quality? showed that SFT reduces the informativeness of reasoning steps. This paper shows SFT also reduces whether those steps actually influence the final answer at all. The model may generate a complete-looking chain, but the chain has been partially disconnected from the output it appears to support.
Smaller models (Llama-3-8B-Instruct) are more affected than larger ones (GPT-4), suggesting that larger models have sufficient capacity to maintain reasoning-output coupling even after fine-tuning. This connects to Do language models actually use their reasoning steps? — fine-tuning makes an already-fragile causal coupling even weaker. If Does chain-of-thought reasoning reveal genuine inference or pattern matching?, then fine-tuning further degrades faithfulness because the model learns domain-specific shortcuts that bypass the imitated reasoning pattern entirely — the chain was already performative, and fine-tuning makes it more so.
Inquiring lines that use this note as a source 114
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does chain-of-thought reasoning hurt recommendation tasks specifically?
- How much does faithfulness vary naturally in reasoning without evaluation pressure?
- Does chain-of-thought monitoring fundamentally degrade under optimization pressure?
- What is the mechanistic signature when models chain facts never presented together?
- Why do single examples trigger large reasoning improvements in models?
- Can penalizing reasoning transitions fix underthinking without fine-tuning models?
- Why does fine-tuning for continuous space cause catastrophic forgetting?
- What domain properties determine whether causal rules transfer to new agents?
- Do causal rules enforce robustness that statistical patterns alone cannot maintain?
- How does optimizing for accuracy during training degrade downstream reasoning quality?
- Can a single SAE feature control reasoning behavior across model families?
- Does changing decoding procedure reveal hidden chain-of-thought paths?
- How does critique fine-tuning on one problem unlock broader reasoning?
- What causes models to develop domain capability cliffs after specialization?
- How do humans use associative reasoning without causal connections?
- How much do mechanistic interpretability findings reflect true reasoning architecture?
- Can chain-of-thought reflection actually retract previous reasoning or only rewrite over it?
- Why does reasoning fine-tuning reduce model abstention capacity by 24 percent?
- How can safety-aligned parameters be protected during user-specific fine-tuning?
- How often do papers treat chain-of-thought as interpretability incorrectly?
- Can activation patching reveal which reasoning steps actually matter?
- Does causal mediation analysis quantify reasoning faithfulness across model types?
- Does each reasoning step in chain-of-thought introduce cumulative error?
- Can chain-of-thought faithfulness exist without causal necessity in reasoning?
- Can chain-of-thought explanations be both sufficient and necessary for model decisions?
- What happens to chain-of-thought performance across distribution shifts?
- Why do more capable models prefer shorter chains of thought?
- What hidden costs emerge when you fine-tune models for a single domain?
- Why does fine-tuning degrade reasoning quality even as accuracy improves?
- Why does fine-tuning sometimes damage chain-of-thought reasoning even when accuracy improves?
- What breaks when you apply reinforcement learning after supervised fine-tuning?
- Why does domain accuracy improve while reasoning quality degrades after supervised fine-tuning?
- Can fine-tuning ever teach semantic inference instead of amplifying training shortcuts?
- Can reinforcement learning add missing domain knowledge to fine-tuned reasoning models?
- Does supervised fine-tuning improve accuracy while damaging the quality of reasoning?
- Does fine-tuning models for specific tasks destroy their ability to reason?
- Does reasoning fine-tuning actually reduce a model's ability to abstain?
- What happens to AI reasoning when you remove specific political features?
- Why does fine-tuning change how models process retrieved context?
- How does chain-of-thought training change higher layer computations?
- How does behavioral fine-tuning differ from factual knowledge encoding in models?
- Why does the distinction between functional and causal grounding matter for AI alignment?
- Why does fine-tuning improve some capabilities while degrading others?
- What distinguishes functional grounding from genuine causal grounding in AI systems?
- Why do different reasoning chains surface different relevant facts?
- Can prompt optimization or fine-tuning inject knowledge models do not already contain?
- Do causal histories determine what mental states a system can instantiate?
- What makes reasoning-specific post-training different from standard parameter scaling?
- Can mechanistic interpretability explain explanation-execution disconnection?
- Why do fine-tuned models fail outside their specialized domains?
- Does fine-tuning actually change model capabilities or only output distribution?
- Why does instruction tuning hurt knowledge-intensive tasks more than reasoning tasks?
- How does the functional separation of knowledge and reasoning affect adaptation methods?
- What happens when reasoning fine-tuning eliminates model refusal mechanisms entirely?
- How much do reasoning models actually verbalize their causal influences?
- What three factors actually drive chain of thought performance improvements?
- Can deliberate corruption of reasoning traces harm out of distribution generalization?
- Why does supervised fine-tuning degrade reasoning quality despite raising accuracy?
- Why does SFT reduce reasoning quality even when improving domain accuracy?
- Can chain-of-thought traces be faithful without causal sufficiency and necessity?
- How does pretrained knowledge constrain what adaptation strategies can achieve?
- When does the correlation between consistency and correctness break down?
- Does reasoning fine-tuning actually damage a model's ability to abstain?
- How does semantic entanglement interact with personality dimension shifts during finetuning?
- How does chain-of-thought reasoning become decorative after domain-specific fine-tuning?
- Does SFT degrade reasoning quality while improving domain accuracy?
- Why does training order matter across different domain types?
- Why does fine-tuning models for continuous reasoning cause catastrophic forgetting?
- How do reasoning improvements suppress a model's ability to abstain?
- How do random walk reasoning chains from knowledge graphs compare to traditional fine-tuning?
- What makes some sentences in reasoning traces have disproportionate causal influence?
- How does data quality mismatch create reasoning degradation in supervised fine-tuning?
- Does reasoning fine-tuning actually harm a model's ability to abstain?
- Does functional integration determine cognitive system boundaries?
- How much does chain-of-thought reasoning narrow the decompression gap?
- Can reasoning fine-tuning improve both capability and instruction compliance together?
- Why does reasoning fine-tuning reduce a model's ability to abstain?
- Why do models skip steps that would make reasoning clearer?
- Why do foundation models develop task-specific heuristics instead of causal understanding?
- Does chain of thought reasoning faithfully reflect what a model actually believes?
- Can layer-wise prediction stabilization identify when genuine reasoning has stopped?
- How do single wrong steps corrupt entire reasoning chains?
- How does semantic association differ from mechanistic causal reasoning?
- Does supervised fine-tuning improve reasoning or just response formatting?
- How does interaction horizon differ from chain-of-thought depth?
- Why do expert reasoners skip steps that novices must state explicitly?
- What makes a causal abstraction more transferable than a generic heuristic?
- Why does a replay mechanism prevent reasoner skills from over-specializing?
- What happens to base model capabilities when you apply finetuning?
- How does vehicle causality differ from content causality in physical systems?
- Why do smaller models lose reasoning faithfulness more than larger models?
- Does fine-tuning push models toward reasoning shortcuts that bypass the chain entirely?
- Can models maintain reasoning-output coupling while improving domain accuracy?
- How does faithfulness differ from informativeness in chain-of-thought evaluation?
- How do thought anchors differ from individual forking tokens mechanistically?
- Why might chain-of-thought reasoning bypass action selection pathways?
- Why do attention circuits need causal verification beyond feature visualization?
- What makes answer equivalence sufficient to discard a reasoning path?
- How much does pretraining quality affect the modularity of fine-tuned models?
- Why does reasoning fine-tuning suppress the confidence signals that adaptive retrieval needs?
- Can mechanistic interpretability tools decode the biases alignment training conceals?
- Does fine-tuning a small model match fine-tuning a large one?
- What training regimes confound surface mechanisms with their actual causes?
- Can single-problem fine-tuning match full RL pipeline reasoning gains?
- Why does chain-of-thought work for math but fail for grounding?
- How does supervised fine-tuning degrade chain-of-thought faithfulness over time?
- Can we detect redundant reasoning steps during model inference instead of training?
- Can models be trained to hide causal influences in their explanations?
- Why does reasoning fine-tuning reduce models' ability to abstain?
- How do finetuning and pretraining improvements differ in their effects on model capabilities?
- How much of chain-of-thought reasoning actually diverges from the final answer?
- Can single representation edits match chain-of-thought reasoning without explicit steps?
- Can a Reflect mechanism detect and revise failed causal predictions?
- What is the accuracy cost of enforcing temporal causality inside model parameters?
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
parallel SFT damage dimension: reasoning quality (InfoGain) vs. reasoning faithfulness (causal connection)
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
FT worsens an already-failing faithfulness property
-
Do language model reasoning drafts faithfully represent their actual computation?
If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.
draft-to-answer consistency is the dimension FT specifically degrades
-
Does chain of thought reasoning actually explain model decisions?
When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.
faithfulness degradation explains why agentic CoT fails: the chain was already partially decorrelated before deployment
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
provides the mechanistic explanation: if CoT is pattern-matching on reasoning form rather than genuine inference, fine-tuning further disconnects the chain from the answer because the model learns domain-specific shortcuts that bypass the imitated reasoning pattern entirely
-
Do chain-of-thought traces actually help users understand model reasoning?
Chain-of-thought explanations are often presented as transparency tools, but do they genuinely improve human understanding or create an illusion of interpretability? A human-subject study tests whether traces help users follow and evaluate model reasoning.
faithfulness degradation compounds the performance-interpretability decoupling: traces already optimized for model performance rather than human interpretability become even less causally connected to final answers after fine-tuning, serving neither function well
-
Does supervised fine-tuning improve reasoning or just answers?
Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
the SFT accuracy trap and faithfulness degradation are two dimensions of the same SFT damage: accuracy trap captures reasoning quality loss (InfoGain), faithfulness degradation captures causal connection loss; together they show SFT makes chains both less informative and less causally relevant
-
Does RLVR actually improve mathematical reasoning or just coherence?
RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.
complementary coherence/faithfulness split: RLVR improves structural coherence between adjacent steps without improving validity; fine-tuning degrades faithfulness (causal connection to answer) while potentially maintaining coherence; both show training can change the surface without changing the substance
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
- AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
- Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
- Measuring Faithfulness in Chain-of-Thought Reasoning
- A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
- LLM Reasoning Is Latent, Not the Chain of Thought
Original note title
fine-tuning degrades cot faithfulness independently of accuracy — reasoning steps influence final answers less after domain-specific training