Why does removing spurious cues sometimes hurt model performance?
Most models improve when spurious features are removed, but some fail worse. This note explores whether that failure represents a fundamentally different problem than traditional shortcut learning.
The literature on shortcut learning describes models that latch onto spurious surface features correlated with labels — lexical-overlap heuristics in NLI, sparse heuristic circuits in arithmetic, content effects in syllogistic reasoning. The standard prescription is to remove the spurious feature: take out the cue, performance recovers because the model is forced to use the intended computation.
The Heuristic Override Benchmark shows that this prescription does not apply to its phenomenon. Removing the heuristic cue (the distance "50 meters") makes models worse, not better. Twelve of fourteen models drop in accuracy when the spurious cue is removed. This is the opposite of shortcut-learning predictions and signals that something different is happening.
The authors locate the difference structurally. Shortcut learning is about filtering: the model needs to ignore the spurious feature and attend to the relevant one. Heuristic override is about composing: the model needs to integrate two things — a salient surface cue and an unstated feasibility constraint — and prioritize the constraint when they conflict. Both signals are integral to the problem; neither is noise. Removing the cue does not clean the input; it removes one of the two ingredients the composition requires, leaving the model less able to make any decision at all.
This connects the failure to the classical frame problem rather than to feature-level shortcut learning. The challenge is enumerating which unstated conditions are relevant — not detecting and filtering distractors. The two failure modes need different benchmarks, different mitigations, and different theoretical accounts.
Inquiring lines that use this note as a source 22
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do only two of fourteen models improve when problem constraints are removed?
- Why does aggregate accuracy fail as a metric for rare harmful cases?
- What makes the frame problem distinct from feature-level shortcuts?
- Why do benchmark designers treat content effects as confounds?
- How does removing a spurious cue change LLM performance?
- How do surface statistical regularities enable correct outputs while degrading robustness?
- Why does mixed instruction data sometimes hurt specific model capabilities?
- What happens to AI reasoning when you remove specific political features?
- Why do models fail under distribution shift if accuracy metrics stay high?
- What happens when you remove core political features from a deep model?
- Why can data filtering fail to remove transmitted behavioral traits?
- What makes correcting a false assumption harder than just detecting it?
- Why do different models respond differently to spurious rewards?
- Does debiasing training data actually solve the bias problem in machine learning?
- How do the six trap categories map onto detection difficulty?
- Can a rejected-edit buffer work like hard negatives in contrastive learning?
- Can false positives from input filtering be reduced without sacrificing defense?
- Can group-relative normalization be modified to resist shortcut trajectories?
- What features does a sample reinforce when it moves bands?
- What mechanisms cause overly hard samples to degrade prior model performance?
- Why do structure-targeted training negatives fail to fix the underlying problem?
- Do feature extraction methods systematically miss computationally important complex features?
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
- Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
- Reasoning Can Hurt the Inductive Abilities of Large Language Models
- LLMs Get Lost In Multi-Turn Conversation
- LLMs can implicitly learn from mistakes in-context
Original note title
LLM heuristic override is structurally distinct from shortcut learning because removing the spurious cue degrades rather than improves performance