Do language models ignore goals when surface cues conflict?
When a task has an obvious surface cue that contradicts an unstated requirement, do LLMs follow the cue or the actual goal? This matters because it reveals whether reasoning failures come from missing knowledge or from how models weight competing signals.
The car-wash problem went viral in February 2026: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?" Every frontier LLM tested recommended walking. The correct answer is to drive, because you cannot wash a car that is not at the car wash. A 53-model evaluation found 42 recommended walking on a single pass, with only 5 answering correctly across ten trials.
The Heuristic Override Benchmark (HOB) generalized this single anecdote into a systematic 500-instance test crossing 4 heuristic families with 5 constraint families. Across 14 models the result is sharp: under strict 10/10 evaluation, no model exceeds 75 percent accuracy. Causal-behavioral analysis on six models showed the Heuristic Dominance Ratio (HDR) — how much more the surface cue influences the decision than the goal — ranged from 8.7× to 38×. The distance cue exerts at least an order of magnitude more influence than the goal in every model tested.
Monotonicity curves further showed that all six models produced sigmoid conflict curves with the same shape, differing only in amplitude and crossover distance. The mapping from distance to decision is approximately context-independent — the goal does not gate the heuristic, only weakly modulates it. This is not a tail-distribution problem at the edges of capability. It is a structural feature of how transformers handle conflicts between salient surface cues and unstated feasibility constraints. The cue dominates; the goal whispers.
Inquiring lines that use this note as a source 8
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Should LLMs query users back when presented with under-specified scenarios?
- How do models integrate conflicting signals in reasoning tasks?
- What role does terminal goal guarding play in model misalignment?
- What causes LLMs to ignore unstated constraints they know about?
- How does removing a spurious cue change LLM performance?
- What distinguishes surface cues from structural meaning in language understanding?
- What makes some interpretive postures stick while others fail to form?
- How can multiple conflicting values coexist in a single LLM system?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do language models fail to use knowledge they possess?
Large language models contain relevant world knowledge but often fail to activate it without explicit cues. This explores whether the bottleneck lies in knowledge storage or in the inference process that decides what background facts apply.
characterizes the failure mode
-
Are models actually reasoning about constraints or just defaulting conservatively?
Do language models genuinely apply constraints when solving problems, or do they simply prefer harder options by default? Minimal pair testing reveals whether apparent reasoning success masks hidden biases.
exposes the apparent-reasoning illusion
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
- Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning
- Premise Order Matters in Reasoning with Large Language Models
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Cognitive Architectures for Language Agents
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
Original note title
LLMs systematically follow surface heuristics over implicit feasibility constraints with the heuristic 8 to 38 times more influential than the goal