Are models actually reasoning about constraints or just defaulting conservatively?
Do language models genuinely apply constraints when solving problems, or do they simply prefer harder options by default? Minimal pair testing reveals whether apparent reasoning success masks hidden biases.
The Heuristic Override Benchmark uses minimal pairs — same surface heuristic, with versus without the implicit constraint — to test whether apparent reasoning successes reflect actual reasoning. The result is striking. Twelve of fourteen models perform worse on the no-constraint variant than on the constraint-active variant, with drops up to 38.5 percentage points. Only two models (GPT-OSS-120B at +13.8 and GPT-OSS-20B at +11.0) improve when the constraint is removed.
This exposes a hidden mechanism behind apparent accuracy. When the constraint is present, the correct answer is the harder one (drive to the car wash that is 50m away). When the constraint is removed, the correct answer is the easier one (walk to the store that is 50m away). Models that default to recommending the harder option score correctly on constraint-active cases without doing any constraint reasoning. They are not solving the problem. They are reflexively choosing the more conservative option, which happens to coincide with the constraint-required answer.
The minimal-pair asymmetry is the only test that catches this. Single-instance accuracy looks fine — the model recommended driving, the right answer was driving. But the same model recommends driving even when walking would be correct, because the recommendation is not based on the constraint. The two-of-fourteen models that improve on minimal pairs are the only ones whose constraint-active accuracy reflects genuine reasoning about the constraint. The rest are riding a conservative-bias accident that aggregate metrics cannot distinguish from reasoning.
Inquiring lines that use this note as a source 108
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do only two of fourteen models improve when problem constraints are removed?
- How can minimal pairs expose reasoning failures that single-instance accuracy metrics miss?
- How should we redesign benchmarks to catch conservative bias in reasoning tasks?
- How do training-data priors influence model defaults when context is ambiguous?
- What scaffolding tools help users specify implicit contextual boundaries to models?
- When does the right constraint beat additional model capacity?
- What structural constraints matter more than model depth for CF?
- Can surface heuristics override implicit constraints in domain-specific reasoning?
- What distinguishes minimal-pair asymmetry from standard accuracy evaluation?
- How do unstated constraints become invisible to training data distributions?
- Can RLHF alignment prevent models from making ethically appropriate rule violations?
- How do models integrate conflicting signals in reasoning tasks?
- How do unstated feasibility constraints affect model decision-making?
- Can prompting techniques reliably force models to enumerate hidden constraints?
- What design changes could make constraint inference more reliable without explicit cuing?
- Does epistemic drift operate the same way across all languages?
- How much does faithfulness vary naturally in reasoning without evaluation pressure?
- Can explicit constraint statements override the dominance of surface heuristics?
- Can reflection in reasoning models be corrective rather than just confirmatory?
- Can explicit numerical signals override learned linguistic defaults in fine-tuned models?
- Can structured prompting reliably force models to enumerate preconditions?
- How does the frame problem differ between symbolic and statistical reasoning systems?
- Can models identify what information they are missing in underspecified problems?
- Do language models inherit gender bias from training data in grading tasks?
- Do language models exhibit the same causal biases that humans show?
- Do tool-enabled reasoning models close the gap on constraint satisfaction?
- Where do humans and language models actually diverge in reasoning ability?
- Why do language models imitate reasoning form without abstract inference capability?
- Why do models show performative reasoning on easy tasks but genuine reasoning on hard ones?
- Do reasoning languages like Prolog follow the same two-constraint transfer pattern?
- Why does homework adherence remain low despite advances in language model capability?
- Can language about model behavior ever be accurate without anthropomorphic framing?
- Why do models automatically adjust reasoning length to problem difficulty?
- Why does hypothesis attestation bias exist separately from frequency bias in NLI?
- Why do language models naturally under-abstain instead of over-abstain?
- What reveals the epistemic limits of language models?
- Are instruction-tuned models more or less sensitive to prompt semantics than others?
- Which game type reveals minimax reasoning in language models?
- What makes correcting a false assumption harder than just detecting it?
- Why do language models struggle with formal logical reasoning and joins?
- How can a model explain something correctly yet fail to apply it?
- Can language models ask clarifying questions when sentences are ambiguous?
- What distinguishes models that refuse cooperation from those that fake alignment?
- Do models trained for reasoning lose their ability to decline questions?
- Can inflection points in reasoning detect when models genuinely change their minds?
- Do reasoning models become more vulnerable to persona-induced bias than standard models?
- Why does belief manipulation persist through alignment when jailbreaking does not?
- What consistency tests could distinguish constructed from genuine preferences?
- What makes deductive reasoning so brittle in language models overall?
- Why do explicit linguistic markers override semantic computation in models?
- Is gradient behavior in language functional or a sign of ambiguity?
- How does constraint complexity relate to optimal reasoning token budgets?
- Do base models and reasoning models fail in opposite directions on uncertainty?
- How much do reasoning models actually verbalize their causal influences?
- What role does inductive bias play versus model capacity in practice?
- What makes constraint satisfaction problems epistemically cleaner than other reasoning tasks?
- What distinguishes reflection that satisfies constraints from reflection that merely sounds reflective?
- Why does reflection in reasoning models tend to be confirmatory rather than corrective?
- Why do language models prefer certain response styles regardless of what the prompt asks?
- Why do difficult problems force models to develop reasoning strategies?
- Why do smaller models favor code formats while larger models prefer natural language?
- Can external classifiers reliably decide when a model should reason?
- Which constraint types do reasoning models handle best?
- Does self-reflection help models notice their own constraint violations?
- When does the correlation between consistency and correctness break down?
- Are reasoning models more vulnerable to persuasion than standard models?
- Why do models overthink underspecified problems instead of rejecting them?
- How do reasoning improvements suppress a model's ability to abstain?
- How do output format constraints compare to input exemplar brittleness?
- Why do reasoning-optimized models still fall for logical fallacies in conversation?
- Why do reasoning-optimized models show no sycophancy resistance advantage?
- Why do language models struggle with evaluative tasks like weighing competing viewpoints?
- What filtering criteria best identify student-compatible refinements from teacher models?
- Why does removing semantic content collapse reasoning in language models?
- Why do models detect false assumptions but still fail to correct them appropriately?
- Can reasoning models succeed at logic but fail at execution?
- What would it mean for a language model to canvas counterpositions?
- Do negative constraints require fundamentally different training signals than positive instructions?
- Can machine learning encode pragmatic reasoning about when rules should bend?
- Why do language models plateau at 55 to 60 percent constraint satisfaction?
- Can models learn to ask clarifying questions instead of making assumptions?
- How much does forcing single-choice answers damage alignment with complex intent?
- What implicit premises do language models skip even with correct surface reasoning?
- Why do different language models converge on similar narrative defaults?
- How does making implicit reasoning requirements explicit change model performance?
- Can data filtering during pretraining prevent cognitive biases in language models?
- Why do language models fail at understanding ambiguous or complex requirements?
- Why do smaller models lose reasoning faithfulness more than larger models?
- Do larger language models overcome greediness in sequential decision-making?
- Do different model sizes show different rates of optional field overfilling behavior?
- Do models verbalize their implicit knowledge when that knowledge influences their output?
- Do alignment benchmarks measure actual bias removal or only verbal compliance?
- Why do language models plateau at constraint satisfaction regardless of scale?
- What distinguishes first-order from second-order agency in language models?
- Can models distinguish between logical impossibility and their own execution limits?
- Can categorical correctness signals stop dense optimizers from finding loopholes?
- How do frontier models maintain agreement scores above 90 percent across reasoning tasks?
- How do alternative hypothesis checks reduce confirmation bias in code reasoning?
- Why do reasoning-optimized models show no resistance advantage on agreement tasks?
- What causes language models' strategic rationality to decline with increased game complexity?
- Can weak models supervise the alignment of stronger models effectively?
- Why does reasoning fine-tuning reduce models' ability to abstain?
- How do logical forms of prompts influence what language models can derive?
- How does contrapositive augmentation change the tractability of reasoning tasks?
- Do models genuinely reason harder on difficult tasks or just appear to?
- Why do language models use remaining tokens to rationalize instead of reconsider?
- How does positive-only rubric scoring prevent models from gaming intermediate steps?
- How does evaluation setting affect measured reasoning capabilities in language models?
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
- Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Can Large Language Models Reason and Optimize Under Constraints?
- When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs
- Premise Order Matters in Reasoning with Large Language Models
- On the Reasoning Capacity of AI Models and How to Quantify It
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Original note title
Conservative bias hides behind apparent reasoning success — most models perform worse when the constraint is removed than when it is present