Can benchmarks designed for shortcut learning detect heuristic override failures?
This explores whether the benchmarks built to catch shortcut learning — template-matching, memorization, output-format mimicry — can also catch a subtler failure: when a model needs to override a learned default heuristic and doesn't.
This reads the question as asking whether the test designs that expose shortcut-taking (out-of-distribution swaps, controlled variants, semantically-stripped instructions) are the same tools that reveal heuristic-override failures — the cases where a model has to suppress a strong prior and apply an exception instead. The corpus suggests the overlap is real but partial: the shortcut benchmarks are excellent at proving a model leaned on a heuristic, and weaker at proving it could have overridden one if asked. The cleanest probe is the out-of-distribution swap. When models are tested on N-1 variants — problems structurally identical except for the piece that defeats template-matching — even RL-fine-tuned models drop sharply, showing the training sharpened memorization rather than installing a procedure Do fine-tuned language models actually learn optimization procedures?. The same logic shows up in latent optimization, where models recognize a problem as template-similar and emit plausible-but-wrong values rather than actually iterating Do large language models actually perform iterative optimization?. These benchmarks detect the heuristic; that's their whole point.
Sources 8 notes
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Across four game-based tasks, reasoning models scored below 25% on exception rules versus 55–65% for non-reasoning models. Chain-of-thought introduces math overuse, overgeneralization, and hallucinated constraints that amplify errors in negative evidence recognition.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.