INQUIRING LINE

Does the Heuristic Override Benchmark measure enumeration or world knowledge?

This asks about a specific named test — the 'Heuristic Override Benchmark' — and whether it probes systematic enumeration or stored world knowledge; the corpus doesn't contain that benchmark by name, but it has a lot to say about the deeper distinction the question is reaching for: when a benchmark score reflects recall versus genuine capability.


Straight answer first: none of the notes in this collection name a 'Heuristic Override Benchmark,' so I can't tell you what that particular test measures. But the question underneath it — is a score capturing memorized patterns or real problem-solving? — is exactly the fault line this corpus keeps returning to, and it's worth following even without the specific paper.

The recurring finding is that many benchmark gains turn out to be recall wearing the costume of reasoning. The sharpest case study shows a model reconstructing 54.6% of a math benchmark from partial prompts alone — then scoring 0.0% on a freshly released test it couldn't have memorized Does RLVR success on math benchmarks reflect genuine reasoning improvement?. The same pattern shows up in optimization: models recognize a problem as template-similar to something they've seen and emit plausible-but-wrong numbers, rather than actually running the procedure Do large language models actually perform iterative optimization?. So 'enumeration vs. world knowledge' is, in this collection's terms, often a false binary — what looks like either can really be schema retrieval.

There's a clean experimental handle on telling them apart: distribution. Chain-of-thought reasoning tracks difficulty only when the problem resembles training data, and decouples entirely when it doesn't Does longer reasoning actually mean harder problems? — fluent reasoning form, broken underlying logic, once you push past the familiar Does chain-of-thought reasoning actually generalize beyond training data?. The diagnostic move, then, isn't to ask whether a benchmark 'measures enumeration or knowledge' but whether it varies the instance structure enough that memorized schemas stop helping.

That's also what makes a benchmark actually informative. Constraint-satisfaction tests — which demand genuine backtracking over unfamiliar instances rather than pattern-matched answers — drop frontier reasoning models to 20–23% Can reasoning models actually sustain long-chain reflection?. The ceiling is the point: a benchmark earns its keep precisely when it strips away the option to recall.

If you came to this question because you're trying to evaluate a benchmark's validity, that reframe is the takeaway worth carrying: the useful question is rarely 'enumeration or world knowledge' but 'can a model pass this by recognizing rather than working?' — and the way you find out is by moving the test off the training distribution and watching whether the score survives.


Sources 5 notes

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Next inquiring lines