How should benchmarks test whether models fit algorithms or patterns?

This explores how to design benchmarks that can tell the difference between a model that has genuinely learned a procedure (an algorithm it can run on new inputs) and one that has just memorized patterns that happen to produce right answers — the difference between competence and a convincing imitation of it.

This explores how benchmarks can distinguish a model that runs an actual procedure from one that pattern-matches its way to the right answer. The uncomfortable starting point across the corpus is that standard benchmarks can't tell these apart at all. A model can pass grammar tests by leaning on sentence length, word choice, and spelling rather than any grammatical rule Can models pass tests while missing the actual grammar?; it can ace theory-of-mind tasks by exploiting templated artifacts and distribution biases instead of reasoning about mental states Can language models solve ToM benchmarks without real reasoning?. A single accuracy number is blind to the difference, so the design question becomes: what extra structure do you build into the test to force the gap into the open?

The most direct technique is the out-of-distribution stress test. If a model has installed an algorithm, it should survive variations that leave the underlying procedure unchanged. The N-1 approach does exactly this — hold the procedure fixed but shift the surface, and watch RL-fine-tuned models drop sharply on the variant while staying strong on the familiar version, which reveals they sharpened template-matching rather than learning to solve Do fine-tuned language models actually learn optimization procedures?. A related move probes whether the model is actually executing iterative steps: ask it to run a numerical method it can't shortcut, and it emits plausible-looking but wrong values because it recognized the problem as template-similar instead of computing through it Do large language models actually perform iterative optimization?.

There's a sharper, less obvious benchmark design hiding here: vary *what the training touched* and see what moves. When a 1.5B model with format-only LoRA tuning matches full RL models on reasoning, that tells you the benchmark was measuring output-format organization, not new reasoning knowledge — the two turn out to be separable, and a good benchmark should know which one it's pricing Can small models reason well by just learning output format?. Similarly, when supervised fine-tuning matches reinforcement learning on a task, that's a signal the task didn't require the deeper capability RL is supposed to add Can language models solve ToM benchmarks without real reasoning?. Comparing cheap and expensive training recipes is itself a benchmark instrument: if the cheap one keeps up, your test wasn't measuring what you thought.

The deepest version of the worry is that behavior alone may never be enough. The Fractured Entangled Representation work shows two networks can produce identical outputs on every input while one has clean internal structure and the other is a tangled mess that shatters under perturbation or distribution shift Can models be smart without organized internal structure? Can AI pass every test while understanding nothing?. If perfect test performance can coexist with broken internal organization, then a benchmark that only reads outputs is structurally incapable of detecting the difference — pushing toward representational and robustness probes, not just accuracy.

Two framings make benchmark design predictive rather than reactive. One says: characterize the task at the computational level first. Treating LLMs as autoregressive probability machines correctly predicted *in advance* which logically-simple tasks (backwards alphabet, letter counting) would be hard, because their targets are low-probability — so a benchmark can be built to target known failure geometry rather than stumbling onto it Can we predict where language models will fail?. The other says: some gaps are architectural, not training gaps. Autoregressive models can't retract emitted tokens, so they hit a ceiling on constraint-satisfaction problems that no amount of scale fixes Why does autoregressive generation fail at constraint satisfaction?. A benchmark that wants to test for genuine algorithm-execution should deliberately include tasks where pattern-matching and procedure-following must diverge — and the surprise the corpus leaves you with is that the cleanest such tests aren't harder questions, they're the *same* question wearing a different surface, scored against a model that should have learned the rule underneath.

Sources 9 notes

Can models pass tests while missing the actual grammar?

BabyLM evaluations showed models can produce correct outputs by relying on sentence length, word choice, and orthography rather than grammatical structure. Standard benchmarks cannot distinguish these two generalization types without tests specifically designed to rule out surface heuristics.

Can language models solve ToM benchmarks without real reasoning?

Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Can small models reason well by just learning output format?

A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Why does autoregressive generation fail at constraint satisfaction?

The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.

How should benchmarks test whether models fit algorithms or patterns?

Sources 9 notes

Next inquiring lines