Can reasoning benchmarks separate logic from believability?

This explores whether our reasoning benchmarks can tell apart genuine logical inference from output that merely sounds convincing — and the corpus is fairly blunt that most of them can't.

This reads the question as asking whether reasoning benchmarks actually isolate valid logic, or whether they reward believability — fluent, well-formed output that imitates reasoning without doing it. The collection leans hard toward the second answer, with several notes converging on the same uncomfortable point from different angles. The sharpest is the finding that logically *invalid* chain-of-thought prompts perform nearly as well as valid ones on hard benchmarks Does logical validity actually drive chain-of-thought gains?. If you can scramble the logic and keep the score, the benchmark was never measuring logic — it was measuring the *form* of reasoning. A companion note frames this directly: CoT is constrained imitation of reasoning schemata learned in training, not genuine abstract inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?.

The believability trap shows up across task types, not just arithmetic CoT. Theory-of-mind benchmarks turn out to be solvable through pattern matching alone — templated artifacts and distribution biases let surface recognition score competitively without any mental-state reasoning Can language models solve ToM benchmarks without real reasoning?. And the tell that you're looking at imitation rather than logic is *how* performance breaks: it degrades predictably the moment you shift task, length, or format away from the training distribution Does chain-of-thought reasoning actually generalize beyond training data?, and failures cluster at instance-level *unfamiliarity* rather than at genuine complexity thresholds Do language models fail at reasoning due to complexity or novelty?. A model that reasons should fail when problems get genuinely harder; a model that pattern-matches fails when problems get unfamiliar. Most benchmarks don't separate those two cases.

The most useful lateral move in the corpus is the work proposing what a benchmark *should* measure instead. Rather than scoring final-output plausibility, one note argues for three structural properties — traceability, counterfactual adaptability, and motif compositionality — as testable signatures of real reasoning Can we measure reasoning quality beyond output plausibility?. Counterfactual adaptability is the key one: change a premise and see whether the conclusion moves the way logic demands. That's a direct probe of logic-over-believability, and it's exactly what plausibility-based scoring misses. Relatedly, the corpus notes that benchmark *improvement* and genuine reasoning *activation* are separable phenomena — RLVR can light up real reasoning patterns while the score gains come from memorizing contaminated data, the two living at different measurement levels Can genuine reasoning activation coexist with contaminated benchmarks?.

There's a worthwhile dissent worth sitting with. One note argues content-independence is the *wrong* target altogether: humans and LLMs show identical content effects on reasoning tasks like the Wason selection test, so demanding logic divorced from believable content may be demanding something humans don't do either Do language models fail reasoning tests that humans pass?. That reframes the whole question — maybe "separate logic from believability" is a cleaner ideal than reasoning, human or machine, actually instantiates. And a second caution: some apparent reasoning failures are really *execution* failures — models that know an algorithm but can't run it across enough text-only steps — which means a benchmark can also wrongly penalize sound logic for a bandwidth problem Are reasoning model collapses really failures of reasoning?.

The thing you may not have known you wanted to know: the field is quietly shifting from scoring *answers* to scoring *the shape of the derivation* — whether the chain is traceable, whether it bends correctly under counterfactual edits. That shift is the real answer to your question. Benchmarks can separate logic from believability, but only once they stop grading the destination and start grading the road.

Sources 9 notes

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Can language models solve ToM benchmarks without real reasoning?

Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Can we measure reasoning quality beyond output plausibility?

Research identifies traceability, counterfactual adaptability, and motif compositionality as testable measures of human-like reasoning. These structural properties reveal whether an agent genuinely reasons causally or merely mimics coherent speech.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Do language models fail reasoning tests that humans pass?

Research shows both humans and LLMs succeed and fail along the same content-sensitivity axis in reasoning tasks like Wason tests and natural language inference. Content-independence is not a meaningful criterion for distinguishing real reasoning from pattern matching.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Can reasoning benchmarks separate logic from believability?

Sources 9 notes

Next inquiring lines