Can simple structure perturbations reliably expose memorization in reasoning models?
This explores whether deliberately disrupting a problem's structure — shuffling premises, corrupting the reasoning trace, padding the input, changing format — reliably tells us when a 'reasoning' model is actually reasoning versus replaying memorized patterns, and the corpus shows perturbation is a sharp diagnostic but a two-edged one.
This explores whether structural perturbations can reliably expose memorization in reasoning models — and the corpus suggests they can, but in a counterintuitive way: the most revealing perturbations are often the ones that *don't* hurt performance. The classic move is to break the logic and see if the answer survives. When researchers fed models systematically invalid chain-of-thought exemplars, accuracy held nearly even with valid reasoning Does logical validity actually drive chain-of-thought gains?. When traces were deliberately corrupted into irrelevance, models trained on them matched — and sometimes beat — models trained on correct traces Do reasoning traces need to be semantically correct?. The fact that breaking the content changes nothing is itself the tell: the model is keying on the *form* of reasoning, not its substance Why does chain-of-thought reasoning fail in predictable ways?.
A second family of perturbations works the opposite way — break the surface and watch performance collapse. Swapping in random, unsupporting premises doesn't stop models from confidently predicting entailment, as long as the hypothesis itself looks familiar from training; the model is responding to memorized propositions, not the premise-hypothesis relationship Do LLMs predict entailment based on what they memorized?. Even a perturbation as crude as padding the input with irrelevant tokens drops reasoning accuracy from 92% to 68% well below the context limit Does reasoning ability actually degrade with longer inputs?. And systematically shifting task, length, or format reveals that chain-of-thought degrades along predictable seams — fluent but logically inconsistent — rather than generalizing Does chain-of-thought reasoning actually generalize beyond training data?.
So 'reliably' deserves a caveat. Perturbations expose *something* consistently, but what they expose depends on which knob you turn. The deepest finding here is that the real fault line isn't task complexity or logical validity at all — it's instance-level novelty. Reasoning models fit instance-based patterns rather than generalizable algorithms, so any chain succeeds if the model has seen something similar, regardless of length or difficulty Do language models fail at reasoning due to complexity or novelty?. That reframes the whole diagnostic: a perturbation reliably exposes memorization only insofar as it pushes an instance *off* the manifold of familiar examples. Cosmetic perturbations that stay near familiar instances won't trip anything.
There's also a finer-grained answer to *where* the memorization lives. The STIM framework decomposes token-level memorization into local, mid-range, and long-range sources, and finds local memorization — predicting the next token from immediately preceding ones — accounts for up to 67% of reasoning errors, especially as complexity rises and distribution shifts Where do memorization errors arise in chain-of-thought reasoning?. This is the mechanistic complement to the perturbation experiments: it explains *why* surface-level structural cues matter so much.
The one note that complicates the pessimism is the distinction between procedural and factual knowledge. Analysis of five million pretraining documents shows reasoning generalization is driven by broad, transferable procedural knowledge, while factual recall depends on narrow document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. That suggests not everything a reasoning model does is memorization — which is precisely why a *single* perturbation can't be a clean detector. To reliably separate the two, you'd want perturbations targeted at the procedural-vs-factual boundary, not just generic structural noise.
Sources 9 notes
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.