INQUIRING LINE

Can simple structure perturbations reliably expose memorization in reasoning models?

This explores whether deliberately disrupting a problem's structure — shuffling premises, corrupting the reasoning trace, padding the input, changing format — reliably tells us when a 'reasoning' model is actually reasoning versus replaying memorized patterns, and the corpus shows perturbation is a sharp diagnostic but a two-edged one.


This explores whether structural perturbations can reliably expose memorization in reasoning models — and the corpus suggests they can, but in a counterintuitive way: the most revealing perturbations are often the ones that *don't* hurt performance. The classic move is to break the logic and see if the answer survives. When researchers fed models systematically invalid chain-of-thought exemplars, accuracy held nearly even with valid reasoning Does logical validity actually drive chain-of-thought gains?. When traces were deliberately corrupted into irrelevance, models trained on them matched — and sometimes beat — models trained on correct traces Do reasoning traces need to be semantically correct?. The fact that breaking the content changes nothing is itself the tell: the model is keying on the *form* of reasoning, not its substance Why does chain-of-thought reasoning fail in predictable ways?.

A second family of perturbations works the opposite way — break the surface and watch performance collapse. Swapping in random, unsupporting premises doesn't stop models from confidently predicting entailment, as long as the hypothesis itself looks familiar from training; the model is responding to memorized propositions, not the premise-hypothesis relationship Do LLMs predict entailment based on what they memorized?. Even a perturbation as crude as padding the input with irrelevant tokens drops reasoning accuracy from 92% to 68% well below the context limit Does reasoning ability actually degrade with longer inputs?. And systematically shifting task, length, or format reveals that chain-of-thought degrades along predictable seams — fluent but logically inconsistent — rather than generalizing Does chain-of-thought reasoning actually generalize beyond training data?.

So 'reliably' deserves a caveat. Perturbations expose *something* consistently, but what they expose depends on which knob you turn. The deepest finding here is that the real fault line isn't task complexity or logical validity at all — it's instance-level novelty. Reasoning models fit instance-based patterns rather than generalizable algorithms, so any chain succeeds if the model has seen something similar, regardless of length or difficulty Do language models fail at reasoning due to complexity or novelty?. That reframes the whole diagnostic: a perturbation reliably exposes memorization only insofar as it pushes an instance *off* the manifold of familiar examples. Cosmetic perturbations that stay near familiar instances won't trip anything.

There's also a finer-grained answer to *where* the memorization lives. The STIM framework decomposes token-level memorization into local, mid-range, and long-range sources, and finds local memorization — predicting the next token from immediately preceding ones — accounts for up to 67% of reasoning errors, especially as complexity rises and distribution shifts Where do memorization errors arise in chain-of-thought reasoning?. This is the mechanistic complement to the perturbation experiments: it explains *why* surface-level structural cues matter so much.

The one note that complicates the pessimism is the distinction between procedural and factual knowledge. Analysis of five million pretraining documents shows reasoning generalization is driven by broad, transferable procedural knowledge, while factual recall depends on narrow document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. That suggests not everything a reasoning model does is memorization — which is precisely why a *single* perturbation can't be a clean detector. To reliably separate the two, you'd want perturbations targeted at the procedural-vs-factual boundary, not just generic structural noise.


Sources 9 notes

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Do LLMs predict entailment based on what they memorized?

McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-model diagnostician. The question remains: *Can simple structure perturbations reliably expose memorization in reasoning models?* This is still live, but the answer depends on what 'reliable' and 'memorization' mean—and newer models and tools may have shifted the terrain.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. Key constraints emerged:
• Logically-invalid chain-of-thought prompts perform nearly as well as valid ones; models key on *form*, not substance (2023).
• Token-level memorization has three distinct sources; local memorization (next-token prediction from immediate context) accounts for ~67% of reasoning errors under distribution shift (2025).
• Reasoning performance degrades predictably with input length (92% → 68% accuracy when padding irrelevant tokens), even far below stated context limits (2024).
• Reasoning generalization is driven by broad procedural knowledge in pretraining, not narrow factual memorization—the two are mechanistically separable (2024–2025).
• CoT reasoning is distribution-bounded; it succeeds when instances stay near the training manifold, regardless of logical validity; off-manifold perturbations expose memorization, on-manifold ones do not (2025–2026).

Anchor papers (verify; mind their dates):
- arXiv:2307.10573 (2023): Invalid Logic, Equivalent Gains
- arXiv:2508.02037 (2025): Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
- arXiv:2411.12580 (2024): Procedural Knowledge in Pretraining Drives Reasoning
- arXiv:2508.01191 (2025): Is Chain-of-Thought Reasoning of LLMs a Mirage?

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For the form-over-substance finding: has activation steering, interpretability tooling, or finetuning on synthetic procedural data since enabled models to decouple form from logic? For the 67% local-memorization figure: do newer tokenizers, instruction-tuning regimes, or chain-length optimization change the ratio? For the distribution-boundedness claim: can in-context examples, retrieval-augmented generation, or multi-agent scaffolding push reasoning beyond the training manifold? Plainly separate which constraints still hold and which have been circumvented.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Look for papers claiming perturbations *fail* to reliably expose memorization, or claiming procedural knowledge cannot be isolated from memorization, or claiming CoT reasoning has escaped distribution bounds.
(3) **Propose 2 research questions** that assume the regime may have moved: (a) If perturbation-based exposure is unreliable because the boundary between memorization and procedural knowledge is ill-defined, what mechanistic dissection (attention pattern, gradient-based probing, causal intervention) would cleanly separate them? (b) If newer reasoning models have learned to generalize beyond the training manifold through some architectural or training innovation, what property of that model would a perturbation test need to have to reliably expose memorization in it?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines