Can irrelevant information reliably expose the limits of LLM reasoning?
This reads the question as: are distractor/irrelevant-content probes a dependable diagnostic for where LLM reasoning breaks down — and what do such probes actually reveal versus confound?
This explores whether injecting irrelevant or misleading information is a reliable way to find the edges of LLM reasoning. The corpus suggests it is a genuinely revealing probe — but what it exposes is more specific (and more confoundable) than a simple competence gap. The cleanest version of this experiment shows up in the finding that LLMs are semantic, not symbolic, reasoners: when you decouple meaningful content from the logical task, performance collapses *even when the correct rules are sitting right there in the context* Do large language models reason symbolically or semantically?. That collapse is exactly what irrelevant-information probes are designed to trigger — the model rides surface associations and training-distribution semantics rather than manipulating the structure of the problem.
Several other notes describe the same vulnerability from different angles, which is what makes the probe feel reliable rather than incidental. Models accommodate false presuppositions baked into a prompt even when direct questioning proves they know the fact is wrong — distracting framing overrides stored knowledge Why do language models accept false assumptions they know are wrong?. And the frame problem reappears as an enumeration failure: models don't fail for lack of world knowledge but because they can't surface which background conditions are *relevant* — forcing explicit enumeration jumps accuracy from 30% to 85% Do language models fail at identifying unstated preconditions?. Both are the inverse face of the irrelevance problem: the same machinery that can't filter out irrelevant content also can't reliably pull in relevant-but-unstated content. Relevance handling, in either direction, is the soft spot.
The sharper, less obvious twist is that irrelevance probes don't *reliably* expose reasoning limits, because they can be confounded by other failure modes. One line of work argues that many apparent reasoning collapses are actually execution failures — text-only models that know the algorithm but can't run it at scale, and tool-enabled models that sail past the supposed cliff Are reasoning model collapses really failures of reasoning?. Another shows that some failures are predictable purely from the model being an autoregressive probability machine — low-probability targets are hard regardless of logical simplicity Can we predict where language models will fail?. So a distractor that tanks performance might be revealing genuine reasoning brittleness, or it might just be pushing the model toward a low-probability output or an execution wall. The probe is informative only if you control for what kind of limit you're hitting.
There's also a reliability problem in the failures themselves. Potemkin understanding shows models that explain a concept correctly, fail to apply it, and then recognize their own failure — a pattern where the explanation and execution pathways are functionally disconnected Can LLMs understand concepts they cannot apply?. If a model can look competent on the explanation and incompetent on the application of the *same* concept, then a single irrelevant-information test can mislead you in either direction depending on which pathway it happens to engage.
The most useful takeaway is that the brittleness these probes expose is often a prompting artifact, not a hard ceiling — which means it's partly *fixable*, which in turn complicates 'reliably.' Structured argument prompts that force models to check warrants catch failures plain chain-of-thought lets through Can structured argument prompts make LLM reasoning more rigorous?, and modular 'cognitive tools' that isolate each reasoning operation lifted GPT-4.1 on competition math from 26.7% to 43.3% with no retraining Can modular cognitive tools unlock reasoning without training?. The thing you didn't know you wanted to know: a distractor that breaks a model under bare prompting may stop breaking it under scaffolded prompting — so irrelevant information reliably exposes the limits of a *given prompting setup*, not a fixed limit of the model. The reasoning frontier moves depending on how you ask.
Sources 8 notes
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.
LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.
Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.