Can models distinguish between logical impossibility and their own execution limits?
This explores whether a model can tell the difference between a problem that's genuinely unsolvable (logically impossible) and a problem it simply can't carry out because of its own limits — and the corpus suggests the field is only beginning to draw that line, often discovering that what looks like a reasoning wall is really an execution wall.
This explores whether a model can tell the difference between a problem that's genuinely unsolvable and one it just can't execute — and the most striking thread in the corpus is that researchers themselves struggle to draw this line, which suggests the models do too. The headline case: when reasoning models 'collapse' on hard problems, that collapse is often misread as the model hitting a reasoning ceiling, when in fact it's running out of execution bandwidth — it knows the algorithm but can't carry out the steps in text-only generation. Give the same model tools, and it sails past the supposed cliff Are reasoning model collapses really failures of reasoning?. So the very distinction your question asks about — impossible vs. can't-execute — is one the research community keeps getting wrong, which is a clue about how hard it is for a model to self-diagnose.
The deeper finding is a kind of split between knowing and doing. Models can state a correct principle (87% accuracy) and then fail to apply it (64%) — not because they lack the knowledge, but because the pathways for articulation and execution are dissociated Can language models understand without actually executing correctly?. If a model's own competence is structurally walled off from its own knowledge, then asking it to distinguish 'this is impossible' from 'I can't do this' is asking it to introspect across exactly the seam where it's most blind. A related tell: when constraints are removed from a problem, most models actually do *worse*, because they were never reasoning about feasibility — they were defaulting to conservative, harder-looking options that masqueraded as careful reasoning Are models actually reasoning about constraints or just defaulting conservatively?. That's the opposite of recognizing impossibility; it's faking the recognition.
There's also evidence the failure isn't where it appears to be. Reasoning models break down not at complexity thresholds but at *unfamiliarity* — they fit instance-level patterns rather than general algorithms, so a chain works only if something similar was in training Do language models fail at reasoning due to complexity or novelty?. And on negative evidence — exception-based rules, the very stuff of 'this case is impossible' — reasoning models actually underperform plain models, hallucinating constraints and overgeneralizing Why do reasoning models fail at exception-based rule inference?. Recognizing impossibility is fundamentally about negative evidence ('no solution exists here'), and that's precisely where chain-of-thought hurts rather than helps.
The most provocative corner of the corpus pushes toward a formal answer: some limits aren't executional at all, they're mathematical. Hallucination is provably inevitable for any computable model on infinitely many inputs, and no amount of internal self-correction can remove it — the only fix is external safeguards Can any computable LLM truly avoid hallucinating?. That reframes your question: a model can't reliably distinguish logical impossibility from its own limits because some of its own limits *are* a form of formal impossibility, and it has no internal vantage point above itself to tell which is which. The work on predicting failures from the 'computational level' makes the same point from the outside — you can forecast where a model will fail by treating it as an autoregressive probability machine, meaning the limits are legible to an external observer in a way they aren't to the model itself Can we predict where language models will fail?.
Where does that leave the optimistic reading? The most promising path isn't asking the model to self-diagnose — it's verifying the *process* from outside. Checking intermediate reasoning states rather than final answers lifts success from 32% to 87%, because most failures are process violations the model never flags itself Where do reasoning agents actually fail during long traces?. The takeaway you might not have expected: the line between 'impossible' and 'I can't' may simply not be one a model can draw from the inside — but it's increasingly one we can draw from the outside, by watching how it works rather than asking it what it can do.
Sources 8 notes
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Across four game-based tasks, reasoning models scored below 25% on exception rules versus 55–65% for non-reasoning models. Chain-of-thought introduces math overuse, overgeneralization, and hallucinated constraints that amplify errors in negative evidence recognition.
Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.