Why do reasoning LLMs fail at deeper problem solving?
Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.
"Reasoning LLMs are Wandering Solution Explorers" provides the most rigorous formalization yet of why reasoning models fail as problem complexity increases. The claim: current RLLMs do not systematically explore solution spaces. They wander.
Systematic exploration requires three properties: (a) validity — the trace follows the reachability structure; (b) effectiveness — the trace contains at least one goal state; (c) necessity — every state in the trace contributes to goal discovery or dead-end elimination. Current models fail all three.
The formalization makes the failure quantifiable. A wandering RLLM performing depth-first search on a binary tree of depth d has a probability pw of omitting one of two child nodes at each decision point. The success probability drops exponentially with depth d. This is not a gradual degradation — it is catastrophic. Problems that appear within reach at depth 5 become virtually impossible at depth 15 not because the model lacks reasoning ability but because it lacks search discipline.
Four failure modes are identified:
- Invalid exploration: transitions violate the problem's reachability structure
- Unnecessary exploration: superfluous states that don't contribute to goal discovery
- Evaluation error: misinterpreting current state or executing planned moves erroneously
- Hallucinated conclusions: claiming solutions that don't satisfy problem constraints
The finding directly challenges the "more thinking tokens = better reasoning" narrative. A wandering model given more tokens doesn't explore more systematically — it wanders more extensively. This is the mechanism behind Does more thinking time always improve reasoning accuracy?: additional compute doesn't fix structural search deficiency.
The exponential degradation result connects to Does policy entropy collapse limit reasoning performance in RL?. Entropy collapse reduces exploration diversity during training; wandering reduces exploration discipline during inference. Both are manifestations of the same problem: the model converges on familiar patterns rather than systematically covering the solution space.
Apple's three-regime confirmation. "The Illusion of Thinking" (Apple) provides independent confirmation through controllable puzzle environments with precise complexity manipulation. Three performance regimes emerge: (1) low-complexity — standard models outperform reasoning models with greater token efficiency; (2) medium-complexity — reasoning models gain advantage through extended thinking; (3) high-complexity — both model types collapse to zero. Near the collapse point, reasoning models reduce their reasoning effort despite having ample token budget — a counterintuitive behavioral scaling limit. Even providing explicit optimal algorithms does not prevent collapse, confirming the bottleneck is execution not conceptualization. The three-regime structure refines the wandering explorer thesis: wandering is harmful at low complexity (overthinking easy problems), partially beneficial at medium complexity (exploring toward solutions), and irrelevant at high complexity (no amount of wandering reaches the goal).
Inquiring lines that use this note as a source 105
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do foundation models develop heuristics instead of world models?
- Can surface heuristics override implicit constraints in domain-specific reasoning?
- Where do LLMs succeed at generation but struggle with evaluation?
- Why do LLM personas struggle with specificity in specialized domains like law?
- What specific execution barriers do LLM ideas encounter most frequently?
- Can evidence density alone shift an LLM from generation to reasoning?
- How much of LLM reasoning failure stems from missing knowledge versus signal weighting?
- What is the relationship between reasoning depth and verbalization requirements?
- Can latent reasoning architectures work as retrofits to existing models?
- Why do LLM outputs match researcher priors without solving tasks correctly?
- Do tool-enabled reasoning models close the gap on constraint satisfaction?
- Why do reasoning models fail on structurally unfamiliar instances?
- How do humans and LMs differ on multi-hop reasoning?
- Why does training format shape reasoning strategy more than domain?
- Why does LLM research ideation collapse into low diversity despite high novelty?
- What graph structures would enable transformational creative reasoning in LLMs?
- Should LLM reasoning be studied as latent state trajectories rather than surface text?
- Does partial trace guidance work better than curriculum learning for hard problems?
- When should an LLM engage extended reasoning versus responding directly?
- Can LLMs explain concepts correctly while failing to use them?
- Why do models automatically adjust reasoning length to problem difficulty?
- Why do LLM social behaviors undermine collaborative reasoning outcomes?
- How do search tasks differ from derivation tasks in reasoning efficiency?
- Why does semantic decoupling specifically break LLM reasoning abilities?
- Why do medical and mathematical tasks require fundamentally different model capabilities?
- How do LLMs and knowledge graphs work together in different integration patterns?
- Why does comparison reasoning generalize better than composition reasoning?
- Can latent reasoning in continuous space scale beyond supervised reasoning tasks?
- Can forcing warrant checking through structured prompts improve LLM reasoning?
- Does this optimism bias contribute to the knowing-doing gap in LLM decision-making?
- Why do simple math problems get worse with longer reasoning chains?
- Do reasoning models trade instruction following for deliberative capability?
- Can knowledge density explain why LLM writing feels coherent but fatiguing?
- Why do LLMs plateau on creativity tasks while humans reach further?
- Where do LLMs fail as knowledge systems compared to humans?
- Why do LLMs generate novel ideas but lack evaluative commitment?
- What internal mechanisms explain LLM reasoning and representation limits?
- Why can't LLMs reason from first principles or initial commitments?
- Why does extended reasoning fail for search and knowledge retrieval tasks?
- Do LLMs fail exploration because of context integration or computational limitations?
- What data presentation structures enable LLMs to learn decision-making from examples?
- How does structural complexity affect LLM performance differently than inferential complexity?
- Which knowledge types do LLMs handle better than humans in reasoning tasks?
- Can LLMs improve at simple deduction through different training approaches?
- Why do reasoning models struggle with self-evaluation and revision?
- How do LLMs default to surface-level strategies instead of genuine mental simulation?
- Do reasoning models overthink ill-posed questions instead of recognizing incompleteness?
- Do LLMs lack architectural scaffolding for compositional reasoning?
- How does structural complexity in sentences degrade LLM reasoning systematically?
- What explains the gap between perplexity performance and actual reasoning capability?
- How do beam search and MCTS traverse reasoning topologies?
- Why do reasoning models wander instead of searching systematically?
- Is the reasoning cliff actually a tool-use problem?
- Why do verbalized reasoning chains fail on certain problem classes?
- What makes constraint satisfaction problems epistemically cleaner than other reasoning tasks?
- What role does curriculum design play in reasoning emergence?
- Why do difficult problems force models to develop reasoning strategies?
- What distinguishes systematic search from wandering exploration in reasoning?
- Which constraint types do reasoning models handle best?
- Can LLM judges be trained to think more rigorously during evaluation?
- Why do LLMs generate novel ideas but struggle to evaluate them?
- Does LLM reasoning always match the outputs it generates?
- Can the LLM-Modulo framework extend solver integration to domain planning?
- How much reasoning depth do we actually need for most real-world tasks?
- Can reasoning models succeed at logic but fail at execution?
- Do reasoning failures stem from strategy or from calculation breakdown?
- How can one training example improve reasoning across thousands of unseen problems?
- Do search agents face their own overthinking threshold like reasoning models do?
- What is the optimal balance between search rounds and reasoning depth per round?
- Do reasoning models switch approaches when encountering local difficulty?
- Why do foundation models develop task-specific heuristics instead of causal understanding?
- Can extended thinking modes introduce genuine rhetorical exploration to LLMs?
- Can you control LLM reasoning strategy without fine-tuning the model?
- What mechanisms cause reasoning models to wander rather than focus?
- What happens when students encounter errors they cannot resolve through prompting alone?
- Why do reasoning model failures stem from execution rather than reasoning?
- Does structured decomposition improve LLM reasoning in other compound tasks?
- What distinguishes LLM Programs from chain-of-thought and agentic frameworks?
- What mechanism causes LLMs to plateau on numerical optimization tasks?
- Why do reasoning models fail to improve constrained optimization performance?
- How should organizations redesign workflows if LLMs cannot solve optimization directly?
- What concrete problems do LLMs solve at the computational level?
- Does RL amplify existing reasoning or create genuinely new computational strategies?
- Why does RL behavior differ between standard reasoning tasks and complex planning domains?
- What failure modes emerge when scheme classification feeds downstream reasoning pipelines?
- What happens to iterative search quality when reasoning depth is unconstrained?
- Why does the Chinese Room argument miss the deeper abstraction problem?
- Can reinforcement learning close the gap between LLM reasoning and action?
- Can symbolic solvers reliably replace LLM reasoning for logical tasks?
- How do reasoning-related features behave when trained on near-impossible problems?
- Does performative reasoning mask underlying uncertainty even on easy problems?
- When is numeric computation the real bottleneck versus reasoning depth?
- What makes deterministic recursive reasoning models underperform on multi-solution tasks?
- How does active reasoning through interaction differ from passive single-turn problem solving?
- What makes reasoning traces effective or ineffective for solving problems?
- How does neuro-symbolic design differ from pure LLM reasoning?
- Why do students learn better from explanations than from solving problems from scratch?
- Is reasoning failure caused by task complexity or training distribution gaps?
- Do different game types reveal different strategic reasoning capabilities in LLMs?
- Why does LLM performance improve when forecasting tasks include organized reasoning?
- How can we turn reasoning model failures into useful training signals?
- How does o1-style reasoning relate to learned search processes versus memorized solutions?
- How does question difficulty and breadth affect what models learn to reason?
- Why do LLMs reason fluently about causality but lack causal rigor?
- Can tools unlock reasoning strategies that require abstract insight beyond computation?
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
this provides the mechanism: additional tokens fund wandering, not systematic exploration
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
training-time collapse mirrors inference-time wandering
-
Does self-revision actually improve reasoning in language models?
When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
self-revision is a specific form of wandering: revisiting explored states rather than covering new ones
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
parallel chains explore independently and thus cover more space than a single wandering chain
-
Does outcome-based RL diversity loss spread across unsolved problems?
When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?
training-time cause of inference-time wandering: outcome-based RL suppresses exploration diversity during training, which means the model enters inference with a narrowed repertoire of search strategies — wandering is partly a consequence of having lost systematic search diversity during RL training
-
Can evolutionary search beat sampling and revision at inference time?
Can LLMs evolve populations of solutions through recombination and selection to outperform simpler inference strategies? This matters because it could reveal whether biological-inspired search improves planning without formal problem definitions.
architectural response to wandering: Mind Evolution's island-model population diversity maintains exploration discipline through parallel sub-populations that prevent the premature convergence and systematic exploration failure that single-trajectory wandering exhibits
-
Do reasoning models switch between ideas too frequently?
Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
complementary failure mode: wandering is insufficient spatial coverage of the solution space; underthinking is insufficient depth on any single path; a model can exhibit both simultaneously, producing long traces that wander between shallow explorations
-
Why do reasoning models fail differently at training versus inference?
Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
wandering is an inference-time manifestation of the exploration-exploitation failure; entropy collapse at training time narrows the repertoire of search strategies, while wandering at inference time reflects the lack of systematic discipline those strategies would provide
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reasoning LLMs are Wandering Solution Explorers
- Large Language Model Reasoning Failures
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models
- Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?
- Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
Original note title
reasoning llms are wandering explorers not systematic searchers — four failure modes degrade success probability exponentially with problem depth