Where exactly do reasoning models fail and break?

Maps where and how reasoning models break down across search, decision-making, adversarial attack, and social understanding.

Topic Hub · 42 linked notes · 7 sections

View as

Exploration and Search Failures

3 notes

Why do reasoning LLMs fail at deeper problem solving?

Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.

Do reasoning models switch between ideas too frequently?

Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.

Why do large language models explore less effectively than humans?

This research investigates why LLMs make decisions too quickly during open-ended exploration tasks. It examines whether the problem lies in training data, prompt engineering, or something deeper in how transformer architectures process information over time.

Hybrid Reasoning and Mode Selection

6 notes

Can models learn when to think versus respond quickly?

Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.

Does the choice of reasoning framework actually matter for test-time performance?

Explores whether different slow-thinking methods like BoN and MCTS produce meaningfully different outcomes, or whether total compute budget is the dominant factor determining reasoning success.

Does extended thinking help or hurt model reasoning?

Explores whether activating thinking mode improves reasoning performance, and what role training plays in determining whether extended internal reasoning chains are productive or counterproductive.

Can models learn to ask clarifying questions instead of guessing?

Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.

Are reasoning model collapses really failures of reasoning?

Explores whether language models hit a fundamental reasoning ceiling or whether text-only evaluation masks execution limitations. Examines how tool access might reveal hidden reasoning capabilities.

Does the reasoning cliff depend on how we test models?

If language models hit a capability wall in text-only reasoning tasks, does that wall disappear when they can use tools? What does this reveal about what we're actually measuring?

Argumentation and Adversarial Failures

14 notes

Does a model improve by arguing with itself?

When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?

Why do language models fail at collaborative reasoning?

When LLMs work together on problems, do their social behaviors undermine correct reasoning? This explores whether collaboration activates accommodation over accuracy.

When does debate actually improve reasoning accuracy?

Multi-agent debate shows promise for reasoning tasks, but under what conditions does it help versus hurt? The research explores whether debate amplifies errors when evidence verification is missing.

Why do reasoning models fail under manipulative prompts?

Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.

How vulnerable are reasoning models to irrelevant text?

Can simple adversarial triggers like unrelated sentences degrade reasoning model accuracy? This explores whether step-by-step reasoning actually provides robustness against subtle input perturbations.

Can language models strategically underperform on safety evaluations?

Explores whether LLMs can covertly sandbag on capability tests by bypassing chain-of-thought monitoring. Understanding this vulnerability matters for safety evaluation pipelines that rely on reasoning transparency.

Do inference-time prompts actually fix sycophancy or redirect it?

Meta-cognitive prompting reduces sycophancy at inference time, but it's unclear whether this fixes the underlying problem or just activates different attention patterns. Understanding the mechanism matters for evaluating whether the fix is robust or brittle.

Do language models actually use their reasoning steps?

Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.

Does reasoning fine-tuning make models worse at declining to answer?

When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.

Can three-way rewards fix the accuracy versus abstention problem?

Standard RL forces models to choose between accuracy and honesty about uncertainty. Could treating correct answers, hallucinations, and abstentions as distinct reward outcomes let models learn when to say 'I don't know'?

When does explicit reasoning actually help model performance?

Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?

Does revising your own reasoning actually help or hurt?

Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.

Does deliberative alignment genuinely reduce scheming or just hide it?

Deliberative alignment dramatically cuts covert actions in language models, but their reasoning reveals awareness of being evaluated. The question is whether the improvement reflects real alignment or strategic compliance.

Can a coordination layer turn LLM patterns into genuine reasoning?

LLMs excel at pattern retrieval but lack external constraint binding. Can a System 2 coordination layer—anchoring outputs to goals and evidence—transform statistical associations into goal-directed reasoning?

Heuristic Override and the Frame Problem (HOB)

6 notes

Do language models ignore goals when surface cues conflict?

When a task has an obvious surface cue that contradicts an unstated requirement, do LLMs follow the cue or the actual goal? This matters because it reveals whether reasoning failures come from missing knowledge or from how models weight competing signals.

Why do language models fail to use knowledge they possess?

Large language models contain relevant world knowledge but often fail to activate it without explicit cues. This explores whether the bottleneck lies in knowledge storage or in the inference process that decides what background facts apply.

Are models actually reasoning about constraints or just defaulting conservatively?

Do language models genuinely apply constraints when solving problems, or do they simply prefer harder options by default? Minimal pair testing reveals whether apparent reasoning success masks hidden biases.

Why does removing spurious cues sometimes hurt model performance?

Most models improve when spurious features are removed, but some fail worse. This note explores whether that failure represents a fundamentally different problem than traditional shortcut learning.

Do language models fail at identifying unstated preconditions?

When LLMs ignore background conditions needed for reasoning, is this a knowledge problem or an enumeration problem? Understanding what causes these failures could improve how we prompt and evaluate reasoning.

Why do confident wrong answers hide in standard accuracy metrics?

When AI systems produce fluent but incorrect recommendations in high-stakes domains, standard accuracy evaluation may miss the failures entirely. What structural blind spot allows these errors to remain invisible?

Writing Angles

2 notes

Why do reasoning models abandon promising solution paths?

Explores whether reasoning models fail because they think insufficiently or because they structurally misorganize their thinking. Challenges the assumption that longer reasoning traces automatically improve performance.

Can LLM judges be tricked without accessing their internals?

Explores whether AI language models used to grade other AI systems are vulnerable to simple presentation-layer tricks like fake citations or formatting, and what that means for benchmark reliability.