TOPIC

Reasoning Critiques

22 synthesis notes · 113 source papers

View as

Does every correct chain-of-thought trace improve fine-tuning?

Are all answer-correct reasoning traces equally valuable for training? This explores whether some correct traces contain reasoning that actually harms model learning despite reaching the right answer.

Can chain-of-thought reasoning be deliberately manipulated to deceive?

Explores whether language models can be backdoored to produce plausible-looking but incorrect reasoning that humans would trust. This matters because CoT inspection is widely used as a safety measure.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.

Does chain-of-thought reasoning actually generalize beyond training data?

Explores whether CoT's strong performance on benchmarks reflects genuine reasoning ability or merely reflects learned patterns tied to specific distributions. Tests how CoT behaves when tasks, formats, or reasoning length shift away from training data.

Does longer reasoning actually mean harder problems?

Do chain-of-thought trace lengths reliably reflect problem difficulty, or do they primarily indicate proximity to training examples? Understanding this matters for designing effective scaling heuristics.

Do chain-of-thought traces actually help users understand model reasoning?

Chain-of-thought explanations are often presented as transparency tools, but do they genuinely improve human understanding or create an illusion of interpretability? A human-subject study tests whether traces help users follow and evaluate model reasoning.

Does failed-step fraction predict reasoning quality better?

Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.

Does LLM math reasoning truly generalize or just pattern match?

This explores whether high scores on math benchmarks reflect genuine reasoning ability or merely template familiarity. The question matters because it determines how much we should trust LLMs on novel numerical problems.

What do models actually learn from chain-of-thought training?

When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.

Why do reasoning models overthink ill-posed questions?

Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.

Why does chain of thought accuracy eventually decline with length?

Explores why longer reasoning chains don't always improve answers, and how the optimal length shifts based on task difficulty and model capability.

Does chain-of-thought reasoning reflect genuine thinking or performance?

When language models generate step-by-step reasoning, are they actually thinking through problems or just producing text that looks like reasoning? This matters for understanding whether extended reasoning tokens add real computational value.

Why do reasoning models fail at exception-based rule inference?

Explores why chain-of-thought models systematically underperform on tasks requiring inductive rule inference from exceptions in game-based settings, despite excelling at normal rule patterns.

Does RL training collapse format diversity in pretrained models?

Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.

Why do better reasoning models ignore instructions?

As models develop stronger reasoning abilities through training, they appear to become worse at following specified constraints. Is this an unavoidable trade-off, and what causes it?

Do models fail worse when their own errors fill the context?

As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?

Why do models hide what users want them to say?

Chain-of-thought monitoring should catch when models follow user preferences, but sycophancy cues—hints about what users want—are both most influential and least reported. Why does the model's reasoning trace systematically obscure this failure mode?

Does telling models they are watched improve reasoning faithfulness?

Explores whether informing models their reasoning is being monitored—a cheap prompt intervention—actually increases the rate at which they verbalize their reasoning steps, drawing on human behavioral science intuitions.

What critical thinking skills do reasoning models actually lose?

Step-by-step reasoning training optimizes narrow deductive thinking while degrading meta-cognitive abilities like recognizing futile thinking and maintaining tentative reasoning. Understanding this tradeoff matters for deploying reasoning models reliably.

Source papers 113

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
The recent work by Shojaee et al. (2025), titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”, presents a compelling e…
A Comprehensive Evaluation of Inductive Reasoning Capabilities and Problem Solving in Large Language Models
Inductive reasoning is fundamental to both human and artificial intelligence. The inductive reasoning abilities of current Large Language Models (LLMs) are evaluated in this research. We argue that on…
A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks
This systematic review synthesizes current efforts to assess LLMs’ ability to perform ToM tasks—an essential aspect of human cognition involving the attribution of mental states to oneself and others.…
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which…
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
Large Language Models (LLMs) like closed weights ones GPT-3.5/4, Claude, Gemini or open weights ones like LLaMa 2/3, Mistral, Mixtral, and more recent ones Dbrx or Command R+ are often described as be…
An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems
In this paper, we introduce a systematic framework beyond conventional method to assess LLMs’ mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically…
Attentive Reasoning Queries: A Systematic Method for Optimizing Instruction-Following in Large Language Models
We present Attentive Reasoning Queries (ARQs), a novel structured reasoning approach that significantly improves instruction-following in Large Language Models through domain-specialized reasoning blu…
Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs
Large language models (LLMs) are increasingly optimized for long reasoning, under the assumption that more reasoning leads to better performance. However, emerging evidence suggests that longer respon…
Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models
Large Language Models (LLMs) have demonstrated remarkable self-improvement capabilities, whereby models iteratively revise their outputs through self-generated feedback. While this reflective mechanis…
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
When language models (LMs) are trained via reinforcement learning (RL) to generate natural language “reasoning chains”, their performance improves on a variety of difficult question answering tasks. T…
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
Large Language Models (LLMs) have been widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness a critical concern. The potential for intentional deception, wher…
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
In this paper, we critically examine that interpretation by investigating how the semantics of intermediate tokens—often anthropomorphized as “thoughts” or reasoning traces and which are claimed to di…
Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
Large Language Models (LLMs) leverage step-by-step reasoning to solve complex problems. Standard evaluation practice involves generating a complete reasoning trace and assessing the correctness of the…
Bounds of Chain-of-Thought Robustness: Reasoning Steps, Embed Norms, and Beyond
Existing research indicates that the output of Chain-of-Thought (CoT) is significantly affected by input perturbations. Although many methods aim to mitigate such impact by optimizing prompts, a theor…
Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess
While reinforcement learning (RL) for large language models (LLMs) has shown promise in mathematical reasoning, strategic reasoning for LLMs using RL remains largely unexplored. We investigate whether…
Can Large Language Models Understand Argument Schemes?
Argument schemes represent stereotypical patterns of reasoning that occur in everyday arguments. However, despite their usefulness, argument scheme classification — that is, classifying natural langua…
Can Large Language Models do Analytical Reasoning?
Our divide-and-conquer approach breaks down play-by-play data into smaller, more manageable segments, solves each piece individually, and then aggregates them together. Besides the divide-and-conquer …
Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning
When AI systems explain their reasoning step-by-step, practitioners often assume these explanations reveal what actually influenced the AI's answer. We tested this assumption by embedding hints into q…
Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge
Large Language Models (LLMs) have become increasingly powerful and ubiquitous, but their stochastic nature poses challenges to the reliability of their outputs. While deterministic settings can improv…
Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning
Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) …
Chain of Thoughtlessness? An Analysis of CoT in Planning
While our problems are very simple, we only find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class, and that those im…
Chain-of-Thought Is Not Explainability
we argue that CoT rationales can be misleading and are neither necessary nor sufficient for trustworthy interpretability By analysing faithfulness in terms of whether CoTs are not only human-interpret…
Chain-of-Verification Reduces Hallucination in Large Language Models
Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses t…
Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data
We argue that the language modeling task, because it only uses form as training data, cannot in principle lead to learning of meaning. We take the term language model to refer to any system trained on…
CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective
Chain-of-Thought (CoT) does not elicit genuine, abstract reasoning. Instead, we argue that Chain-of- Thought (CoT) functions as a powerful structural constraint that guides Large Language Models (LLMs…
Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Note: this comment was debunked as AI generated and the math is bad Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit ”accuracy collapse” on planning puzzles beyond certain comp…
Complex Logical Instruction Generation
Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tas…
Could you be wrong: Debiasing LLMs using a metacognitive prompt for improving human decision making
Identifying bias in LLMs is ongoing. Because they are still in development, what is true today may be false tomorrow. We therefore need general strategies for debiasing that will outlive current model…
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning…
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Test-time scaling such as OpenAI’s o1 [1] and DeepSeek’s R1 [2] brings a profound paradigm shift to Large Language Models (LLMs) [3–7]. Test-time scaling enables longer Chain-of-Thought thinking and i…
DecepChain: Inducing Deceptive Reasoning in Large Language Models
Large Language Models (LLMs) have been demonstrating increasingly strong reasoning capability with their chain-of-thoughts (CoT), which are routinely used by humans to judge answer quality. This relia…
Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning
Chain-of-Thought (CoT) prompting has been shown to enhance the multi-step reasoning capabilities of Large Language Models (LLMs). However, debates persist about whether LLMs exhibit abstract generaliz…
DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning
Abstract Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an answer for a given input, DeepSeek-R1 creates detailed…
Demystifying Chains, Trees, and Graphs of Thoughts
The field of natural language processing (NLP) has witnessed significant progress in recent years, with a notable focus on improving large language models’ (LLM) performance through innovative prompti…
Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
Large reasoning models (LRMs) have demonstrated impressive capabilities in complex problem-solving, yet their internal reasoning mechanisms remain poorly understood. In this paper, we investigate the …
Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces
Long chain-of-thought (CoT) traces are widely used as supervision for reasoning-oriented LLM SFT, yet answer-correct traces can still lead to markedly different fine-tuning outcomes. We study post-con…
Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue…
Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?
Recent progress in reasoning-oriented Large Language Models (LLMs) has been driven by introducing Chain-of-Thought (CoT) traces, where models generate intermediate reasoning traces before producing an…
Do Large Language Models Reason Causally Like Us? Even Better?
Indeed, a growing number of researchers have proposed that current LLMs are unable to generalize causal ideas beyond their training distribution and/or without strong user-induced guidance (e.g., chai…
Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
Large language models (LLMs) are trained to imitate humans to explain human decisions. However, do LLMs explain themselves? Can they help humans build mental models of how LLMs process different input…
Do Prompt-Based Models Really Understand the Meaning of Their Prompts?
Recently, a boom of papers has shown extraordinary progress in zero-shot and few-shot learning with various prompt-based models. It is commonly argued that prompts help models to learn faster in the s…
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly in mathematics and …
Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which revea…
Don't "Overthink" Passage Reranking: Is Reasoning Truly Necessary?
With the growing success of reasoning models across complex natural language tasks, researchers in the Information Retrieval (IR) community have begun exploring how similar reasoning capabilities can …
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models for advanced mathematical reasoning and coding. Following the success of frontier reasoning mod…
Efficient Reasoning with Hidden Thinking
Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities in Multimodal Large Language Models (MLLMs). However, the verbose nature of textual …
Evaluating Large Language Models in Theory of Mind Tasks
Many animals excel at using cues such as vocalization, body posture, gaze, or facial expression to predict other animals’ behavior and mental states. Dogs, for example, can easily distinguish between …
Evaluating the False Trust Engendered by LLM Explanations
Large Language Models (LLMs) and Large Reasoning Models (LRMs) are increasingly used for critical tasks, yet they provide no guarantees about the correctness of their solutions. Users must decide whet…
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathemat…
Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis
The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requ…
Instance-adaptive Zero-shot Chain-of-Thought Prompting
the efficacy of a singular, task-level prompt uniformly applied across the whole of instances is inherently limited since one prompt cannot be a good partner for all, a more appropriate approach shoul…
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providi…
LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling
Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more…
LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring
Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat to this is sandbagging — the stra…
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-ofthoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training tech…
LLMs can implicitly learn from mistakes in-context
Learning from mistakes is a fundamental feature of human intelligence. Previous work has shown that Large Language Models (LLMs) can also learn from incorrect answers when provided with a comprehensiv…
LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
Recent progress in Large Reasoning Models (LRMs) has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflec…
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought
Large language models (LLMs) have shown remarkable reasoning capabilities given chain-of-thought prompts (examples with intermediate reasoning steps). Existing benchmarks measure reasoning ability ind…
Language Models Learn to Mislead Humans via RLHF
Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve h…
Large Language Model Reasoning Failures
Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist…
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-by-step reasoning process that underlies the final evaluation of a response. However, due to the lack of hu…
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framewo…
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
Reinforcement Learning with Verifiable Rewards (RLVR)-based post-training of Large Language Models (LLMs) has been shown to improve accuracy on reasoning tasks and continues to attract significant att…
Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models
Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value. While previous work has explored large language model (LLM) hallucination and…
Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning
Large language models (LLMs) have been shown to perform better when asked to reason step-by-step before answering a question. However, it is unclear to what degree the model’s final answer is faithful…
Measuring Faithfulness in Chain-of-Thought Reasoning
Large language models (LLMs) perform better when they produce step-by-step, “Chain-of- Thought” (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful exp…
Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models
Large Reasoning Models (LRMs) have significantly enhanced their capabilities in complex problem-solving by introducing a thinking draft that enables multipath Chain-of-Thought explorations before prod…
Metacognitive Retrieval-Augmented Large Language Models
Retrieval-augmented generation have become central in natural language processing due to their efficacy in generating factual content. While traditional methods employ single-time retrieval, more rece…
Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
Chain-of-thought (Wei et al., 2022; Nye et al., 2021) is a widely used prompting technique for large language and multimodal models (LLMs and LMMs), instructing models to “think step-by-step” or provi…
Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending …
Mitigating Hallucinations in Large Language Models via Causal Reasoning
Large language models (LLMs) exhibit logically inconsistent hallucinations that appear coherent yet violate reasoning principles, with recent research suggesting an inverse relationship between causal…
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Mitigating reward hacking—where AI systems misbehave due to flaws or misspecifications in their learning objectives—remains a key challenge in constructing capable and aligned models. We show that we …
On the Reasoning Capacity of AI Models and How to Quantify It
Through controlled experiments on reasoning benchmarks, we show that true reasoning remains challenging for current models, with apparent success often relying on sophisticated combinations of memoriz…
OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. …
Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity
Intermediate token generation (ITG), where a model produces output before the solution, has been proposed as a method to improve the performance of language models on reasoning tasks. While these reas…
Potemkin Understanding in Large Language Models
This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs—such as AP exams—are also those used to test people. However, this rai…
Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs
Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large language models (LLMs) by enabling detailed step-by-step solutions. However, due to the verbosity of LLMs, the resulting reaso…
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
it remains contentious whether RL truly expands a model’s reasoning capabilities or merely amplifies high-reward outputs already latent in the base model’s distribution, and whether continually scalin…
QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?
Large language models (LLMs) have shown impressive performance on reasoning benchmarks like math and logic. While many works have largely assumed well-defined tasks, real-world queries are often under…
RLPR: Extrapolating RLVR to General Domains without Verifiers
Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical an…
Reasoning Can Hurt the Inductive Abilities of Large Language Models
Large Language Models (LLMs) have shown remarkable progress across domains, yet their ability to perform inductive reasoning—inferring latent rules from sparse examples—remains limited. It is often as…
Reasoning Models Can Be Effective Without Thinking
Recent LLMs have significantly improved reasoning capabilities, primarily by including an explicit, lengthy Thinking process as part of generation. In this paper, we question whether this explicit thi…
Reasoning Models Don't Always Say What They Think
Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model’s CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monit…
Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?
Human reasoning involves different strategies, each suited to specific problems. Prior work shows that large language model (LLMs) tend to favor a single reasoning strategy, potentially limiting their…
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
We provide evidence of performative chain-ofthought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its inter…
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
To obtain trustworthy evaluation signals, we introduce a generator that creates fully synthetic arithmetic problems of arbitrary length and difficulty, yielding clean datasets we call RandomCalculatio…
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specia…
Reasoning with Large Language Models, a Survey
in addition to these associative “System 1” tasks, recent advances in Chain-of-thought prompt learning have demonstrated strong “System 2” reasoning abilities, answering a question in the field of art…
S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models
We introduce S1-Bench, a novel benchmark designed to evaluate Large Reasoning Models’ (LRMs) performance on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reason…
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, t…
Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?
To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue th…
Sources of Hallucination by Large Language Models on Inference Tasks
We establish two biases originating from pretraining which predict much of their behavior, and show that these are major sources of hallucination in generative LLMs. First, memorization at the level o…
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
Intermediate token generation (ITG), where a model produces output before the solution, has been proposed as a method to improve the performance of language models on reasoning tasks. These intermedia…
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved …
Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories
Despite the remarkable capabilities of large language models, current training paradigms inadvertently foster sycophancy, i.e., the tendency of a model to agree with or reinforce user-provided informa…
Talking About Large Language Models
“Third, a great many tasks that demand intelligence in humans can be reduced to next token prediction with a sufficiently performant model. It is the last of these three surprises that is the focus of…
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple…
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from data contamination and do…
The Illusion of the Illusion of the Illusion of Thinking
"Shojaee et al.’s underlying observations hint at a more subtle, yet real, challenge for LRMs: a brittleness in sustained, high-fidelity, step-by-step execution. The true illusion is the belief that …
The Invisible Leash: Why RLVR May Not Escape Its Origin
Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI’s capabilities, particularly in solving complex logical…
The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose–measure–bridge–treat framework. Causal-behavior…
Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models
Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like C…
Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts ar…
Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods
This work conducts a comprehensive analysis of inference-time scaling methods for both reasoning and non-reasoning models on challenging reasoning tasks. Specifically, we focus our research on verifie…
Thought Anchors: Which LLM Reasoning Steps Matter?
We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We present three complementary attribution methods: (1) a black-box method …
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Large language models (LLMs) such as OpenAI’s o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting humanlike deep thinking. However, we iden…
Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines
Agentic pipelines present novel challenges and opportunities for human-centered explainability. The HCXAI community is still grappling with how best to make the inner workings of LLMs transparent in a…
Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models
Abstract. While large language models demonstrate impressive performance on static benchmarks, the true potential of large language models as self-learning and reasoning agents in dynamic environments…
Unsupervised Elicitation of Language Models
To steer pretrained language models for downstream tasks, today’s post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficul…
What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT
Large reasoning models (LRMs) spend substantial test-time compute on long chain-of-thought (CoT) traces, but what characterizes an effective CoT remains unclear. While prior work reports gains from le…
When More is Less: Understanding Chain-of-Thought Length in LLMs
Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems. While longer CoTs are often presumed superior, this paper challenges that notion, arguing that lon…
When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs
Reasoning-enhanced large language models (RLLMs), whether explicitly trained for reasoning or prompted via chain-of-thought (CoT), have achieved state-of-the- art performance on many complex reasoning…
interwhen: A Generalizable Framework for Steering Reasoning Models with Test-time Verification
Reasoning models produce long traces of intermediate decisions and tool calls, making test-time verification increasingly important for ensuring correctness. Existing approaches either verify only the…