TOPIC

Chain-of-Thought and Reasoning Methods

23 synthesis notes · 76 source papers

View as

Why do models fail at asking good questions during interaction?

When models must actively seek information through questions rather than receive it passively, they struggle dramatically. This explores why GPT-4o plateaus at 35% accuracy and whether training or prompting can fix the underlying deficit.

Can minimal reasoning chains match full explanations?

Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.

Can reasoning models actually sustain long-chain reflection?

Tests whether large reasoning models genuinely perform self-correction and backtracking, or merely simulate it fluently. Uses constraint satisfaction problems where performance cannot be faked by surface plausibility.

Why does autoregressive generation fail at constraint satisfaction?

Explores whether the 20-23% performance ceiling on constraint satisfaction benchmarks reflects model limitations or a fundamental architectural mismatch between how LLMs generate tokens and how constraint solvers need to work.

Why do chain-of-thought examples fail across different conditions?

Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.

Can one statistical measure serve dual purposes in RL training?

Explores whether cross-rollout variance can simultaneously weight important tokens and filter low-signal queries, potentially unlocking efficiency gains in reasoning tasks without human labels.

How quickly do errors compound during model self-training?

When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.

Why do models trust their own generated answers?

Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.

Do large language models make the same causal reasoning mistakes as humans?

Research on collider structures reveals whether LLMs share human biases in causal inference. This matters because if both fail identically, collaboration might reinforce rather than correct errors.

Can longer reasoning chains eliminate model sensitivity to input noise?

Does adding more chain-of-thought steps eventually make language models robust to perturbations? This matters because it determines whether extended reasoning is a viable defense against adversarial attacks.

Can small models reason well by just learning output format?

Does reasoning performance depend primarily on adapting how models express outputs rather than acquiring new knowledge? The Tina research tests this by applying LoRA to a 1.5B model during reasoning training.

What alignment data structure best trains reasoning generalists?

Explores whether preference trees—with diverse reasoning chains, multi-turn critique loops, and pairwise contrasts—offer a structured way to build alignment datasets that improve open-model reasoning across domains.

Can models recognize question difficulty before they reason?

Does reasoning language models encode implicit knowledge of problem difficulty in their hidden states, even before generating solution steps? And if so, why don't they act on this knowledge?

Can reasoning topologies be formally classified as graph types?

This explores whether Chain of Thought, Tree of Thought, and Graph of Thought represent distinct formal graph structures with different computational properties. Understanding this matters because the topology itself determines what reasoning strategies are possible.

Source papers 76

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

A Survey of Calibration Process for Black-Box LLMs
Large Language Models (LLMs) demonstrate remarkable performance in semantic understanding and generation, yet accurately assessing their output reliability remains a significant challenge. While numer…
A Survey on Diffusion Language Models
Abstract—Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterati…
Activation Steering for Chain-of-Thought Compression
Large language models (LLMs) excel at complex reasoning when they include intermediate steps, known as chains of thought (CoTs). However, these rationales are often overly verbose, even for simple pro…
Advancing LLM Reasoning Generalists with Preference Trees
We introduce EURUS, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, EURUS models achieve state-of-the-art results among open-source models…
Agentic Code Reasoning
Can LLM agents explore codebases and reason about code semantics without executing the code? We study this capability, which we call agentic code reasoning, and introduce semi-formal reasoning: a stru…
Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models
Bilgehan Sel, Ahmad Al-Tawaha, Vanshaj Khattar, Lu Wang, Ruoxi Jia and Ming Jin Virginia Tech, Microsoft “Current literature, aiming to surpass the “Chain-of-Thought” approach, often resorts to an ex…
Answering Questions by Meta-Reasoning over Multiple Chains of Thought
![A screenshot of a computer screen](/assets/paper-images/AnsweringQuestionsByMetaReasoningOverMultipleChainsOfThought.png) Modern systems for multi-hop question answering (QA) typically break questi…
Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data
“The recent success in large language models (LLMs) has shown that properly prompted LLMs demonstrate emergent capabilities on complex understanding and question-answering tasks (Wei et al., 2022a). E…
Base Models Know How to Reason, Thinking Models Learn When
Why do thinking language models like DeepSeek R1 outperform their base counterparts? Despite consistent performance gains, it remains unclear to what extent thinking models learn entirely new reasonin…
Bounds of Chain-of-Thought Robustness: Reasoning Steps, Embed Norms, and Beyond
Existing research indicates that the output of Chain-of-Thought (CoT) is significantly affected by input perturbations. Although many methods aim to mitigate such impact by optimizing prompts, a theor…
Break the Chain: Large Language Models Can be Shortcut Reasoners
Recent advancements in Chain-of-Thought (CoT) reasoning utilize complex modules but are hampered by high token consumption, limited applicability, and challenges in reproducibility. This paper conduct…
Can Large Language Models Reason and Optimize Under Constraints?
Large Language Models (LLMs) have achieved notable performance across a wide range of natural language understanding and generation tasks, from open-ended dialogue and code synthesis to mathematical r…
Chain of Draft: Thinking Faster by Writing Less
In this work, we propose Chain of Draft (CoD), a novel paradigm inspired by human cognitive processes, where LLMs generate minimalistic yet informative intermediate reasoning outputs while solving tas…
Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective
Large Language Models (LLMs) have made notable progress in mathematical reasoning, yet they often rely on single-paradigm reasoning that limits their effectiveness across diverse tasks. In this paper,…
Chain-of-Retrieval Augmented Generation
This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer. Conventional RAG methods usually p…
Chain-of-Thought Is Not Explainability
we argue that CoT rationales can be misleading and are neither necessary nor sufficient for trustworthy interpretability By analysing faithfulness in terms of whether CoTs are not only human-interpret…
Chain-of-Thought Reasoning Without Prompting
In enhancing the reasoning capabilities of large language models (LLMs), prior research primarily focuses on specific prompting techniques such as few-shot or zero-shot chain-of-thought (CoT) promptin…
Chain-of-thought Reasoning Is A Policy Improvement Operator
Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-gen…
CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective
Chain-of-Thought (CoT) does not elicit genuine, abstract reasoning. Instead, we argue that Chain-of- Thought (CoT) functions as a powerful structural constraint that guides Large Language Models (LLMs…
ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning
Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given th…
Competitive Programming with Large Reasoning Models
We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoni…
Compositional Reasoning with Transformers, RNNs, and Chain of Thought
Large language models [Touvron et al., 2023, Anil et al., 2023, Achiam et al., 2023] are increasingly used to perform logical reasoning and other problems that require algorithmic thinking. To underst…
Cumulative Reasoning with Large Language Models
Despite the recent advancements in language models (LMs), their ability to solve complex problems remains limited. This paper introduces Cumulative Reasoning (CR), a novel approach that utilizes LMs c…
Decoupling Knowledge and Reasoning in LLMs: An Exploration Using Cognitive Dual-System Theory
While large language models (LLMs) leverage both knowledge and reasoning during inference, the capacity to distinguish between them plays a pivotal role in model analysis, interpretability, and develo…
DeepAgent: A General Reasoning Agent with Scalable Toolsets
Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow p…
Demystifying Chains, Trees, and Graphs of Thoughts
The field of natural language processing (NLP) has witnessed significant progress in recent years, with a notable focus on improving large language models’ (LLM) performance through innovative prompti…
Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
Large reasoning models (LRMs) have demonstrated impressive capabilities in complex problem-solving, yet their internal reasoning mechanisms remain poorly understood. In this paper, we investigate the …
Diagnostic Reasoning Prompts Reveal the Potential for Large Language Model Interpretability in Medicine
One of the major barriers to using large language models (LLMs) in medicine is the perception they use uninterpretable methods to make clinical decisions that are inherently different from the cogniti…
Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks
Reinforcement learning (RL) training of large language models (LLMs) on unverifiable tasks is challenging even when a reasonable-quality reference answer is available. We propose a constrained RL trai…
Do Large Language Models Reason Causally Like Us? Even Better?
Indeed, a growing number of researchers have proposed that current LLMs are unable to generalize causal ideas beyond their training distribution and/or without strong user-induced guidance (e.g., chai…
Efficient Reasoning with Balanced Thinking
Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, faili…
Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
Large language models (LLMs) have demonstrated remarkable proficiency in generating detailed and coherent explanations of complex concepts. However, the extent to which these models truly comprehend t…
Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning
Large Language Models (LLMs) have shown remarkable abilities across various language tasks, but solving complex reasoning problems remains a challenge. While existing methods like Chainof-Thought (CoT…
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
When leveraging language models for reasoning tasks, generating explicit chainof- thought (CoT) steps often proves essential for achieving high accuracy in final outputs. In this paper, we investigate…
From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?
While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information n…
Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis
The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requ…
Implicit Chain of Thought Reasoning via Knowledge Distillation
To augment language models with the ability to reason, researchers usually prompt or finetune them to produce chain of thought reasoning steps before producing the final answer. However, although peop…
Integrating Large Language Models and Reinforcement Learning for Non-Linear Reasoning
Large Language Models (LLMs) were shown to struggle with long-term planning, which may be caused by the limited way in which they explore the space of possible solutions. We propose an architecture wh…
Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation
Since smaller language models (SLMs) are computationally more efficient but often under-perform compared to larger models, Knowledge Distillation (KD) methods allow for finetuning these smaller models…
Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Althoug…
LLM Reasoning Is Latent, Not the Chain of Thought
This position paper argues that large language model (LLM) reasoning should be studied as latent-state trajectory formation rather than as faithful surface chainof- thought (CoT). This matters because…
LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
Recent progress in Large Reasoning Models (LRMs) has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflec…
Language models show human-like content effects on reasoning tasks
Abstract reasoning is a key ability for an intelligent system. Large language models (LMs) achieve above-chance performance on abstract reasoning tasks, but exhibit many imperfections. However, human …
Large Language Model Guided Tree-of-Thought
“Fields Medal winner Terence Tao once shared his experiences solving hard math problems1: “When I was a kid, I had a romanticized notion of mathematics, that hard problems were solved in Eureka moment…
Latent Skill Discovery for Chain-of-Thought Reasoning
![A diagram of a person with question marks](/assets/paper-images/LatentSkillDiscoveryForChainOfThoughtReasoning.png) Recent advances in Large Language Models (LLMs) have led to an emergent ability o…
Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering
Answering questions that require multi-hop reasoning at web-scale necessitates retrieving multiple evidence documents, one of which often has little lexical or semantic relationship to the question. T…
Least-to-most Prompting Enables Complex Reasoning In Large Language Models
“However, chain-of-thought prompting has a key limitation—it often performs poorly on tasks that require generalization of solving problems harder than the demonstration examples, such as compositiona…
Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones
Inference-time computation has emerged as a promising scaling axis for improving large language model reasoning. However, despite yielding impressive performance, the optimal allocation of inference-t…
Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models
To address this issue, some studies employ the approach of propositional logic to further enhance logical reasoning abilities of LLMs. However, the potential omissions in the extraction of logical exp…
Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
Large Language Models (LLMs) have demonstrated significant improvements in reasoning capabilities through supervised fine-tuning and reinforcement learning. However, when training reasoning models, th…
Multi-hop Question Answering via Reasoning Chains
Multi-hop question answering requires models to gather information from different parts of a text to answer a question. Most current approaches learn to address this task in an end-to-end way with neu…
Nexus: An Agentic Framework for Time Series Forecasting
Time series forecasting is not just numerical extrapolation, but often requires reasoning with unstructured contextual data such as news or events. While specialized Time Series Foundation Models (TSF…
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
Recently, there has been significant progress in teaching language models to perform step-bystep reasoning to solve complex numerical reasoning tasks. Chain-of-thoughts prompting (CoT) is the state-of…
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs
Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model fi…
Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models
Chain-of-Thought (CoT) prompting has improved the reasoning performance of large language models (LLMs), but it remains unclear why it works and whether it is the unique mechanism for triggering reaso…
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
We provide evidence of performative chain-ofthought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its inter…
Reasoning to Learn from Latent Thoughts
Human-written text is the culmination of an underlying thought process—when we write, there is often an internal dialogue that clarifies or even determines the written word. However, modern language m…
S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models
We introduce S1-Bench, a novel benchmark designed to evaluate Large Reasoning Models’ (LRMs) performance on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reason…
Self-consistency Improves Chain Of Thought Reasoning In Language Models
“Although language models have demonstrated remarkable success across a range of NLP tasks, their ability to demonstrate reasoning is often seen as a limitation, which cannot be overcome solely by inc…
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
Human cognition typically involves thinking through abstract, fluid concepts rather than strictly using discrete linguistic tokens. Current reasoning models, however, are constrained to reasoning with…
Soft Tokens, Hard Truths
Large Language Models (LLMs) have achieved impressive success across a wide range of reasoning tasks, particularly when enhanced with Chain-of-Thought (CoT) prompting, where models generate intermedia…
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks by generating intermediate reasoning steps. However, most existing approaches focus on hard token…
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
Intermediate token generation (ITG), where a model produces output before the solution, has been proposed as a method to improve the performance of language models on reasoning tasks. These intermedia…
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved …
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correc…
The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think
Long chain-of-thought (CoT) is an essential ingredient in effective usage of modern large language models, but our understanding of the reasoning strategies underlying these capabilities remains limit…
Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models
Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like C…
Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection
Self-detection for Large Language Models (LLMs) seeks to evaluate the trustworthiness of the LLM’s output by leveraging its own capabilities, thereby alleviating the issue of output hallucination. How…
Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor
Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compress…
Thought Anchors: Which LLM Reasoning Steps Matter?
We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We present three complementary attribution methods: (1) a black-box method …
Tina: Tiny Reasoning Models via LoRA
How cost-effectively can strong reasoning abilities be achieved in language models? Driven by this fundamental question, we present Tina, a family of tiny reasoning models achieved with high cost-effi…
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra “thinking” really helpful?…
Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis
Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability of large language models by augmenting the query using multiple examples with multiple intermediate steps. De…
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inferenc…
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can re…
Zero-Shot Verification-guided Chain of Thoughts
Previous works have demonstrated the effectiveness of Chain-of-Thought (COT) prompts and verifiers in guiding Large Language Models (LLMs) through the space of reasoning. However, most such studies ei…

Chain-of-Thought and Reasoning Methods

Why do models fail at asking good questions during interaction?

Can minimal reasoning chains match full explanations?

Can reasoning models actually sustain long-chain reflection?

Why does autoregressive generation fail at constraint satisfaction?

Why do chain-of-thought examples fail across different conditions?

Can one statistical measure serve dual purposes in RL training?

How quickly do errors compound during model self-training?

Why do models trust their own generated answers?

Do large language models make the same causal reasoning mistakes as humans?

Can longer reasoning chains eliminate model sensitivity to input noise?

Can small models reason well by just learning output format?

What alignment data structure best trains reasoning generalists?

Can models recognize question difficulty before they reason?

Can reasoning topologies be formally classified as graph types?

Do reasoning traces actually cause correct answers?

Can we identify which tokens actually matter for reasoning?

Should reasoning benchmarks score final answers or reasoning traces?

What makes reflection actually work in reasoning models?

Can rubrics and dense rewards work together without hacking?

When does sequential reasoning beat parallel voting?

Which sentences actually steer a reasoning trace?

Does training data format shape reasoning strategy more than domain?

Why do standard process reward models fail on thinking traces?

Source papers 76