Can models learn when to think versus respond quickly?

Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.

Synthesis note · 2026-02-22 · sourced from Reasoning o1 o3 Search

The question "can LLMs learn when to think?" has a concrete answer. Thinkless trains a single model to adaptively select between extended chain-of-thought reasoning and concise direct responses, guided by three factors: task complexity, model capability, and the user's efficiency-accuracy tolerance.

The mechanism: two control tokens (<think> and <short>) are generated as the first output token, signaling the reasoning mode. A distillation warm-up phase aligns each token with expert behavior — a reasoning model for <think>, a compact instruction model for <short>. Then RL optimizes the routing policy.

The critical technical contribution is DeGRPO (Decoupled Group Relative Policy Optimization). Vanilla GRPO treats all tokens uniformly, but the control token is one token while the response spans hundreds to thousands. Long responses dominate gradient updates, causing the single control token to receive weak, biased signals. The model rapidly collapses to one mode — typically <short>, since short samples update the control token faster.

DeGRPO separates two objectives: (1) mode selection — how quickly the policy adapts based on current accuracy, and (2) accuracy improvement — refining answer content within the selected mode. This decoupling stabilizes training and prevents the mode collapse observed in all vanilla GRPO experiments.

The result: the model self-calibrates. Simple arithmetic routes to <short>. Multi-condition problems with multiple concepts route to <think>. The policy reflects a well-calibrated difficulty assessment without explicit difficulty labels in training.

This is the concrete instantiation of Does RL teach reasoning or just when to use it?. RL doesn't teach the model to reason — it teaches it to recognize when reasoning is worth the compute. The capability comes from pre-training and distillation; RL manages the deployment decision. The design premise aligns with Do base models already contain hidden reasoning ability?: if reasoning capability is already latent, then what's needed is not more capability training but a routing mechanism -- and the DeGRPO control token is exactly that routing mechanism.

The connection to Can we allocate inference compute based on prompt difficulty? is architecturally direct. Compute-optimal scaling proposes adaptive budget allocation as a principle. Thinkless implements it as a learned routing mechanism inside a single model.

Three-mode taxonomy with two knowledge boundaries (from Arxiv/Routers): The Fast, Slow, and Tool-augmented Thinking survey formalizes the decision space Thinkless operates in. Two knowledge boundaries define the taxonomy: (1) a fast/slow boundary separating intuitive from deliberative processes (System 1 vs System 2), and (2) an internal/external boundary distinguishing parameter-grounded reasoning from tool-augmented reasoning. This extends Thinkless's binary think/short routing to a three-mode decision: fast thinking (direct generation), slow thinking (CoT/self-reflection/verification), and tool-augmented thinking (calculators, code interpreters, search). Selection mechanisms are either implicit (learned end-to-end during post-training, no explicit control signal) or explicit (rule-based or model-based external routing). Thinkless is an implicit selector for the fast/slow boundary; extending it to the internal/external boundary would require a third mode for tool invocation decisions.

Inquiring lines that use this note as a source 108

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 9

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

23 direct connections · 212 in 2-hop network ·dense cluster Open in graph ↗

Can models learn when to think versus respond qu… Does RL teach reasoning or just when to use it? Can we allocate inference compute based on prompt … When does explicit reasoning actually help model p… Can routers select the right model before generati… Does RL post-training create reasoning or just dep… Do base models already contain hidden reasoning ab… When should an agent actually stop and deliberate? Does thinking emerge when agents choose between le…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
Thinkless is the concrete implementation: RL learns the routing, not the reasoning
Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
Thinkless implements adaptive allocation as a learned control token decision
When does explicit reasoning actually help model performance? Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
the routing decision is the practical resolution: use reasoning where it helps, skip where it hurts
Can routers select the right model before generation happens? Explores whether LLMs can be matched to queries by estimating difficulty upfront, before any generation begins. This matters because routing could cut costs significantly while preserving response quality.
external model routing as the inter-model analog of Thinkless's intra-model mode routing
Does RL post-training create reasoning or just deploy it? Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.
Thinkless is the strongest concrete evidence for the post angle: RL literally learns a routing token, not reasoning capability; the "when not how" claim is architecturally explicit in the control token design
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
Thinkless's design premise: if reasoning capability is already latent in the base model, what's needed is not more capability training but a routing mechanism that decides when to activate it; DeGRPO is that routing mechanism
When should an agent actually stop and deliberate? How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
SAND extends the Thinkless routing principle to a finer granularity: Thinkless decides once per response whether to think or not, while SAND decides at each step within an agentic trajectory whether to deliberate; together they form a hierarchy of adaptive compute allocation (response-level routing + step-level gating)
Does thinking emerge when agents choose between learned sub-policies? Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.
theoretical grounding: the thought MDP formalizes what DeGRPO's control token does — selecting between sub-policies (think vs. short) already contained in the policy function; the meta-policy over sub-policies IS the routing decision
When should retrieval happen during model generation? Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
the retrieval-level analog of Thinkless's compute routing: FLARE gates retrieval on low token-probability, Thinkless gates extended thinking on task complexity; both implement uncertainty-triggered compute allocation, one at the retrieval layer, one at the reasoning layer

Can models learn when to think versus respond quickly?

Related concepts in this collection 9

Related papers in this collection 8

Search by related questions 5