SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation

Can models learn when to think versus respond quickly?

Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.

Synthesis note · 2026-02-22 · sourced from Reasoning o1 o3 Search
How should we allocate compute budget at inference time?

The question "can LLMs learn when to think?" has a concrete answer. Thinkless trains a single model to adaptively select between extended chain-of-thought reasoning and concise direct responses, guided by three factors: task complexity, model capability, and the user's efficiency-accuracy tolerance.

The mechanism: two control tokens (<think> and <short>) are generated as the first output token, signaling the reasoning mode. A distillation warm-up phase aligns each token with expert behavior — a reasoning model for <think>, a compact instruction model for <short>. Then RL optimizes the routing policy.

The critical technical contribution is DeGRPO (Decoupled Group Relative Policy Optimization). Vanilla GRPO treats all tokens uniformly, but the control token is one token while the response spans hundreds to thousands. Long responses dominate gradient updates, causing the single control token to receive weak, biased signals. The model rapidly collapses to one mode — typically <short>, since short samples update the control token faster.

DeGRPO separates two objectives: (1) mode selection — how quickly the policy adapts based on current accuracy, and (2) accuracy improvement — refining answer content within the selected mode. This decoupling stabilizes training and prevents the mode collapse observed in all vanilla GRPO experiments.

The result: the model self-calibrates. Simple arithmetic routes to <short>. Multi-condition problems with multiple concepts route to <think>. The policy reflects a well-calibrated difficulty assessment without explicit difficulty labels in training.

This is the concrete instantiation of Does RL teach reasoning or just when to use it?. RL doesn't teach the model to reason — it teaches it to recognize when reasoning is worth the compute. The capability comes from pre-training and distillation; RL manages the deployment decision. The design premise aligns with Do base models already contain hidden reasoning ability?: if reasoning capability is already latent, then what's needed is not more capability training but a routing mechanism -- and the DeGRPO control token is exactly that routing mechanism.

The connection to Can we allocate inference compute based on prompt difficulty? is architecturally direct. Compute-optimal scaling proposes adaptive budget allocation as a principle. Thinkless implements it as a learned routing mechanism inside a single model.

Three-mode taxonomy with two knowledge boundaries (from Arxiv/Routers): The Fast, Slow, and Tool-augmented Thinking survey formalizes the decision space Thinkless operates in. Two knowledge boundaries define the taxonomy: (1) a fast/slow boundary separating intuitive from deliberative processes (System 1 vs System 2), and (2) an internal/external boundary distinguishing parameter-grounded reasoning from tool-augmented reasoning. This extends Thinkless's binary think/short routing to a three-mode decision: fast thinking (direct generation), slow thinking (CoT/self-reflection/verification), and tool-augmented thinking (calculators, code interpreters, search). Selection mechanisms are either implicit (learned end-to-end during post-training, no explicit control signal) or explicit (rule-based or model-based external routing). Thinkless is an implicit selector for the fast/slow boundary; extending it to the internal/external boundary would require a third mode for tool invocation decisions.

Inquiring lines that use this note as a source 108

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 9

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
23 direct connections · 212 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

hybrid reasoning via decoupled rl learns when to engage extended thinking versus giving concise responses based on task complexity and model capability