Can a single model implement fast thinking, slow thinking, and tool use?
This explores whether one model can do all three—answer simply when that's enough (fast thinking), reason at length when needed (slow thinking), and call external tools—rather than splitting those jobs across separate systems.
This explores whether a single model can carry all three modes—fast answers, slow reasoning, and tool use—instead of routing each to a dedicated system. The corpus says the hard part isn't whether one model *can* hold these capabilities (it usually already does) but whether it can decide *when* to use which. The cleanest evidence is Thinkless, which trains one model to route between extended reasoning and direct answers using a method that decouples the choice of mode from the quality of the answer, preventing the model from collapsing into always-think or always-skip Can models learn when to think versus respond quickly?. That routing matters because more thinking is not free: accuracy actually peaks and then declines as thinking tokens grow, with models overthinking easy problems and underthinking hard ones Does more thinking time always improve reasoning accuracy?.
A recurring theme is that the slow-thinking capacity is often already latent in the base model—the bottleneck is elicitation, not acquisition. Several independent mechanisms (RL steering, critique tuning, decoding tweaks, feature steering) all surface reasoning that was already present, suggesting post-training selects reasoning rather than creating it Do base models already contain hidden reasoning ability?. But there's a real limit to the single-model dream: a non-reasoning model can't simply spend more inference compute to catch up to a reasoning-trained one, because training installs a *protocol* that makes the extra tokens productive in the first place Can non-reasoning models catch up with more compute?. And the *quality* of the thinking mode depends on training too—vanilla models can use extended thinking counterproductively (self-doubt that hurts answers), while RL redirects the same mechanism into useful analysis Does extended thinking help or hurt model reasoning?.
The tool-use leg of your question gets an interesting answer from a different angle: rather than one monolith doing everything internally, the corpus shows reasoning operations themselves can be packaged as tool calls. 'Cognitive tools'—reasoning steps implemented as sandboxed, modular LLM calls—lifted GPT-4.1's math performance sharply with no RL training, because modularity enforces an isolation that plain prompting can't guarantee Can modular cognitive tools unlock reasoning without training?. This blurs the line in your question: tool use and slow thinking can be the *same* mechanism, where the model invokes structured reasoning as a callable operation. A related line shows that separating the planner (decomposer) from the executor (solver) outperforms a single model trying to do both, with the planning skill transferring across domains while solving doesn't Does separating planning from execution improve reasoning accuracy?.
So there's a genuine tension worth knowing: one model can be trained to route between modes and to invoke tools, but the corpus repeatedly finds that *forcing separation*—decomposer from solver, reasoning steps into isolated tool calls—buys accuracy and generalizability that a single undifferentiated forward pass struggles to match. The slow-thinking machinery also needs guardrails the single model doesn't naturally have: models switch reasoning paths too early and waste tokens (fixable by penalizing thought-transitions at decode time) Do reasoning models switch between ideas too frequently?, and even elaborate reasoning frameworks converge once you control for total compute, meaning the win comes from compute and reward quality, not the framework wrapper Does the choice of reasoning framework actually matter for test-time performance?.
The thing you may not have known you wanted: the frontier framing isn't 'fast vs. slow vs. tools' as three skills to bolt together, but a single learned *controller* deciding how much to think and when to reach outside itself—and the evidence suggests that controller is the scarce ingredient, while the underlying capabilities are largely already there.
Sources 9 notes
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.
Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.