Can extended deliberation in agents become counterproductive like human overthinking?
This explores whether agents and reasoning models, like humans, can hurt themselves by deliberating too long — and what the corpus knows about when more thinking flips from helpful to harmful.
This explores whether extended deliberation can backfire the way human overthinking does — and the corpus says yes, with surprising specificity. The cleanest evidence is a non-monotonic curve: accuracy climbs with thinking tokens up to a point, then falls off a cliff. One study watched benchmark accuracy drop from 87.3% to 70.3% as thinking tokens scaled from ~1,100 to ~16,000 Does more thinking time always improve reasoning accuracy? When does thinking too much actually hurt reasoning?. The mechanism is recognizably human: extra tokens inflate output variance and invite self-revision errors — the model talks itself out of correct answers, the same way a person second-guesses a right first instinct.
What's striking is that the corpus locates the failure not in the *amount* of thinking but in *when and why* it happens. Vanilla models use 'thinking mode' as a kind of anxious self-doubt that degrades performance; RL training doesn't make them think less, it redirects the same mechanism toward productive gap analysis Does extended thinking help or hurt model reasoning?. So overthinking isn't a quantity bug — it's a quality of disposition. A related failure shows up with ill-posed questions: reasoning models spin out long redundant chains on problems with missing premises, while plainer models simply flag them as unanswerable. The reasoning model was optimized to *produce* steps and never taught *when to disengage* Why do reasoning models overthink ill-posed questions?. That's overthinking as a trained reflex with no off-switch.
The more interesting turn is what the corpus offers as antidotes, because they map onto how disciplined human thinkers actually behave. Rather than a fixed thinking budget, two notes argue for *spending compute only where uncertainty is real*. SAND samples several candidate actions and deliberates only when they diverge — if the model already agrees with itself, it acts without belaboring When should an agent actually stop and deliberate?. ReBalance uses the model's own confidence variance as a live signal, steering it to cut redundancy when overconfident and explore more when underconfident, with no retraining at all Can confidence patterns reveal overthinking versus underthinking?. Both treat deliberation as something to *meter*, not maximize.
There's a sharp distinction worth carrying away: not all 'more' is overthinking. The corpus separates two axes. Per-step reasoning depth (chain-of-thought) is the one prone to diminishing and then negative returns. But test-time *interaction* — taking more steps in an environment to explore, backtrack, and replan — scales differently and dominates on tasks where the agent can't see everything at once Does agent interaction time scale separately from reasoning depth?. Even search budget in research agents follows its own scaling curve with its own diminishing returns Do search steps follow the same scaling rules as reasoning tokens?. So an agent that wanders the world longer is not necessarily an agent that's rotting its own answer by re-thinking the same thought.
The thing you might not have known you wanted: the human-overthinking analogy is almost too good. The corpus elsewhere frames LLMs as 'scaled System-1 cognition' — fast intuition rather than deliberate reason Why do people trust AI outputs they shouldn't?. Under that lens, forcing extended deliberation on a System-1 engine is exactly the move that produces human-style overthinking: a fast intuitive system pushed to rationalize generates noise, not insight. The fixes the corpus likes — externalizing memory and skills into a harness so the model isn't re-deriving everything each time Where does agent reliability actually come from?, or letting agents fold their history into compact schemas and pause to reconsider Can agents compress their own memory without losing critical details? — are less 'think harder' and more 'know when to stop and what to offload.' That's the same lesson humans learn the hard way.
Sources 11 notes
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Empirical studies demonstrate non-monotonic scaling in test-time reasoning: accuracy peaks at a critical thinking-token count, then declines sharply (87.3% to 70.3% as tokens scale from 1,100 to 16,000). Extended thinking inflates output variance and introduces self-revision errors rather than improving solution quality.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
SAND uses self-consistency sampling to flag uncertainty: if N policy samples all match the expert action, skip deliberation; if they diverge, trigger execution-guided critiques. This step-level compute allocation lets agents deliberate only at genuinely uncertain decision points.
ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.
Test-time interaction—increasing environment steps—enables exploration, backtracking, and replanning that per-step reasoning cannot achieve. Curriculum-based RL on rollout length produces SOTA web agents, showing interaction scaling dominates on tasks with partial observability.
Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.
Rose-Frame identifies map-territory confusion, intuition-reason conflation, and confirmation-bias reinforcement as traps that multiply their distorting effects when they co-occur. Evidence from cross-linguistic overreliance and architectural transformer biases confirms the compounding mechanism operates universally.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.