Does extended thinking help or hurt model reasoning?

Explores whether activating thinking mode improves reasoning performance, and what role training plays in determining whether extended internal reasoning chains are productive or counterproductive.

Synthesis note · 2026-02-22 · sourced from Conversation Agents

The proactive critical thinking experiments reveal a striking interaction between training and inference-time reasoning. For vanilla (off-the-shelf) models, activating "thinking mode" — the extended internal reasoning chains used by models like Qwen3 — actually degrades performance on proactive critical thinking tasks. The extended thinking "appears to induce counterproductive self-doubt rather than useful analysis, leading to a clear drop in performance."

But after RL training on proactive critical thinking tasks, the same thinking mode becomes beneficial. Training fundamentally changes how models use their internal reasoning. This is not merely about more or less thinking — it is about the quality direction of thinking.

The finding connects to several established insights but adds a distinct mechanism:

Since Does RL teach reasoning or just when to use it?, RL manages the timing of reasoning. The proactive thinking result extends this: RL also manages the mode of reasoning — redirecting extended thinking from unproductive self-doubt toward productive gap analysis.

The SFT finding adds nuance: when SFT data is self-generated by the model, it "does not inherently enhance its capabilities" and may reduce output entropy, constraining the subsequent RL phase. This echoes Does policy entropy collapse limit reasoning performance in RL? — SFT-then-RL may face the same entropy collapse that pure RL faces, but through a different mechanism (entropy reduction from self-generated imitation rather than RL convergence).

The practical implication: extended thinking is not a universal good. It is a resource that can be directed productively or destructively, and the direction depends on training. "More thinking" applied to a model without the right training signal may systematically make things worse.

Inquiring lines that use this note as a source 118

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 171 in 2-hop network ·dense cluster Open in graph ↗

Does extended thinking help or hurt model reason… Does RL teach reasoning or just when to use it? Can models learn when to think versus respond quic… Does policy entropy collapse limit reasoning perfo… What critical thinking skills do reasoning models …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
RL manages timing; this paper shows RL also manages quality direction of reasoning
Can models learn when to think versus respond quickly? Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
DeGRPO mode selection; proactive thinking adds a training-mediated quality dimension
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
SFT-then-RL may face entropy collapse through self-generated imitation
What critical thinking skills do reasoning models actually lose? Step-by-step reasoning training optimizes narrow deductive thinking while degrading meta-cognitive abilities like recognizing futile thinking and maintaining tentative reasoning. Understanding this tradeoff matters for deploying reasoning models reliably.
the thinking-mode reversal is a specific instance of the broader critical thinking problem: reasoning training optimizes one narrow type of thinking while degrading others; the proactive thinking result shows RL can selectively repair one form of degradation (self-doubt → gap analysis) while the critical thinking post documents the broader pattern

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rl training transforms thinking mode from counterproductive self-doubt into beneficial proactive analysis — the same mechanism helps or hurts depending on training

Does extended thinking help or hurt model reasoning?

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4