SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Do reasoning models switch between ideas too frequently?

Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.

Synthesis note · 2026-02-22 · sourced from Reasoning o1 o3 Search
How should we allocate compute budget at inference time?

"Thoughts Are All Over the Place" identifies a failure mode complementary to but distinct from overthinking: underthinking. Where overthinking generates excessively long traces, underthinking generates traces that switch between reasoning directions too frequently, failing to follow any promising path to completion.

The empirical finding: frequent thought switching correlates with incorrect responses across multiple o1-like models on challenging mathematical test sets. The model starts down one reasoning path, encounters difficulty, switches to a different approach, encounters difficulty there too, switches again — never committing enough depth to any single path to reach a solution.

A novel metric quantifies this: token efficiency in incorrect answers, measuring how much of the reasoning trace was "wasted" on abandoned approaches versus productively advancing toward a solution.

TIP (Thought-switching Penalty) is a pure decoding strategy — no model fine-tuning required. During generation, it penalizes the probability of tokens that signal thought transitions (linguistic markers like "Alternatively," "Let me try," "Wait"), encouraging the model to continue exploring the current path rather than jumping to a new one. The result: accuracy improves across challenging datasets.

This reframes the overthinking/underthinking relationship. They are not opposites on a single dimension (trace length). Overthinking is excessive computation within a committed path. Underthinking is insufficient computation per path due to premature switching. A model can simultaneously overthink (too many tokens total) and underthink (too few tokens per path) — producing a long trace that wanders between incomplete approaches.

The connection to Why do reasoning LLMs fail at deeper problem solving? is direct: premature thought switching is one mechanism that produces wandering behavior. The "unnecessary exploration" failure mode is exactly what happens when the model abandons productive branches for new ones without sufficient exploration.

Inquiring lines that use this note as a source 118

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
20 direct connections · 161 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

underthinking is premature thought switching — penalizing reasoning transitions improves accuracy without fine-tuning