Can abstractions guide exploration better than depth alone?

Does training a model to propose reasoning abstractions as intermediate subgoals help it explore diverse solution strategies more effectively than simply extending chain-of-thought depth?

Synthesis note · 2026-02-22 · sourced from Training Fine Tuning

RLAD addresses a structural problem with current reasoning training: RL incentivizes depth (longer chains attempting to verify one strategy) but not breadth (exploring diverse strategies). Long chains degenerate into frequent logic switches and unfocused exploration — the "underthinking" failure mode. Since Why do reasoning LLMs fail at deeper problem solving?, merely extending chains doesn't help.

The solution: reasoning abstractions — concise natural language descriptions of procedural and factual knowledge that function as high-level subgoals. Two models are jointly trained:

Abstraction generator: given a problem, propose multiple reasoning abstractions (strategies, intermediate lemmas, relevant principles)
Solution generator: conditioned on an abstraction, generate a solution that utilizes its information

The abstraction generator is rewarded for the improvement in solution accuracy that conditioning on its abstractions produces. The solution generator is rewarded for accuracy when using the abstraction. This cooperative two-player RL setup decouples learning signals: abstraction proposal and solution execution develop separately.

The key scaling result: allocating more test-time compute to generating abstractions is more beneficial for performance than generating more solutions — at large test budgets. This challenges the standard parallel sampling approach (generate N solutions, pick the best). Instead: generate diverse abstractions, then one good solution per abstraction. The abstractions enforce breadth where depth-only chains fail.

This connects to Why does parallel reasoning outperform single chain thinking? — abstractions are a mechanism for structured parallel exploration. And to Does separating planning from execution improve reasoning accuracy? — abstractions are a learned, RL-trained form of decomposition rather than a fixed prompt scaffold. In terms of the Can reasoning topologies be formally classified as graph types?, RLAD creates a two-level structure: parallel abstraction nodes (breadth-first, like CoT-SC) each conditioning a single depth-first solution chain (like CoT), producing a learned GoT-like topology where aggregation happens at the abstraction level.

The warmstart from SFT (summarize multiple candidate solutions → generate diverse abstractions) followed by RL refinement mirrors the Why does SFT-then-RL training follow a predictable three-phase pattern? dynamic, but in a cooperative multi-agent setting.

Inquiring lines that use this note as a source 129

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 170 in 2-hop network ·dense cluster Open in graph ↗

Can abstractions guide exploration better than d… Why do reasoning LLMs fail at deeper problem solvi… Why does parallel reasoning outperform single chai… Does separating planning from execution improve re… Does policy entropy collapse limit reasoning perfo… Can reasoning topologies be formally classified as…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do reasoning LLMs fail at deeper problem solving? Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.
the problem RLAD addresses: depth without breadth
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
abstractions enforce structured parallel exploration
Does separating planning from execution improve reasoning accuracy? Can modular LM architectures that split problem decomposition from solution execution outperform monolithic models? This explores whether decoupling these cognitive operations reduces interference and boosts performance.
abstractions as learned decomposition
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
abstractions may resist entropy collapse by maintaining strategy diversity
Can reasoning topologies be formally classified as graph types? This explores whether Chain of Thought, Tree of Thought, and Graph of Thought represent distinct formal graph structures with different computational properties. Understanding this matters because the topology itself determines what reasoning strategies are possible.
RLAD creates a distinct topology: a two-level graph where the abstraction generator produces parallel breadth nodes (like CoT-SC) and each abstraction conditions a depth-first solution chain (like CoT); the result is a learned GoT-like structure where aggregation (in-degree > 1) happens at the abstraction level rather than at the solution level

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reasoning abstractions decompose exploration into breadth-first strategy discovery and depth-first solution generation via two-player rl

Can abstractions guide exploration better than depth alone?

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4