Can abstractions guide exploration better than depth alone?
Does training a model to propose reasoning abstractions as intermediate subgoals help it explore diverse solution strategies more effectively than simply extending chain-of-thought depth?
RLAD addresses a structural problem with current reasoning training: RL incentivizes depth (longer chains attempting to verify one strategy) but not breadth (exploring diverse strategies). Long chains degenerate into frequent logic switches and unfocused exploration — the "underthinking" failure mode. Since Why do reasoning LLMs fail at deeper problem solving?, merely extending chains doesn't help.
The solution: reasoning abstractions — concise natural language descriptions of procedural and factual knowledge that function as high-level subgoals. Two models are jointly trained:
- Abstraction generator: given a problem, propose multiple reasoning abstractions (strategies, intermediate lemmas, relevant principles)
- Solution generator: conditioned on an abstraction, generate a solution that utilizes its information
The abstraction generator is rewarded for the improvement in solution accuracy that conditioning on its abstractions produces. The solution generator is rewarded for accuracy when using the abstraction. This cooperative two-player RL setup decouples learning signals: abstraction proposal and solution execution develop separately.
The key scaling result: allocating more test-time compute to generating abstractions is more beneficial for performance than generating more solutions — at large test budgets. This challenges the standard parallel sampling approach (generate N solutions, pick the best). Instead: generate diverse abstractions, then one good solution per abstraction. The abstractions enforce breadth where depth-only chains fail.
This connects to Why does parallel reasoning outperform single chain thinking? — abstractions are a mechanism for structured parallel exploration. And to Does separating planning from execution improve reasoning accuracy? — abstractions are a learned, RL-trained form of decomposition rather than a fixed prompt scaffold. In terms of the Can reasoning topologies be formally classified as graph types?, RLAD creates a two-level structure: parallel abstraction nodes (breadth-first, like CoT-SC) each conditioning a single depth-first solution chain (like CoT), producing a learned GoT-like topology where aggregation happens at the abstraction level.
The warmstart from SFT (summarize multiple candidate solutions → generate diverse abstractions) followed by RL refinement mirrors the Why does SFT-then-RL training follow a predictable three-phase pattern? dynamic, but in a cooperative multi-agent setting.
Inquiring lines that use this note as a source 129
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes conceptual inquiry the fastest high-scoring AI interaction pattern?
- Why do foundation models develop heuristics instead of world models?
- What scaffolding tools help users specify implicit contextual boundaries to models?
- Can surface heuristics override implicit constraints in domain-specific reasoning?
- Why do abstract semantic memories outperform specific interaction histories for journey discovery?
- Can explicit constraint statements override the dominance of surface heuristics?
- What is the relationship between reasoning depth and verbalization requirements?
- How does SONAR embedding quality affect downstream reasoning accuracy?
- Can step-level deliberation flags guide other reasoning systems?
- Can graph cyclicity and topology predict when reasoning systems achieve breakthrough insights?
- Does optimizing directly for semantic diversity improve both reasoning quality and exploration?
- Why must procedural skills consolidate before strategic reasoning can develop?
- How does nesting optimization levels improve on traditional network depth?
- Why does explicit theory injection work better than example-based learning for reasoning tasks?
- Can a proposer agent actively surface a solver's weaknesses to prevent plateau?
- How does semantic search over research papers guide autonomous architecture proposals?
- How does critique fine-tuning on one problem unlock broader reasoning?
- What makes bilevel metacognition architectural rather than emergent in current systems?
- Do larger models develop more abstract features than smaller ones?
- What role does exploration-exploitation balance play in abstraction formation?
- Does the model learn depth-wise drift as an explicit strategy?
- Can marginal hints integrate better into reasoning than comprehensive explanations?
- Can activation patching reveal which reasoning steps actually matter?
- Can prompting for specific creative paradigms improve ideation diversity?
- Can diverse critiques on a single problem unlock reasoning without diverse problem sets?
- How do evolutionary archives enable diverse exploration in self-improving systems?
- Can accelerated sampling techniques from image generation speed up evolutionary search?
- Can energy minimization replace reasoning-specific reinforcement learning for system 2 thinking?
- How do gradient descent iterations at inference compare to chain-of-thought reasoning chains?
- Can evolutionary approaches avoid the overthinking failure mode of iterative refinement?
- What does an intermediate interface between planning and grounding actually look like?
- Does the planning-grounding factoring principle apply to other agent tasks?
- What makes hierarchical community summaries useful for exploration without a specific question?
- Does architectural design matter more than model scale for reasoning tasks?
- Do reasoning models trade instruction following for deliberative capability?
- Does social scaffolding outperform purely intrinsic motivation for agent exploration?
- Can combinational creativity alone drive open-ended learning in agents?
- Do task-specific heuristics emerge because they compress well enough?
- Can breadth-first search in continuous space outperform chain-of-thought on logical tasks?
- Why do longer reasoning chains signal hesitation rather than depth?
- Does reasoning structure match explicit versus implicit task demands?
- How do foundation models develop task-specific heuristics instead of world models?
- Can external summarization solve exploration problems in complex real-world environments?
- Do LLMs fail exploration because of context integration or computational limitations?
- How does explicit exploratory prompting compare to fine-tuned reinforcement learning for in-context adaptation?
- Why do models learn reasoning form instead of actual abstract inference?
- Can we transfer reasoning structure without copying surface form?
- How does MCTS combine parallel exploration with sequential reasoning depth?
- When does self-reflection actually help reasoning models improve?
- How does inductive reasoning from partial evidence enable hypothesis formation?
- Can depth scaling and breadth scaling unlock independent capability axes?
- Why does exploration quality matter more than learner network depth?
- Do depth thresholds correspond to transitions between procedural and strategic learning?
- Why does imitation learning create a ceiling for reasoning capability?
- What makes diverse reasoning sources more valuable than deeper single paths?
- Does deep-thinking ratio measure computational effort better than chain-of-thought length?
- Why do reasoning chains degenerate into undirected exploration at scale?
- How does separating decomposition from execution improve multi-step reasoning?
- Can recursive subtask trees implement tree-of-thought reasoning more efficiently?
- How does graph of thoughts enable divide-and-conquer reasoning patterns?
- Why do reasoning models wander instead of searching systematically?
- Is the reasoning cliff actually a tool-use problem?
- Does small-world structure in reasoning graphs improve generalization?
- How does dynamic recurrence during training improve depth extrapolation?
- Do higher asymptote recipes unlock genuinely novel reasoning strategies?
- What distinguishes systematic search from wandering exploration in reasoning?
- Does verbal step-by-step reflection preserve learning signals that abstraction removes?
- When are multiple independent attempts more valuable than depth?
- How does soft thinking compare to sampling multiple independent reasoning paths?
- Can curriculum approaches teach agents when to stop exploring?
- How much does chain-of-thought reasoning narrow the decompression gap?
- How can semantic diversity optimization work if exploration and exploitation were truly opposed?
- Can capability boundary collapse be addressed by operating at representational rather than token level?
- How much reasoning depth do we actually need for most real-world tasks?
- Do novelty and feasibility always trade off in idea generation?
- Can a single architecture represent both physical and mental possibility spaces?
- Do reasoning failures stem from strategy or from calculation breakdown?
- Does unrestricted reasoning per search step degrade iterative quality over time?
- Do search agents face their own overthinking threshold like reasoning models do?
- What is the optimal balance between search rounds and reasoning depth per round?
- Do reasoning models switch approaches when encountering local difficulty?
- Why do foundation models develop task-specific heuristics instead of causal understanding?
- Does critique training improve exploration diversity during model training or only test time?
- Why is metacognition neglected as a foundational AI research area?
- Why do per-turn thinking budgets matter alongside iterative retrieval depth?
- What distinguishes task-specific heuristics from genuine world models?
- Does algorithmic decomposition prevent planning-execution interference in reasoning?
- Can operationalizing theory into prompt structure improve reasoning more than theory itself?
- How should humans specify deterministic abstractions of RL problems?
- How does policy initialization with sub-policies enable emergent thinking?
- What structural differences emerge between early generic skills and later meta-strategy skills?
- Does RL amplify existing reasoning or create genuinely new computational strategies?
- Why does RL behavior differ between standard reasoning tasks and complex planning domains?
- What makes planning, tool use, and reasoning into jointly optimizable subsystems?
- How does interaction horizon differ from chain-of-thought depth?
- How does making implicit reasoning requirements explicit change model performance?
- What makes a causal abstraction more transferable than a generic heuristic?
- What happens to iterative search quality when reasoning depth is unconstrained?
- How does Self-Discover compare to the cognitive tools approach?
- How do progressive abstraction chains differ from branching reasoning topologies?
- Why does the Chinese Room argument miss the deeper abstraction problem?
- How does planning-before-execution compare to iterative reasoning and action loops?
- How do strategy-level abstractions differ from storing raw task workflows?
- Why does per-step deliberation lose global perspective compared to dynamic discovery?
- What distinguishes graph-of-thought reasoning from other structured reasoning topologies?
- Can deterministic recurrent depth achieve the computational benefits of stochastic reasoning?
- Do linearized traces genuinely expand exploration beyond standard chain-of-thought?
- How does entropy loss enable exploration beyond a single training example?
- Does preference tuning help or hurt the exploration of solution spaces in code?
- Can the exploration ceiling be raised beyond what pretraining established?
- When is numeric computation the real bottleneck versus reasoning depth?
- Can width-scaling replace depth-scaling on inherently sequential problems?
- Why do longer reasoning chains explore like tourists instead of scientists?
- Why does prompting discover capabilities that need reward-driven refinement?
- How does active reasoning through interaction differ from passive single-turn problem solving?
- How does the prefrontal cortex inspire artificial reasoning architectures?
- How does continuous soft thinking explore multiple paths without explicit training?
- How does training data structure shape reasoning strategy more than domain content?
- Can backward planning reduce search difficulty when multiple goal state paths exist?
- Why does extended reasoning training improve exploration without adding new capabilities?
- What makes o1's chain-of-thought processing specifically effective for exploration tasks?
- How should AI ideation systems decompose and recombine research concepts?
- How do extrapolative and contextual generalization measure RL reasoning gains?
- How does o1-style reasoning relate to learned search processes versus memorized solutions?
- What makes multi-turn critique trajectories more effective than single-turn reasoning chains?
- Why does strategy diversity within reasoning chains improve model generalization?
- How does early commitment in reasoning differ from early exploitation in planning?
- What makes exploration a verifiable and measurable training objective?
- Can tools unlock reasoning strategies that require abstract insight beyond computation?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do reasoning LLMs fail at deeper problem solving?
Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.
the problem RLAD addresses: depth without breadth
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
abstractions enforce structured parallel exploration
-
Does separating planning from execution improve reasoning accuracy?
Can modular LM architectures that split problem decomposition from solution execution outperform monolithic models? This explores whether decoupling these cognitive operations reduces interference and boosts performance.
abstractions as learned decomposition
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
abstractions may resist entropy collapse by maintaining strategy diversity
-
Can reasoning topologies be formally classified as graph types?
This explores whether Chain of Thought, Tree of Thought, and Graph of Thought represent distinct formal graph structures with different computational properties. Understanding this matters because the topology itself determines what reasoning strategies are possible.
RLAD creates a distinct topology: a two-level graph where the abstraction generator produces parallel breadth nodes (like CoT-SC) and each abstraction conditions a depth-first solution chain (like CoT); the result is a learned GoT-like structure where aggregation (in-degree > 1) happens at the abstraction level rather than at the solution level
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reasoning LLMs are Wandering Solution Explorers
- RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- Stream of Search (SoS): Learning to Search in Language
- Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models
- Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models
Original note title
reasoning abstractions decompose exploration into breadth-first strategy discovery and depth-first solution generation via two-player rl