What makes planning, tool use, and reasoning into jointly optimizable subsystems?
This explores why breaking an AI agent into separate parts — the part that plans, the part that calls tools, the part that reasons — turns out to be what makes the whole thing tunable, and what the corpus says the necessary conditions for that are.
This reads the question as asking what structural conditions let planning, tool use, and reasoning be treated as separate modules you can improve independently or together — rather than as one tangled chain of thought you can only tune as a blob. The corpus's strongest answer is surprisingly consistent: the precondition is decoupling. The moment these capabilities stop sharing the same context window and the same forward pass, each gains a clean interface, and a clean interface is what makes a subsystem optimizable.
The clearest version of this is the idea of representing an agent as a computational graph, where nodes are operations and edges define how information flows between them Can we automatically optimize both prompts and agent coordination?. Once you do that, prompting tricks that looked like separate inventions — chain-of-thought, tree-of-thought, Reflexion — turn out to be the same structure with different wiring, and you can optimize two axes at once: the prompt inside each node, and the connections between nodes. That's the literal meaning of "jointly optimizable": the planning topology and the reasoning content become two knobs on one object instead of one inseparable habit.
The reason decoupling matters isn't just elegance — it's that fusing these jobs causes interference. Separating the model that decomposes a problem from the model that solves it improves accuracy, and notably the *decomposition* skill transfers across domains while the *solving* skill does not Does separating planning from execution improve reasoning accuracy?. That asymmetry is the whole argument for modularity: planning and execution are different kinds of skill that generalize differently, so welding them together wastes the part that travels. The same logic shows up in tool use, where decoupling reasoning from tool observations — planning the full chain before executing it, or reasoning over abstract placeholders that get filled in later — kills the quadratic prompt growth and serial latency you get when every tool response is stuffed back into the reasoning stream Can reasoning and tool execution be truly decoupled?.
There's a deeper claim underneath, which is that planning and reasoning aren't even one capability. One line of work argues reasoning systems should separate *when* to reason from the *capacity* to reason — RL post-training mostly teaches activation timing for mechanisms pre-training already installed How should reasoning systems actually be architected?. The failure mode when you don't separate these is visible: reasoning models "wander" and abandon promising paths not because they lack compute but because nothing is governing the structure of exploration Why do reasoning models abandon promising solution paths?. Give exploration an explicit governing layer — abstractions that enforce breadth before depth — and you can jointly train the abstraction generator alongside the solution generator Can abstractions guide exploration better than depth alone?. Planning becomes a trainable subsystem precisely because it's been pulled out as its own thing.
The practical payoff of treating these as separate modules is that you can also *learn and reuse* them. Once sub-tasks are first-class objects, an agent can extract reusable routines from past experience and compound them hierarchically, with gains of 24–51% that grow as tasks drift from training Can agents learn reusable sub-task routines from past experience?. This connects to the most radical framing in the collection: hiding step-irrelevant context so each LLM call sees only what it needs, turning reasoning into modular, debuggable sub-tasks embedded in an explicit algorithm Can algorithms control LLM reasoning better than LLMs alone?, and structuring reasoning as recursive subtask trees that prune their own memory Can recursive subtask trees overcome context window limits?. The thread tying all of these together — and the thing you might not have known you wanted to know — is that "jointly optimizable" and "cleanly separable" are the same property viewed from two sides: subsystems become tunable together only once each one has been given its own boundary, interface, and memory.
Sources 9 notes
Language agents represented as computational graphs—where nodes are operations and edges define information flow—reveal that CoT, ToT, and Reflexion are formally equivalent structures. This unified view enables automatic optimization of both node prompts and edge connectivity without manual redesign.
Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.
ReWOO and Chain-of-Abstraction both decouple reasoning from tool responses through different mechanisms—planning-before-execution and abstract placeholders respectively—eliminating quadratic prompt growth and sequential latency while maintaining reasoning quality.
Research shows RL post-training teaches models *when* to use reasoning mechanisms that pre-training already provides. Decoupled architectures, latent reasoning in continuous space, and interleaved action-grounding all outperform monolithic chain-of-thought approaches.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.
LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.
The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.