Can language agents be represented as optimizable computational graphs?
This explores whether you can model an AI agent's whole workflow—its prompts and the way information flows between steps—as a graph that can be tuned automatically, rather than hand-built.
This explores whether an AI agent's inner workings—its prompts and the connections between reasoning steps—can be captured as a graph and then optimized automatically, instead of redesigned by hand. The corpus says yes, and the payoff is bigger than it first appears. When you represent a language agent as a computational graph (nodes are operations, edges carry information between them), popular prompting strategies like chain-of-thought, tree-of-thought, and Reflexion stop looking like separate inventions and turn out to be the same kind of structure wearing different clothes Can we automatically optimize both prompts and agent coordination?. Once they're all graphs, you can optimize along two axes at once: the text inside each node (the prompts) and the wiring between nodes (who talks to whom).
What makes this useful is that the two things people normally tune by trial and error become search problems. The same note shows you don't have to choose between improving a prompt and rearranging the agent's coordination—both can be learned. That reframes a lot of "agent engineering" as graph optimization rather than craft.
The corpus also hints at where the edges of the graph come from and where the graph stops helping. On the wiring side, there's a complementary idea: instead of hand-connecting which agent handles what, you can let agents discover each other through semantic capability vectors, making coordination a learned, scalable operation rather than manual plumbing Can semantic capability vectors replace manual agent routing?. And a node in such a graph need not be a giant model—much of the repetitive work can be handled by small language models, with large ones called only selectively, which is itself a structural optimization of the graph's cost Can small language models handle most agent tasks?.
But a graph that optimizes prompts and edges is still only as good as the signals flowing through it. Reflexion—one of the techniques the unifying view absorbs—works precisely because it feeds back an unambiguous success/failure signal that the agent stores as memory and reuses, no weight updates required Can agents learn from failure without updating their weights?. That's the kind of clean signal graph optimization thrives on. The harder truth is that other parts of the agent live outside the graph: turning a model into something that reliably acts in the world takes pipeline transformation—data, grounding, infrastructure, safety—not just rewiring nodes Can you turn an LLM into an agent by just fine-tuning?.
The most interesting limit is conceptual. Optimizing a graph is a form of self-improvement, and self-improvement in language models has a formal ceiling: every reliable fix needs something external to verify it, because a system can't escape the gap between generating an answer and checking it through metacognition alone What stops large language models from improving themselves?. So "optimizable computational graph" is real and powerful for unifying techniques and automating design—but the optimizer still needs an outside signal to climb toward, which is exactly why the trial-and-error feedback loops keep showing up as the thing that makes the graph learn at all.
Sources 6 notes
Language agents represented as computational graphs—where nodes are operations and edges define information flow—reveal that CoT, ToT, and Reflexion are formally equivalent structures. This unified view enables automatic optimization of both node prompts and edge connectivity without manual redesign.
Versioned Capability Vectors embedded in HNSW indices couple semantic matching with policy and budget constraints, making capability discovery a first-class operation that scales sub-linearly as agent heterogeneity increases.
SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.