How does this differ from using LLMs as the policy itself?

This explores the contrast between making the LLM *be* the decision-maker (the policy you optimize and act through directly) versus wrapping it inside something else — an algorithm, a traditional model, a router — that does the actual deciding while the LLM contributes pieces.

This explores the contrast between making the LLM *be* the policy — the thing that directly chooses actions and gets optimized end-to-end — versus embedding it as a component inside a larger structure that holds the real control. The corpus stakes out both poles, and the gap between them turns out to be where most of the practical design happens.

The purest version of "LLM as policy itself" comes from reframing the model as a learnable agent. When you treat an LLM as a policy in a partially-observable decision process rather than a one-shot text generator, its memory, planning, and reasoning all become things you can optimize with reinforcement learning across many steps How does treating LLMs as multi-step agents change what we can optimize?. That's the maximal-trust stance: the model decides, and you shape its decisions through reward. The alternative designs in the corpus are mostly reactions to the ways that trust gets betrayed.

The first betrayal is execution. LLMs are good at *knowing* how to plan but bad at producing plans that actually run — only about 12% of GPT-4's generated plans execute without errors Can large language models actually create executable plans?. So instead of letting the model be the policy, you make it a subroutine inside an explicit algorithm that manages control flow and feeds it only the context relevant to each step Can algorithms control LLM reasoning better than LLMs alone?. The algorithm is the policy; the LLM is hired help. This matters because LLMs also lock into bad early guesses and can't recover when information arrives gradually Why do language models fail in gradually revealed conversations? — a hard argument against handing them the whole loop.

The recommender-systems work makes the same move from a different angle, and names the tradeoff cleanly. There are three ways to integrate an LLM: feed its embeddings to a traditional ranker, have it emit semantic tokens, or let it recommend directly as the policy How should language models integrate into recommender systems?. The surprising empirical result is that the indirect routes win: using the LLM to enrich item descriptions and then handing that text to a conventional recommender beats asking the LLM to recommend directly, because the model is great at content understanding but lacks the specialized ranking bias the task needs Does LLM input augmentation beat direct LLM recommendation?. "LLM as policy" and "LLM as feature extractor" aren't just architectural preferences — they cash out in measurable quality.

There's even a layer *above* the policy question: routing decides which model handles a query before any generation happens, a pre-generation choice that's distinct from evaluating what the model produces Can routers select the right model before generation happens?. So the full spectrum runs from the LLM as the optimized agent that acts, down through the LLM as a decomposed sub-task worker, the LLM as a content-enrichment component, and finally the LLM as one selectable option a router picks among. The thing worth taking away: "use the LLM as the policy" is the most powerful and the most fragile choice, and a lot of the best systems deliberately demote the model precisely where its planning, ranking, or recovery weaknesses would otherwise sink the whole loop.

Sources 7 notes

How does treating LLMs as multi-step agents change what we can optimize?

The Agentic RL survey shows that modeling LLMs as policies in Partially Observable MDPs rather than single-step generators makes memory, planning, and reasoning into RL-optimizable subsystems. This structural reframing explains the recent empirical convergence across memory-based agents, skill learning, and strategy distillation.

Can large language models actually create executable plans?

Only 12% of GPT-4 generated plans are actually executable without errors. LLMs excel at acquiring planning knowledge but fail at the reasoning assembly required to handle subgoal and resource interactions.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

How should language models integrate into recommender systems?

Research identifies three patterns: LLM embeddings feeding traditional recommenders, LLM-generated semantic tokens for decision-making, and direct LLM-as-recommender. Each trades off compatibility, latency, bias exposure, and capability utilization differently.

Does LLM input augmentation beat direct LLM recommendation?

Using LLMs to augment item descriptions with paraphrases, summaries, and categories—then feeding enriched text to traditional recommenders—beats asking LLMs to recommend directly. The mechanism: LLMs excel at content understanding but lack specialized ranking bias, so their textual enrichment is more valuable than their predictions.

Can routers select the right model before generation happens?

RouteLLM and Hybrid-LLM both achieve 40-50% cost reduction by routing to a single model based on query difficulty prediction, not response evaluation. Single-model routing minimizes latency compared to ensemble or cascade alternatives.

How does this differ from using LLMs as the policy itself?

Sources 7 notes

Next inquiring lines