Should user simulators be trained via RL like agents or decomposed into trackable state components?

This explores whether the better path to a faithful user simulator is reinforcement learning (treating the simulator like a trainable agent) or structural decomposition (breaking the user's goal into trackable parts) — and whether those are even rival choices.

This reads the question as an either/or, but the corpus' sharpest finding is that the two are not rivals — decomposition is what makes the RL signal trustworthy in the first place. The straight RL case is real: inverting the usual setup so the *simulator* is the policy, rewarded on prompt-to-line, line-to-line, and Q&A consistency, cuts persona drift by more than half Can training user simulators reduce persona drift in dialogue?. That treats the simulator like any agent — give it a reward, let multi-turn experience shape it. But the goal-tracking work shows why that alone is fragile: simulators lose track of their own goals mid-conversation, and that drift *corrupts the very reward signal* an RL loop depends on. The fix (UGST) decomposes a user goal into profile, policy, task, requirements, and preferences, each independently tracked — and then internalizes alignment through a three-stage pipeline that ends in GRPO Why do LLM user simulators fail to track their own goals?. Note what that means: the decomposition isn't an alternative to RL, it's the scaffolding that lets RL train on a coherent target instead of a slowly-corrupting one.

So the real answer the corpus offers is: decompose *so that* you can train. And there's a third axis the question doesn't name — conditioning. RecLLM gets realism not from RL or goal-tracking but from feeding the simulator explicit latent variables: a session-level user profile and turn-level intent Can controlled latent variables make LLM user simulators realistic?. That's a different lever entirely — control the inputs rather than train the behavior — and it's measurably realistic by discriminator and distribution-matching tests. The trackable-components view and the controllable-latents view are close cousins: both say a simulator improves when its hidden state is made explicit rather than left implicit in a prompt.

The wider agent-design literature backs the decomposition instinct. Reliable agents come from *externalizing* cognitive burdens — memory, skills, protocols — into a harness rather than hoping model scale solves them internally Where does agent reliability actually come from?. A simulator whose goal is split into tracked sub-states is doing exactly this: externalizing 'who am I and what do I want' so it can't silently drift. But the RL camp has a counter-warning worth hearing: agents trained only on static, pre-specified structure are capped by their curators' imagination and never learn from their own failures Can agents learn beyond what their training data shows?. Over-decompose and hand-specify everything, and you may build a simulator that only covers the user types you thought to enumerate.

That tension surfaces in two failure modes the corpus has already mapped. Persona work shows hand-built generators optimized for *coverage* beat statistical density-matching at catching rare-but-consequential user configurations Should persona simulation prioritize coverage over statistical matching? — structure helps reach the edges. But social simulation collapses the moment agents must hold *private* information the model would normally just share with itself; omniscient setups hide this, and no amount of clean decomposition fixes a simulator that skips the grounding work of genuinely not-knowing Why do LLMs fail when simulating agents with private information?. That's a behavior you'd more plausibly train into existence than specify.

The thing worth carrying away: 'RL vs. decomposition' dissolves on contact with the strongest paper here. Decompose the user's goal into trackable state to keep the reward honest, condition on explicit latents for realism, then run RL on top of that clean signal — and keep enough open-ended interaction that the simulator can still surprise you with user behavior nobody enumerated. The order is the insight, not the choice.

Sources 7 notes

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Why do LLM user simulators fail to track their own goals?

The UGST framework breaks user goals into profile, policy, task, requirements, and preferences—each with explicit status tracking. A three-stage method (steering, SFT, GRPO) progressively internalizes goal alignment, reducing the misalignment that corrupts RL training signals.

Can controlled latent variables make LLM user simulators realistic?

RecLLM demonstrates that conditioning an LLM simulator on session-level (user profile) and turn-level (user intent) latent variables produces synthetic conversations measurable as realistic via crowdsource discrimination, discriminator models, and classifier-ensemble distribution matching.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Should persona simulation prioritize coverage over statistical matching?

Evolutionary optimization of Persona Generator code achieves broader trait coverage than density-matched baselines, including rare but consequential user configurations that naive LLM prompting misses.

Why do LLMs fail when simulating agents with private information?

Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.

Should user simulators be trained via RL like agents or decomposed into trackable state components?

Sources 7 notes

Next inquiring lines