Can curiosity reward during conversation compete with simulated interaction optimization for alignment?
This explores whether rewarding a model for being curious mid-conversation — asking clarifying questions, probing intent — can win out against training it on simulated multi-turn rollouts as a route to alignment, and the corpus suggests these aren't actually rivals.
This reads the question as a contest: curiosity reward (paying the model to ask questions and discover intent) versus simulated interaction optimization (training on rollouts of future turns). The corpus's sharpest point is that the contest is mostly an illusion — simulated interaction is the mechanism that makes curiosity rewardable in the first place. Under standard single-turn RLHF, a clarifying question always scores worse than a confident answer, because the reward is collected immediately and a question defers the payoff. CollabLLM's fix is to estimate the long-term value of a turn by simulating where the conversation goes, which is exactly what flips clarifying questions from penalized to rewarded Why do language models respond passively instead of asking clarifying questions?. So 'curiosity' isn't a competitor to simulated optimization; it's what simulated optimization buys you that next-turn optimization can't.
The reason this matters is that the default alignment target is actively hostile to curiosity. Preference optimization rewards fluent, confident responses, and in doing so erodes the grounding acts — checks, clarifications, repairs — that humans use to build shared understanding, dropping them roughly 77% below human levels Does preference optimization damage conversational grounding in large language models? Does preference optimization harm conversational understanding?. That's an alignment tax: a model that looks more helpful single-turn but fails silently across turns. Curiosity reward is one way to pay that tax back, but only if the training horizon is long enough to see the payoff — which loops you straight back to needing simulation.
The more interesting tension the corpus surfaces is internal to curiosity itself: when should a model ask versus just act? Insert-expansions from conversation analysis give a formal account of when probing the user prevents misunderstanding rather than merely recovering from it When should AI agents ask users instead of just searching?. And proactive dialogue research shows the opposite move — supplying relevant information without being asked — can cut conversation length by up to 60% Could proactive dialogue make conversations dramatically more efficient?. So 'curiosity' isn't a single reward signal; over-rewarding questions could make a model that interrogates when it should just answer. Conversational recommender work argues this is why asking, recommending, and timing should be optimized as one joint policy rather than separate rewards that can't inform each other Can unified policy learning improve conversational recommender systems?.
Where the two approaches genuinely diverge is on what the simulator is for. Simulated interaction optimization can be turned inward — training a consistent user simulator cuts persona drift by 55% and gives you a stable partner to optimize against Can training user simulators reduce persona drift in dialogue?. The risk is that you align to the simulated user, not the real conversational work. Several notes warn that the work being skipped isn't informational at all: conversation maintenance — reference repair, topic hand-off, mirroring word choice — is social action that training-by-prediction never rewards Why don't language models develop conversation maintenance skills? Why don't conversational AI systems mirror their users' word choices?. A curiosity reward at least targets that relational layer directly; a simulator optimized for task completion may route around it.
The quiet payoff here: the real lever isn't which reward wins, it's the horizon. Curiosity and simulated rollouts are the same bet — that conversational quality lives in the trajectory, not the turn. The structure-only finding that conversation shape predicts satisfaction nearly as well as full text analysis Can conversation shape predict whether it will work? is the clearest evidence that single-turn reward is measuring the wrong thing entirely, and that whatever signal you use, it has to see the arc.
Sources 10 notes
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.
Simulations show proactivity—providing relevant information without being asked—cuts dialogue turns by 60% in medium-complexity domains. This behavior mirrors human conversation and Grice's maxims but is almost entirely absent from AI datasets and research benchmarks.
Research shows that formulating attribute-asking, item-recommending, and timing decisions as a single graph-based RL policy achieves better joint optimization than isolated components. Separation prevents gradient signals from informing one another and fails to optimize conversation trajectory holistically.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.
Response generation models fail to adapt vocabulary toward users' lexical choices, a phenomenon central to human rapport and clarity. Post-training via DPO on coreference-identified preferences can teach models in-context convention formation.
A structure-only model analyzing conversation trajectory achieved 68% accuracy predicting satisfaction, nearly matching full-text LLM analysis at 70%. Combined structural and textual features reached 80%, showing that how conversations unfold geometrically captures interaction quality text-based classifiers miss.