Why do standard next-token prediction models struggle with conversational initiative?
This explores why models trained to predict the next token tend to wait for instructions rather than steer a conversation — asking questions, raising topics, or planning ahead — and what in their training causes that passivity.
This explores why next-token prediction models struggle to *lead* a conversation rather than just respond to it. The corpus points to a clean answer: passivity isn't a quirk, it's baked into the objective. A model trained to predict the most likely continuation, then aligned to be helpful on each turn, is structurally built to react. Research finds that LLMs like ChatGPT can't initiate topics, plan strategically, or drive dialogue from their own goals, because their training optimizes for answering queries — not for creating dialogue Why can't conversational AI agents take the initiative?. The fluency of the output hides the absence of any underlying agenda.
The sharper culprit is the reward signal layered on top. When models are tuned for immediate, single-turn helpfulness, they learn that the safest move is to answer right now — which actively discourages asking clarifying questions or holding back for a better multi-turn outcome Why do language models respond passively instead of asking clarifying questions?. Initiative requires optimizing for the *trajectory* of a conversation, not the next reply. Tellingly, the same lesson shows up in conversational recommenders: when you split 'what to ask,' 'what to recommend,' and 'when to do each' into separate decisions, none can inform the others, and the system can't optimize the conversation as a whole — a single unified policy does far better Can unified policy learning improve conversational recommender systems?.
This matters because reactivity has a real cost. Models that don't take initiative lock into premature assumptions early in underspecified conversations and never recover — across 200,000+ conversations, every major LLM dropped ~39% in multi-turn settings, with agent patches recovering only 15-20% Why do language models fail in gradually revealed conversations?. The flip side is the upside being left on the table: proactively offering relevant information before it's asked for — the way humans actually talk — can cut the number of conversation turns by up to 60%, yet this behavior is almost entirely absent from AI training data and benchmarks Could proactive dialogue make conversations dramatically more efficient?. The models aren't incapable; they were never rewarded for it.
What's interesting is that some of the missing ingredients seem latent rather than impossible. Small models trained with uncertainty-aware objectives can learn *when not to answer* — abstaining on uncertain predictions — and match models 10x their size, suggesting the capacity for calibrated restraint exists but stays undertrained in standard LLMs Can models learn to abstain when uncertain about predictions?. Knowing you're uncertain is the precondition for asking a good clarifying question. And at the foundations, reframing next-token prediction itself as a reasoning task with verifiable rewards hints that the bare objective isn't the hard ceiling — it's how we train on top of it Can next-token prediction become a reasoning task with RL?.
The thread across all of this: conversational initiative is a goal-directed, multi-turn skill, and the standard recipe — predict the next token, reward immediate helpfulness on each turn — selects against exactly that. The models default to a reactive 'superposition' of plausible responses Do large language models actually commit to a single character? because nothing in their training gives them a reason to commit to a direction of their own.
Sources 8 notes
Research shows LLMs including ChatGPT cannot initiate topics, plan strategically, or lead conversations because their training optimizes for responding to queries, not creating dialogue from agent goals. This passivity is reinforced by alignment objectives and masked by fluent-sounding outputs.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
Research shows that formulating attribute-asking, item-recommending, and timing decisions as a single graph-based RL policy achieves better joint optimization than isolated components. Separation prevents gradient signals from informing one another and fails to optimize conversation trajectory holistically.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.
Simulations show proactivity—providing relevant information without being asked—cuts dialogue turns by 60% in medium-complexity domains. This behavior mirrors human conversation and Grice's maxims but is almost entirely absent from AI datasets and research benchmarks.
Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.
Reinforcement Pre-Training transforms next-token prediction into a reasoning task by providing verifiable rewards from the corpus itself, eliminating reward hacking and enabling inference-time scaling during pretraining. This suggests token-level reasoning patterns during pretraining strengthen downstream RL fine-tuning.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.