Why do token-level language models fail at utterance-level pragmatic optimization?
This explores why models trained to predict the next token — a local, moment-by-moment objective — struggle when success depends on the whole utterance or conversation landing right (being clear, asking the right question, achieving a communicative goal).
This explores why a system optimized one token at a time fails at goals that only make sense at the scale of a whole utterance or conversation. The short version the corpus keeps circling back to: the training objective and the success criterion live at different altitudes. A language model is, at its core, an autoregressive probability machine that maximizes the likelihood of the next token given what came before Can we predict where language models will fail?. Pragmatic optimization — saying the thing that actually moves a conversation toward what the user wants — is a property of the whole exchange, not of any single token. Nothing in the local objective is looking at that larger target, so it gets sacrificed whenever it conflicts with what's locally probable.
The sharpest illustration is in how reward shaping bakes this in. Standard RLHF optimizes for immediate, next-turn helpfulness, which quietly trains models to give a confident answer now rather than ask a clarifying question that would pay off three turns later Why do language models respond passively instead of asking clarifying questions?. The fix in that work is telling: you have to estimate the long-term value of an interaction — explicitly optimize at the conversation level — before the model will do the pragmatically smart thing. That's the same gap from the other side: utterance-level competence has to be designed in, because token-level (or turn-level) reward won't produce it on its own.
The cost shows up downstream as conversations that derail. When information is revealed gradually, models lock onto a premature guess early and can't recover — a 39% average performance drop across multi-turn settings, with mitigations clawing back only 15–20% Why do language models fail in gradually revealed conversations?. A pragmatically optimizing speaker would hold off, hedge, or probe; a next-token optimizer commits to the locally fluent continuation and pays for it later. Relatedly, the model doesn't even hold a fixed stance to optimize around — it maintains a superposition of possible characters and samples one at generation time, so there's no stable communicative intent being steered Do large language models actually commit to a single character?.
What makes this feel less like a tuning bug and more like an architectural ceiling is that the same mismatch recurs wherever the goal is procedural rather than next-step. Models don't actually run iterative optimization in latent space; they recognize a problem as template-similar to something seen in training and emit a plausible-looking value, a failure that persists across scale Do large language models actually perform iterative optimization?. Pragmatic optimization is itself iterative — track the goal, evaluate whether the last move helped, adjust — and the corpus suggests that kind of held-over-time optimization is exactly what next-token prediction substitutes pattern-matching for. There's even a mechanistic hint at why: learning concentrates in a small set of high-entropy 'forking' tokens Do high-entropy tokens drive reasoning model improvements?, i.e. the signal that gets refined is local decision points, not utterance-level plans.
The deeper reason it can't be prompted away: prompting and in-context steering work only within what the model already is. Strong training priors override the current context when they conflict Why do language models ignore information in their context?, and prompt optimization can reorganize existing knowledge but cannot inject a capability the model lacks Can prompt optimization teach models knowledge they lack?. So if utterance-level pragmatic optimization isn't in the objective, you can't reliably ask for it at inference time — you have to change what's being optimized, which is precisely the move the multi-turn-reward work makes. The thing you didn't know you wanted to know: 'be more pragmatic' isn't a prompt problem, it's an altitude problem in the loss function.
Sources 8 notes
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.