How does monological training on text differ from dialogical training in conversation?
This explores the gap between models trained to predict static text — one writer, no turn-taking — and what real conversation actually requires: two parties jointly building and repairing shared understanding.
This reads the question as asking what gets lost when a system learns language from monologue — text as finished artifact — rather than from the back-and-forth work of dialogue. The corpus frames the core difference sharply: text training is form-to-form prediction, while conversation is a coordinated social act. Bender & Koller's argument is the anchor here — meaning lives in the relation between expressions and communicative intent, and a model trained only on form has no access to the shared attention or intent that grounds language Can language models learn meaning from text patterns alone?. A related framing says LLMs essentially operationalize Saussure's *langue*: they compress the relational structure of a language without ever touching its external referents Can language models learn meaning without engaging the world?. Monological training can produce stunning fluency precisely because fluency turns out not to require dialogue at all.
But dialogue requires things text never teaches. Conversation is held together by implicit maintenance work — reference repair, topic hand-off, the small acts that keep two people oriented — and models don't acquire these because the training signal rewards predicting information, not doing relational work Why don't language models develop conversation maintenance skills?. The deepest version of this gap is common ground: human dialogue lets both parties propose and update shared assumptions, but an LLM interprets every later turn inside its fixed initial prompt frame and can't symmetrically revise the shared scoreboard, leaving the user as its sole maintainer Can LLMs truly update shared conversational common ground?. One note pushes this to its blunt conclusion — we talk *at* models, not *to* them, because the preposition 'to' presupposes an addressee capable of mutual uptake Are we really communicating with language models?.
Here's the twist the corpus adds, and the thing you might not expect: the dialogical failures aren't only a side effect of monological pretraining — they're actively *manufactured* by the alignment stage that's supposed to make models conversational. RLHF optimizes for single-turn helpfulness, rewarding confident answers over clarifying questions, which drives grounding acts down to roughly 22% of human levels — an 'alignment tax' where the model looks helpful but fails silently across turns Does preference optimization harm conversational understanding?. Because the reward lands on the next turn, models learn to respond passively rather than actively discover what the user wants; multi-turn-aware rewards reverse this and restore real collaboration Why do language models respond passively instead of asking clarifying questions?. The same single-turn pressure suppresses proactivity — volunteering relevant information unasked — even though doing so can cut conversations by up to 60% Could proactive dialogue make conversations dramatically more efficient?.
There's also an identity cost. Human dialogue is pragmatic: speakers switch register and renegotiate the terms of the exchange as it unfolds. Alignment instead locks a model into one static communicative persona that users can't reshape through conversation Can language models adapt communication style to different contexts?. And the registers a model does have are inherited wholesale from its training distributions — the sycophantic chat voice comes from RLHF on conversational data, the falsely objective essay voice from published prose, each carrying its source's failure modes Why do LLMs produce such different writing in chat versus posts?. A systematic review reinforces why this matters: lexical alignment serves task efficiency while emotional and prosodic alignment build trust, so collapsing these dimensions produces category errors like cold service bots and evasive mental-health assistants Do different types of alignment serve different conversational goals?.
The synthesis, then: monological training gives you a system that has absorbed the *structure* of language but none of the *coordination* of conversation — and the standard fix, preference alignment, optimizes the wrong unit (one turn) and so deepens the dialogical deficit it appears to address. The lever isn't more text or more RLHF; it's reward signals scoped to the whole interaction rather than the next reply.
Sources 11 notes
Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.
Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.
Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.
LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.
LLMs process tokens and generate continuations rather than receive and uptake communication. The preposition 'to' presupposes an addressee capable of mutual orientation and shared commitment that LLMs cannot provide, making Chalmers' investigation built on an unwarranted linguistic foundation.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
Simulations show proactivity—providing relevant information without being asked—cuts dialogue turns by 60% in medium-complexity domains. This behavior mirrors human conversation and Grice's maxims but is almost entirely absent from AI datasets and research benchmarks.
System prompts and RLHF training lock models into one communicative identity across all interactions, preventing the contextual register-switching and value trade-offs that characterize human pragmatics. Users cannot reshape model behavior through dialogue negotiation.
The same model produces sycophantic chat (shaped by RLHF on conversational data) and falsely objective posts (shaped by published prose training). Each register inherits failure modes from its training distribution rather than representing different models or subsystems.
A 2020–2025 systematic review shows lexical alignment drives task efficiency and comprehension, while emotional and prosodic alignment drive relational warmth and trust. Conflating them in design produces category errors—cold customer-service bots and evasive mental-health assistants.