Does DPO training with coreference chains teach spontaneous convention formation?

This explores whether the DPO-on-coreference-chains method actually teaches LLMs to invent linguistic conventions on the fly during conversation — and what the rest of the corpus suggests about whether that's even possible.

This explores whether the DPO-on-coreference-chains method actually teaches LLMs to invent linguistic shorthand on the fly during conversation. The short answer from the corpus is yes — and the way it works is more interesting than the question lets on. The core finding Can we teach LLMs to form linguistic conventions in context? is that you don't fine-tune a model for each task; instead you build preference pairs from TV scripts — some examples reward shortening a reference after it's been introduced ("the red-haired detective" → "she"), others penalize shortening it too early — and add special [remention] planning tokens. After DPO on those pairs, the model spontaneously forms ad-hoc conventions mid-interaction. So 'spontaneous convention formation' isn't a metaphor here; it's the measured behavior.

What makes this land is a deeper claim the corpus keeps circling: conventions are a relational, in-conversation phenomenon, and that's exactly the thing LLMs are usually bad at. Several notes argue that models treat the opening prompt as a fixed frame and can't jointly revise shared assumptions with a user Can LLMs truly update shared conversational common ground?, and that the implicit techniques humans use to keep a conversation coherent — reference repair, handing off a topic — never develop because training rewards predicting information, not doing relational work Why don't language models develop conversation maintenance skills?. Convention formation is squarely in that 'relational work' category. The coreference-DPO result is notable precisely because it manufactures a training signal for something the standard objective ignores.

There's also a nice resonance with how these models learn meaning at all. One note argues LLMs operationalize Saussure's *langue* — they pick up culturally situated patterns purely by compressing the relational structure of text, no external referents required Can language models learn meaning without engaging the world?. A linguistic convention is a relational fact (this short form now stands for that long one), so it's the kind of thing a relational compressor should be able to absorb given the right examples. The coreference method essentially supplies those examples in a targeted way rather than hoping they emerge from generic pretraining.

Worth holding a healthy skepticism, though, because the corpus has a recurring caution: trained behaviors that look like a new capability are sometimes imitation of form. Chain-of-thought, for instance, reproduces familiar reasoning *shapes* learned from training and degrades under distribution shift rather than reflecting genuine inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?. The honest open question is whether DPO-taught convention formation generalizes to novel referents and conversation types, or whether it's reproducing the re-mention patterns of TV dialogue. The note frames it as genuine in-context convention formation; the broader corpus would push you to ask how far past the training distribution it holds.

If you want to go deeper, the contrast with the rigidity findings is the richest thread: the same models that resist personality conditioning and stay locked in a single communicative identity Can language models adapt communication style to different contexts? can, with the right preference signal, become flexible about reference. That tension — globally static persona, locally adaptive convention — is the thing you didn't know you wanted to know.

Sources 6 notes

Can we teach LLMs to form linguistic conventions in context?

Post-training with two types of preference pairs derived from TV scripts — one encouraging re-mention shortening, one preventing premature shortening — plus special [remention] tokens enables models to spontaneously form ad-hoc linguistic conventions during interaction without task-specific fine-tuning.

Can LLMs truly update shared conversational common ground?

LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.

Why don't language models develop conversation maintenance skills?

Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Can language models adapt communication style to different contexts?

System prompts and RLHF training lock models into one communicative identity across all interactions, preventing the contextual register-switching and value trade-offs that characterize human pragmatics. Users cannot reshape model behavior through dialogue negotiation.

Does DPO training with coreference chains teach spontaneous convention formation?

Sources 6 notes

Next inquiring lines