What training signals would models need to learn reciprocal common-ground construction?

This explores what kind of feedback during training could teach a model to *co-build* shared understanding with a user — proposing and revising mutual assumptions — rather than just interpreting everything inside a fixed opening frame.

This question is really about a missing signal. The corpus's sharpest finding on the problem is that LLMs can't actually update common ground at all: they read every later turn through the lens of the initial prompt, so when a user pivots or contradicts an earlier framing, the model can't absorb that revision into jointly held background — the human ends up being the sole keeper of the conversational scoreboard Can LLMs truly update shared conversational common ground?. Reciprocity fails not because the model is dumb but because nothing in training ever rewarded it for being a co-author of shared assumptions.

The most suggestive recipe for fixing that comes from a very different corner of the collection: agents trained against a *diverse* set of partners develop cooperation on their own, because mutual vulnerability to being exploited creates pressure for both sides to adapt to each other Can agents learn cooperation by adapting to diverse partners?. The training signal there isn't a rule that says 'cooperate' — it's a population of changing counterparts that makes adapting-to-your-partner the only stable strategy. Applied to common ground, this hints that you'd need to train models against interlocutors who themselves revise their assumptions, so the model is penalized for clinging to a stale frame and rewarded for tracking a moving shared state.

Such a signal also has to be honest and persistent. Reflexion shows that agents learn well from *unambiguous* feedback — a clean success/failure signal stops the model from rationalizing, and keeping the resulting self-diagnosis uncompressed in memory keeps it usable across turns Can agents learn from failure without updating their weights?. Common-ground construction would need something analogous: a clear signal of whether the model's belief about 'what we both now assume' actually matches the user's, carried forward rather than re-derived from scratch each turn. There's even a route to making the model generate part of that signal itself — post-completion learning trains a model to compute its own evaluation in unused sequence space, internalizing assessment instead of leaning on an external judge Can models learn to evaluate their own work during training?.

But the corpus also flags two traps. First, you can't prompt your way there: prompt optimization only reorganizes knowledge already in the model and can't inject a capability that training never built Can prompt optimization teach models knowledge they lack? — so reciprocal common ground has to come from the training signal, not clever instructions. Second, the obvious tool (RL) tends to *narrow* rather than broaden: RL post-training collapses onto a single dominant format and suppresses alternatives in the first epoch Does RL training collapse format diversity in pretrained models?, which is the opposite of the partner-diversity that drove cooperation in the first place. And there's a deeper worry — that a model could learn the surface *form* of updating common ground without the substance, the same way chain-of-thought reproduces the shape of reasoning without genuine inference and degrades off-distribution Does chain-of-thought reasoning reveal genuine inference or pattern matching?.

Put together, the corpus points to a signal that doesn't yet exist in standard pipelines: train against diverse, self-revising partners so mutual adaptation becomes necessary; give the model a clean, persistent signal of belief-match it can carry and partly self-generate; and resist the RL-style convergence that would flatten the very diversity that makes reciprocity emerge. The thing you didn't expect to learn here is that 'learning to share common ground' may look less like a language-modeling objective and more like a multi-agent cooperation problem.

Sources 7 notes

Can LLMs truly update shared conversational common ground?

LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.

Can agents learn cooperation by adapting to diverse partners?

Sequence model agents trained against diverse co-players develop in-context best-response strategies that naturally resolve into cooperation. Mutual vulnerability to exploitation creates pressure that drives cooperative mutual adaptation without hardcoded assumptions or timescale separation.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

What training signals would models need to learn reciprocal common-ground construction?

Sources 7 notes

Next inquiring lines