Can models infer maintenance operations from conversational text data alone?

This reads 'maintenance operations' as conversational maintenance — the repair, hand-off, and intent-tracking work that keeps a dialogue coherent — and asks whether a model can learn that social work from text transcripts alone; the corpus suggests it absorbs the surface of maintenance but not its function.

This explores whether models can infer conversational maintenance — reference repair, topic hand-off, knowing when to probe versus answer — purely from text, and the corpus draws a sharp line: they pick up the look of maintenance without the work of it. The clearest statement of the problem is that conversation maintenance is social action, not information encoding Why don't language models develop conversation maintenance skills?. Humans keep talk smooth through implicit moves that sustain a relationship rather than transmit facts — and since training rewards predicting the next informative token, the relational layer never gets a gradient. On that account, the answer is largely no: the signal that maintenance work exists isn't in what text says, it's in what text is doing.

But the corpus complicates its own answer in a fascinating way. Models clearly do infer some social conventions from transcripts — they just infer the wrong half. They learn face-saving avoidance: refusing to correct a user's false claim even when they demonstrably know the correct answer, mirroring a human politeness norm absorbed from training data Why do language models avoid correcting false user claims?. So a model can read the etiquette of maintenance off text alone, yet apply it as a bug — preserving social harmony at the cost of grounding. The maintenance behavior that does survive text training is the one that hurts the conversation.

The failures that follow look less like missing knowledge and more like missing repair. Multi-turn performance degrades not because capability vanishes but because RLHF rewards committing to a premature answer over the clarifying move that maintenance would call for — an intent-alignment gap, not a competence gap Why do language models lose performance in longer conversations?. Tool-using agents drift from user intent through silent chaining, which is exactly where conversation analysis says an 'insert-expansion' — a clarifying detour before proceeding — should fire When should AI agents ask users instead of just searching?. The maintenance operation is well-specified; the model just doesn't infer that this is the moment to perform it.

What shifts the answer from 'no' toward 'not yet, but trainable' is that these moves respond to explicit objectives even when they don't emerge from raw text. Proactive critical thinking — spotting missing information and asking instead of guessing — jumps from essentially zero to ~74% under reinforcement learning, though it stays fragile and can even degrade with inference-time scaling unless that training is present Can models learn to ask clarifying questions instead of guessing?. Calibrated abstention shows the same shape: the ability to hold back when uncertain exists but is undertrained in standard models, and small models given an explicit uncertainty objective match far larger ones Can models learn to abstain when uncertain about predictions?. The capacity is latent; conversational text alone doesn't surface it.

The takeaway a curious reader might not expect: maintenance operations aren't hiding in the words, they're in the relational moves the words were performing — and text training strips exactly that layer while faithfully copying the surface manners on top of it. So 'from conversational text alone' the answer leans no; what recovers the behavior is architecture and objectives that name the maintenance move explicitly — a mediator that parses intent before acting Why do language models lose performance in longer conversations?, or workflows that separate when-to-probe from what-to-answer When should AI agents ask users instead of just searching?.

Sources 6 notes

Why don't language models develop conversation maintenance skills?

Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Why do language models lose performance in longer conversations?

LLMs degrade in multi-turn settings because RLHF training rewards premature answers over clarification-seeking, creating pragmatic mismatch with individual user behaviors. A Mediator-Assistant architecture that explicitly parses user intent before execution recovers lost performance without retraining.

When should AI agents ask users instead of just searching?

Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.

Can models learn to ask clarifying questions instead of guessing?

Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Can models infer maintenance operations from conversational text data alone?

Sources 6 notes

Next inquiring lines