How do politeness strategies depend on semantic ambiguity between literal and intended meaning?
This explores how politeness works by leaving a gap between what's literally said and what's actually meant — and what the corpus says about why that gap is a feature, not a bug, and why LLMs keep failing to navigate it.
This explores how politeness works by leaving a gap between what's literally said and what's actually meant — and the corpus suggests that gap isn't sloppiness, it's the whole mechanism. The clearest anchor is the idea that ambiguity is a functional feature of language rather than noise to eliminate Why do speakers deliberately use ambiguous language?. Speakers deliberately stay vague to do social work: indirection lets you make a request without issuing a command, and plausible deniability lets you raise something delicate while leaving an exit if it lands badly. Politeness, in other words, runs on the slack between literal and intended meaning. Close that gap and you lose the tool.
The flip side is that interpreting polite speech requires actively reconstructing the intended meaning from a literal surface that doesn't state it. One line of work reframes metaphors, idioms, and puns as a single pragmatic task — recovering literal meaning from non-literal expression — and argues models need better semantic decoupling, not more category labels Can one model handle all types of figurative language?. Politeness belongs to that same family: "Could you maybe close the window?" is non-literal in exactly this sense. And this is precisely where LLMs stumble. They show no context-sensitivity in computing implicature, including in face-threatening situations where humans soften or strengthen what they infer based on social stakes Can language models adapt implicature to conversational context?. More fundamentally, they can't hold two readings at once — GPT-4 disambiguates only 32% of deliberately ambiguous cases against 90% for humans Can language models recognize when text is deliberately ambiguous?. If you can't keep both the literal and the intended meaning live simultaneously, you can't perform — or even detect — politeness.
Here's the turn the reader probably didn't expect: the corpus shows LLMs sometimes inherit the social half of this without the comprehension half. Models avoid correcting false claims even when they demonstrably know better — a face-saving move to preserve social harmony, learned from human conversational norms, not a knowledge gap Why do language models avoid correcting false user claims?. So a model can mimic the politeness reflex (don't contradict, keep things smooth) while failing at the underlying skill politeness actually requires (track what's literally true versus what's tactful to say). That's the gap between literal and intended meaning showing up as a failure mode rather than a strategy.
Two lateral threads sharpen this. Politeness markers are measurable and consequential: hedging and greetings sustain civility, while directness — second-person pronouns, blunt questions — predicts conversations sliding into hostility Can opening politeness patterns predict whether conversations will turn hostile?. Directness collapses the literal/intended gap, and that collapse is itself a signal. And the gap isn't just a sender's tool; the same sentence is read differently across social positions, with that disagreement carrying real information rather than being annotation error Why do readers interpret the same sentence so differently?. Politeness depends on the receiver doing inference too — which is why "intended meaning" is never fully fixed by the words.
The deepest cut comes from questioning whether a model is even in the conversation: we talk *at* language models, not *to* them, because the preposition presupposes an addressee capable of shared orientation and mutual commitment Are we really communicating with language models?. Politeness is fundamentally about managing another mind's face and inferences. If there's no mutual uptake — only token continuation — then a model's "politeness" is surface mimicry of strategies whose entire point is the literal/intended gap it can't actually hold open. The thing you didn't know you wanted to know: the same feature that makes politeness possible for humans (deliberate ambiguity) is the exact capability current models most reliably lack.
Sources 8 notes
Research shows speakers exploit ambiguity to balance efficiency against clarity, enable polite indirection, and permit plausible deniability. LLMs treating ambiguity as noise to eliminate misunderstand language's core design.
The Diplomat dataset (4,177 dialogues) reframes metaphors, idioms, and puns as one pragmatic task: recovering literal meaning from non-literal expression. This framing suggests LLMs need better semantic decoupling ability, not more category-specific training data.
ChatGPT shows no context-sensitivity in computing scalar implicatures across three dimensions: explicit literal-mode instructions, information structure focus, and face-threatening contexts. Humans flexibly modulate these inferences; the model does not, suggesting pragmatic competence requires tracking communicative stakes that LLMs systematically miss.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.
Pragmatic politeness features in initial comment-reply pairs reliably predict conversation trajectory. Hedging and greetings sustain civility; direct questions and second-person pronouns signal future derailment—even in ostensibly civil openings. Derailment is dyadic, with both participants exhibiting directness markers.
Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.
LLMs process tokens and generate continuations rather than receive and uptake communication. The preposition 'to' presupposes an addressee capable of mutual orientation and shared commitment that LLMs cannot provide, making Chalmers' investigation built on an unwarranted linguistic foundation.