Is interpretive multiplicity a bug in language or a feature?
This explores whether the fact that the same words yield many valid readings is a defect to be engineered away or a real property of meaning — and what the corpus shows about how humans and LLMs each handle that multiplicity.
This explores whether multiple valid readings of the same text are a flaw or a feature — and the corpus draws a sharp line: for humans, multiplicity is a feature; for today's language models, it's a blind spot they mostly can't see. The most direct evidence is Interpretation Modeling work showing that when readers disagree about a socially loaded sentence, that disagreement isn't annotation noise to be cleaned up — it carries real information about where readers stand Why do readers interpret the same sentence so differently?. The spread of readings *is* part of what the sentence means. So at the level of human language, interpretive multiplicity looks like a feature: meaning is partly a function of who's reading.
The twist is that LLMs, which you might expect to be natural homes for many-readings-at-once, are strikingly bad at it. On the AMBIENT benchmark, GPT-4 correctly disambiguates deliberately ambiguous text only 32% of the time against 90% for humans, and it fails across lexical, structural, and scope ambiguity alike — it can't hold two readings in mind simultaneously Can language models recognize when text is deliberately ambiguous?. So the bug isn't in language; it's in the model's relationship to language. The multiplicity is really there, and the system that can't represent it is the one with the deficit.
Why can't they? The corpus points to a mechanism: models track statistical surface mass rather than meaning. They systematically prefer the higher-frequency phrasing of two equivalent paraphrases Do language models really understand meaning or just surface frequency?, and they treat structural cues like presupposition triggers and non-factive verbs as surface patterns instead of computing what those cues actually do to meaning Why do embedding contexts confuse LLM entailment predictions?. A system that collapses toward the most frequent reading is, almost by construction, a system that flattens multiplicity. That same surface-over-structure habit shows up as grammatical competence that decays predictably as sentences get more deeply embedded Does LLM grammatical performance decline with structural complexity?, Why do large language models fail at complex linguistic tasks?.
Here's the thing you might not expect: at one level LLMs *do* carry multiplicity — just not where reading happens. An LLM behaves like a non-deterministic simulator holding a superposition of possible characters, which is why regenerating a prompt yields different personalities, all consistent with context, and why that cloud narrows as a conversation proceeds Does an LLM commit to a single character or maintain many?. So the model maintains many possible *speakers* while being unable to recognize many possible *meanings*. Multiplicity lives in its outputs but not in its comprehension — which is almost the inverse of the human case, where we read multiply but each speaker is singular.
The practical payoff: collapsing interpretive multiplicity is exactly the kind of failure that hides from standard evaluation. Models can produce a correct-sounding explanation while failing to apply the concept Can LLMs understand concepts they cannot apply?, and identical accuracy scores can sit on top of fractured internal structure Can models be smart without organized internal structure?. So if you treat ambiguity as a bug and optimize it away, you don't get a clearer model — you get one that confidently picks the popular reading and looks fine on the leaderboard while having quietly thrown away information that, in human language, was the point.
Sources 9 notes
Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.