SYNTHESIS NOTE
Language, Text, and Discourse Reasoning, Retrieval, and Evaluation

Why do semantically identical prompts produce different LLM outputs?

Explores why paraphrases with the same meaning yield different model outputs. This matters because it reveals what LLMs actually respond to during inference—and whether prompt engineering is optimizing meaning or something else.

Synthesis note · 2026-05-02 · sourced from Natural Language Inference
Why do LLMs fail at understanding what remains unsaid? What grounds language understanding in systems without embodiment?

Cao et al. (2024) showed prompts with the same meaning give very different output quality. Adam's Law isolates frequency as a primary variable in that variance: when paraphrase pairs are matched on meaning but differ on sentence-level corpus frequency, the higher-frequency variant systematically wins. This converts a known phenomenon — prompt sensitivity — from a vague reliability concern into a specific architectural claim about what the model is actually responding to.

The implication for Does model confidence predict robustness to prompt changes? is direct but complicating. Confidence-based accounts read prompt sensitivity as model uncertainty fluctuating across surface variations. Adam's Law inserts a deeper variable: even at fixed model confidence, frequency mass differs across paraphrases because pre-training exposure differs, and that exposure asymmetry shapes the prediction independent of how confident the model "feels." Confidence and frequency are entangled, but frequency is the more upstream cause.

For a Language-as-Event frame, this is load-bearing. A prompt is not a transparent vessel that hands meaning to the model. It is a token sequence whose statistical mass relative to pre-training shapes how the model parses the request before any semantic interpretation occurs. Two synonymous sentences are not the same event. They are two different statistical encounters that happen to share a meaning a human would assign them. The model registers the encounter; meaning is what we read into the registration. This connects to Can models pass tests while missing the actual grammar? — when surface and meaning compete, surface wins by construction.

A practical corollary: prompt-engineering as a discipline is partly a folk practice of frequency optimization. "Phrase it like a textbook" or "rewrite the prompt the way StackOverflow would phrase it" are intuitive moves toward higher-frequency surface forms. Adam's Law gives that folk practice a name and a mechanism — and a warning, because frequency-tuning a prompt does not improve the model's reasoning; it just moves the request into the model's denser distributional region.

Inquiring lines that use this note as a source 33

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 116 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

paraphrase equivalence is a fiction — same-meaning prompts produce different LLM outputs because frequency, not semantics, drives the prediction