Do language models really understand meaning or just surface frequency?
Explores whether LLMs comprehend semantic meaning independently of textual frequency, or whether high-frequency paraphrases systematically outperform rare ones even when meaning is identical across math, translation, and reasoning tasks.
Adam's Law (TFL) generalizes a previously local finding into a global property of LLM computation. The earlier NLI work showed predicates in entailment hypotheses skew higher-frequency than premises, and that fine-tuning amplifies rather than dilutes this bias — see Does fine-tuning on NLI teach inference or amplify shortcuts?. Adam's Law extends this across four task families: math reasoning, machine translation across hundreds of language pairs, commonsense reasoning, and agentic tool calling. The constant: when meaning is held fixed and only surface form varies, the higher-frequency paraphrase outperforms the lower-frequency one.
The mechanism is straightforward but uncomfortable. Higher-frequency text occurred more often during pre-training, so it sits in a denser, better-modeled region of the distribution. The model's "comprehension" is therefore not meaning-recognition first and surface-decoding second — it is statistical-mass recognition first, with meaning emerging downstream of that recognition. This converges with Can models pass tests while missing the actual grammar?: correct outputs do not certify that meaning is what the model is tracking.
The pattern matters because paraphrase invariance is a load-bearing assumption almost everywhere LLMs are deployed. We assume the same prompt, said two ways, will yield the same answer. Adam's Law says no: it will yield the frequency-weighted answer, and the surface form is a covariate of accuracy, not a transparent vehicle for the request. This also shadows the output side. Do different AI models actually produce diverse outputs? documents convergence in what models say; Adam's Law documents the same convergence in how models comprehend what is said to them. Both endpoints of the prompt-response loop pull toward the corpus mean. Frequency is not noise around meaning. Frequency is a substantial fraction of what comprehension means inside a transformer.
Inquiring lines that use this note as a source 79
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do multiple language models independently produce similar outputs in influence campaigns?
- Can AI detect sense-of-nonsense the way human readers do?
- Why does training data saliency distort how models judge meaning?
- What makes ambiguity recognition fundamentally important for poetry analysis?
- Why does statistical compression destroy literary connotation and meaning?
- Does functional grounding through discourse patterns count as genuine semantic meaning?
- Why does combining natural language with numerical scores improve prediction accuracy?
- Can meaning-level metrics like Semantic Entropy avoid length bias?
- How does syntactic encoding relate to semantic feature representation?
- What percentage of natural language relies on plausible deniability through ambiguous phrasing?
- How does semantic ambiguity differ from structural ambiguity in language?
- Is interpretive multiplicity a bug in language or a feature?
- What other semantic relations benefit from explicit surface markers in text?
- Can language models ground clarifications without vision and kinesthetic modalities?
- How should meaning spaces be systematically modeled across different applications?
- Why do language models fail when semantic content is stripped away?
- Can autoregressive models learn faithful translation to logical representations without semantic loss?
- How does semantic grounding differ between human minds and language models?
- How do rare linguistic registers differ from conceptually complex examples?
- Why do embeddings measure semantic association instead of task relevance?
- Why does homework adherence remain low despite advances in language model capability?
- What semantic classifier design avoids lexical variation without genuine conceptual distinctness?
- Do LLMs compute scalar implicature differently across conversational contexts?
- Can LLMs improve at metaphor if they handle decoupled semantics better?
- How does implicit meaning processing limit LLM pragmatic reasoning?
- Can language models acquire meaning from distributional patterns alone without joint attention?
- Does generalization frequency explain why models favor upward semantic movement?
- What makes vector embeddings fail on single-hop semantic relevance queries?
- Does fine-tuning on NLI tasks amplify or reduce frequency bias in language models?
- Is paraphrase invariance a reliable assumption when deploying language models in production?
- Can correct model outputs prove that semantic meaning rather than surface patterns drove the response?
- Can frame semantics explain why context matters more than word similarity?
- Do LLMs struggle more with semantic accuracy than syntactic correctness across domains?
- Can LLMs infer implicit meaning without surface linguistic markers?
- What distinguishes surface cues from structural meaning in language understanding?
- Why do LLMs fail at semantic generalization despite grammatical accuracy?
- Can language models develop world models that ground meaning in causal reality?
- Can understanding language happen entirely within a language system alone?
- Do metaphors work by decoupling meaning from linguistic associations?
- Can LLMs identify implicit metaphoric mappings that require pragmatic inference?
- How does the distance between natural language and formal notation affect translation accuracy?
- Why do LLMs choose surface-order quantifier scope over contextually correct readings?
- Can LLM semantic representations exist without causally influencing their generation output?
- Can encoder models match human conceptual structure better than larger language models?
- How much semantic meaning survives when LLMs paraphrase poetry and literary text?
- Can benchmark performance distinguish surface from structural linguistic knowledge?
- Can adding more words to a passage actually interfere with meaning?
- Why do different readers extract different meanings from identical text?
- What distinguishes conceptual understanding from statistical pattern matching in models?
- Why do NLP models fail at recognizing multiple valid interpretations?
- Does model confidence actually explain why paraphrases produce different outputs?
- Do LLMs learn linguistic generalizations or just surface-level frequency patterns?
- Can presupposition projection strength vary by context in embeddings?
- Why do semantic similarity and task relevance diverge in vector search results?
- How does bidirectional entailment distinguish semantic equivalence from token similarity?
- How do LLMs compress literary language without losing essential nuance?
- Why does cross-text analogical reasoning fail when semantics decouple from symbols?
- Why does training data not function as a searchable corpus?
- Why do readability and style metrics plateau while reasoning improves with scale?
- How do language models transmit traits through semantically unrelated data?
- Why does probability of text completion not equal knowledge value?
- What distinguishes real understanding from superficial pattern matching?
- Why does joint attention matter for acquiring linguistic meaning?
- How do static embeddings and contextualized representations divide semantic labor?
- How does training distribution shape what language models understand best?
- How does modeling capability relate to lossless compression in language models?
- Why does semantic similarity retrieval enable skill transfer to novel situations?
- Can adversarial paraphrasing defeat feature-based detection of LLM text?
- Why does semantic diversity matter more than surface lexical diversity?
- Can vector embeddings measure task relevance instead of semantic similarity?
- When does RLHF reduce diversity and when does it preserve semantic variation?
- Does language convey meaning purely through relational structure without external grounding?
- Can autoformalisation from natural language preserve semantic accuracy?
- How do mechanistic features compare to natural language for interpretability?
- How well does semantic similarity preserve survey response nuance?
- What is the comprehension-generation asymmetry in language models?
- Can readers detect meaning through resonance patterns alone without knowing authorial intent?
- Where does the meaning actually originate in reader-detected resonance across language?
- Why do multimodal models fail on rare and underrepresented concepts?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does fine-tuning on NLI teach inference or amplify shortcuts?
When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.
local finding that Adam's Law generalizes
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
mechanism: surface, not semantics
-
Do different AI models actually produce diverse outputs?
Explores whether using multiple different language models together creates genuine diversity or whether shared training and alignment cause them to converge on similar answers despite independence.
output-side counterpart of the same dynamic
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Adam's Law: Textual Frequency Law on Large Language Models
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Word Meanings in Transformer Language Models
- Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds
- Bigger is not always better: The importance of human-scale language modeling for psycholinguistics
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- Language models show human-like content effects on reasoning tasks
- Large Linguistic Models: Investigating LLMs' metalinguistic abilities
Original note title
high-frequency phrasing wins — LLMs systematically prefer textually frequent paraphrases over rare ones with the same meaning