Why does training data saliency distort how models judge meaning?

This explores how the sheer statistical weight of frequent or strongly-associated training patterns can override actual meaning when a model decides what a text 'says' — and what that reveals about whether models judge meaning at all.

This explores how the statistical mass of training data — how often a phrasing or association appeared — can crowd out meaning when a model evaluates text. The clearest evidence is direct: models systematically prefer high-frequency surface forms over rarer but semantically identical paraphrases, and this bias holds across math, translation, commonsense reasoning, and tool calling Do language models really understand meaning or just surface frequency?. That consistency is the tell — it suggests the model is tracking how much statistical weight a form carries from pretraining, not recognizing what it means.

The same dynamic shows up as a tug-of-war between what's in front of the model and what it learned. When prior training associations are strong enough, models generate outputs that contradict their own context — and you can't fix this by prompting harder; it takes intervening directly in the model's internal representations Why do language models ignore information in their context?. Saliency, in other words, isn't a surface quirk you can argue the model out of. There's even a measurable threshold to it: how strongly a keyword gets primed after training is predictable from its probability beforehand, with a sharp cutoff around one-in-a-thousand separating words that 'stick' from those that don't, after as few as three exposures Can we predict keyword priming before learning happens?.

Why would frequency dominate meaning in the first place? One camp argues it's structural: meaning requires linking expressions to communicative intent, and a system trained only on form-to-form prediction never has access to that, so it can only ever reconstruct statistical regularity Can language models learn meaning from text patterns alone?. But the corpus doesn't let that conclusion sit unchallenged. Other work shows LLMs operationalize Saussure's *langue* — they compress the relational structure of language so well that fluent, situated generation needs no external referent Can language models learn meaning without engaging the world? — and that even static embeddings, before attention runs, encode genuine semantic content like valence and concreteness Do transformer static embeddings actually encode semantic meaning?. So the distortion may be less 'no meaning' and more 'real semantic signal getting drowned out by louder frequency signal.'

Where saliency does the most damage is in cases that demand holding more than one reading at once. Models fail badly at recognizing deliberate ambiguity — GPT-4 disambiguates only 32% of cases where humans hit 90% — because they collapse to a single dominant interpretation instead of entertaining the alternatives Can language models recognize when text is deliberately ambiguous?. That's saliency as a failure of plurality: the most-trained reading wins by default. It's worth contrasting with how humans interpret, where disagreement across social positions is irreducible and meaningful rather than noise to be averaged away Why do readers interpret the same sentence so differently?. The thing you didn't know you wanted to know: the distortion isn't only that models pick the frequent reading — it's that they can't even see the existence of the others.

Sources 8 notes

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Can language models learn meaning from text patterns alone?

Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Do transformer static embeddings actually encode semantic meaning?

Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Why do readers interpret the same sentence so differently?

Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher probing whether training-data saliency—the crowding-out of semantic meaning by statistical frequency—remains a hard constraint on model judgment or has been relaxed by new architectures, training methods, or evaluation. The question: *Can models learn to privilege meaning over frequency, and if so, how?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat as perishable claims to re-test:
• Models systematically prefer high-frequency surface forms over rarer but semantically identical paraphrases across math, translation, and commonsense reasoning (2023–2024).
• When prior training associations are strong, models generate outputs contradicting their own context; prompting cannot fix this; only internal intervention works (~2024).
• The priming threshold is sharp: keywords with ~one-in-a-thousand base probability 'stick' after as few as three exposures (~2024).
• LLMs fail at ambiguity recognition (GPT-4: 32% vs. humans 90%) because they collapse to a single dominant reading instead of entertaining alternatives (~2023).
• Newer work hints at mechanistic understanding and consistency training as possible remedies (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2304.14399 (2023-04): We're Afraid Language Models Aren't Modeling Ambiguity
• arXiv:2404.04125 (2024-04): No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Performance
• arXiv:2507.08017 (2025-07): Mechanistic Indicators of Understanding in Large Language Models
• arXiv:2510.27062 (2025-10): Consistency Training Helps Stop Sycophancy and Jailbreaks

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above—frequency bias, context failure, priming thresholds, ambiguity collapse—judge whether newer models (GPT-4o, o1, Claude 3.5+), training innovations (consistency training, mechanistic steering, in-context inference), or evaluation harnesses have since RELAXED or OVERTURNED it. Separate the durable question (saliency as a statistical phenomenon) from the perishable limitation (whether it can be mitigated). Where a constraint still holds, flag it plainly.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does mechanistic understanding (2025-07) or consistency training (2025-10) actually *solve* the ambiguity problem, or only narrow it? Has frequency law (2026-04) been refuted by scaling or method?
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., *If saliency bias can be suppressed via internal steering, does that preserve semantic fidelity or just compress it differently?* *Can models now hold multiple meanings in superposition, or do newer architectures still collapse to dominance?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why does training data saliency distort how models judge meaning?

Sources 8 notes

Next inquiring lines