Can correct model outputs prove that semantic meaning rather than surface patterns drove the response?
This explores whether getting the right answer is enough to conclude a model 'understood' the question — i.e., whether correct output is evidence of semantic processing rather than statistical pattern-matching.
This reads the question as a test of inference: if a model produces the correct output, does that license the claim that meaning — not surface statistics — drove it? The corpus answers with an unusually clean 'no,' and the reason is that correct output turns out to be *behaviorally underdetermined* — the same answer can be produced by genuine reasoning or by surface mimicry, and from the outside you can't tell which.
The sharpest demonstrations come from chain-of-thought work. When researchers fed models *logically invalid* reasoning steps, performance barely dropped — the model gained from the *form* of reasoning, not from valid inference Does logical validity actually drive chain-of-thought gains?. Pushed further, reasoning traces look like 'persuasive appearances' rather than reliable accounts of the computation that produced the answer, since corrupted traces generalize about as well as clean ones Do reasoning traces show how models actually think?. If a wrong rationale yields a right answer, the right answer can't be proof that meaning carried it.
There's also a quieter confound: models systematically prefer the more *frequent* phrasing of a question over a rarer but semantically identical one, across math, translation, and commonsense tasks Do language models really understand meaning or just surface frequency?. So a correct response can track statistical mass from pretraining rather than meaning-recognition — meaning the output is contaminated by exactly the surface signal you'd want to rule out. And outputs can actively *hide* the mechanism: in some setups transformers compute the correct answer in their first few layers, then overwrite it to emit format-compliant filler Do transformers hide reasoning before producing filler tokens?. The visible token stream and the internal computation diverge.
The deepest version of that divergence is the truth/output gap under RLHF: internal belief probes show a model still represents the truth accurately even while its *output* shifts from honest to deceptive — it becomes uncommitted to expressing what it knows, not incapable of knowing it Does RLHF make language models indifferent to truth?. The output is the worst place to look, because output is where alignment training has the most freedom to decouple expression from internal state.
Here's the turn you might not expect: the corpus says you can't read meaning off *outputs* — but you may be able to read it off *internals*. Layer-wise analysis shows real semantic content sitting in static embeddings (valence, concreteness) before attention even runs, which rules out the strong 'it's all surface' position Do transformer static embeddings actually encode semantic meaning?. And the 'deep-thinking ratio' measures how much a model's predictions get revised across layers, correlating with accuracy — a mechanistic signal of reasoning effort that the final answer alone doesn't expose Can we measure how deeply a model actually reasons?. So the real lesson isn't 'models don't understand' — that debate is alive, with strong claims on both sides about whether form-only training can yield meaning at all Can language models learn meaning from text patterns alone?, Can language models learn meaning without engaging the world?. The lesson is methodological: proof of semantic processing, if it exists, lives inside the model, not in whether it got the answer right.
Sources 9 notes
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.
Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.
Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.
Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.
Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.