INQUIRING LINE

What distinguishes surface generalizations from true linguistic generalizations?

This explores the line between a model that has actually learned a rule and one that's only matched a pattern that looks like the rule — and asks how you can tell them apart, in grammar and beyond.


This explores the line between a model that has actually learned a rule and one that only reproduces patterns that mimic the rule. The cleanest case comes from grammar: BabyLM evaluations found models can produce grammatically correct outputs by leaning on sentence length, word choice, and spelling rather than any underlying syntactic structure — and crucially, standard benchmarks can't catch this unless they're designed specifically to rule out the surface shortcut Can models pass tests while missing the actual grammar?. So the distinguishing feature isn't in the output (both look the same); it's whether performance survives when you strip away the cues that correlate with the right answer but aren't the answer.

That diagnostic — does the behavior track meaning or just the statistics that usually accompany meaning? — recurs across the corpus far from grammar. In moral reasoning, GPT-4's ratings for a scenario and its meaning-reversed twin correlate at r=.99, while humans land at r=.54: the model is tracking lexical distribution, not what the scenario actually says Do LLMs generalize moral reasoning by meaning or surface form?. In theory-of-mind tasks, models default to surface strategies and collapse on open-ended perspective-taking even while passing structured benchmarks Do large language models genuinely simulate mental states?. The pattern is identical: success on the constrained test, failure once the surface correlate is removed.

The sharpest articulation of the gap is "potemkin understanding" — models that explain a concept correctly, fail to apply it, and then recognize their own failure, a combination no human cognition produces Can LLMs understand concepts they cannot apply?. This suggests surface and true generalization aren't just two scores on a scale but functionally separate pathways: explanation and execution can come apart entirely. Mechanistic interpretability backs this up by showing understanding isn't one thing — conceptual, world-state, and principled understanding sit in distinct mechanisms, and higher tiers don't replace the lower-tier heuristics, they coexist with them as a patchwork Do language models understand in fundamentally different ways?. A model can have a genuine circuit for one thing and a shortcut for another simultaneously.

Here's the twist worth carrying away: not every surface-looking behavior is a failure of understanding. Content effects show LLMs reproducing human belief-bias patterns item-by-item across three reasoning tasks — which the corpus reads not as a bug but as evidence that semantic content and logical form are architecturally inseparable, in humans too Do language models show the same content effects humans do?. And the relational view of language argues that compressing the structure of text alone — Saussure's langue, with no external referents — is genuinely a form of learning, not a cheap imitation of it Can language models learn meaning without engaging the world?. So "surface" isn't always shallow. The real distinction the corpus draws is narrower and more useful: a true generalization is one that holds when you vary everything except the thing it claims to be about. The surface kind is the one that quietly depended on something else the whole time — and the only way to know which you have is to build the test that takes the crutch away.


Sources 7 notes

Can models pass tests while missing the actual grammar?

BabyLM evaluations showed models can produce correct outputs by relying on sentence length, word choice, and orthography rather than grammatical structure. Standard benchmarks cannot distinguish these two generalization types without tests specifically designed to rule out surface heuristics.

Do LLMs generalize moral reasoning by meaning or surface form?

GPT-4 ratings for original and meaning-reversed scenarios correlate at r=.99, while human ratings correlate at r=.54. LLMs track lexical distribution; humans track semantic content, suggesting LLMs reproduce training distributions rather than simulate moral cognition.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Do language models understand in fundamentally different ways?

Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.

Do language models show the same content effects humans do?

LLMs show identical content-sensitivity patterns to humans on NLI, syllogisms, and Wason tasks, with belief-bias signatures matching human error rates item-by-item. This behavioral isomorphism across three independent tasks suggests content and logical form are inseparable in transformer reasoning architecturally.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Next inquiring lines