SYNTHESIS NOTE
Model Architecture and Internals Reasoning, Retrieval, and Evaluation Language, Text, and Discourse

Can models pass tests while missing the actual grammar?

Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.

Synthesis note · 2026-02-21 · sourced from Discourses
Where exactly do LLMs break down with language structure? How should researchers navigate LLM reasoning research?

The BabyLM Challenge included an evaluation specifically designed to distinguish two kinds of generalization:

Models were fine-tuned on an ambiguous training set where labels were consistent with either generalization, then evaluated on a test set that disambiguated which one the model converged on.

The key insight: a model can produce correct outputs on typical evaluation tasks while relying on surface generalizations rather than structural ones. If test sets are not specifically designed to rule out surface heuristics, you cannot tell which kind of generalization the model is using.

This has wide implications for how we evaluate LLMs. When a model answers a grammaticality judgment task correctly, we tend to assume it has learned the relevant grammar. But it may have learned that short sentences with common words tend to be grammatical, that sentences with complex embeddings tend to be flagged as ungrammatical, or some other surface regularity that happens to correlate with the training labels.

Instruction tuning provides a striking parallel: Does instruction tuning teach task understanding or output format? shows that IT models achieve comparable accuracy even when instructions are replaced with simplified or deliberately wrong ("delusive") instructions. Models learn the output format distribution — what kind of response is expected — rather than the task semantics the instructions describe. The "instruction-following" that benchmarks measure is largely format compliance that correlates with task understanding but doesn't require it, precisely paralleling how syntactic benchmark performance correlates with grammatical knowledge but doesn't require it.

The distinction matters for robustness: surface generalizations fail on unusual structures. Linguistic generalizations are rule-governed and extend systematically to novel forms. If deployment involves unusual syntactic structures, a model relying on surface heuristics will fail — and the failure won't be predictable from standard benchmark performance.

A behavioral counterpart exists in moral reasoning: Do LLMs generalize moral reasoning by meaning or surface form?. Minimal wording changes that reverse the moral meaning of a scenario (e.g., "wrongfully convicted" → "rightfully convicted") leave LLM moral ratings nearly unchanged (r=.99) while human ratings shift substantially (r=.54). This extends the surface-generalization finding from grammatical structure into behavioral/moral reasoning — the same failure mode operating at a higher cognitive level. Humans track the semantic reversal; LLMs track the token distribution.

Inquiring lines that use this note as a source 19

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
21 direct connections · 188 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

lms may learn surface generalizations rather than linguistic generalizations despite correct outputs