INQUIRING LINE

Does approaching human performance mean learning the same grammatical rules?

This explores whether a model that scores like a human on grammar tests has actually internalized grammatical rules — or whether matching the score and matching the underlying competence are two different things.


This explores whether a model that scores like a human on grammar tests has actually internalized grammatical rules — or whether matching the score and matching the underlying competence are two different things. The corpus is unusually pointed on this, and the short answer it suggests is: no, approaching human performance does not guarantee learning the same rules. Performance and competence come apart.

The optimistic side of the ledger is real. Models trained on child-scale data — under 100 million words — can land within a few percentage points of humans on grammatical acceptability tasks, with curation mattering more than raw volume Can language models learn grammar from child-scale data?. So the headline number gets there. But other work shows that number can be earned the wrong way: models often produce 'correct' outputs by leaning on sentence length, word choice, and spelling cues rather than grammatical structure, and standard benchmarks can't tell the two apart unless they're specifically designed to rule out surface shortcuts Can models pass tests while missing the actual grammar?. The tell shows up under pressure — grammatical competence degrades predictably as sentences get structurally complex, with recursion and deep embedding failing consistently. That's the signature of surface heuristics, not internalized rules: a real rule shouldn't care how deep the nesting goes Does LLM grammatical performance decline with structural complexity?.

What makes this more than a grammar curiosity is that the same pattern recurs everywhere the corpus looks. Instruction tuning is the clearest cousin: models trained on semantically empty or deliberately wrong instructions perform almost identically to those trained on correct ones (43% vs. 42.6%), meaning what transfers is the shape of the output space, not understanding of the task Does instruction tuning teach task understanding or output format?. Same lesson, different domain — matching the behavior doesn't mean learning the thing the behavior is supposed to require. Even social judgment shows the split: GPT-4.5 out-predicts every individual human on social appropriateness, yet all the models share identical blind spots on unwritten norms, suggesting they learned the surface of culture without the grounding underneath Can AI learn social norms better than humans?.

There's a quieter mechanism for why this happens. Models fail to integrate what's in front of them when prior training associations are strong enough to override it — parametric pattern-matching wins over the actual structure of the input Why do language models ignore information in their context?. A system that defaults to learned statistical associations is exactly the kind that would pass a grammar test by association rather than by rule.

The interesting twist is whether 'same rules' is even the right bar. One thread argues humans and LLMs differ categorically when you observe them from the outside, but converge when you treat both as participants drawing on the same symbolic substrate — making the difference structural rather than absolute Do humans and LLMs differ fundamentally or just superficially?. And finetuned models can out-predict theory-built cognitive models of human decision-making without implementing any human-like theory at all Can language models learn to model human decision making?. So the corpus leaves you with a sharper question than you started with: if a system can match — even beat — human performance while demonstrably not running the same rules, then 'learning the same grammar' may be a thing we want for interpretability and trust, not something performance will ever force on its own.


Sources 8 notes

Can language models learn grammar from child-scale data?

Models trained on ≤100 million words performed within a few percentage points of humans on grammatical acceptability tasks, suggesting syntactic competence doesn't require massive datasets. Data composition and curation mattered more than raw volume.

Can models pass tests while missing the actual grammar?

BabyLM evaluations showed models can produce correct outputs by relying on sentence length, word choice, and orthography rather than grammatical structure. Standard benchmarks cannot distinguish these two generalization types without tests specifically designed to rule out surface heuristics.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can AI learn social norms better than humans?

GPT-4.5 outperformed every individual human at judging social appropriateness across 555 scenarios, challenging the theory that embodied cultural experience is necessary. However, all AI models share identical systematic errors on unwritten norms.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do humans and LLMs differ fundamentally or just superficially?

Applied Habermas's observer/participant distinction to AI: from outside, humans and LLMs are utterly different; from within shared discourse, both draw on the same symbolic substrate, making the difference structural rather than absolute.

Can language models learn to model human decision making?

LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.

Next inquiring lines