INQUIRING LINE

How do rare linguistic registers differ from conceptually complex examples?

This explores the difference between two reasons an LLM might struggle with an example — because the wording is statistically rare (far from what it saw in pretraining) versus because the underlying idea is genuinely hard — and whether the corpus treats these as the same axis or two different ones.


This explores a distinction the corpus draws sharply: 'rare' and 'hard' are not the same axis. Rarity is about distance from the pretraining distribution — how often a phrasing or instance showed up in the data. Conceptual complexity is about the difficulty of the underlying reasoning. The most direct statement of this is curriculum work that flips standard intuition: instead of feeding models easy concepts first, it feeds them rare data first, because rarity signals a distributional gap, not a pedagogical one Does ordering training data by rarity actually improve language models?. In that framing, 'easy vs. hard' is really 'common vs. far-from-training' wearing a disguise.

The reason this matters is that frequency, not meaning, turns out to be a primary lever on model behavior. Models systematically prefer high-frequency surface forms over semantically identical rare paraphrases — the same idea, said two ways, gets different performance purely because one phrasing carries more statistical mass Do language models really understand meaning or just surface frequency?. Pushed to its root, you can predict where a model fails just by asking which target responses are low-probability: tasks that are logically trivial (counting letters, reciting the alphabet backwards) break not because they're complex but because they're rare in text Can we predict where language models will fail?. So a rare register fails for a distributional reason even when nothing conceptually difficult is happening.

Conceptual or structural complexity is a genuinely separate failure mode. Grammatical competence degrades predictably as syntactic depth and embedding increase — models that handle simple sentences fine misidentify embedded clauses and complex nominals, suggesting they learned surface heuristics rather than structural rules Does LLM grammatical performance decline with structural complexity?, Why do large language models fail at complex linguistic tasks?. Here the difficulty scales with the structure itself, not with how often that structure appeared.

The sharpest entry into your question is the reasoning literature, which directly pits novelty against complexity and finds novelty wins. Reasoning models don't break at a complexity threshold — they break at instance-unfamiliarity boundaries. A long, complicated reasoning chain succeeds if the model trained on similar instances, while a short one fails if the instance is novel, because models fit instance-based patterns rather than general algorithms Do language models fail at reasoning due to complexity or novelty?. That's the cleanest articulation of your distinction: what looks like 'too hard' is often just 'too rare.' There's even a mechanistic correlate — hidden states sparsify under out-of-distribution shift, an adaptive response keyed to unfamiliarity rather than to reasoning load per se Do language models sparsify their activations under difficult tasks?.

The twist worth taking away: for these models the two categories partly collapse. Because the underlying mechanism is statistical compression — capturing broad category structure while discarding fine distinctions Do LLMs compress concepts more aggressively than humans do? — a rare register and a complex concept can produce the same symptom for the same reason: both sit in low-density regions of what the model compressed well. The 'potemkin' pattern, where a model explains a concept correctly but cannot apply it, hints that conceptual mastery and execution are separately distributed too, so a model can have seen the explanation often (common) while the application case stays rare Can LLMs understand concepts they cannot apply?. The practical upshot: when a model stumbles, the more useful question is usually 'how rare is this?' before 'how hard is this?'


Sources 9 notes

Does ordering training data by rarity actually improve language models?

CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Do LLMs compress concepts more aggressively than humans do?

Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Next inquiring lines