Why do rare complex structures in training data harm LLM generalization?

This explores why training data that is both statistically rare and structurally complex tends to break down LLM generalization — and the corpus suggests rarity and complexity are actually two separate harms that get tangled together.

This explores why rare, complex structures hurt generalization, and the most useful thing the corpus does is pull apart two mechanisms that the question bundles into one. "Rare" is about statistical frequency — how far an example sits from the pre-training distribution. "Complex" is about structure — syntactic depth, recursion, embedded clauses. They feel like the same problem, but the research treats them as distinct failure routes.

Start with rarity. One line of work reframes curriculum learning entirely: rare data is hard not because it's conceptually difficult but because it signals a *distributional gap* — places where the model's pre-training never built coverage Does ordering training data by rarity actually improve language models?. That connects to a sharper claim: because an LLM is fundamentally an autoregressive probability machine, tasks whose correct answers are *low-probability strings* are systematically harder even when they're logically trivial — backwards alphabet, letter counting Can we predict where language models will fail?. Rarity harms generalization because the model is, at bottom, matching frequency, not learning the procedure that would let it extrapolate to the rare case.

Now complexity, which exposes a different wound. As syntactic depth and embedding increase, grammatical competence degrades *predictably* — models that handle simple sentences cleanly fall apart on recursion and nested clauses Does LLM grammatical performance decline with structural complexity?, Why do large language models fail at complex linguistic tasks?. The interpretation isn't "not enough data" — it's that the model learned surface heuristics that *approximate* grammar rather than the compositional rules that would generalize to arbitrary depth. The same pattern shows up away from language: when semantic cues are stripped out, reasoning collapses even with the correct rules sitting in context, because models lean on token associations, not symbolic manipulation Do large language models reason symbolically or semantically?.

Here's where the two strands meet and why the combination is especially toxic. Complex structures are *also* rare (deeply-nested sentences are uncommon in any corpus), so a complex-and-rare example hits both failure routes at once: it's a low-frequency target the autoregressive prior fights, *and* it demands compositional structure the model only ever surface-approximated. The deeper finding across the corpus is that scale doesn't rescue this — models pattern-match memorized templates instead of executing the underlying procedure, and out-of-distribution variants produce sharp performance drops no matter the parameter count Do large language models actually perform iterative optimization?, Do fine-tuned language models actually learn optimization procedures?. Generalization fails because there was never a generalizing mechanism — just a very good frequency map.

The thing you might not have expected to want to know: this is why a model can *explain* a complex structure correctly and still fail to *apply* it — the "potemkin" pattern where explanation and execution run on functionally disconnected pathways Can LLMs understand concepts they cannot apply?. Rare complex structures don't just sit in a blind spot; they reveal that fluency and competence were never the same thing. And counterintuitively, deliberately training on the rare cases *first* — treating rarity as a map of where coverage is thin rather than as difficulty to defer — may be the more direct fix Does ordering training data by rarity actually improve language models?.

Sources 8 notes

Does ordering training data by rarity actually improve language models?

CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Why do rare complex structures in training data harm LLM generalization?

Sources 8 notes

Next inquiring lines