How does circuit complexity limit which grammatical structures transformers can acquire?
This explores whether the theoretical computation limits of fixed-depth transformers (the AC0/TC0 complexity classes) are what stops them from acquiring recursive, deeply-embedded grammar — and what the corpus says about working around it.
This explores whether the theoretical computation limits of fixed-depth transformers are what stops them from learning recursive grammar, and the corpus lines up surprisingly cleanly around that idea. The sharpest statement of the ceiling comes from work on hierarchical recurrence Can recurrent hierarchies achieve reasoning that transformers cannot?: a standard transformer has a fixed number of layers, which caps the depth of computation it can perform in a single forward pass — formally it sits inside the AC0/TC0 complexity classes. Grammar that requires tracking arbitrarily deep nesting (clauses inside clauses inside clauses) needs a kind of unbounded counting or stack-keeping that those classes can't express. So the limit isn't a training-data problem; it's baked into the architecture's shape.
You can see the fingerprints of this ceiling in behavior. Two independent lines find that LLM grammatical competence doesn't fail randomly — it degrades *predictably* as structural depth and embedding increase Does LLM grammatical performance decline with structural complexity?, with top models consistently misidentifying embedded clauses and complex nominals exactly when nesting gets deep Why do large language models fail at complex linguistic tasks?. That smooth, monotonic decline is the signature of a system that learned surface heuristics within its computational budget rather than a true recursive rule that would generalize to any depth. The compositional-reasoning work makes the same point from another angle: transformers tend to solve structured tasks by memorizing linearized computation subgraphs and then fail catastrophically on novel combinations Do transformers actually learn systematic compositional reasoning? — pattern-matching standing in for a procedure the architecture can't natively run.
The most revealing evidence is what happens when you hand the architecture the missing primitive. Pushdown Layers bolt an explicit stack onto self-attention, and recursive syntactic generalization jumps 3-5x while perplexity holds steady Can explicit stack tracking improve how transformers learn recursive syntax?. That's the tell: if recursion were just hard to learn from data, more scale would fix it; instead, giving the model the one data structure that escapes the complexity class is what unlocks it. Recursive grammar specifically benefits from that architectural inductive bias in a way general compositional skill does not.
There's a fascinating tension worth sitting with, though. In principle a single finite transformer is Turing-complete — the right prompt can make it compute anything Can a single transformer become universally programmable through prompts?. So the ceiling isn't absolute capability; it's what *gradient-descent training reliably installs*. Standard training rarely produces models that implement arbitrary programs, so the practical limit on grammar acquisition is the gap between what the architecture *could* express and what learning actually carves into it. Escape routes therefore come in two flavors: change the architecture's effective depth (hierarchical recurrence) or change its memory (the explicit stack of pushdown layers, or surprise-gated neural memory modules that scale computation beyond attention's native limits Can neural memory modules scale language models beyond attention limits?).
The thing you might not have expected to want to know: the deepest grammatical failures and the famous failures with jokes and wordplay may share a root. One account argues transformers integrate tokens by additive parallel aggregation rather than selectively suppressing irrelevant words Why do AI systems miss jokes and wordplay so consistently? — a structural operation the architecture lacks, not a knowledge gap. Whether it's deep clause embedding or frame-dependent meaning, the recurring story is the same: some linguistic structures sit just outside what a fixed-depth, flat-aggregation machine can compute, and no amount of data moves that wall — only changing the machine does.
Sources 8 notes
The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Pushdown Layers—a drop-in self-attention replacement with explicit stack tracking—achieve 3-5x more sample-efficient syntactic generalization while maintaining perplexity. The improvement shows that recursive structure specifically benefits from architectural inductive bias despite general compositional generalization emerging from scale.
Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Transformers integrate token information through weighted parallel aggregation rather than selective suppression of irrelevant words. This structural difference explains consistent failures with jokes, wordplay, and frame-dependent meaning—not knowledge gaps, but missing cognitive operations.