What formal language complexity level matches transformer computational limits best?

This explores where transformers actually sit on the formal-language scale (the Chomsky hierarchy and its circuit-complexity cousin) — what level of structured language they can genuinely compute versus merely imitate.

This explores where transformers actually sit on the formal-language scale rather than where their architecture theoretically tops out, and the corpus pulls in two directions worth holding side by side. On paper, transformers are extremely powerful: a single finite-size transformer can be coaxed by the right prompt into computing any computable function, making prompts themselves Turing-complete programs Can a single transformer become universally programmable through prompts?. But that's a statement about what's possible, not what gets learned — the same work notes standard training almost never produces models that actually implement arbitrary programs this way. So the interesting question isn't the ceiling; it's the operating level.

The sharpest direct answer comes from research on which formal languages help transformers learn at all. Transfer from a formal language to natural language only works when the formal language meets two conditions at once: it captures hierarchical, nested dependencies (its place in the Chomsky hierarchy) AND it stays inside what transformers can actually learn with length generalization (its circuit complexity) What formal languages actually help transformers learn natural language?. That second constraint is the real boundary. Circuit complexity — roughly, how much parallel computation a fixed-depth network can do in one forward pass — is the level that matches transformer limits, and it sits below the full recursive power the Chomsky hierarchy alone would suggest. Languages that need genuine unbounded recursion fall outside it.

You can see that ceiling show up behaviorally. Grammatical competence degrades predictably as sentences get structurally deeper: simple sentences are handled well, but recursion and center-embedding fail consistently — evidence that models lean on surface heuristics rather than true structural grammar rules Does LLM grammatical performance decline with structural complexity?. The same pattern recurs in reasoning: transformers succeed on compositional tasks by memorizing and matching linearized computation subgraphs from training, then collapse on novel compositions, with errors compounding step by step Do transformers actually learn systematic compositional reasoning?. And when a task demands genuine iteration — running a numerical method to convergence — models pattern-match a memorized template and emit plausible-but-wrong values instead of executing the loop Do large language models actually perform iterative optimization?. Each of these is the signature of a system that's strong at bounded, parallel, pattern-shaped computation and weak at unbounded sequential recursion.

The thing you might not have expected: this isn't a scale problem you can buy your way out of. On constrained-optimization tasks, models plateau around 55–60% satisfaction regardless of parameter count, architecture, or training regime — a fundamental ceiling rather than a gap waiting for the next bigger model Do larger language models solve constrained optimization better?. That's what a complexity-class limit looks like from the outside: it doesn't move when you add parameters. If you want to go deeper on why some hard limits are provably immovable for any computable model at all, the hallucination-inevitability work makes the formal version of that argument Can any computable LLM truly avoid hallucinating?.

So the cleanest synthesis the corpus supports: transformers are Turing-complete in principle but operate, in practice, at the level of bounded-depth circuit complexity — they comfortably handle hierarchical structure up to a point, but the moment a task requires true recursion or iteration past their fixed compute budget, they substitute matching for computing. The right formal-language target to train on, accordingly, is the one that lives inside that circuit-complexity envelope while still demanding hierarchy What formal languages actually help transformers learn natural language?.

Sources 7 notes

Can a single transformer become universally programmable through prompts?

Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.

What formal languages actually help transformers learn natural language?

Transfer from formal to natural language succeeds only when formal languages satisfy two conditions: they capture hierarchical dependencies (Chomsky hierarchy) AND are learnable by transformers with length generalization (circuit complexity). Formal languages meeting both constraints outperform matched natural language training.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Can any computable LLM truly avoid hallucinating?

Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.

What formal language complexity level matches transformer computational limits best?

Sources 7 notes

Next inquiring lines