Can explicit stack mechanisms extend what formal languages transformers can learn?
This explores whether bolting an explicit stack onto a transformer actually widens the class of formal/recursive languages it can learn — and what the corpus says about why transformers struggle with hierarchical structure in the first place.
This explores whether giving transformers an explicit stack helps them learn the kinds of nested, recursive structure that formal languages demand. The corpus gives a fairly direct yes — but a qualified one, and the qualifications are where it gets interesting. The clearest evidence is Pushdown Layers, a drop-in replacement for self-attention that maintains an explicit stack tape as it reads. The result is 3-5x more sample-efficient generalization on recursive syntax, with no hit to perplexity Can explicit stack tracking improve how transformers learn recursive syntax?. The signal here isn't just "stacks help" — it's that recursive structure *specifically* benefits from baked-in architectural bias, even though looser compositional skills tend to emerge on their own from scale. Some structure the model will find; recursion it would rather be handed.
Why recursion in particular? A second note reframes the whole problem. When researchers strip the semantic content out of reasoning tasks, transformer performance collapses even when the correct rules sit right there in the context — models lean on token associations and parametric commonsense, not formal symbol manipulation Do large language models reason symbolically or semantically?. A stack mechanism is essentially a way to smuggle genuine symbolic state into a system that otherwise fakes it semantically. That also explains a related failure: transformers often solve compositional tasks by memorizing linearized computation subgraphs from training, then breaking badly on novel combinations as errors compound across steps Do transformers actually learn systematic compositional reasoning?. An explicit stack attacks exactly that weakness — it gives the model a place to *track* depth rather than pattern-match it.
The most surprising adjacent piece is what makes a formal language *useful* as a transformer teacher in the first place. Pre-training on formal languages transfers to natural language only when the formal language satisfies two conditions at once: it captures hierarchical dependencies (high on the Chomsky hierarchy) AND stays learnable by a transformer with length generalization (bounded circuit complexity) What formal languages actually help transformers learn natural language?. That's the crux of your question turned inside out: there's a tension between *expressiveness* and *learnability*. Plain transformers can only reach so far up the hierarchy before length generalization breaks. An explicit stack is precisely the device that raises the learnable ceiling — it lets the architecture reach languages (context-free and beyond) that a vanilla attention stack stalls on.
Worth knowing the outer bound, though. In principle a single finite transformer is already Turing complete given the right prompt prompting-is-turing-complete-a-single-finite-transformer-can-compute-any-co — so the question was never "can transformers represent recursion" but "will training actually find that solution." That same note flags that standard training rarely produces models implementing arbitrary programs. So explicit stacks aren't expanding theoretical capacity; they're a learnability shortcut, narrowing the search so gradient descent lands on the recursive solution instead of a brittle memorized one. A neighboring finding sharpens this: several reasoning collapses turn out to be *execution* failures, not reasoning failures — models that know an algorithm still can't run it step-by-step in text alone, and tool-augmented versions clear the supposed cliff Are reasoning model collapses really failures of reasoning?. An explicit stack is the same intuition pushed into the architecture rather than into an external tool: give the model real working memory for state, and the recursion it couldn't sustain becomes tractable.
Sources 6 notes
Pushdown Layers—a drop-in self-attention replacement with explicit stack tracking—achieve 3-5x more sample-efficient syntactic generalization while maintaining perplexity. The improvement shows that recursive structure specifically benefits from architectural inductive bias despite general compositional generalization emerging from scale.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Transfer from formal to natural language succeeds only when formal languages satisfy two conditions: they capture hierarchical dependencies (Chomsky hierarchy) AND are learnable by transformers with length generalization (circuit complexity). Formal languages meeting both constraints outperform matched natural language training.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.