What does next-token prediction tell us about compositional linguistic competence?
This explores what training a model to predict the next token actually buys you when it comes to genuine grammatical and compositional understanding — whether the objective produces real structural competence or convincing surface mimicry.
This explores what the next-token prediction objective actually teaches a model about grammar and composition — and the corpus splits sharply on the answer. The skeptical reading is that prediction-from-form gives you surface pattern-matching dressed up as competence. When researchers push grammatical complexity, LLM performance degrades in a tellingly systematic way: simple sentences are handled fine, but recursion and deep embedding break down consistently, which looks far more like learned surface heuristics than internalized structural rules Does LLM grammatical performance decline with structural complexity?. Bender & Koller make the strongest version of this case — meaning is the relation between expressions and communicative intent, and a system trained only on form-to-form prediction has no access to that relation, so it can't reconstruct the meaning that grounds language in the first place Can language models learn meaning from text patterns alone?.
The sharpest diagnostic comes from treating the model as exactly what the objective makes it: an autoregressive probability machine. Once you do that, you can *predict in advance* where it will fail — tasks whose correct answer is a low-probability sequence (reversing the alphabet, counting letters) are systematically hard even when they're logically trivial Can we predict where language models will fail?. That's the deep point: next-token prediction doesn't optimize for structural correctness, it optimizes for likelihood, so 'competence' tracks frequency, not grammar. You see the same fingerprint when strong training priors simply override what's in the context window Why do language models ignore information in their context?, and when local token-to-token associations — the most prediction-native signal there is — turn out to cause the majority of reasoning errors Where do memorization errors arise in chain-of-thought reasoning?.
But the corpus also pushes back, and this is the part most readers won't expect. Compositional structure does seem to emerge from prediction, just not where you'd look for it. Pruning experiments show neural networks spontaneously implement compositional subroutines in isolated, ablatable subnetworks — and pretraining makes that modularity *more* reliable, not less Do neural networks naturally learn modular compositional structure?. At inference these learned components can even be recombined: tuning just the singular values of weight matrices yields composable 'expert' vectors that mix dynamically without interfering Can models dynamically activate expert skills at inference time?. So composition is in there — the objective builds modular machinery — but the machinery is statistical, not symbolic.
The most interesting tension is that the model's internal computation and its emitted tokens can diverge. Logit-lens work finds transformers computing the correct answer in their early layers, then actively *suppressing* that representation to emit format-compliant filler Do transformers hide reasoning before producing filler tokens?. Read alongside the finding that only a small minority of high-entropy 'forking' tokens carry the real decision-making load Do high-entropy tokens drive reasoning model improvements?, a subtler picture emerges: next-token prediction is a noisy readout of richer internal structure, not a transparent window into it.
The thing worth walking away with: 'does next-token prediction yield compositional competence?' is the wrong question, because the answer is both yes and no at different levels. The objective demonstrably builds modular, recombinable internal structure — yet what it surfaces as output is governed by sequence probability, so it fails exactly where being correct means being improbable. The competence and the incompetence have the same cause.
Sources 9 notes
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.
Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.