What does next-token prediction tell us about compositional linguistic competence?

This explores what training a model to predict the next token actually buys you when it comes to genuine grammatical and compositional understanding — whether the objective produces real structural competence or convincing surface mimicry.

This explores what the next-token prediction objective actually teaches a model about grammar and composition — and the corpus splits sharply on the answer. The skeptical reading is that prediction-from-form gives you surface pattern-matching dressed up as competence. When researchers push grammatical complexity, LLM performance degrades in a tellingly systematic way: simple sentences are handled fine, but recursion and deep embedding break down consistently, which looks far more like learned surface heuristics than internalized structural rules Does LLM grammatical performance decline with structural complexity?. Bender & Koller make the strongest version of this case — meaning is the relation between expressions and communicative intent, and a system trained only on form-to-form prediction has no access to that relation, so it can't reconstruct the meaning that grounds language in the first place Can language models learn meaning from text patterns alone?.

The sharpest diagnostic comes from treating the model as exactly what the objective makes it: an autoregressive probability machine. Once you do that, you can *predict in advance* where it will fail — tasks whose correct answer is a low-probability sequence (reversing the alphabet, counting letters) are systematically hard even when they're logically trivial Can we predict where language models will fail?. That's the deep point: next-token prediction doesn't optimize for structural correctness, it optimizes for likelihood, so 'competence' tracks frequency, not grammar. You see the same fingerprint when strong training priors simply override what's in the context window Why do language models ignore information in their context?, and when local token-to-token associations — the most prediction-native signal there is — turn out to cause the majority of reasoning errors Where do memorization errors arise in chain-of-thought reasoning?.

But the corpus also pushes back, and this is the part most readers won't expect. Compositional structure does seem to emerge from prediction, just not where you'd look for it. Pruning experiments show neural networks spontaneously implement compositional subroutines in isolated, ablatable subnetworks — and pretraining makes that modularity *more* reliable, not less Do neural networks naturally learn modular compositional structure?. At inference these learned components can even be recombined: tuning just the singular values of weight matrices yields composable 'expert' vectors that mix dynamically without interfering Can models dynamically activate expert skills at inference time?. So composition is in there — the objective builds modular machinery — but the machinery is statistical, not symbolic.

The most interesting tension is that the model's internal computation and its emitted tokens can diverge. Logit-lens work finds transformers computing the correct answer in their early layers, then actively *suppressing* that representation to emit format-compliant filler Do transformers hide reasoning before producing filler tokens?. Read alongside the finding that only a small minority of high-entropy 'forking' tokens carry the real decision-making load Do high-entropy tokens drive reasoning model improvements?, a subtler picture emerges: next-token prediction is a noisy readout of richer internal structure, not a transparent window into it.

The thing worth walking away with: 'does next-token prediction yield compositional competence?' is the wrong question, because the answer is both yes and no at different levels. The objective demonstrably builds modular, recombinable internal structure — yet what it surfaces as output is governed by sequence probability, so it fails exactly where being correct means being improbable. The competence and the incompetence have the same cause.

Sources 9 notes

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Can language models learn meaning from text patterns alone?

Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about what next-token prediction teaches LLMs about compositional linguistic competence. The question remains open: does the autoregressive objective genuinely build structural understanding, or only surface statistical associations?

What a curated library found — and when (dated claims, not current truth): Findings span 2022–2025.

• Next-token prediction optimizes for likelihood, not grammatical correctness; performance degrades systematically with structural complexity (recursion, embedding), suggesting learned heuristics rather than internalized rules (~2023–2024).
• Internally, transformers build modular, ablatable subnetworks for compositional tasks, and these modules become MORE reliable with pretraining; singular-value tuning enables dynamic expert recombination at inference (~2025).
• Models compute correct answers in early layers then actively suppress those representations to emit high-probability tokens; only a small minority of high-entropy tokens carry decision-critical information (~2025).
• Training priors and local token associations override context and drive reasoning errors; low-probability sequences (reversals, counting) fail even when trivial (~2023–2024).
• Logit-lens reveals a gap: internal computation and emitted tokens diverge; the model is a noisy readout of richer structure, not a transparent window (~2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2301.10884 (2023-01) – Structural compositionality in neural networks
• arXiv:2412.04537 (2024-12) – Hidden computations in chain-of-thought
• arXiv:2501.06252 (2025-01) – Self-adaptive LLMs via expert vectors
• arXiv:2506.01939 (2025-06) – High-entropy minority tokens drive RL efficiency

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, ask: have recent scaling (larger models, longer context windows), training methods (constitutional AI, RLHF refinements), inference tools (speculative decoding, adaptive compute), or evaluation benchmarks since late 2024 relaxed these limits? Separate the durable tension (does the objective reward likelihood over correctness?) from the perishable limitation (do *current* models fail at recursion?). Cite what changed it.
(2) Surface the strongest work from the last 6 months that contradicts or supersedes the library's findings — especially any showing modular structure doesn't emerge or that early-layer computations don't predict final output.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "If high-entropy tokens now carry 80% of reasoning load, can we train smaller models to allocate entropy more uniformly?" or "Do newer architectures (e.g., non-causal, recurrent) still exhibit the early-layer suppression pattern?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What does next-token prediction tell us about compositional linguistic competence?

Sources 9 notes

Next inquiring lines