Can standard next-token prediction capture complex multi-step human reasoning directly?

This explores whether plain next-token prediction — the basic 'guess the next word' objective — can on its own produce genuine multi-step reasoning, or whether reasoning has to be added through some other mechanism.

This explores whether plain next-token prediction — predicting the next word, one at a time — can by itself capture real multi-step human reasoning, or whether reasoning has to be grafted on through other means. The corpus gives a split verdict: the raw objective seems insufficient, but it's a richer foundation than it first appears, and most of the action is in how you shape what gets predicted. A striking finding is that standard models trained only to predict text already do hidden reasoning — one analysis shows transformers compute correct answers in their early layers and then actively overwrite that with format-compliant filler tokens, so the reasoning is present but suppressed by the surface objective Do transformers hide reasoning before producing filler tokens?. That hints the prediction objective isn't the bottleneck so much as what the model is rewarded to surface.

Several threads argue you can coax reasoning out of pure prediction without changing the architecture at all. Quiet-STaR trains a model to generate a private rationale at every token position on ordinary internet text, judging the rationale by whether it improves the next-token prediction — so reasoning emerges as a side effect of better language modeling Can models learn reasoning from predicting any text?. Reinforcement Pre-Training reframes next-token prediction as a reasoning task by treating the corpus itself as a verifiable reward signal Can next-token prediction become a reasoning task with RL?, and RLP plants chain-of-thought during pretraining using the model's own log-likelihood gain as a verifier-free reward Can chain-of-thought reasoning be learned during pretraining itself?. Even just seeding training data with 'lookahead' tokens that smuggle in future information lets a vanilla model learn planning without any architectural change Can embedding future information in training data improve planning?. The common move: keep next-token prediction, but enrich the target so the gradient flows toward reasoning.

The skeptical camp says the surface form fools us. Chain-of-thought, it turns out, is largely constrained imitation of reasoning patterns seen in training rather than genuine inference — performance degrades predictably the moment you shift task, length, or format away from the training distribution Does chain-of-thought reasoning reveal genuine inference or pattern matching? Does chain-of-thought reasoning actually generalize beyond training data?. Probing further, when you strip familiar meaning out of a problem and leave only the logical structure, models collapse — they reason through semantic association, not symbolic manipulation Do large language models reason symbolically or semantically?. And reasoning quietly falls apart with longer inputs well before the context window fills, dropping from 92% to 68% accuracy with just a few thousand tokens of padding Does reasoning ability actually degrade with longer inputs?. So 'direct' multi-step reasoning from prediction is partly real, partly mimicry that breaks under stress.

What sharpens the picture is that the reasoning signal lives in a tiny minority of tokens. Only about 20% of tokens are high-entropy 'forking points' where the model genuinely decides, and training on just those matches full-gradient performance Do high-entropy tokens drive reasoning model improvements?. Models even internally rank tokens by function, preferentially preserving symbolic-computation tokens while discarding grammar and filler Which tokens in reasoning chains actually matter most?. This reframes the whole question: next-token prediction treats every token equally, but reasoning concentrates in a handful of pivotal choices — which is why merely predicting text well doesn't automatically yield reliable reasoning.

The takeaway you might not have expected: the debate isn't really 'prediction vs. reasoning' but how much structure you inject into the prediction target. Pure next-token prediction already contains latent reasoning it then hides, and training regime — not raw compute — decides whether that latent capacity becomes usable; non-reasoning models can't close the gap no matter how much inference budget you throw at them, because the reasoning protocol has to be instilled during training Can non-reasoning models catch up with more compute? Can chain-of-thought reasoning be learned during pretraining itself?. So the honest answer is: not cleanly on its own, but the gap is closed by reshaping what the model predicts, not by abandoning prediction.

Sources 12 notes

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can models learn reasoning from predicting any text?

Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.

Can next-token prediction become a reasoning task with RL?

Reinforcement Pre-Training transforms next-token prediction into a reasoning task by providing verifiable rewards from the corpus itself, eliminating reward hacking and enabling inference-time scaling during pretraining. This suggests token-level reasoning patterns during pretraining strengthen downstream RL fine-tuning.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can embedding future information in training data improve planning?

TRELAWNEY augments training data with special tokens encapsulating future information, allowing models to learn goal-conditioned generation using standard infrastructure. Results show improved planning, algorithmic reasoning, and story generation without modifying architecture or training procedures.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can standard next-token prediction capture complex multi-step human reasoning directly?

Sources 12 notes

Next inquiring lines