How does predictive accuracy on future tokens differ from correctness on labeled answers?
This explores the gap between two things we might call 'right': a model predicting what token comes next in a corpus (self-supervised likelihood), versus a model landing on the answer a label says is correct — and what the corpus reveals about when those two come apart.
This explores the difference between predictive accuracy on future tokens — how well a model guesses the next word from raw text — and correctness on labeled answers, where an external signal says 'this is the right solution.' The two look similar (both reward 'getting it right') but the corpus shows they pull on different parts of the model, and the most interesting work lives in the gap between them.
The cleanest bridge between the two is Reinforcement Pre-Training Can next-token prediction become a reasoning task with RL?, which turns next-token prediction itself into a verifiable task: the corpus *is* the label, so predicting the next token becomes a reasoning problem with a built-in correctness check. That reframing matters because it exposes what ordinary pretraining hides — that not all tokens carry the same weight. Only about 20% of tokens are high-entropy 'forking points' where reasoning actually branches, and training on just those matches full updates Do high-entropy tokens drive reasoning model improvements?. Specific tokens like 'Wait' and 'Therefore' spike in mutual information with the correct answer, and suppressing them wrecks accuracy while suppressing random tokens doesn't Do reflection tokens carry more information about correct answers?. So token-level prediction and answer-level correctness aren't uniformly linked — a minority of tokens carries almost all the signal that connects the two.
The more surprising finding is how far the two can decouple. Models trained on *deliberately corrupted* reasoning traces stay just as accurate on final answers, and sometimes generalize better Do reasoning traces need to be semantically correct? — meaning the tokens a model emits don't have to be semantically true for the labeled answer to come out right. They function as computational scaffolding, not meaning. The flip side: transformers can compute the correct answer in their first few layers and then actively *overwrite* it with format-compliant filler before producing output Do transformers hide reasoning before producing filler tokens?. Internal correctness and the tokens you actually predict can diverge inside a single forward pass.
This is why optimizing for one doesn't cleanly buy you the other. Longer chains of thought (more predicted tokens) help accuracy only up to a point, then hurt — an inverted-U, with RL naturally driving toward shorter chains as models improve Why does chain of thought accuracy eventually decline with length?. And a model's confidence in its own predicted tokens is a poor proxy for correctness: models systematically over-trust answers they generated themselves, because high-probability tokens simply *feel* right during self-evaluation Why do models trust their own generated answers?. The fix isn't better prediction — it's calibration. Models trained with uncertainty-aware objectives that let them *abstain* when unsure can match models 10x their size Can models learn to abstain when uncertain about predictions?.
The thing worth taking away: predictive accuracy is about plausibility (does this token fit the distribution?), while labeled correctness is about truth (does this answer match reality?). The corpus suggests the real engineering lever is learning *where* in the token stream those two coincide — the forking tokens, the lookahead signals you can embed in training data Can embedding future information in training data improve planning? — rather than assuming a fluent prediction and a correct answer are the same achievement.
Sources 9 notes
Reinforcement Pre-Training transforms next-token prediction into a reasoning task by providing verifiable rewards from the corpus itself, eliminating reward hacking and enabling inference-time scaling during pretraining. This suggests token-level reasoning patterns during pretraining strengthen downstream RL fine-tuning.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.
TRELAWNEY augments training data with special tokens encapsulating future information, allowing models to learn goal-conditioned generation using standard infrastructure. Results show improved planning, algorithmic reasoning, and story generation without modifying architecture or training procedures.