Why does masking the penultimate token outperform random token masking?
This reads the question as being about masked-prediction training: when you train a model by hiding tokens and asking it to recover them, why does deliberately masking the second-to-last token beat masking a random one — and the honest answer is that the corpus here has no paper on penultimate-token masking specifically, but it has a strong, repeated finding about *why position and informativeness matter* that explains the effect.
This explores why a targeted masking position (the penultimate token) would outperform uniform random masking — and the collection doesn't contain a paper on that exact recipe, so treat what follows as the corpus reasoning *around* the principle rather than a direct citation of the result. The throughline across several notes is the same: tokens are not equal carriers of learning signal, and spending your training budget on the high-signal ones beats spreading it uniformly.
The sharpest version of this is the finding that only about 20% of tokens are high-entropy 'forking points' where the model actually makes a reasoning decision, and that training exclusively on those tokens matches or beats updating on all of them Do high-entropy tokens drive reasoning model improvements?. Random masking, by contrast, lands mostly on the predictable 80% — grammar, connective tissue, tokens the model could already guess from context. A mask that reliably hits a decision-bearing position is doing the same thing the 20%-token result does: concentrating the gradient where the information is. The penultimate token is interesting precisely because it sits where almost the entire sentence is available as context but the prediction is not yet trivial.
A second note makes the 'not all tokens matter equally' point from the opposite direction: when reasoning chains are pruned by functional importance, models preferentially keep symbolic-computation tokens and discard grammar and meta-discourse first Which tokens in reasoning chains actually matter most?. So the model's own internal ranking already says some positions are load-bearing and most are filler. Random masking is blind to that ranking; a structured masking rule can align with it.
There's also a deeper, more theoretical reason targeting matters, which the latent-prediction work gives in formal terms: same-level representations are far more *correlated* with each other than raw tokens are, which is why predicting structured targets is exponentially more sample-efficient than predicting arbitrary tokens Why is predicting latents more sample-efficient than tokens?. The flip side for masking is that a random target is often weakly coupled to its context (low signal, easy or noisy to predict), whereas a well-chosen position carries a prediction that's both hard and highly determined by what surrounds it — exactly the kind of target that teaches the model something. The byte-level work shows the same instinct operationalized: spend more compute on high-entropy regions and less on predictable ones Can byte-level models match tokenized performance with better efficiency?.
The thing worth taking away: 'where you mask' is really a question about *where the learning signal lives*, and across entropy analysis, functional-token pruning, latent prediction, and dynamic byte compute, the corpus keeps arriving at the same answer — uniform treatment of tokens wastes most of your effort on positions that teach nothing. If you want the exact penultimate-vs-random comparison, this collection won't have it, but it tells you why such a result would be expected rather than surprising.
Sources 4 notes
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.
The Byte Latent Transformer (BLT) dynamically segments bytes into patches based on next-byte entropy, allocating more compute to high-entropy regions and less to predictable ones. At 8B parameters, BLT matches tokenized baselines while reducing inference cost and improving robustness to typos and cross-lingual transfer.