Does token-level loss aggregation help aligned models differently?
This explores whether weighting the training loss per-token — rather than treating every token equally — matters, and whether that interacts with the alignment a model has already been through.
This explores whether token-level loss aggregation (how the learning signal is spread across the tokens in a sequence) helps, and whether already-aligned models respond to it differently than raw base models. The corpus doesn't frame this as one tidy experiment, but several notes converge on a striking premise: tokens are not equal, so aggregating loss uniformly across all of them is almost certainly wasting most of the signal.
The sharpest evidence is that a small minority of tokens carries the learning. In reinforcement learning with verifiable rewards, only about 20% of tokens are high-entropy 'forking points,' and training on just those matches or beats updating on everything Do high-entropy tokens drive reasoning model improvements?. Reasoning chains show the same shape from a different angle: models internally rank tokens by functional importance, preserving symbolic-computation tokens while grammar and filler get pruned first — and students trained on those pruned chains outperform students trained on full ones Which tokens in reasoning chains actually matter most?. Even at the input side, byte-level models that allocate more compute to high-entropy stretches match tokenized baselines more efficiently Can byte-level models match tokenized performance with better efficiency?. The recurring lesson is that where you spend gradient (or compute) should follow where the uncertainty and the real decisions live, not be smeared evenly.
The 'aligned models differently' part is where it gets interesting. Alignment doesn't leave the token distribution untouched — it flattens it. The 'Artificial Hivemind' finding shows that shared alignment procedures push 70+ models toward strikingly similar outputs, collapsing diversity Do different AI models actually produce diverse outputs?. If alignment has already homogenized the easy, predictable tokens, then for an aligned model the meaningful signal concentrates even harder in the rare divergent tokens — which is exactly what a token-weighted loss would target and a uniform loss would drown out.
There's also a warning about how you apply the pressure. Direct fine-tuning corrupts knowledge stored in lower layers, whereas proxy-tuning — shifting the output distribution at decoding time without touching base weights — closes most of the alignment gap while preserving knowledge Can decoding-time tuning preserve knowledge better than weight fine-tuning?. And DPO's advantage for small models comes precisely from supplying explicit negative examples that target the specific tokens where formatting fails, rather than averaging over the whole sequence the way plain SFT does Can small models match large models on function calling?. Both point the same way: precision about which tokens absorb the update beats brute uniform pressure.
The unexpected payoff is a caution. Models can hit perfect linear decodability while their internal organization is quietly fractured and brittle Can models be smart without organized internal structure? — and transformers will compute an answer in early layers, then overwrite it with format-compliant filler Do transformers hide reasoning before producing filler tokens?. So a loss that rewards the visible filler tokens equally with the load-bearing ones isn't just inefficient; it can train the model to look aligned while burying the reasoning. Token-level aggregation isn't a tuning knob — it's a question of whether you're spending your signal on the tokens that actually decide the answer.
Sources 8 notes
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
The Byte Latent Transformer (BLT) dynamically segments bytes into patches based on next-byte entropy, allocating more compute to high-entropy regions and less to predictable ones. At 8B parameters, BLT matches tokenized baselines while reducing inference cost and improving robustness to typos and cross-lingual transfer.
INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.