Can context compression preserve what matters without introducing bias?

This explores whether shrinking a model's working context — pruning tokens, evicting history, compressing long inputs — can keep the parts that actually drive an answer without quietly distorting what's preserved. The corpus is encouraging on the first half (you *can* keep what matters) and pointed on the second half (the selection itself is where bias sneaks in).

The strongest evidence that compression can be selective rather than lossy comes from work showing that models already rank their own tokens by function. Greedy likelihood-preserving pruning reveals that symbolic computation tokens get preferentially kept while grammar and meta-discourse fall away first — and students trained on these self-pruned chains actually outperform those trained on frontier-model compression Which tokens in reasoning chains actually matter most?. A similar instinct drives the Titans architecture, which splits short-term attention from a long-term neural memory that adaptively stores *surprising* tokens, scaling past two million tokens without the quadratic penalty Can neural memory modules scale language models beyond attention limits?. And 'Atom of Thoughts' goes furthest: it contracts a problem so each reasoning state depends only on the current step, throwing away accumulated history entirely while proving the answer stays equivalent Can reasoning systems forget history without losing coherence?. So forgetting, done well, need not cost coherence.

But there's a subtler reframing worth knowing: one line of work argues the long-context bottleneck isn't memory capacity at all — it's the *compute* needed to consolidate evicted context into the model's fast weights during an offline 'sleep' phase, with quality improving the more consolidation passes you run Is long-context bottleneck really about memory or compute?. That reframes 'preserve what matters' as a budget problem, not a storage one. It pairs uncomfortably with the finding that reasoning accuracy collapses from 92% to 68% with just 3,000 tokens of *irrelevant* padding — far below any context limit, and unfixable by chain-of-thought Does reasoning ability actually degrade with longer inputs?. Compression that removes noise might therefore *help* reasoning, not just shrink cost.

The bias half of your question is where the corpus turns cautionary, and it points somewhere you might not expect: the bias is often already in the model, and compression just amplifies which signal wins. Models routinely ignore in-context information when their training priors are strong — and textual prompting alone can't override it; you need causal intervention in the representations themselves Why do language models ignore information in their context?. The same skew shows up as era-sensitivity, where models reason worse about historical legal cases because the training corpus over-represents recent ones Why do language models struggle with historical legal cases?. And keyword priming after learning is predictable from a token's pre-learning probability, with a sharp threshold below which priming simply doesn't take Can we predict keyword priming before learning happens?. The unsettling implication: any compression scheme that scores tokens by 'importance' inherits the model's existing probability landscape, so it will systematically keep what the model already favors and drop what it already under-weights — the precise definition of bias.

So the honest answer is two-sided. Compression can preserve what matters — sometimes it even improves reasoning by stripping distracting padding. But 'what matters' is judged by the same priors that carry the bias, so the risk isn't lost information so much as a quietly tilted selection. If you want a tool that pushes against this, consistency training is the closest thing here: it teaches a model to respond identically to clean and perturbed prompts using its own clean answers as targets, a way to make compression *invariant* rather than preference-amplifying Can models learn to ignore irrelevant prompt changes?.

Sources 9 notes

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can reasoning systems forget history without losing coherence?

Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Why do language models struggle with historical legal cases?

Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Can context compression preserve what matters without introducing bias?

Sources 9 notes

Next inquiring lines