INQUIRING LINE

Can simple proxies like length predict optimal sparsity per request?

This explores whether a cheap, observable signal like input length is enough to set the right sparse-attention budget for each request — or whether you need to know more about the task.


This explores whether you can read off the optimal sparsity for a given request from a simple proxy like input length, rather than measuring it directly. The corpus says: partly yes, and length is a real signal — but it's not the whole story, and the more interesting finding is that the *right* proxy depends on what the request is doing.

The strongest case for length comes from work showing that optimal sparse-attention budgets scale with sequence length — longer inputs tolerate much higher sparsity without losing quality, which means a fixed budget is wasteful and per-request adaptation pays off Does fixed sparsity work for all sequence lengths?. So length isn't a bad proxy; it captures something real about how much redundancy a request contains. And the stakes are worth it: sparsity isn't a quality-for-speed trade but a Pareto improvement, so getting the budget right lets larger sparse models beat smaller dense ones at equal compute Does sparse attention trade off quality for speed?.

But length alone hides a second axis: task structure. Tolerance swings wildly by what kind of reasoning the request needs — a single-fact lookup can survive 95% sparsity, while multi-hop or aggregation tasks fall apart at 50-67% because they need attention spread across many regions of the context How much sparsity can different reasoning tasks actually tolerate?. Two requests of identical length can have opposite optimal budgets. So length predicts *capacity to tolerate* sparsity, but the task predicts *demand* for dense attention, and you need both.

The lateral move the corpus suggests is to stop thinking about hand-picked proxies and ask what the model already knows about its own request. There's a recurring pattern of cheap pre-generation prediction beating elaborate heuristics: routers estimate query difficulty before generating to pick the right model and cut cost 40-50% Can routers select the right model before generation happens?, and on the retrieval side, calibrated token-probability uncertainty beats complex adaptive heuristics at deciding when to fetch more context Can simple uncertainty estimates beat complex adaptive retrieval?. The implication for sparsity is that the model's own uncertainty or estimated complexity may be a sharper per-request signal than any external proxy like length.

There's even a hint that sparsity is something the network expresses internally on its own terms: representational density is learned, with models defaulting to dense activations on familiar inputs and sparse ones on unfamiliar territory Is representational sparsity learned or intrinsic to neural networks?. That reframes the question — instead of guessing optimal sparsity from outside, you might read it off the model's own activation patterns. So: length is a useful starting proxy, but the corpus points past it toward task-awareness and self-estimated difficulty as the signals that actually carry the prediction.


Sources 6 notes

Does fixed sparsity work for all sequence lengths?

Longer sequences tolerate significantly higher sparsity levels than shorter ones without performance loss. Fixed-budget sparse attention is suboptimal in production; budgets should adapt per input based on context length and other request properties.

Does sparse attention trade off quality for speed?

The Sparse Frontier benchmark shows that at equivalent compute cost, larger sparse-attention models outperform smaller dense models on long-context tasks. Sparsity lets you train bigger models within the same budget, making it Pareto-improving rather than a pure trade-off.

How much sparsity can different reasoning tasks actually tolerate?

Single-QA tasks tolerate 95% sparsity while multi-hop and aggregation tasks degrade substantially at 50-67% sparsity. This pattern reflects structural differences: single-QA concentrates reasoning in few tokens, while multi-hop and aggregation require distributed attention across multiple regions.

Can routers select the right model before generation happens?

RouteLLM and Hybrid-LLM both achieve 40-50% cost reduction by routing to a single model based on query difficulty prediction, not response evaluation. Single-model routing minimizes latency compared to ensemble or cascade alternatives.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Next inquiring lines