INQUIRING LINE

Does sparsity-guided ordering work equally well for reasoning and classification tasks?

This explores whether using a model's own representation sparsity to order examples (a difficulty signal) holds up the same way across reasoning-heavy and classification-style tasks — and what the corpus says about why sparsity tracks difficulty at all.


This explores whether ordering examples by representation sparsity — feeding 'harder' sparse cases before 'easier' dense ones — pays off uniformly across reasoning and classification tasks. The corpus suggests the technique is real but its leverage almost certainly isn't flat across task types, because sparsity itself means something different depending on how much reasoning a task demands.

The core method is Can representation sparsity order few-shot demonstrations effectively?, which uses last-layer activation sparsity to rank few-shot demonstrations from sparse-hard to dense-easy, needing no human difficulty labels. The reason this works at all is sharpened by Do language models sparsify their activations under difficult tasks?: models sparsify their hidden states *adaptively* as tasks get unfamiliar or reasoning-heavy — sparsity rises with reasoning load and distance from what the model has seen. So sparsity is a proxy for two distinct things at once: genuine reasoning effort and mere distributional unfamiliarity. On a reasoning task those overlap; on a classification task that's just out-of-distribution, sparsity may be flagging novelty, not difficulty. That gap is exactly where 'equally well' breaks down.

The curriculum literature here makes the same point from the other side. Does ordering training data by rarity actually improve language models? argues that the signal ordering should track is *distance from the pre-training distribution*, not conceptual difficulty — rare data first because rarity marks a distributional weakness. If sparsity is partly an unfamiliarity detector, sparsity-guided ordering is implicitly doing distributional curriculum, which would help most where the model's coverage is patchy and matter little where the task is already well-represented. Whether a given classification task benefits depends less on the label 'classification' and more on how far it sits from training.

There's a deeper reason reasoning tasks may respond differently: reasoning isn't uniform internally. Which tokens in reasoning chains actually matter most? shows models rank tokens by functional importance — symbolic-computation tokens are preserved, grammar and filler get pruned first — so 'difficulty' in a reasoning chain is concentrated in specific structural moments rather than spread evenly. Ordering by a single scalar sparsity score is a blunter instrument for that than for a classification decision, where difficulty is closer to a single boundary call. And Does chain-of-thought reasoning actually generalize beyond training data? warns that reasoning behavior degrades predictably outside the training distribution, producing fluent-but-invalid logic — meaning the very sparsity spike that signals a hard reasoning case can coincide with the model reasoning *badly*, complicating any clean hard-to-easy story.

The corpus doesn't contain a head-to-head benchmark of sparsity-guided ordering on reasoning versus classification, so a flat 'yes/no' isn't earned. But the lateral read is consistent: sparsity is a strong, label-free difficulty proxy precisely because it conflates reasoning effort with distributional novelty, and that conflation is benign on reasoning tasks (where the two move together) and noisier on classification tasks (where it mostly reports unfamiliarity). The thing you didn't know you wanted to know: the method's quiet kinship with Can non-reasoning models catch up with more compute? — both say the productive signal lives in how training shaped the model's internal structure, not in raw compute or surface task labels.


Sources 6 notes

Can representation sparsity order few-shot demonstrations effectively?

Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Does ordering training data by rarity actually improve language models?

CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Next inquiring lines