Can representation sparsity order few-shot demonstrations effectively?

Does measuring how sparse a model's hidden states are for each example provide a reliable signal for ordering few-shot demonstrations in prompts? This matters because curriculum ordering significantly affects in-context learning performance.

Synthesis note · 2026-05-18 · sourced from LLM Architecture

Once representational sparsity tracks task difficulty for a given model, sparsity itself becomes a usable signal for curriculum design. Farther the Shift, Sparser the Representation operationalizes this with Sparsity-Guided Curriculum In-Context Learning (SG-ICL), which uses the sparsity of last-layer activations to schedule few-shot demonstrations in the prompt.

The mechanism: measure how sparse the model's last hidden states are when processing each candidate few-shot example. Order them so the demonstrations escalate from sparse (high difficulty for this model) to dense (low difficulty), or vice versa depending on what the curriculum is meant to achieve. The result is considerable performance enhancements over random or naive ordering.

This is a model-internal curriculum signal. Most curriculum learning approaches require external difficulty labels — annotator effort, heuristics about problem features, or proxy measures like solution length. Sparsity sidesteps this entirely. The model itself reveals which examples are hard for it through how its representations respond. The curriculum can be tailored to the specific model being used rather than to some external notion of universal difficulty.

The technique generalizes across the in-context learning landscape. Anywhere few-shot prompting is used — classification, reasoning, agentic deployments — sparsity-derived ordering is available. It costs nothing extra at the relevant scale: hidden states are computed regardless, and reading their sparsity is a free byproduct. The only requirement is access to the activations, which is available for any white-box deployment.

For builders of LLM pipelines, this argues for instrumentation that exposes activation-sparsity statistics. The signal supports curriculum ordering, hard-example mining, confidence calibration, and likely other applications not yet identified. Sparsity is becoming a richer interpretability primitive than the static-property framing has suggested.

The deeper template is that adaptive internal phenomena — sparsity here, attention concentration elsewhere, gradient magnitudes during training — can be operationalized as signals for system behavior once they are recognized as informative rather than incidental.

Inquiring lines that use this note as a source 24

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 127 in 2-hop network ·dense cluster Open in graph ↗

Can representation sparsity order few-shot demon… Do language models sparsify their activations unde… Is representational sparsity learned or intrinsic … Why do trajectories matter more than individual ex…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do language models sparsify their activations under difficult tasks? When LLMs encounter unfamiliar or difficult inputs, do their internal representations become sparser rather than denser? Understanding this adaptive response could reveal how models stabilize reasoning under uncertainty.
same paper, the underlying phenomenon this method exploits
Is representational sparsity learned or intrinsic to neural networks? Explores whether sparsity in neural network activations is engineered through training or emerges as a default response to unfamiliar inputs. Understanding this distinction could reshape how we design and interpret model behavior.
same paper, the developmental story
Why do trajectories matter more than individual examples for in-context learning? Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.
adjacent: another structural requirement for effective ICL

Can representation sparsity order few-shot demonstrations effectively?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4