Why does entropy-based frame sampling work better than uniform stride selection?
This reads as a question about why selecting samples by where the information actually concentrates (entropy) beats grabbing them at fixed intervals — and while the corpus has no paper on video frame sampling specifically, it has a strong recurring answer to the deeper principle: signal-guided selection beats blind uniform/random selection.
This explores why picking frames where content changes most (high entropy) outperforms grabbing every Nth frame on a fixed schedule. The corpus doesn't hold a paper on video frame sampling directly — but the same fight, information-guided selection vs. uniform stride, shows up repeatedly across very different tasks, and the verdict is consistent: uniform selection wastes budget on redundant samples while burying the moments that actually carry signal.
The sharpest parallel is in how reasoning traces get filtered. Step-level confidence filtering beats global confidence averaging precisely because averaging is the uniform-stride mistake in disguise: it smears one number across a whole trace and masks the exact step where reasoning breaks Does step-level confidence outperform global averaging for trace filtering?. Look locally, at the points of high uncertainty, and you catch the breakdown — and you can stop early instead of paying for the whole trace. Entropy-based frame sampling is the same move in the time dimension: spend your budget where the signal spikes, not on a flat schedule that treats every interval as equally worth looking at.
The pattern recurs as a deliberate design choice elsewhere. Sparsity-guided curriculum learning orders in-context demonstrations by an internal information measure (activation sparsity) instead of an arbitrary order, with no external labels needed Can representation sparsity order few-shot demonstrations effectively?. DRO reuses cross-rollout variance as a selection signal to filter out degenerate, low-information comparisons before they waste training Can one statistical measure serve dual purposes in RL training?. SkillRL refuses to process all trajectories uniformly — successes and failures carry different information, so they get handled differently, beating uniform consolidation Should successful and failed episodes be processed differently?. In every case the lesson is the same: a content-blind, uniform rule leaves information-density on the table.
The inverse case makes it concrete. Random tool sampling fails for synthetic data generation because picking items without regard to their relationships produces incoherent, low-value samples — the fix is to sample from a relevance graph so what you select actually composes tool-calling-data-synthesis-fails-through-random-tool-sampling-and-single-turn-fo. Uniform stride is random sampling's orderly cousin: both ignore where the meaningful structure lives. And there's a subtler reason placement matters at all — position itself can swing outcomes by up to 20% in in-context learning, independent of content How much does demo position alone affect in-context learning accuracy?, a reminder that which samples land where is never neutral.
What you didn't know you wanted to know: the win from entropy sampling isn't a video trick — it's an instance of a principle the corpus keeps rediscovering under different names (confidence, sparsity, variance, relevance graphs). Anytime a system can read off where its own signal concentrates, that internal measure beats any fixed external schedule. If you want to go deeper, start with the confidence-filtering note — it's the cleanest statement of why local information beats a flat global rule.
Sources 6 notes
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.
DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
Repositioning an identical demo block from prompt start to end swaps up to 20% accuracy and flips nearly half of predictions. This spatial effect operates independently of demo content and spans multiple task types.