INQUIRING LINE

Does importance sampling actually recover capabilities lost to hard sample training?

This explores whether reweighting training examples (importance sampling) can undo the capability damage that comes from training a model on problems that are too hard for it.


This explores whether reweighting examples can repair what hard-sample training breaks — and the corpus's most useful move is to question the premise that importance sampling is doing recovery at all. The collection has sharp material on how hard samples cause damage and on what reweighting actually accomplishes, but those turn out to be two different jobs.

Start with the damage. Training on near-impossible problems doesn't just waste compute — it actively corrupts capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. The mechanism matters here: group-relative normalization treats a rare accidental success on an impossible problem as a high-advantage trajectory, so the model gets rewarded for answer-repetition and computation-skipping, and those shortcuts bleed into tasks it used to do honestly. The crucial wrinkle for your question is that this corruption lives *inside the weighting itself* — the advantage estimate is what amplifies the garbage. So reweighting isn't a neutral repair tool applied after the fact; it's the same lever that caused the harm.

The closest thing the corpus has to importance sampling shows it working as prevention, not recovery. DRO reuses cross-rollout variance to both weight tokens and *filter out* degenerate queries before they poison training Can one statistical measure serve dual purposes in RL training?, and the informativeness of any sample shifts as the model's ability changes — the productive band of medium-difficulty problems drifts during training, so static difficulty labels go stale within steps How does model ability change what samples teach?. Gradient-based selection tells the same story: choosing 5% of data beats full training precisely because the dropped examples were *actively hindering* specific skills Can we train better models on less data?. In every case the win comes from never reinforcing the harmful trajectory — keeping the damage from happening, not reversing it.

When the corpus talks about genuine *recovery*, the working levers are different ones entirely. Capability comes back by staying close to the base model rather than by clever sample weighting: low KL drift from base preserves the model's plasticity and its ability to keep learning Does staying close to the base model preserve learning ability?, decoding-time proxy tuning leaves base weights untouched and so dodges the knowledge corruption that direct fine-tuning inflicts Can decoding-time tuning preserve knowledge better than weight fine-tuning?, and the capability you're trying to recover may have been latent in the base activations the whole time — post-training selects reasoning rather than creating it Do base models already contain hidden reasoning ability?.

So the honest synthesis: the corpus gives you no evidence that importance sampling *recovers* capabilities lost to hard-sample training, and good reason to doubt the framing. Reweighting is well-positioned to stop the damage — filter the degenerate queries, track the drifting difficulty band, drop the harmful examples. But once shortcuts have already contaminated a skill, the recovery lever the research points to is staying near the base distribution or re-eliciting what's still latent there, not resampling the same poisoned signal more cleverly. Prevention is a sampling problem; recovery is a distribution-distance problem.


Sources 7 notes

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Can we train better models on less data?

LESS uses low-rank gradient features to select instruction data most similar to target capabilities, and training on the selected 5% consistently outperforms full dataset training. The improvement occurs because mixed datasets contain examples that actively hinder specific skills by shifting reasoning strategy away from task requirements.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Next inquiring lines