INQUIRING LINE

Can importance sampling reduce variance in off-policy reward estimation?

This explores whether reweighting samples drawn from one policy can give you a lower-variance estimate of rewards for a different policy — and the honest answer is that the corpus addresses variance and off-policy reward estimation, but approaches the problem through different machinery than classical importance sampling.


This explores whether importance sampling — reweighting data collected under one policy to estimate rewards under another — can tame the variance that makes off-policy estimation noisy. The collection doesn't contain a note that tackles importance-sampling estimators head-on (no inverse-propensity weighting, no clipped likelihood ratios). What it does have is a cluster of work attacking the same underlying pain: reward signals are noisy, and that noise has to be managed before learning is stable. Read laterally, these notes suggest the corpus has largely routed *around* importance sampling rather than through it.

The most direct neighbor is the idea of using variance itself as the signal rather than something to suppress. Can one statistical measure serve dual purposes in RL training? takes the spread across multiple rollouts of the same query and reuses it two ways — weighting tokens within an answer and filtering out degenerate queries entirely. That's a variance-reduction move in spirit (discard the comparisons that would inject the most noise), but it's self-supervised and on-policy, sidestepping the reweighting problem importance sampling is built to solve. Its companion note Can rubrics and dense rewards work together without hacking? adds a related instinct: use rubrics as accept/reject *gates* on whole rollout groups rather than converting fuzzy scores into dense rewards, which keeps unreliable signal from contaminating the estimate at all.

Where off-policy estimation genuinely shows up is reward estimation *without ground truth*. Can models improve themselves using only majority voting? estimates rewards by majority vote across samples — a consensus estimator whose variance shrinks as you draw more rollouts, which is the same bias-variance lever importance sampling pulls, just by averaging instead of reweighting. And Can reward models benefit from reasoning before scoring? shows reward models that reason before scoring can spend more test-time compute to produce more reliable estimates — again trading compute for lower-variance reward signal without touching propensity ratios.

The more surprising thread is that several notes suggest *which* samples you keep matters more than how you reweight them. Does negative reinforcement alone outperform full reinforcement learning? finds that training on negative samples alone can match full RL while preserving diversity — implying the positive trajectories that importance weighting would carefully reweight may be the ones degrading performance by concentrating probability mass. Should successful and failed episodes be processed differently? pushes the same asymmetry: successes and failures should be processed *differently*, not folded into one reweighted estimate. If the corpus has a thesis here, it's that off-policy noise is better handled by structured selection and consensus than by a single scalar reweighting — and Can scalar rewards capture all the information in agent feedback? hints at why, since a scalar reward (the only thing importance sampling can reweight) throws away the directive information that richer feedback carries.

So: the corpus can't tell you whether classical importance sampling reduces variance in your estimator — that specific question is a gap here. But it strongly suggests the field's working answer is to attack reward-estimation variance through consensus, gating, asymmetric trajectory handling, and reasoning-augmented scoring instead, which may be why the importance-sampling framing is underrepresented.


Sources 7 notes

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Can models improve themselves using only majority voting?

Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Next inquiring lines