What limits RLVR effectiveness beyond mathematical and coding domains?
This explores why RLVR — reinforcement learning from verifiable rewards — struggles to travel beyond domains like math and code where answers can be checked automatically, and what the corpus reveals about the underlying mechanism.
This explores why RLVR struggles outside math and code, and the corpus suggests the limit isn't really about subject matter — it's about what RLVR actually does to a model and what conditions a domain has to meet for it to do anything at all. The deepest finding is that RLVR mostly doesn't teach new reasoning; it sharpens sampling toward solutions the base model could already reach. Pass@k analysis shows base models outperform RLVR-trained models at high k, meaning the technique narrows the search rather than expanding the frontier Does RLVR actually expand what models can reason about?, and a related line of work frames this as RLVR "activating" pretraining strategies rather than installing new ones — a single example can trigger the activation, and even spurious rewards work nearly as well as correct ones What does reward learning actually do to model reasoning?. If RLVR only amplifies what pretraining already seeded, then any domain where the base model is weak gets little lift no matter how good the reward signal is.
The second limit is environmental, and it's the most direct answer to the question: domains differ in whether they can even support this kind of optimization. The conditions that make autonomous, reward-driven improvement work are immediate scalar metrics, modular structure, fast iteration, and version control — and a domain missing any of them resists the approach regardless of model power, because the bottleneck is the environment's structure, not the model What makes a research domain suitable for autonomous optimization?. Math and code happen to hand you a clean, instant, checkable score. Most domains — writing, strategy, open-ended judgment — don't, which is exactly where "verifiable" in RLVR stops being free.
Worse, where verification is shaky, RLVR's gains can be illusory. On contaminated benchmarks, apparent improvements are largely memorization: a model reconstructs half of MATH-500 from partial prompts yet scores zero on a clean post-release benchmark Does RLVR success on math benchmarks reflect genuine reasoning improvement?. And even genuine gains are often structural rather than semantic — RLVR makes reasoning traces locally more coherent without guaranteeing they're globally valid Does RLVR actually improve mathematical reasoning or just coherence?. Outside math, where you can't cheaply audit a final answer, that gap between "looks right" and "is right" becomes much harder to police.
There's also a set of failure modes that get worse precisely when problems are hard or rewards are noisy — the regime non-math domains live in. Overly hard samples push models toward degenerate shortcuts that contaminate skills they already had Do overly hard RLVR samples actually harm model capabilities?; the on-policy constraint drives "capability boundary collapse," where exploitation crowds out exploration and the model's problem-solving scope actually shrinks Why does RLVR training narrow a model's problem solving ability?; and RL tends to converge onto a single dominant output format, suppressing the diversity that open-ended tasks depend on Does RL training collapse format diversity in pretrained models?. The multimodal case is the cleanest cautionary tale: text-token RL and verbose chain-of-thought help reasoning but actively degrade fine-grained perception, because they optimize verbalization when the real bottleneck is visual attention — the right reward aimed at the wrong target Does verbose chain-of-thought actually help multimodal perception tasks?.
The interesting forward edge is that researchers are trying to engineer around the verifiability wall rather than accept it. One approach reuses cross-rollout variance as a self-supervised signal that both weights tokens and filters degenerate queries, reporting faster, more stable training specifically on unverifiable tasks — a hint that the binary "is there a checkable answer" gate might be softened with internal statistics instead of external ground truth Can one statistical measure serve dual purposes in RL training?. So the honest summary is layered: RLVR is limited beyond math and code partly because it only amplifies existing capability, partly because most domains lack the clean scalar feedback it needs, and partly because its characteristic failures intensify exactly where verification is weakest — but the frontier work is about replacing the missing verifier, not just lamenting its absence.
Sources 10 notes
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Autonomous research pipelines require immediate scalar metrics, modular architecture, fast iteration cycles, and version control. Domains lacking any property resist autoresearch regardless of LLM capability, because the bottleneck is environmental structure, not model power.
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.
DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.