INQUIRING LINE

At what capability level does the generation-verification gap make intrinsic rewards insufficient?

This explores the boundary condition where a model's ability to check its own answers stops outpacing its ability to produce them — the point past which a model training on its own internal reward signal can no longer improve and needs outside help.


This is really a question about a single asymmetry: can a model judge a candidate answer more reliably than it can generate one? When verification is the easier half of that pair — checkable math, code that runs, problems where wrong answers are obvious — intrinsic signals like self-consistency carry a model a long way. The generation-verification gap makes intrinsic rewards insufficient precisely at the frontier where that asymmetry flips: where the hardest problems a model faces are ones it cannot verify any better than it can solve. Can models reliably improve themselves without external feedback? makes this the load-bearing point — pure self-improvement stalls there, and every method that actually keeps working quietly smuggles in an external anchor: a frozen past version, a third-party judge, a user correction, a tool that returns ground truth. The 'capability level' in the question isn't a fixed model size; it's wherever a given problem sits relative to that model's own verification ceiling.

What sharpens this is that even *external* verifiable rewards don't buy you new capability — so intrinsic ones certainly can't. Does RLVR actually expand what models can reason about? shows via pass@k that RLVR narrows sampling toward solutions already living in the base model's distribution rather than expanding the set of solvable problems, and What does reward learning actually do to model reasoning? drives it home: a single example can trigger the gains, and *spurious* rewards work nearly as well as correct ones. That's the giveaway — if a wrong reward and a right reward produce the same lift, the reward isn't teaching anything; it's activating pretraining. So an intrinsic reward, which at best approximates a correct external one, is structurally capped at re-sorting what the model already knows. The moment a task requires reasoning patterns outside that base distribution, you've left what any self-generated signal can reach — and only genuine transfer (distillation) crosses that line.

The interesting escape hatch is to raise the verification side of the gap rather than accept it as fixed. Can reward models benefit from reasoning before scoring? and Can generative reasoning beat discriminative models with less training data? both show that letting the judge *reason* before it scores — spending test-time compute on evaluation — lifts the verification ceiling well past what a snap outcome-judgment achieves (a 1.5B generative verifier beating GPT-4o on a fraction of the labels). That reframes the answer to the question: intrinsic rewards become insufficient not at some absolute capability level but wherever generation has outrun a *non-reasoning* verifier. Make the verifier think, and you push the insufficiency threshold higher.

A few notes complicate the picture in useful ways. Does binary reward training hurt model calibration? shows a crude intrinsic-style signal (binary correctness) doesn't just plateau — it actively corrupts the model into confident guessing, because it never punishes confident wrong answers. And Can scalar rewards capture all the information in agent feedback? argues a scalar reward throws away the *directive* half of feedback (how to change) and keeps only the *evaluative* half (how good) — so part of what makes self-rewarding insufficient is that the reward format itself is lossy, independent of capability. The takeaway a reader might not expect: 'intrinsic rewards run out' is less a wall at a fixed skill level and more a relationship you can renegotiate — by anchoring to something external, by making the verifier reason, or by using a richer feedback signal than a single number.


Sources 7 notes

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can generative reasoning beat discriminative models with less training data?

GenPRM and ThinkPRM reframe process supervision as generative tasks with CoT reasoning before judgment, achieving superior performance on far fewer labels. A 1.5B GenPRM beats GPT-4o; ThinkPRM uses only 1% of PRM800K labels to surpass full-dataset discriminative verifiers.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Next inquiring lines