INQUIRING LINE

What makes some tasks bounded enough for reliable RL?

This explores what properties make a task 'bounded' — verifiable, decomposable, closed-ended enough — that reinforcement learning produces reliable gains rather than noise or degradation.


This explores what makes a task bounded enough for reliable RL — and the corpus keeps returning to one answer: it's not the size of the task, it's whether you can cheaply tell right from wrong at each step. The cleanest illustration is verifiability. Execution-free code reasoning only becomes a usable RL signal once structured reasoning templates cross a ~93% accuracy threshold for checking whether two patches are equivalent — below that line the reward is too noisy to train on; above it, certain task classes (fault localization, patch equivalence) suddenly become RL-tractable Can structured reasoning replace code execution for RL rewards?. Boundedness, in other words, is a property of the *verifier*, not just the problem.

The second ingredient is decomposability. A task that looks impossibly long-horizon becomes bounded if you can shatter it into minimal subtasks each small enough to check by voting — MAKER runs million-step tasks to zero errors this way, and strikingly finds that small non-reasoning models suffice once decomposition is extreme enough Can extreme task decomposition enable reliable execution at million-step scale?. The same instinct shows up in reasoning structured as recursive subtask trees, where bounding each step's working memory lets a single model sustain accuracy past its context limits Can recursive subtask trees overcome context window limits?. Reliability isn't found in the whole task; it's manufactured by carving the task into pieces whose correctness is locally decidable.

The domain itself also matters. When you compare structured domains (math, code — closed answers) against creative ones (open-ended generation), they pull entropy in opposite directions: structured training systematically *decreases* output entropy toward a correct attractor, while creative training increases it. Train them in the wrong order and the structured collapse damages the open-ended skills Does training order reshape how models handle different task types?. So 'bounded' partly means 'has a low-entropy correct answer the model can converge onto' — which is exactly why RL behaves differently on essays than on equations.

But boundedness has a ceiling worth knowing about. Even on perfectly verifiable tasks, RLVR doesn't seem to expand what a model can do — pass@k analysis shows it sharpens sampling toward solutions already latent in the base model rather than teaching genuinely new reasoning Does RLVR actually expand what models can reason about?. And the reward shape can quietly betray you: binary correct/incorrect rewards train confident wrong answers because nothing penalizes confident errors, until you add a proper scoring rule like Brier Does binary reward training hurt model calibration?. A task can be bounded and still teach the wrong lesson if the reward is mis-specified.

The encouraging counterweight is that boundedness can be *engineered* into territory that looks unbounded. Modified DAPO training doubled SWE-bench performance on genuinely stateful, multi-turn software tasks with delayed rewards Can reinforcement learning scale beyond single-turn language tasks?, and GRPO-RoC got a 14B model to frontier math by filtering noisy positive trajectories while keeping diverse failures as signal — essentially cleaning the reward channel rather than the task Why do correct code trajectories teach models to tolerate errors?. The thing you didn't know you wanted to know: 'bounded enough for RL' is rarely a fixed fact about a task. It's a verifier you can build, a decomposition you can impose, and a reward you can shape — and where you can't build any of those, RL stalls no matter how simple the problem looks.


Sources 8 notes

Can structured reasoning replace code execution for RL rewards?

Semi-formal reasoning templates enable execution-free patch equivalence verification at 93% accuracy on real agent code, crossing the reliability threshold needed for RL reward signals. This makes execution-free verification viable for certain task classes like fault localization and code reasoning.

Can extreme task decomposition enable reliable execution at million-step scale?

MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can reinforcement learning scale beyond single-turn language tasks?

Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.

Why do correct code trajectories teach models to tolerate errors?

GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing RL-task-boundedness claims from a curated 2024–2026 library. The question remains open: what makes some tasks bounded enough for reliable RL?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–09 through 2026–05. Key constraints the library identified:

• Verifiability is the hard gate: execution-free code reasoning requires ~93% accuracy in structured checkers before RL reward becomes usable; below that, noise overwhelms training (~2024–09).
• Decomposability manufactures boundedness: extreme subtask splitting (voting, recursive trees, microagents) enables million-step tasks; base model scale becomes secondary (~2025–07, ~2025–11).
• Entropy dynamics differ sharply by domain: structured tasks (math, code) collapse toward correct attractor; creative tasks expand entropy; training order matters (~2025–07).
• RL sharpens, not expands: pass@k analysis shows RL refines latent solutions rather than teaching new reasoning; base model capability is the ceiling (~2025–04).
• Reward mis-specification is silent: binary rewards train confident errors; proper scoring rules (Brier) are necessary (~2024–09).
• Long-horizon multi-turn tasks (SWE-bench) become tractable via trajectory filtering and asymmetric reward shaping (~2025–08).

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (2025–04) — Does RL expand reasoning beyond base model?
• arXiv:2511.09030 (2025–11) — Solving million-step tasks via decomposition.
• arXiv:2508.03501 (2025–08) — Multi-turn software engineering with RL.
• arXiv:2409.15360 (2024–09) — Reward robustness and calibration.

Your task:
(1) RE-TEST EACH CONSTRAINT. For verifiability, decomposability, entropy dynamics, capability expansion, and reward shape—has newer model scale, synthetic-verifier training, or orchestration (memory continuity, caching) since relaxed these limits? Separate durable question ("is boundedness engineered or intrinsic?") from perishable claims ("93% is the threshold"; "RL cannot teach new reasoning"). Name what loosened each constraint or confirm it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from last ~6 months—especially if any paper shows RL genuinely expanding reasoning, or verifiers breaking the ~93% floor, or entropy dynamics reversed.
(3) Propose 2 research questions that ASSUME the regime shifted: e.g., "If synthetic verifiers can reach 98%+ via meta-training, does that unlock stateful tasks?" or "Does continuous memory update (arXiv:2605.12978) alter decomposability requirements?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines