INQUIRING LINE

How does behavior cloning reduce complexity before RL training in rerankers?

This reads 'behavior cloning' as the imitation/supervised warm-start phase that runs before reinforcement learning — and asks how copying demonstrated behavior shrinks the search space RL then has to optimize over; the corpus doesn't speak to rerankers specifically, but it covers the imitation-before-RL pattern in depth.


This explores how a behavior-cloning (imitation) phase makes the later RL phase tractable — and while the collection has no note on rerankers as such, the underlying mechanic shows up clearly in the work on curriculum and warm-start training. The cleanest statement of it is the finding that running supervised imitation first, then RL against verifiable rewards, beats either method alone: the imitation phase exists precisely to create 'reasonable rollouts the RL phase can then sharpen' Does sequencing imitation then exploration training improve reasoning?. That's the complexity reduction in one sentence — outcome rewards are nearly useless when a fresh policy almost never produces a good trajectory to reward, so cloning demonstrated behavior lifts the policy into a region where the reward signal actually carries information.

You can see why this matters by looking at what happens when RL has to do the exploring on its own with sparse signal. Training on problems that are too hard for the current policy doesn't teach reasoning — it teaches degenerate shortcuts, because group-relative normalization treats the rare accidental success as a high-advantage trajectory and reinforces answer-repetition and computation-skipping Do overly hard RLVR samples actually harm model capabilities?. Behavior cloning heads this off by raising the baseline competence so that 'success' is common enough to be meaningful rather than accidental. In a reranker, the analogous move is cloning a teacher's ordering decisions so the policy starts from sensible rankings, and RL only has to refine the margins.

The other half of the answer is about what cloning preserves that pure RL destroys. RL reliably compresses behavioral diversity — search agents converge on narrow reward-maximizing strategies through the same entropy-collapse seen in reasoning, and it's specifically supervised fine-tuning on diverse demonstrations that keeps exploration breadth alive Does reinforcement learning squeeze exploration diversity in search agents?. Relatedly, RL tends to collapse onto a single dominant format inherited from pretraining within the first epoch Does RL training collapse format diversity in pretrained models?. So cloning isn't just a head start; it's a way of installing the variety of good behaviors before RL's narrowing pressure kicks in.

There's also a structural reason the warm-start is cheap to exploit. RL doesn't rewrite the whole network — it updates only 5–30% of parameters, in sparse but nearly full-rank subnetworks that are consistent across seeds Does reinforcement learning update only a small fraction of parameters?. That fits the division of labor: behavior cloning sets the bulk of the competent behavior, and RL makes a small, targeted adjustment on top. The two-phase view reinforces this — RL itself proceeds from procedural mastery first to strategic refinement second Does RL training follow a predictable two-phase learning sequence?, and cloning essentially pre-pays the procedural-mastery phase so RL can spend its budget on the strategic part.

The thing you might not have expected: the value of behavior cloning isn't mainly that it saves compute — it's that it makes the reward signal *legible*. A reward you can't earn teaches nothing, or worse, teaches a shortcut. If you want the sharper contrast between imitation-foundation and reward-refinement, the curriculum note Does sequencing imitation then exploration training improve reasoning? is the doorway; for the failure mode it prevents, start with the overly-hard-samples note Do overly hard RLVR samples actually harm model capabilities?.


Sources 6 notes

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about behavior cloning as a complexity-reduction phase before RL training. The question remains: does imitation-then-refinement still hold as the dominant paradigm, or have newer methods (e.g., direct RL initialization, in-context adaptation, or hybrid curricula) since eroded or unified the two-phase picture?

What a curated library found — and when (findings span 2024–2026; treat as dated claims, not current truth):
• Supervised imitation first, then RL against verifiable rewards, beats either method alone; cloning lifts the policy into a region where reward signal carries information (2024–2025)
• RL with sparse/hard-negative rewards induces degenerate shortcuts (answer-repetition, computation-skipping); behavior cloning raises baseline competence so success is common enough to be meaningful (2024)
• RL training collapses behavioral diversity; supervised fine-tuning on diverse demonstrations preserves exploration breadth (2025)
• RL post-training converges on a single dominant pretraining distribution format within the first epoch (2025)
• RL updates only 5–30% of parameters in sparse, full-rank subnetworks; cloning sets bulk competence, RL makes targeted adjustment (2025)
• Two-phase RL dynamics: procedural mastery first, strategic refinement second; cloning pre-pays procedural phase (2025)

Anchor papers (verify; mind their dates):
• arXiv:2402.05808 (2024-02) — Reverse Curriculum RL
• arXiv:2409.15360 (2024-09) — Reward-Robust RLHF
• arXiv:2605.22817 (2026-05) — Vector Policy Optimization; training for diversity
• arXiv:2605.28388 (2026-05) — Sample difficulty in RLVR

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, assess whether newer model scales, better initialization schemes (e.g., instruct-tuned or multi-task pretrained base), modern RL frameworks (PPO variants, DPO, or end-to-end differentiable reward learning), or improved evaluation harnesses have since RELAXED the two-phase requirement or shown single-phase RL can match cloning+RL. Separate the durable question (does RL still collapse diversity? does reward sparsity still induce shortcuts?) from the perishable limitation (must we clone first?). Cite what relaxed it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially any claiming end-to-end RL without explicit imitation, or unified loss formulations that dissolve the boundary.
(3) Propose 2 research questions that assume the regime may have shifted: (a) Under what model-size or reward-quality thresholds does cloning become optional? (b) Can in-context exemplars or prompt-level curriculum replace behavior-cloning pretraining?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines