Does random tree expansion depth affect process supervision granularity?
This explores whether the depth at which a randomly-branching reasoning tree expands changes how fine-grained the 'process supervision' (step-by-step feedback) you get out of it is — and the corpus says yes, that mapping is essentially free.
This explores whether the depth at which a randomly-branching reasoning tree expands changes how fine-grained the step-by-step feedback signal becomes — and the most direct note in the corpus says yes, with a twist worth knowing: the granularity isn't something you schedule or pay annotators for, it falls out of the sampling structure itself. In Tree-GRPO, early branches in the tree sit near the start of a reasoning trajectory and naturally produce coarse, strategy-level signals, while late branches sit deep in the trajectory and produce fine-grained, detail-level supervision. So expansion depth doesn't just *affect* granularity — it *is* the dial for it, and the dial turns by itself Does tree depth automatically produce supervision at multiple granularities?.
The deeper trick underneath is how tree structure converts a single end-of-trajectory reward into many step-level signals. Because sibling subtrees share a common prefix and diverge afterward, comparing their outcomes tells you which step caused the divergence — turning one outcome reward into step-wise preference data without ever training a separate process reward model Can tree structure alone convert outcome rewards into process supervision?. AlphaLLM makes the same move from a different angle: MCTS rankings over solution paths, plus a few critic models, produce dense process-level quality signals that stand in for human step labels Can tree search replace human feedback in LLM training?.
What makes this an Inquiring Line rather than a single-paper answer is that tree depth is only *one* structural feature you can exploit for the same trick. A synthesis note in the corpus lines up three siblings: Tree-GRPO reads tree topology, Supervised RL reads expert-aligned actions, and ToolPO reads tool-call positions — each squeezes dense step signal out of sparse outcomes using whatever structure its trajectories happen to have Can trajectory structure replace hand-annotated process rewards?. Depth-of-branching is the tree's particular handle; position and alignment are the others'.
And if you don't have a tree at all, you can still manufacture granularity by manipulating *where* reasoning starts rather than where it branches. Reverse-curriculum RL (R3) slides the start state backward from near-completion, so each curriculum stage exposes a different step's failure mode — reaching process-supervision granularity from pure outcome feedback, no tree required Can curriculum learning approximate expensive process supervision?. Read together, these notes reframe your question: granularity of supervision is less about how deep you dig and more about which axis of structure — depth, position, or start-point — you let the geometry expose for free.
One adjacent thread worth a doorway: even when you *have* fine-grained signal, how you read it matters. Step-level confidence filtering catches reasoning breakdowns that whole-trace averaging hides, which is the consumer-side mirror of this whole question — fine-grained supervision is only as useful as your willingness to act locally on it Does step-level confidence outperform global averaging for trace filtering?.
Sources 6 notes
Tree-GRPO's random expansion strategy naturally produces supervision at varying granularities—early branches provide coarse strategy-level signals while late branches provide fine-grained detail supervision. This multi-resolution signal emerges from sampling structure alone, without annotation effort or granularity scheduling.
Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.
AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.
Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.
R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.