INQUIRING LINE

Why does exploration quality matter more than learner network depth?

This explores a tension the corpus keeps circling: whether the *behavior* a model is trained to do — how widely and how messily it searches — does more work than the raw architectural capacity (layers) you give it.


This explores whether exploration quality outranks network depth — and the corpus suggests the framing is partly a trap, because the two aren't really competitors: depth's gains largely *route through* exploration. The clearest case is self-supervised RL where scaling to 1000 layers produces sudden capability jumps — walking at depth 16, wall-climbing at depth 256 — but the authors attribute these to *synergistic* gains in exploration and expressivity, not depth alone Does network depth unlock qualitatively new behaviors in RL?. Depth matters precisely because it buys better search, not in spite of it. So 'depth vs. exploration' often resolves into 'depth as one way to fund exploration.'

The reason exploration is the load-bearing variable is that it's the thing that breaks most easily — and breaks independently of how big the network is. Reinforcement learning reliably *collapses* behavioral diversity: policies converge on a few reward-maximizing strategies through entropy collapse, in search agents just as in reasoning, and no amount of depth fixes a policy that has stopped exploring Does reinforcement learning squeeze exploration diversity in search agents?. LLMs are also surprisingly poor explorers on their own — in simple multi-armed-bandit tasks, only GPT-4 with explicit hints and *external* history summarization explores adequately, because the model can't reliably aggregate its own interaction history Why do LLMs struggle with exploration in simple decision tasks?. There's even a mechanistic story for the failure: uncertainty signals fire in early transformer layers while the long-horizon 'empowerment' signals only emerge in middle layers, so models commit to a choice before the exploration-relevant representations have formed Why do large language models explore less effectively than humans?.

What then *does* move capability is the structure and richness of the search itself. Training on the full messy search process — mistakes, backtracking, dead ends serialized as text — produces 25% better problem-solvers than training only on clean optimal trajectories, because the model learns an internal world-model for searching rather than memorizing one route Does training on messy search processes improve reasoning?. Forcing breadth instead of depth helps too: allocating test-time compute across diverse *abstractions* beats sampling more solutions down the same chain, because depth-only reasoning falls into an 'underthinking' rut Can abstractions guide exploration better than depth alone?. And tree search (MCTS) can manufacture the dense quality signal that exploration needs without any human annotation, letting a model improve by ranking its own paths Can tree search replace human feedback in LLM training?.

Here's the part you may not have known you wanted: the supposed *cost* of exploration may be partly illusory. One analysis finds the exploration–exploitation trade-off is a measurement artifact of looking at things token-by-token — at the hidden-state level the two are nearly uncorrelated, and a method optimizing both at once gains 21.4% Is the exploration-exploitation trade-off actually fundamental?. Pair that with the bandit result that smarter uncertainty handling (separating what's genuinely unknown from irreducible noise) cuts the interactions needed by 29% Can neural networks explore efficiently at recommendation scale?, and the picture sharpens: better exploration is often *free capability* you can unlock by changing how a fixed model searches — whereas adding depth is a more expensive, and as the tiny-model results show Does depth matter more than width for tiny language models?, more situational lever.


Sources 10 notes

Does network depth unlock qualitatively new behaviors in RL?

Scaling to 1000-layer networks in self-supervised RL produces dramatic capability jumps at specific thresholds—depth 16 enables walking, depth 256 enables wall-climbing—driven by synergistic gains in both exploration and expressivity rather than gradual improvement.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Why do LLMs struggle with exploration in simple decision tasks?

Across multi-armed bandit environments, only GPT-4 with explicit exploratory hints, external history summarization, and chain-of-thought reasoning achieves satisfactory exploration. Without external summarization, models cannot reliably track and aggregate unstructured interaction history to guide exploratory decisions.

Why do large language models explore less effectively than humans?

SAE decomposition shows uncertainty values dominate early transformer blocks while empowerment representations emerge only in middle blocks. This temporal mismatch causes models to commit to decisions before long-term exploration signals can influence them. Reasoning-trained o1 overcomes this by extending computation time.

Does training on messy search processes improve reasoning?

Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Is the exploration-exploitation trade-off actually fundamental?

Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.

Can neural networks explore efficiently at recommendation scale?

ENR separates aleatoric from epistemic uncertainty, focusing computation only on parameter uncertainty needed for Thompson sampling. It improved click-through rates 9% and ratings 6% while requiring 29% fewer interactions than baselines.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Next inquiring lines