Does network depth unlock qualitatively new behaviors in RL?
Can scaling neural network depth from shallow (2-5 layers) to very deep (1000 layers) produce fundamental shifts in what self-supervised RL agents can learn, rather than just incremental improvements? This matters because it challenges assumptions about feedback constraints in RL.
Most RL research uses shallow architectures (2-5 layers). Scaling network depth to 1024 layers in self-supervised RL produces 2x-50x performance improvements — but not through gradual improvement. Instead, there are pronounced jumps at critical depth thresholds that vary by environment: depth 4 produces rudimentary policies (falling, throwing toward target), depth 16 enables walking upright, depth 64 navigates simple mazes, and depth 256 produces entirely novel behaviors (leveraging body position to propel over walls, shifting into seated postures to worm through obstacles).
The mechanism is a synergy between exploration and expressivity. A controlled experiment separates these factors: deep and shallow "learner" networks train on data collected by a separate "collector" network. When the collector is deep (rich exploration data), the deep learner substantially outperforms the shallow one — expressivity matters. When the collector is shallow (poor exploration data), both learners perform equally poorly — exploration constrains everything. Neither factor alone explains the gains; scaling depth enhances both simultaneously.
This is conducted in unsupervised goal-conditioned settings with no demonstrations or rewards — the agent must explore from scratch and learn to reach commanded goals. The self-supervised contrastive RL algorithm provides the learning framework. Stabilization requires residual connections, layer normalization, and Swish activations.
The finding challenges the conventional wisdom that RL provides too few bits of feedback to train large networks. In self-supervised RL specifically, the ratio of feedback to parameters becomes less constraining because the agent generates its own training signal. Since Why does parallel reasoning outperform single chain thinking?, the depth-scaling result offers a complementary axis: scaling depth may be as important as scaling parallel breadth for unlocking qualitatively new capabilities.
Inquiring lines that use this note as a source 9
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does nesting optimization levels improve on traditional network depth?
- Does the model learn depth-wise drift as an explicit strategy?
- Why does depth outperform width for sub-billion parameter models?
- Why does exploration quality matter more than learner network depth?
- How do residual connections and layer norm stabilize training in deep RL?
- How does dynamic recurrence during training improve depth extrapolation?
- How does layer removal affect transformers compared to ResNets?
- How do parallel sampling and sequential depth compare as scaling dimensions?
- Can the exploration ceiling be raised beyond what pretraining established?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
complements: depth scaling and parallel scaling may be independent capability axes
-
Can reinforcement learning discover reasoning strategies base models cannot?
Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.
parallels: both show RL discovering qualitatively new behaviors, though in different domains (reasoning vs locomotion)
-
Does RL training follow a predictable two-phase learning sequence?
This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
connects: depth thresholds may correspond to phase transitions between procedural and strategic capabilities
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- DeepNet: Scaling Transformers to 1,000 Layers
- The Art of Scaling Reinforcement Learning Compute for LLMs
- Hierarchical Reasoning Model
- RL + Transformer = A General-Purpose Problem Solver
- Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
- Progress Measures For Grokking Via Mechanistic Interpretability
Original note title
network depth above critical thresholds causes qualitative behavioral jumps in self-supervised rl