Humanoid

Conventional nets (4 layers)

Deep nets (64 layers)

Humanoid

Conventional nets
(4 layers)

Deep nets
(64 layers)

Ant U4-Maze

Conventional nets
(4 layers)

Deep nets
(64 layers)

Ant Big Maze

Conventional nets
(4 layers)

Deep nets
(64 layers)

Arm Push Hard

Conventional nets
(4 layers)

Deep nets
(64 layers)

Arm Binpick Hard

Conventional nets
(4 layers)

Deep nets
(64 layers)

Humanoid Big Maze

Conventional nets
(4 layers)

Deep nets
(64 layers)

*Here we scale to depth 64. We test the limits of scaling up to 1000 layers in the scaling limits section below.

Abstract

Scaling up self-supervised learning has driven breakthroughs in language and vision, yet comparable progress has remained elusive in reinforcement learning (RL). In this paper, we introduce a self-supervised RL paradigm that unlocks substantial improvements in scalability, with network depth serving as a critical factor. Whereas most RL papers in recent years have relied on shallow architectures (around 2 – 5 layers), we demonstrate that increasing the depth up to 1000 layers can significantly boost performance. Our experiments are conducted in an unsupervised goal-conditioned setting, where no demonstrations or rewards are provided, so an agent must explore (from scratch) and learn how to maximize the likelihood of reaching commanded goals. Evaluated on a diverse range of simulated locomotion and manipulation tasks, our approach ranges from doubling performance to over 50× on humanoid-based tasks. Increasing the model depth not only increases success rates but also qualitatively changes the behaviors learned, with more sophisticated behaviors emerging as model capacity grows.

The Building Blocks of RL Scaling (Our Approach)

In NLP and Computer Vision, scaling up model size has driven notable AI advancements in recent years. Yet, why hasn't RL seen comparable progress? In this work, we ask: can we leverage insights and lessons from language and vision to serve as building blocks for scaling RL?

At first glance, it makes sense why training very large RL networks should be difficult: the RL problem provides very few bits of feedback (e.g., only a sparse reward after a long sequence of observations), so the ratio of feedback to parameters is very small. The conventional wisdom (which many recent models reflect) has been that large AI systems must be trained primarily in a self-supervised fashion and that RL should only be used to finetune these models. Indeed, many of the recent breakthroughs in other fields have been primarily achieved with self-supervised methods.

1. Self-supervised learning

Our first step is to rethink the conventional wisdom above: "reinforcement learning" and "self-supervised learning" are not diametric learning rules, but rather can be married together into self-supervised RL systems that explore and learn policies without reference to a reward function or demonstrations. In this work, we use one of the simplest self-supervised RL algorithms, contrastive RL (CRL).

2. Data Scale

A key factor in scaling has been the vast amount of internet data available for learning—meanwhile, data sparsity has often been a challenge in RL. In recent years, however, there's been a proliferation of GPU-accelerated RL environments, enabling the collection of hundreds of millions of environment steps of online RL data within a few hours. In this work, we leverage the JaxGCRL environment of robotic locomotion, navigation, and manipulation tasks.

3. Signal Density

The success of scaling relies not only on data quantity, but also the density of training signals: in next-word prediction, every token in a text becomes a labeled example. By contrast, RL often relies on sparse, delayed rewards, with goal-conditioned RL providing only a single bit of reward feedback per trajectory. However, one underlying mechanism of the CRL algorithm is Hindsight Experience Replay (HER). HER "relabels" each trajectory with the goal the agent actually achieved, so all trajectory data, even unsuccessful attempts that did not reach the goal reward, effectively become labeled samples for learning, significantly improving signal density in a self-supervised manner.

4. Classification vs. Regression

Scaling in language and vision are based on a classification paradigm, where increased network capacity yields predictably improved cross-entropy loss. One related work is Farebrother et al. (2024), who showed that discretizing the TD objective of value-based RL into a categorical cross-entropy loss leads to improved scaling properties. In this vein, note that the CRL algorithm in our approach effectively uses a cross entropy loss as well. Its InfoNCE objective is a generalization of the cross-entropy loss, thereby performing RL tasks by classifying whether current states and actions belong to the same or different trajectory that leads toward a goal state.

5. Network Architecture

Finally, training large RL networks often yields training instabilities. Thus, scaling will require incorporating architectural techniques from prior work, including residual connections, layer normalization, and Swish activation. These components enable increasing model capacity while preserving training stability.

Empirical Results

Scaling network depth yields performance gains across a suite of locomotion, navigation, and manipulation tasks, ranging from doubling performance to 50× improvements on Humanoid-based tasks. Notably, rather than scaling smoothly, performance often jumps at specific "critical" depths (e.g., 8 layers on Ant Big Maze, 64 on Humanoid U-Maze), which correspond to the emergence of qualitatively distinct learned policies (see the section below).

Qualitatively-Different Learned Behaviors Emerge With Depth

Increasing depth results in new capabilities:
Row 1: A humanoid agent trained with network depth 4 collapses and throws itself towards the goal, as opposed to in
Row 2, where the depth 16 agent gains the ability to walk upright.
Row 3: At depth 64, the humanoid agent in U-Maze struggles to reach the goal and falls.
Row 4: An impressively novel policy emerges at depth 256, as the agent exhibits an acrobatic strategy of compressing its body to vault over the maze wall.

Deep networks exhibit improved generalization.
(Top left) We modify the training setup of the Ant U-Maze environment such that start-goal pairs are separated by ≤3 units. This design guarantees that no evaluation pairs (top right) were encountered during training, testing the ability for combinatorial generalization via "stitching."
(Bottom) Generalization ability improves as network depth grows from 4 to 16 to 64 layers.

Deep networks learn better representations.
In the U4-Maze, the start and goal positions are indicated by the ⊙ and G symbols respectively, and the visualized Q values are computed via the L₂ distance in the learned representation space, i.e., Q(s,a,g) = ‖φ(s,a) - ψ(g)‖₂. The shallow depth-4 network (left) appears to naively rely on Euclidean proximity, exhibited by the high Q values of the semicircular gradient near the start position, despite the maze wall.
In the depth-64 heatmap (right), the highest Q values cluster at the goal, gradually tapering along the maze's interior boundary. These results highlight how increasing depth is important for learning value functions in goal-conditioned settings, which are characterized by long horizons and sparse rewards.

Scaling Depth Unlocks Batch Size Scaling and Outperforms Width Scaling

Scaling network width vs. depth.
We reflect findings from previous works which suggest that increasing network width can enhance performance. However, in contrast to prior work, our method is able to scale depth, yielding more impactful performance gains while also being more parameter-efficient (similar performance for $50\times$ smaller models). For instance, in the Humanoid environment, raising the width to 4096 (depth=4) fails to match the performance achieved by simply doubling the depth to 8 (width=256). This comparative advantage of scaling depth seems more pronounced as the observation dimensionality increases.

Deeper networks unlock batch size scaling.
Scaling batch size has been an effective mechanism in many areas in ML, but it has not demonstrated the same effectiveness in reinforcement learning. Our findings indicate that, while scaling batch size provides only marginal benefits at the original network capacity, larger networks can effectively leverage batch size scaling to achieve further improvements.

How Far Can We Scale Depth?

Exploring the limits of scaling.
We push the boundaries of network depth scaling up to 1024 layers. In Humanoid Big Maze, performance plateaus. However, on Humanoid U-Maze, we observe continued performance improvements with network depths of 256 and 1024 layers. This may be because the maneuver being learned (flipping over the wall) is exceptionally complex (see Figure above), and requires greater network depth to learn. Note that for the 1024-layer training runs, we observed the actor loss exploding at the onset of training, so we maintained the actor depth at 512 while using 1024-layer networks only for the two critic encoders.

Summary of Key Empirical Findings

CRL is scalable to depths unattainable by other RL proprioceptive algorithms (1000+ layers), perhaps due to its self-supervised nature.
Both width and depth are key factors influencing CRL's performance, but depth achieves greater performance and better parameter-efficiency (similar performance for 50× smaller models).
We observe signs of emergent behaviors in CRL with deep neural networks, such as humanoid learning to walk and navigate a maze.
Scale unlocks learning difficult maze topologies.
Batch size scaling occurs in CRL for deep networks.
CRL benefits from both the actor and critic scale.

Open Questions and Future Directions

Pushing depth scaling further: Can we continue to push the limits of depth scaling, beyond 1000+ layers, particularly in the critic network? What happens if we use Neural ODEs, which can be viewed as modeling a ResNet of continuous (effectively infinite) depth?
Single-goal setting: Does scaling CRL work in the "single goal" setting, with no subgoals in addition to no rewards/demonstrations in self-supervised RL? Although the tasks in this research utilize subgoals during training, a promising sign is one of the ten tasks, Arm Binpick Hard, is very close to single-goal (bins are small and far apart, no inherent subgoal structure). And we do see scaling on this task as well.
Distributed training: Testing the limits of CRL scaling via distributed training? Currently all experiments are run on a singular GPU. Given that we find scaling depth, width, and batch size all improve performance, what performance gains and emergent performance can we achieve when we scale all of these dimensions together?
Theoretical understanding: Can we gain a more rigorous theoretical understanding of why exactly does depth scaling in CRL work?
Offline RL: Can we get deep networks to help in the offline setting?
Transfer learning: Can we apply transfer learning to CRL, and if so, can scaling enable the learning of generalized policies that contribute toward developing foundational RL models?

Join us! Our open-source code makes it easy to start experimenting with these ideas. Reach out if you have ideas about how to solve these open problems!

${\bf B\kern-.05em{\small I\kern-.025em B}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}$

@article{wang2025thousand,
    title     = {1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities},
    author    = {Wang, Kevin and Javali, Ishaan and Bortkiewicz, Micha{\l} and Trzci\'nski, Tomasz and Eysenbach, Benjamin},
    journal   = {arXiv preprint arXiv:2503.14858},
    year      = {2025}
}