By Frederick d’Oleire Uquillas, Science Communications Fellow for the AI Lab
For years, reinforcement learning has typically relied on relatively shallow neural network architectures.
In language and vision, researchers increased depth to hundreds of layers and observed the emergence of new capabilities. In reinforcement learning (RL), by contrast, most architectures used two to five layers, occasionally up to eight. Most prior work found that increasing depth did not lead to meaningful improvements, and in some cases, worse performance.
A paper from Kevin Wang, Ishaan Javali, Michał Bortkiewicz and colleagues at Princeton University and Warsaw University of Technology asks the question: What if we can leverage the insights from scaling in other domains to scale RL?
In their paper, they propose core building blocks that enable RL to scale in reward-sparse settings: self-supervision, scaling data, signal density, stable architecture (via Layer Normalization and residual connections), and scaling depth (the number of layers in the network). With this combination of algorithmic and architectural choices, they scale networks up to 1,024 layers.
The result is not only improved performance across a variety of locomotion, manipulation, and navigation-based environments, but also qualitatively novel behaviors that are unlocked with scale. In most settings, agents outperform various state-of-the-art baseline algorithms, solve tasks that shallower networks consistently fail to complete, and in one maze environment with deep networks (256+ layers), humanoid agents even learn to catapult themselves over a maze wall instead of walking around it to reach the goal faster.
Their paper, supervised by Prof. Benjamin Eysenbach at Princeton University, “1000 Layer Networks for Self-Supervised Reinforcement Learning: Scaling Depth Can Enable New Goal-Reaching Capabilities,” earned a Best Paper Award at NeurIPS, one of the most competitive conferences in machine learning.
The problem with reinforcement learning
Reinforcement learning trains an agent through trial and error: Actions that lead toward a goal are reinforced, and those that do not are penalized, over many iterations.
A central challenge is that reward signals are often sparse. An agent may execute thousands of actions before receiving informative feedback. This makes optimization of large networks difficult, since many parameters must be updated from a limited supervisory signal.
In addition, reward functions are typically manually specified by humans. Designing effective reward functions for high-dimensional agents, such as dexterous manipulators or humanoid robots, is difficult and often requires extensive task-specific engineering. Poorly specified rewards can lead to unintended behaviors or inefficient exploration.
The setup: Self-supervised, goal-conditioned RL
Their paper doesn’t try to scale “classic” RL with handcrafted rewards. Instead, it focuses on self-supervised, goal-conditioned reinforcement learning.
In this setting, the agent is given a target state and must reach it, without access to handcrafted rewards, demonstrations, or external guidance. The learning signal is derived from whether its behavior achieves the specified goal.
The authors use a method called contrastive reinforcement learning, which reframes learning as a representation learning problem: The agent learns representations that distinguish state-action pairs that reach a goal from those that do not. Even if a goal isn’t reached, the final state that was reached, can be “relabeled” and used for training the networks, as if it were the actual goal. This setup produces a far denser and more stable training signal than sparse terminal rewards.
Effects of increasing depth
The authors observe that each environment exhibits a critical depth at which performance improves sharply. For relatively simple environments, substantial gains can occur at depths as low as 8 or 16 layers. In more challenging domains, such as humanoid locomotion, increasing depth continues to yield improving performance, with huge jumps in performance happening as deep as 256 layers.
Across tasks, these transitions correspond to performance increases ranging from approximately 2× to 50× relative to shallow baselines. Below the critical depth, improvements are incremental; beyond it, agents often achieve qualitatively different and substantially stronger behavior.
Additional experiments suggest that deeper networks learn more structured and meaningful representations of the environment. Deeper models also exhibit more effective exploration dynamics and can leverage larger batch sizes during training. Crucially, given a fixed parameter budget, increasing depth leads to significantly larger performance gains than increasing width.
Significance of the Results
This work provides building blocks that enable RL performance to scale with increasing depth, something that no prior work had rigorously shown.
It suggests that predicting future state representations, combined with an appropriate architectural design, leads to an inherently more stable and scalable approach than reward-based reinforcement learning. Additionally, combining the proposed building blocks enables the emergence of new behavioral capabilities.
These findings extend beyond benchmark navigation tasks. Goal-conditioned, self-supervised agents provide a framework for systems that learn autonomously and generalize across tasks without manual reward design.
The contribution is therefore not only larger models, but evidence that depth alters the representational and behavioral capacity of RL systems.
Curious to learn more?
You can read the full paper NeurIPS 2025 here.

Leave a Reply