Fbhchile

2026-05-19 03:46:53

Mastering Long-Horizon Planning with GRASP: A New Gradient-Based Approach for World Models

GRASP is a new gradient-based planner that makes long-horizon planning with world models practical, using parallel virtual states, stochastic exploration, and gradient reshaping to overcome vanishing gradients and local minima.

Planning over long time horizons with learned world models has been a persistent challenge: optimization becomes ill-conditioned, gradients vanish or explode, and local minima abound. GRASP introduces a novel gradient-based planner that makes long-horizon planning practical by rethinking how trajectories are optimized. Instead of directly backpropagating through a high-dimensional vision model, GRASP lifts the trajectory into virtual states for parallel optimization, injects stochasticity for exploration, and reshapes gradients to deliver clean signals to actions. Below, we answer common questions about this approach, its motivations, and its impact.

What is GRASP and why was it developed?

GRASP (Gradient-based Recursive Action Search for Planning) is a method for planning over long horizons using learned dynamics models—often called world models. It was developed because even powerful world models struggle to be useful for control: their high-dimensional latent spaces cause fragile gradients, non-greedy structures create deep local minima, and long sequences amplify these issues. GRASP directly addresses these weaknesses. It introduces three key innovations: (1) lifting the trajectory into virtual states so that optimization can run in parallel across time steps, dramatically reducing wall-clock time; (2) adding controlled stochasticity to the state iterates to encourage exploration and avoid poor local optima; and (3) reshaping the gradient flow so that action updates receive strong, clean signals without passing brittle gradients through high-dimensional visual encoders. Together, these ideas make gradient-based planning robust and practical for horizons that were previously intractable.

Mastering Long-Horizon Planning with GRASP: A New Gradient-Based Approach for World Models
Source: bair.berkeley.edu

Why are long horizons particularly difficult for gradient-based planning?

Long horizons amplify three classic problems in gradient-based planning. First, optimization becomes ill-conditioned: the loss landscape over many time steps is highly non-convex and riddled with flat regions and sharp ravines. Gradients can vanish—making early actions infeasible to update—or explode, destabilizing the entire plan. Second, non-greedy structure means that the optimal sequence of actions may require coordinated changes across many time steps; gradient descent often gets stuck in local minima that are locally plausible but globally poor. Third, when the world model is a deep neural network (especially a vision model), the gradients of the loss with respect to the actions must flow through many layers of high-dimensional state representations. These “state-input” gradients are notoriously brittle: they can be noisy, small, and unreliable, especially over long rollouts. GRASP tackles each of these directly by parallelizing optimization, injecting stochasticity for exploration, and reshaping gradients to bypass the high-dimensional vision encoder.

How does GRASP improve gradient flow for action updates?

GRASP introduces a gradient reshaping technique that sidesteps the fragile “state-input” gradients through high-dimensional vision models. Instead of backpropagating through the entire world model to compute gradients of the planning objective with respect to actions, GRASP computes gradients of the loss with respect to the virtual states that it creates during planning. These virtual states are intermediate representations that are optimized in parallel across time. By reshaping the gradient paths, GRASP ensures that action updates receive strong and clean signals, regardless of the complexity of the underlying world model. The key insight is that the gradient signal for an action does not need to traverse the high-dimensional encoder; it can be derived from the difference between the predicted next state (under the plan) and a target state. This avoids vanishing or noisy gradients and makes planning over hundreds of steps feasible with a simple gradient-based optimizer.

What exactly is a world model in this context?

In this work, a world model is a learned predictive model that, given a sequence of past observations and actions, forecasts what will happen next. Formally, it defines a distribution \( P_\theta(s_{t+1} \mid s_{t-h:t}, a_t) \) that approximates the environment’s transition dynamics. These models are typically deep neural networks trained on large datasets of agent experience, often in high-dimensional spaces like images or latent vectors. As they scale, they become general-purpose simulators that can predict long sequences of future observations and generalize across tasks. However, having a powerful predictor is not the same as being able to plan with it. Planning requires solving an optimization problem: find actions that, when fed into the world model, produce a trajectory achieving a high reward or low cost. GRASP is designed specifically to make this optimization tractable for long horizons, even with complex world models.

Mastering Long-Horizon Planning with GRASP: A New Gradient-Based Approach for World Models
Source: bair.berkeley.edu

How does GRASP use stochasticity for exploration?

GRASP injects controlled stochasticity directly into the state iterates during planning. When optimizing a trajectory, each virtual state is perturbed with noise at every iteration, encouraging the optimizer to explore alternative trajectories that might be missed by purely deterministic gradient descent. This is analogous to adding momentum or noise in stochastic optimization, but tailored to the planning setting. The noise is carefully scaled so that it does not destabilize convergence; instead, it helps the planner escape shallow local minima and discover more globally optimal action sequences. This is particularly important for long horizons, where the problem is rich with poor local optima. The stochasticity is not applied to the actions themselves but to the intermediate states, which allows the dynamics of the world model to naturally propagate perturbations and create diverse candidate trajectories. The result is more robust planning across a wide range of tasks and horizon lengths.

What are the key results and why do they matter?

The main result is that GRASP enables gradient-based planning over horizons of hundreds or even thousands of steps, a regime where previous methods fail due to vanishing gradients or prohibitive compute. Empirically, GRASP outperforms popular alternatives like cross-entropy method (CEM) and standard gradient descent with backpropagation through time, both in planning success rate and in the quality of the obtained plans. Importantly, GRASP does not require engineering tricks like gradient clipping or careful initialization; its built-in parallelization, stochastic exploration, and gradient reshaping make it robust out of the box. These results matter because they show that learned world models can be used for long-horizon control without sacrificing the simplicity of gradient-based optimization. This opens the door to applying sophisticated generative world models to real-world sequential decision-making problems—such as robotics, video games, and simulation-based planning—where planning over long horizons is essential.

How might GRASP influence future research in world models and planning?

GRASP suggests that the bottleneck in model-based planning today is not the predictive power of world models but rather the optimizer used to plan with them. By making gradient-based planning practical for long horizons, GRASP encourages the community to focus on scaling up world models without worrying about planning brittleness. Future research could extend GRASP to stochastic environments, integrate it with model-based reinforcement learning algorithms, or combine it with advanced gradient estimation techniques. Additionally, the idea of lifting trajectories into virtual states for parallel optimization may influence how we design neural planners more broadly. GRASP also highlights the importance of gradient reshaping—a principle that could be applied beyond planning, such as in control or sequence generation tasks. Ultimately, GRASP provides a foundation for building agents that can think ahead over long time scales using the full power of learned dynamics.