VERSES Research Blog

Behavior Prediction under Occlusion

Written by VERSES | Sep 18, 2025 4:00:00 PM

In autonomous driving and social robot navigation, predicting the future behavior of surrounding humans and objects is one of the most crucial tasks–known as behavior prediction. Equipped with prediction models that are both accurate and aware of their own uncertainty, an autonomous vehicle can plan future movements into the future that maximizes efficiency while avoiding potential collisions. 

To deal with the heterogeneity of relevant information on the road, such as complex road geometry, traffic signals, and various road users, existing autonomous vehicle solutions employ large transformer models to learn from massive datasets. While these models excel at predicting nominal driving conditions, they have been shown to struggle in safety-critical “long-tail” scenarios, such as when the vehicle suffers from occluded views of surrounding pedestrians. This is not surprising, because these black-box models do not explicitly represent occlusion in their model structure, and they are rarely trained on such data. 

In our recent paper titled Navigation under uncertainty: Trajectory prediction and occlusion reasoning with switching dynamical systems, in collaboration with Volvo Cars, a world leader in vehicle safety, we explored a novel approach to behavior prediction that bakes occlusion awareness into the model while enhancing the expressiveness of state-of-the-art (SOTA) models. 

 

Behavior prediction with switching dynamical systems

Road user behavior is inherently multimodal: when observing a cyclist riding along the side of the road, it is often difficult to tell whether they will continue straight or make a turn. However, when they do decide to make a turn, the change of course often appears to occur abruptly. Similarly, abrupt changes in sensory experience can occur in the sudden appearance of encroaching pedestrians when driving by walls or large vehicles. 

We tackle both of these scenarios with a unified probabilistic framework for modeling sequential data called switching dynamical systems (SDS) that is designed to handle these unexpected events. A SDS is a set of simple dynamical system models that when combined together, can express more complex trajectories than any single dynamical system in the set can alone.  A single dynamical system cannot describe sudden jumps in signals, such as those caused by the sudden appearance of a pedestrian or the sudden turn of a cyclist, because it only excels at modeling smoothly transitioning behavior. However, this can be done with a set of different dynamical systems, each accounting for a different subset of the full trajectory. 

SDS are generalizations of mixture-of-experts models, as the strength of different experts, in our case different dynamical systems, can be allocated not only at each time step but also across temporal scales. Furthermore, two previous works on mixture models, Gradient-free variational learning with conditional mixture networks and Variational Bayes Gaussian Splatting, show that this class of models is particularly suited for efficient inference and learning.

The SDS perspective not only provides us with a unified modeling framework but also affords the following benefits which we demonstrate in the paper:

  • Enhanced model expressiveness
  • Occlusion awareness
  • Interpretability

 

Enhanced model expressiveness

To account for the multimodality of road user behavior, SOTA models already use a mixture modeling approach called conditional Gaussian mixture models (cGMM). Given the current position, velocity, and heading of a road user, cGMM predicts a set of paths the road user may continue along, where each path is modeled as a Gaussian distribution. However, each time step along the path is treated as independent from each other, which is prone to losing local correlation and consistency. As a more expressive form of mixture model, SDS generalizes cGMM by replacing independent Gaussian distributions with smoothly varying dynamical systems, thereby enhancing the expressiveness of SOTA models.

Below is a comparison between SDS and cGMM used by SOTA models on a behavior prediction task from the Waymo Open Motion Dataset consisting of 192 randomly selected evaluation trajectories. For this task, we are interested in more accurate models as measured by lower averaged displacement error (ADE) from the ground truth trajectories and we are interested in models that can more accurately characterize its uncertainty or mistakes as measured by lower normalized calibration error (NCE). We see that SDS outperforms cGMM in both metrics.

Model cGMM SDS
ADE (↓) 6.76 4.62
NCE (↓) 4.74 3.84

 

Occlusion awareness

The SDS architecture allows the model to anticipate the existence of pedestrians before they even appear. For each pedestrian, the model maintains two dynamical systems, one for nominal sensor measurements in the absence of a pedestrian within the sensor range and another one for when a pedestrian is located. Depending on the observations experienced, the model assigns weights to either one being the more accurate description of the current scene. For multiple pedestrians, the model effectively maintains a mixture of mixtures and automatically allocates newly identified pedestrians to unused experts. Therefore, the model is never surprised by the sudden appearance of pedestrians.

Below is an illustration of how our model reasons about potential pedestrians occluded by parked vehicles along the street. At the center of each image is our vehicle driving from the south end to the north end of the road. Behind each parked vehicle is a blue trapezoid highlighting the occluded region from the vehicle. The concentric circles on top of the occluded regions represent the vehicle’s beliefs about potential pedestrians as Gaussian distributions. Specifically, the size of the Gaussians represent the area the vehicle thinks it may find a pedestrian in (i.e., the “where”) and the opacity of the Gaussians represent whether the vehicle thinks there exists a pedestrian (i.e., the “if”). The actual pedestrians in the scene are represented by red and green x’s depending whether they are occluded or not.

From figure 1 to 3, we see that as the vehicle moves forward, it correctly identifies pedestrians that have stepped into the view by clamping a very small Gaussian on top of them. For occluded regions that did not hide any pedestrians, the model correctly updates its beliefs about the “if” and “where” of the pedestrians as the Gaussians in those regions reduce in size and opacity. 

 

Interpretability

The SDS also carries a level of interpretability, wherein each dynamical system can be examined and understood separately. In the context of behavior prediction, each simpler dynamical system within the full SDS can be interpreted as a motion primitive, e.g., going straight, turning in a particular direction, or stopping. Switching between these primitives can be understood as dynamic discrete choices taken by the road users, which is consistent with well-known human modeling frameworks in psychology and behavioral economics. Furthermore, such compositionality allows certain safety critical components in the model to be designed and evaluated separately and sometimes manually. For example, we specified the dynamical system governing object occlusion manually as the geometry of the problem is known. This allows designers to address safety-critical long-tail scenarios where pure data-driven models struggle.

 

Conclusion

Taken together, our work represents an alternative, grey-box approach for behavior prediction that is tailored to the needs of safety-critical domains.

Going forward, we plan to incorporate additional data modalities for behavior prediction and enhance the inference efficiency of our model. We plan to include auxiliary input data such as road graphs and traffic signals to provide more contextual information for predicting road user behavior. Our model is also particularly suited to efficient Bayesian inference using the technique we demonstrated in our previous work on conditional mixture networks.