7 min read

Why Learn if You Can Infer: Active Inference for Robot Planning & Control

Why Learn if You Can Infer: Active Inference for Robot Planning & Control

For a less technical explanation see Real World Intelligence: These Are the Droids You're Looking For.

Artificial intelligence (AI) is rapidly evolving: vision-language models like GPT-4V and Gemini are unifying modalities, while companies like Figure AI and World Labs are accelerating progress toward embodied agents that can perceive, reason, and act in the physical world.  At VERSES, we have long been advocating for spatial bossintelligence as a foundational layer for AI, collaborating with the IEEE to develop standards for the Spatial Web—an architecture for grounding AI in physical, semantic, and contextual environments.

While these developments represent meaningful progress, they largely follow a familiar trajectory: train ever-larger black-box models (mostly Transformers) on ever-larger datasets. Fueled by recent successes in deep learning, the field tends to "turn everything into a machine learning problem", treating the design of intelligent agents as a matter of mapping observations directly to actions via learned policies. This paradigm has serious shortcomings, especially for embodied agents. The real world is messy, dynamic, and diverse; the data requirements for training general-purpose, physically grounded policies far exceed those in more constrained domains like language or autonomous driving. Crucially, such end-to-end learning often obscures any understanding of the why behind an agent’s actions.

What if, instead of learning policies directly, we equipped agents to infer the hidden causes of their observations—and to generate actions by reasoning about those inferred states? This is the central idea behind active inference: rather than passively learning behavior from vast amounts of experience, agents use a generative model of the world to make sense of sensory inputs and select actions that reduce uncertainty or fulfill internal goals.

But this raises a natural question: what kind of generative models are apt for robot control? In the remainder of this article, we'll introduce a hierarchical active inference architecture tailored for embodied agents. At the top level, the agent reasons over discrete, symbolic planning states—such as “go to the kitchen”, “open the fridge”, or “pick up the bottle”. At the bottom level, it manages continuous sensorimotor dynamics, such as predicting joint velocities and proprioceptive feedback. This layered model enables flexible, goal-directed behavior and precise motor control. 

In addition, we combine this with a perception module that models the environment using a mixture-based generative representation—similar in spirit to Gaussian splats—capturing spatial structures like objects, surfaces, and obstacles in a compact and probabilistic way. This allows the agent to perceive affordances, avoid collisions, and reason about navigational constraints—all within the same inference-based framework.

By integrating structured perception, planning, and control into a unified generative model, we argue that active inference offers a scalable, interpretable, and biologically grounded alternative to brute-force policy learning—one better suited to the demands of real-world robotics.

 

Universal Generative Models

To equip our embodied agent with a generative model, we build on what Professor Karl Friston refers to as a universal generative model—a general framework for modeling any agent interacting with any environment. Concretely, the generative model specifies how hidden states (s) give rise to observations (o) via a likelihood mapping denoted by A, and how these states evolve over time under the influence of actions (u) via a transition model B. In a universal generative model, this same basic structure is stacked hierarchically, such that higher levels contextualize the prior beliefs over outcomes and states at lower levels—typically through C (preferences over observations) and D (priors over initial states). Within each hierarchical level, the state space can also be factorized into independent but interacting components, giving the model factorial depth.

Active Inference Robot Planning and Control 01

For a robot operating in the real world, this hierarchical model unfolds to represent structured beliefs about objects and their spatial relationships (i.e., "what" and "where"), while temporal coarse-graining enables the agent to acquire and coordinate goal-directed behaviors such as “pick”, “place”, or “navigate”. At the lowest level of the hierarchy, the temporal resolution becomes fine-grained enough to require a continuous-time active inference formulation, where inference and motor control are tightly coupled within a loop that minimizes variational free energy in real time.

 

Control as Inference

A core advantage of our approach lies in how control is reframed as inference. Traditional robot control methods—such as inverse kinematics (IK) or operational space control (OSC)—compute joint positions or torques to achieve desired end-effector trajectories. While effective in well-defined settings, these methods assume perfect knowledge of the robot’s morphology and often struggle with redundancy (e.g., robots with more than six degrees of freedom) or with environmental constraints like dynamic obstacles. While deep reinforcement learning is often used to address these limitations, it typically does so via opaque, black-box controllers that lack interpretability and require extensive training. In contrast, active inference offers a principled alternative: it treats control as a process of probabilistic inference. By designing a generative model to mirror the robot’s kinematic hierarchy, we can recover kinematic inversion as a natural consequence of inference. This not only enables the agent to generate motor commands that fulfill spatial goals, but also allows it to infer beliefs about morphological parameters, such as link lengths. As a result, a single model can flexibly adapt to different robot configurations, environments, and control objectives.

A mosaic of multiple robot configurations, reaching for the target (green) with their end effector, while avoiding obstacles (red).

 

The Habitat Benchmark

To evaluate our approach, we use the Habitat benchmark, a high-fidelity simulation platform for embodied AI that features realistic indoor environments and tasks requiring perception, navigation, and interaction. We focus on three challenging, long-horizon tasks: TidyHouse, PrepareGroceries, and SetTable

🧹 TidyHouse  In the TidyHouse task, the robot must relocate five objects from their initial positions to their goal locations. These locations are typically open surfaces like kitchen counters or tables. 

🛒 PrepareGroceries  For PrepareGroceries, the robot must first move two items from an already open refrigerator to a nearby countertop. Then, it needs to take one item from the counter and place it back inside the fridge.

🍽️ SetTable  SetTable is more complex. It requires the robot to interact with articulated (movable) objects. First, it must open a closed drawer to retrieve a bowl and place it on a table. Then, it needs to open a refrigerator to grab an apple and place it next to the bowl. Both the drawer and fridge have to ultimately be closed.

These tasks require the agent to sequence multiple navigation and manipulation actions to pick and place objects from different locations. The complexity and extended time horizons of these three tasks make them a strong test for general-purpose control architectures. Habitat tasks pose a significant challenge for traditional learning-based approaches, which struggle with sparse rewards and the need to master long sequences of interdependent behaviors. 

We take a different approach and implement a hierarchical generative model based on active inference (see figure below). At the top level, the agent selects symbolic goals (e.g., Pick, Move, Place) based on inferred beliefs over task structure and scene context. Mid-level control resolves discrete subgoals like object manipulation or retrying failed actions, while low-level controllers operate in continuous space to predict velocities and maintain whole-body control. In addition, we use a Variational Bayes Gaussian Splatting module to convert raw RGBD sensor input into a spatial representation that is used for setting spatial goals and obstacles for the generative model.

 

Active Inference Robot Planning and Control 02

 

Our results on the Habitat benchmark are visualized in the graph below, where we plot the average success rate (over 100 runs in different apartment layouts) against the sequential stages within each task. The results demonstrate that our approach (green solid line) outperforms two RL baselines across all three tasks, with the performance advantage becoming more pronounced as task complexity increases. Notably, the Monolithic RL baseline (red dash-dotted line) has a sharp drop in performance, essentially failing on later stages of all tasks. The Multi-skill RL Mobile manipulation (MM) baseline (yellow dashed line) performs better than Monolithic RL but still falls short of our approach, especially in the more complex SetTable task. The TAMP approach (purple dotted line) makes use of a perfect task planner and implements each skill with a classical Sense-Plan-Act pipeline. It assumes perfect knowledge of scene geometry from the simulator, a perfect arm controller that kinematically sets it to a desired joint pose, and oracle navigation. Despite having access to such privileged information, TAMP underperformed our solution in TidyHouse and PrepareGroceries tasks, while it was not applicable to SetTable due to the complexity of the problem.

VERSES Hierarchical Active Inference Habitat Tidy House

Tidy House

VERSES Hierarchical Active Inference Habitat Robot Prepare Groceries

Prepare Groceries

VERSES Hierarchical Active Inference Habitat Robot Set Table

Set Table

VERSES Habitat Hierarchical Active Inference Robot Benchmark Comparison

 

The difference in performance might stem from different architectural choices. The Monolithic RL approach trains a single policy for each task, but this is notoriously hard in the context of long-horizon scenarios. The MM baseline, instead, trains separate policies for subtasks (such as pick and place) and then chains these trained policies together following a fixed sequence. Although simple and effective in many cases, a fixed plan cannot recover from task failures, such as re-picking an object that falls from a surface. Additionally, the MM baseline requires extensive offline training to achieve good performance. This includes 6400 episodes per task across varied randomized layouts and configurations in the Habitat environments, and 100 million training steps per skill across a total of 7 skills. 

In contrast, our approach overcomes both the training overhead and failure recovery issues of these baselines. Where fixed plans cannot handle task failures, our high-level active inference model enables automatic failure recovery – for example, the robot will retry to pick up objects that accidentally fall off the table, as shown in the video below. We also equip the robot to manipulate objects in confined spaces (such as the fridge) via the VBGS perception module in the low-level controller, which provides obstacle constraints (shown as red dots in the video below). Our approach does not require any offline training, but solely tuning of the parameters of the generative model over a handful of episodes per task. This demonstrates that capturing domain knowledge in an explicit generative model can compete with, and even outperform, learning-based approaches. 

VERSES Hierarchical Active Inference Habitat Robot Confined Pick ep1

Confined pick with obstacle constraints

VERSES Hierarchical Active Inference Habitat Robot Repick ep9

Repick after dropping object

And Beyond?

Our results on the Habitat benchmark demonstrate the potential of active inference as a scalable and interpretable framework for embodied intelligence. By leveraging a universal generative model with hierarchical structure, factorial depth, and temporal abstraction, our agent can reason over symbolic goals, adaptively plan in uncertain environments, and execute fine-grained control—all through a unified process of inference. This stands in contrast to traditional policy-learning approaches, which often require vast amounts of data and struggle to generalize to long-horizon, multi-stage tasks.

While we focused on task execution through inference on a given generative model, learning plays a foundational role in building and fine-tuning generative models that generalize to new tasks and environments. Crucially, learning in our approach is also an inference process—over model parameters and structures rather than state-to-action mappings. Therefore, a key future direction is enabling autonomous discovery of the model parameters and structures, including factorizations and hierarchies, through explorative interactions with the task with minimal supervision. Our previous work on supervised structure learning provides a foundation that would reduce reliance on hand-designed state spaces and improve generalizability. At the motor level, we see promise in using few-shot learning from demonstrations to rapidly acquire skills like grasping or placing, which can then be integrated into the generative model as reusable control primitives. 

Ultimately, framing autonomy as structured inference—grounded in learning and driven by uncertainty reduction—provides a principled pathway toward general-purpose embodied intelligence. Integrating online belief updating with continual learning mechanisms will be essential for maintaining performance in non-stationary environments where task requirements, environmental dynamics, and reward structures evolve over time. This adaptive capacity to handle changing environmental conditions is a fundamental requirement for robust autonomous systems operating in real-world settings.

The full study behind this article is published as Mobile Manipulation with Active Inference for Long-Horizon Rearrangement Tasks.