7 min read
Real World Intelligence: These Are the Droids You’re Looking For
Steven Swanson : Aug 14, 2025 5:30:00 AM

Adaptive robot without pre-training beats top model with up to 93% better performance on Meta's Habitat benchmark
Bottom Line Up Front
Today's robots rely on pre-programming and pre-training and have difficulty operating in the real world where conditions are constantly changing. No amount of hardcoding and pre-training will allow a robot to adapt to the curveballs and anomalies they will invariably encounter outside of controlled environments or on real world multi-step tasks. What if, instead of following instructions or parroting training data, robots are given some basic principles to work within such as “collisions should be avoided” and “there’s a possibility that its actions might fail” combined with a sort of mental model of their environment that allow them to figure out how to optimally fulfill its goals on its own? Robots that can adapt would be exponentially more useful and applicable than they are today.
Care to know how? Read on.
The dream of household robots bustling through our homes, folding laundry, tidying up, or fetching a snack, has been a staple of science fiction for decades. Yet, for all our technological leaps in robotics, that future still feels stubbornly out of reach. Why is it so incredibly difficult to build a truly helpful robot, one that can navigate a messy living room as easily as you or me? It turns out, giving a machine common sense and a helping hand is far more complicated than sending a rocket to Mars.
For every impressive video of robots performing backflips and deftly climbing over an uneven surface there are many more of them laughably failing at simple tasks. For you and me setting a table is trivial but for a robot, it's a bewildering gauntlet of challenges. It needs to know what a fork is, where they are, how to open a drawer, how to grasp a plate without dropping it, how to move through a cluttered kitchen, and how to place each item precisely. Every one of these "simple" actions involves hundreds of tiny, interconnected decisions and precise movements.
For years, scientists have tried to tackle this problem with different approaches, each hitting its own wall. One method, Task and Motion Planning (TAMP), is like giving a robot a highly-detailed blueprint for every single action. While precise, these blueprints need perfect knowledge of the environment, are incredibly slow and computationally expensive to create. What if a chair is moved, or a cat walks by? The blueprint becomes useless.
Another popular approach, Reinforcement Learning (RL), tries to teach robots through trial and error, much like how you learn to ride a bike by falling off a lot. Firstly, robot models need millions of practice tries in simulated worlds to learn even basic skills, which is incredibly inefficient. More importantly, if a robot encounters conditions not represented in its training data or part of a pre-determined sequence of steps, such as dropping an object, it doesn’t know how to adapt or recover. It’s like the difference between being book smart vs street smart. These models also struggle with "hand-off issues" where one learned skill, like picking an object, doesn't smoothly connect to the next skill, like placing an object. Imagine forgetting how to turn your bike after you master pedaling straight. Plus, designing the "rewards" for correct behavior is incredibly complex for multi-step tasks. And unlike a human, a robot's body – such as a moving base and flexible arm – are often treated separately to simplify the problem, however this makes the robot less robust and adaptive overall.
State of the art TAMP and RL success rate quickly drops as the number of interactions increase
Source: Habitat 2.0: Training Home Assistants to Rearrange their Habitat
In essence, real-world robotics is hard because the world is messy, unpredictable, and full of surprises, and robots have lacked an intelligent way to cope with that uncertainty. They can be designed to perform well on highly repetitive tasks in highly controlled environments but can fall short in dynamic environments that require learning and adapting in real time to anomalous situations not represented in their training data. For all the cautious optimism around autonomous vehicles (AV) they still exhibit erratic behavior like driving around in circles, confusing fog with a barrier, getting stuck in a standoff against other AVs, or driving through wet concrete.
A New Brain for Robots: Learning from Ourselves
In a new paper titled, Mobile Manipulation with Active Inference for Long-Horizon Rearrangement Tasks, submitted to the International Workshop on Active Inference for peer review, VERSES brings a radically different idea to the table, one inspired by the very way our own brains work: Active Inference. Think of it like this: your brain is constantly making guesses about what will happen next. When something unexpected happens, your brain gets a "surprise," and it adjusts its guesses and actions to reduce that surprise. VERSES is building digital brains that work on this same principle, constantly predicting and acting to minimize unexpected outcomes.
In the paper, we introduce a fully Hierarchical Active Inference model for robots, especially suitable for complex multi-step tasks like rearranging objects in a home. It isn't just a big, monolithic brain; it's more like a company with different levels of management, each with its own job. Instead of taking a bottom up, highly prescriptive, task-first approach to robotics, the model applies a top down, self-organizing, goal-first philosophy. What’s more, the model is not tailored to a specific robot configuration or environment and will adapt to any robot form factor and apartment layout.
Notably, this model differs from our AXIOM architecture which we encoded with a few basic priors, or principles, and the agent learned to achieve competent gameplay gradually through experience. In contrast, we fine-tuned this model with more sophisticated priors (see below) and instead of learning through experience or being trained on samples it infers or “figures out” optimal actions to take to achieve its goals.
Among others the robot knows
- it has a body
- how to control its body
- that objects exist and can identify them
- that objects can be contained inside other objects (food in the fridge)
- that it can collide with obstacles or items and that collisions should be avoided
- that actions can fail
While different, both architectures are rooted in the same framework (Active Inference) which includes a probabilistic world model on which agents perform inference in order to select optimal actions and both leverage aspects of other research we’ve shared including VGBS, Predictive Coding, and CAVI-CMN.
VERSES Active Inference based robot performing domestic tasks. Original Habitat simulations can be viewed on the technical deep dive, Why Learn When You Can Infer.
At the top of the hierarchy is the High-Level Model, the "CEO" of the robot's brain. This boss decides the big picture: "Go to the kitchen," "Pick up the apple," "Set the table." These aren't just single commands; they're abstract goals that the lower-level "managers" then figure out how to achieve.
Below that, there are specialized "managers" like the Pick & Place Model and the Navigation Model. The Pick & Place Model is clever because it can adapt to unexpected issues, like if a grasping attempt fails. It has different strategies to "retry" an action, making the robot more resilient. The Navigation Model figures out (infers) the best paths to move the robot around, using knowledge of the environment to plan its route.
And at the very bottom, coordinating all the robot's physical movements, is the Whole-Body Controller. This is where the magic happens. Unlike older robotic systems that might separate the wheeled base from the arm, VERSES' stack coordinates the entire robot's body at once. This means the robot can use its wheels to move its body into a better position to help its arm reach for an object, greatly extending its reach and flexibility. It's like a dancer using their whole body to make a graceful move, rather than just isolated limbs.
To "see" its world, this robot brain uses something called Variational Bayes Gaussian Splatting (VBGS). Rather than just storing point clouds (a common way to do navigation in commercial systems) we use a sophisticated way of building a probabilistic 3D map of the environment from what the robot observes. This map helps the robot understand where objects are, where obstacles lie, and how to avoid bumping into things, making it safe and effective in cluttered environments.
We generate a Gaussian Splat as a probabilistic map of the environment
The results are impressive. When tested on complex tasks from Meta’s Habitat Benchmark – like "TidyHouse" (rearranging objects), "PrepareGroceries" (getting food from the fridge), and "SetTable" (setting a table) – our robot outperformed all previous state-of-the-art methods. In the graphic below Monolithic RL (orange) fails as the number of tasks increases. MM Baseline (Yellow) performs reasonably well but shows similar signs of degrading performance as the number of tasks increase. It should be noted that MM Baseline is pre-loaded with the steps required to achieve its goal in advance, while ours (Green) instead infers them making it more resilient and adaptive. These Habitat results show that Active Inference is not just a theoretical concept; it's a powerful, practical solution for real-world robotics.
Our approach outperforms the baselines in all three tasks, averaging 76% completion rate in TidyHouse, 83% completion rate in PrepareGroceries, and 56% completion rate in SetTable. The best performing baseline, MM, instead, averages 71%, 64%, and 29% respectively. Additionally, the MM baseline requires extensive offline training – 6400 episodes per task across varied layouts and configurations in the Habitat environments, and 100 million steps per skill across a total of 7 skills (1.3 Billion combined steps for all skills used across the all 3 tasks). In contrast, our method relies on hand-tuning each skill over just a handful of episodes and is evaluated directly on unseen layouts and configurations, demonstrating strong generalization without the need for data-intensive training. However, we still rely on privileged information, such as the floor map for path planning and articulated object states. These assumptions will be removed in future work.
MM Baseline | VERSES | Difference | |
TidyHouse | 71% | 76% | 7% |
PrepareGroceries | 64% | 83% | 30% |
SetTable | 29% | 56% | 93% |
Combined | 54.7% | 66.5% | 21% |
Total Training Steps | 1.3 Billion | 0 |
Why it Matters: A Rosey Future
The Hierarchical Active Inference model implemented in this study showcases a means of solving the kind of long horizon (multi-step) learning required for robots to operate autonomously in real world dynamic environments. Critically it embodies the generalizability, composability and skill transferability needed to scale robotics at large in a way that is form-factor agnostic. By way of example, the following mosaic shows a series of various robots with different configurations pursuing targets and avoiding obstacles are all powered by the same model.
Agentic and autonomous robots that can adapt to novel challenges, self-organize, self-correct, and require less upfront programming and extensive training could expand the scope of tasks that can be automated across sectors like logistics, manufacturing, healthcare, and exploration and accelerate market growth. GlobalData estimates the robotics market to be worth $218 billion by 2030 while Boston Consulting Group projects the global robotics market to exceed $260 billion by 2030.
It is no secret that the first company to build a robot that can simply load a dishwasher will immediately have a bestselling product. - Max Bennett
In his 2023 book, A Brief History of Intelligence, Max Bennett wrote, “Although The Jetsons correctly predicted cell phones and smartwatches, we still don’t have anything like Rosey. As of this book going to print, even Rosey’s most basic behaviors are still out of reach. It is no secret that the first company to build a robot that can simply load a dishwasher will immediately have a bestselling product. All attempts to do this have failed. It isn’t fundamentally a mechanical problem; it’s an intellectual one—the ability to identify objects in a sink, pick them up appropriately, and load them without breaking anything has proven far more difficult than previously thought.”
While Rosey isn’t available for pre-order today, we believe there is now line of sight.
VERSES vision is a Smarter World where people and technology live in greater harmony and we are building these technologies into Genius towards that end. If you're a Data Scientist or Machine Learning Engineer or Researcher seeking to rapidly build reliable domain-specific predictions and decisions get started for free today.
For a more technical deep dive, see Why Learn if You Can Infer: Active Inference for Robot Planning & Control.