11 min read

AXIOM: Mastering Arcade Games in Minutes with Active Inference and Structure Learning

AXIOM: Mastering Arcade Games in Minutes with Active Inference and Structure Learning

Humans demonstrate a remarkable ability to learn complex behaviours from minimal experience, whether mastering new motor skills, understanding novel social dynamics, or adapting to unfamiliar environments. This sample efficiency suggests that the brain begins each task with strong hypotheses about how the world is organised, and can efficiently learn from single examples while generalizing these insights to new situations without requiring extensive replay or repetition. Moreover, humans appear to naturally balance exploration driven by curiosity and information gain with exploitation of known rewarding actions, enabling rapid discovery of relevant structure in novel environments.

Modern deep learning architectures have achieved impressive cross-domain performance through flexible designs that discover rich representations with minimal inductive biases. However, they typically require massive datasets for training and therefore fail to exhibit human-like sample efficiency. In contrast, Bayesian approaches have become the dominant paradigm in the brain sciences, offering a principled way to endow models with structured prior knowledge , to support continual and online learning, and to define active sampling strategies - potentially forming the foundations of sample-efficient cognition. Yet, traditional Bayesian architectures often rely on hand-engineered priors tailored to specific tasks, limiting their general applicability. This creates a fundamental tension: structured priors enable fast learning but lack flexibility, while flexible representations generalize broadly but demand extensive data.

Our approach addresses all three aspects of human-like learning through a unified framework. We develop Bayesian models that incorporate core priors - abstract organisational principles that specify architectural constraints for learning, while leaving specific structural instantiations to be discovered. For instance, object-centric core priors posit that the world consists of discrete, extensive entities with sparse interactions and piecewise linear dynamics, without dictating particular objects or interaction rules for any given task. To enable learning from single examples, we employ fast structure learning algorithms that grow and adapt structures based on novel data points rather than replaying datasets and backpropagating gradients, emulating humans’ ability to perform fast, continual learning. Finally, we use active inference to plan actions based on principled uncertainty estimation instead of relying on fixed state-action policies, naturally balancing exploration and exploitation to quickly discover task-relevant structure. This trinity of core priors, fast structure learning, and active planning bridges the gap between the sample efficiency of human learning and the broad applicability of modern machine learning approaches.

 

Gameworld 10k — a Test-bed for Core Priors

 

 

Figure 1.  All ten titles of Gameworld, reading left to right: Aviate, Bounce, Cross, Drive, Explode (first row); Fruits, Gold, Hunt, Impact, Jump (second row). 

AXIOM gameworld perturbations fruits AXIOM gameworld

Figure 2.  Examples of controlled perturbations on two of the ten games. Explode (left) with changed shapes, Fruits (right) with changed colours 

We introduce Gameworld 10k as a suite of ten arcade-style games engineered from the ground up to let us “write core priors large” while still posing a non-trivial visual control challenge. Each title obeys the same high-level template:

  1. Discrete objects, continuous motion. Sprites are single-colour geometric shapes that move with piece-wise linear dynamics.
  2. Sparse, local interactions. Objects affect one another only on contact (or within a short-distance field), and at most one interaction fires per frame.
  3. A controllable player avatar. Every game features a single player object whose action space is 2–4 discrete moves.
  4. Reward is linked directly to events. Points (positive or negative) are emitted in the same step that the causative interaction occurs, eliminating hidden timers.
  5. Full programmability. Because the games are implemented in a few hundred lines of Python, we can perturb any element at run-time—change colours, swap sprites, halve gravity, flip reward signs—and watch how models adapt to these interventions.

The ten titles cover the classic arcade spectrum - Aviate, Bounce, Cross, Drive, Explode, Fruits, Gold, Hunt, Impact, and Jump. For example, Bounce is directly inspired by the classic Atari game “Pong”, while Aviate requires navigating a flying player in between pipes, inspired by the popular game “Flappy Bird.” Despite sharing a base template, each game introduces its twist through game mechanics, object types and shapes, or the reward schedule. As part of the Gameworld 10k benchmark, we cap training at 10,000 environment steps - hence the “10k” - an order of magnitude less than the classic Atari 100k challenge, and equivalent to approximately 12 minutes of human gameplay. Because every latent variable is under our control, we can run causal tests normally impossible in ALE. Figure 2design shows an animation of a game of Explode, which flips mid-episode so that all the objects change shape and the falling object becomes twice as large.

Gameworld 10k offers a transparent set of regularities - objects, sparse interactions, locally-linear motion - that are published up-front. Because those assumptions are common across many real-world tasks, any agent (pixel-based, object-centric, or otherwise) can decide whether and how to exploit them, letting us measure how quickly different learning strategies convert general-purpose structure into competent play.

 

Why not just stick with Atari?

The Arcade Learning Environment (ALE) has been a workhorse for reinforcement-learning research, but it was never designed to probe how structured world models learn and adapt to controlled perturbations. ALE’s source code offers little room for controlled interventions: you cannot reskin sprites, rewire collisions, or inject a new force field without editing assembly-like ROMs and breaking the leaderboard. In addition, many ALE game environments hide bookkeeping in ad-hoc variables (e.g., Kaboom’s negative points counter is only indirectly available through a special lives counter, which decrements a long time after multiple bombs have been missed) or embed quirks such as the ball periodically disappearing in Pong. Such artefacts are orthogonal to the research questions that most object-centric or physics-based models rely on, forcing researchers either to abandon those priors, shoe-horn in special-case fixes, or train a value network that can solve the credit-assignment problem to circumvent these counterintuitive feedback mechanisms.

In short, Gameworld 10k provides fully controllable 2-D Arcade environments written in Python, while also replacing the accidental idiosyncrasies of Atari with designed regularities that mirror many of the physical and causal intuitions baked into 2-D Arcade games.. It is a probe for the central hypothesis of this work: that giving an agent core priors unlocks human-like sample efficiency without sacrificing flexibility or interpretability.

 

AXIOMActive eXpanding Inference with Object-centric Models

AXIOM Architecture

Figure 3.  Inference and prediction flow using AXIOM. The sMM extracts object-centric representations from pixel inputs. For each object latent and its closest interacting counterpart, a discrete identity token is inferred using the iMM and passed to the rMM, along with the distance and the action, to predict the next reward and the tMM switch. The object latents are then updated using the tMM and the predicted switch to generate the next state for all objects. (a) Projection of the object latents into image space. (b) Projection of the kth latent whose dynamics are being predicted and (c) of its interaction partner. (d) Projection of the rMM in image space; each of the visualised clusters corresponds to a particular linear dynamical system from the tMM. (e) Projection of the predicted latents. The past latents at time t are shown in grey.

Previous work from VERSES has demonstrated the power of describing complex scenes and dynamics with mixture models; we have found that learning and inference with mixture models can bridge the expressiveness of deep learning models with the learnability and sample efficiency of Bayesian methods [1, 2, 3]. This admits fast, online structure learning algorithms by adding and updating mixture components when the current set of components cannot adequately explain the data; importantly, this can proceed without having to store previous datapoints in a replay buffer and re-fit existing components. Since the expressiveness of these models resides in the fact that they approximate complex densities using compositions of many simpler ones, it also means that the updates for individual components can be made exact using appropriate conjugate priors, which allows uncertainty quantification by computing posterior distributions over model parameters.

The world model of AXIOM heavily leverages these insights to build an object-centric world model that incorporates mixture models at every stage: segmenting objects, identifying their types, predicting their dynamics, and classifying their interactions. Figure 3 shows the flow of inference and planning in the AXIOM architecture.

 

AXIOM’s Mixture Modules

Everything AXIOM does starts with an image frame. Instead of feeding raw pixels into a neural net, the slot mixture model (sMM) explains every pixel with one of a handful of object variables or slots. Each slot proposes a set of interpretable object features (colour, position, shape), and whichever slot best predicts a given pixel “wins” that pixel, in a very similar fashion to slot attention [4, 5] or spatial mixture [6] models. Because the sMM is a Bayesian mixture, it can grow new slots the moment an unfamiliar sprite appears, and it can let go of slots that stop receiving evidence. In practice, the system converges on the right number of game objects within a few dozen frames.

Once a slot is established, AXIOM assigns it a discrete identity - ball, paddle, fruit - using an identity mixture model (iMM). The iMM clusters simple geometric features (colour + shape) so that, for example, every red square is treated as the same kind of thing even if it respawns in a different place. Identifying the type is important for differentially predicting its dynamics [7] – for instance, an obstacle may interact differently with the player than an item in the game Hunt (see Figure 1). Because identities are treated probabilistically and inferred directly from object-level data, the agent can seamlessly re-label objects when the game designer reskins them at run-time.

To forecast motion, AXIOM uses a transition mixture model (tMM) that behaves like a switching linear dynamical system (SLDS). Think of it as a library of motion “verbs” -  falling, sliding, bouncing. Every frame, each slot chooses the verb that best explains its current velocity; if none fit, the library automatically expands with a new linear mode. Because these modes are shared across objects, the system can learn “gravity” once and reuse it for every apple that drops.

Games are interesting because objects collide, trigger rewards, or end the episode. AXIOM captures such sparse interactions with a recurrent mixture model (rMM). The rMM looks at a slot, its nearest neighbour, the recent action and the current reward, and clusters these multi-object snapshots. The cluster a slot falls into then predicts which motion verb the tMM should switch to next. This tight Bayesian loop lets AXIOM notice, from a single example, that “ball plus paddle plus left-move → ball bounces up.”

Learning structure on the fly (Model Expansion and Reduction).

All four mixtures - sMM, iMM, tMM, rMM - share the same expansion rule: if a new observation is too surprising under the current clusters, spin up a fresh component; otherwise, update an existing one. Because updates use closed-form variational formulas, AXIOM can run this one-frame-at-a-time online, with no replay buffer and no gradients. The result is human-scale learning curves: most Gameworld titles are mastered in under ten thousand steps. This expansion algorithm can be thought of as the continuous mixture-model equivalent of the fast structure learning algorithm developed in papers like “Supervised structure learning” and “From pixels to planning: scale-free active inference”, where new hidden states are added automatically to accommodate the complexity of the data stream.

Expansion alone would eventually leave a sprawling set of near-duplicate clusters. Periodically, AXIOM therefore applies Bayesian model reduction (BMR): it asks whether merging two clusters would increase the predicted model evidence upon combining those clusters. If so, the reduction is kept; if not, it is rolled back.  BMR turns dozens of one-off “ball-moving-left” clusters seen in different parts of space into a single, general rule for identifying dynamics, improving generalisation and keeping inference fast by reducing the number of model parameters. 

 

Acting through Active Inference

AXIOM plans with active inference. It rolls out imagined futures under its current world model, scores them by (i) expected reward and (ii) the information they would add to the rMM, then chooses the action sequence with the lowest expected free energy. Early in training, the information-gain term drives curiosity; once the dynamics are nailed down, utility dominates, and the agent settles on high-scoring play.

 

Why this matters

Sample efficiency AXIOM’s structured world model and efficient learning algorithm means it only requires a single update pass to internalise a new observation. The complete model reaches high scores (relative to other SOTA backpropagation-based RL benchmarks, BBF and DreamerV3 – see Figure 4),  on the ten-game Gameworld 10k benchmark, often requiring much less than 10,000 frames to reach top performance.

Paradigm shift. AXIOM moves away from overparameterized models optimized with backpropagation on large replay buffers, to growing and pruning mixture models, one datapoint at a time. This represents a compute- and sample-efficient Bayesian departure from SOTA methods in deep RL, which rely on massive models, compute budgets, and stochastic gradient descent for optimization.

Interpretability by design. Because objects, motions and interactions sit in explicit clusters, we can render them back to the screen (see Figures 5–6) and watch the agent imagine the next bounce or predict where penalties lurk.

Robustness to cosmetic shifts. Change every sprite from green to purple mid-game, and the iMM simply adds a “purple” identity and re-uses the same motion verbs; the policy hardly falters (see Figure 4(d) from the preprint)

Together, these ingredients make AXIOM a concrete demonstration that Bayesian structure learning, coupled with object-centric core priors, can deliver the rapid learning and graceful generalisation long promised by cognitive theories of intelligence, while staying computationally practical for real-world use.

Results

AXIOM Results

Figure 4.  Online learning performance. Moving average (1k steps) reward per step during training for AXIOM, BBF and DreamerV3 on the Gameworld 10k environments. Mean and standard deviation over 10 parameter seeds per model and environment.

 

AXIOM gameworld learning 01AXIOM gameworld learning 02AXIOM gameworld learning 03AXIOM gameworld learning 04

AXIOM gameworld learning 05AXIOM gameworld learning 06AXIOM gameworld learning 07AXIOM gameworld learning 08

Figure 5. Raw pixels, segmentations of moving objects, planned trajectories, and growing and merging of the rMM model (left to right) on Explode (top row) and Impact (bottom row).  

 

AXIOM gameworld learning 09AXIOM gameworld learning 10

Figure 6. Visualisation of reward and gameplay over 2,000 steps on Explode (left) and Impact (right). These results show that AXIOM can learn to play the game with a minimal number of datapoints.

 

Scaling to Complex Visual Scenes

While AXIOM demonstrates impressive sample efficiency on our Gameworld 10k benchmark, the base environments use deliberately simplified visual elements - single-colour sprites of different shapes. This suggests a clear avenue for future work, which should explore how the approach can handle more complex scenes, with multicolored, multi-part sprites and richer visual textures. The spatial mixture model (sMM) naturally affords hierarchical extensions that could handle far more complex visual scenes, where objects are composed of simpler sub-objects in a dynamic scene graph. A car, for instance, might be decomposed into wheels, body, and windows, each with its own dynamics and interaction patterns. These high-level latent representations - summarising the complicated visual parts - can then be used more effectively for dynamics modelling and planning. This hierarchical structure would allow AXIOM to maintain its interpretability and sample efficiency even when dealing with the rich visual complexity of environments like Atari or modern video games.

AXIOM gameworld learning 11

Figure 7. Hierarchical decomposition. Example hierarchical segmentation on a version of Explode with more complex sprites (left panel). The first level sMM (middle panel) identifies each single-colored part of a complex, multi-colored object as a slot. The second level sMM (right panel) clusters the lower-level parts into whole objects.

 

Future work

Discovering Core Priors

Our work demonstrates that when the set of core priors aligns with the structural regularities present in the data distribution, fast structure learning can rapidly acquire models capable of complex behaviour. AXIOM's success on Gameworld 10k stems precisely from this alignment: the object-centric, sparsely-interacting world structure we assumed matches the actual generative process underlying these games. Thus, we have shown that utilising core priors can close much of the sample-efficiency gap between current agents and humans when those core priors match the underlying data structure.

Future work should therefore focus on developing principled methods to automatically infer such core priors across a domain of tasks, enabling our approach to scale to domains like Minecraft, where the underlying generative processes are less transparent but still governed by discoverable structural regularities. The challenge lies in discovering organisational principles that operate at the right level of abstraction - general enough to apply across diverse scenarios within a domain, yet specific enough to meaningfully constrain the search space for structure discovery. This represents a crucial step toward building truly adaptive agents that can rapidly construct structural models of novel environments without requiring explicit engineering of domain-specific architectural constraints, ultimately bridging the gap between the flexibility of modern deep learning and the sample efficiency of human-like learning.

Inspired by recent work highlighting soft inductive biases as a unifying principle in modern deep learning, where rather than strictly restricting hypotheses, one embraces a flexible space with a soft preference for simpler, data-consistent solutions, we can similarly envision soft core priors for AXIOM. Rather than encoding our meta-structural principles (sparsity, compositionality, locality, extensiveness) as hard constraints, we would impose them as soft variants that gently bias the model toward these organisational motifs while still permitting deviations when the data demand novel structures. Investigating how to design, learn, and balance these soft core priors - so that AXIOM retains both its interpretability and its human-like sample efficiency without brittleness - remains an exciting open question for future work.

 

Conclusion

In this work, we introduced AXIOM, a novel and fully Bayesian object-centric agent that learns how to play simple games from raw pixels with improved sample efficiency compared to both model-based and model-free deep RL baselines. Importantly, it does so without relying on neural networks, gradient-based optimization, or replay buffers. We believe it is important to challenge the hegemony of deep learning in reinforcement learning research, and this and previous work from VERSES [2, 7, 8] represent an important step toward demonstrating that alternative paradigms can achieve human-like performance.

References

  1. Singh, R., & Buckley, C. L. (2023). Attention as implicit structural inference. Advances in Neural Information Processing Systems, 36, 24929-24946.
  2. Heins, C., Wu, H., Markovic, D., Tschantz, A., Beck, J., & Buckley, C. (2024). Gradient-free variational learning with conditional mixture networks.
  3. Van de Maele, T., Catal, O., Tschantz, A., Buckley, C. L., & Verbelen, T. (2024). Variational Bayes Gaussian Splatting.
  4. Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., ... & Kipf, T. (2020). Object-centric learning with slot attention. Advances in neural information processing systems, 33, 11525-11538.
  5. Kirilenko, D., Vorobyov, V., Kovalev, A. K., & Panov, A. I. (2023). Object-centric learning with slot mixture module.
  6. Greff, K., Kaufman, R. L., Kabra, R., Watters, N., Burgess, C., Zoran, D., ... & Lerchner, A. (2019, May). Multi-object representation learning with iterative variational inference. In International conference on machine learning (pp. 2424-2433). PMLR.
  7. Friston, K., Heins, C., Verbelen, T., Da Costa, L., Salvatori, T., Markovic, D., ... & Parr, T. (2024). From pixels to planning: scale-free active inference.
  8. Friston, K. J., Da Costa, L., Tschantz, A., Kiefer, A., Salvatori, T., Neacsu, V., ... & Buckley, C. L. (2024). Supervised structure learning. Biological Psychology, 193, 108891.
  9. Heins, C., Van de Maele, T., Tschantz, A., Linander, H., Markovic, D., Salvatori, T., ... ,Verbelen, T., Buckley, C. (2025). AXIOM: Learning to Play Games in Minutes with Expanding Object-Centric Models.
  10. Wilson, A. G. (2025). Deep learning is not so mysterious or different.