On Upcoming 2024 Benchmark Work from VERSES
As VERSES emerges from stealth mode over the course of 2024, we have put forth a research roadmap that outlines the key milestones and benchmarks against which to measure the progress and significance of our research and development efforts, against state of the art deep learning—for the benefit of industry, academia, and the public.
We will demonstrate that AI technology based on active inference is able to match performance on industry-standard benchmarks, or exceed it by significant margins, while using orders of magnitude less data. In this blog post, we discuss benchmark targets and explain their overall relevance.
Around two years ago, our research team decided that it was best to stop arguing with peers in the scientific community over the principles and mathematics that underlie active inference. More specifically, several in the scientific community argued that mathematical proofs were not enough, and requested access to materials assessing how fully implemented active inference technology fares against state-of-the-art deep learning on industry-standard benchmarks.
In our view, this request is fair—especially given the fact that, historically speaking, the Bayesian inference algorithms upon which active inference is based have been difficult to scale up. However, recent advances, both in academia and within our Research Lab at VERSES, have enabled us to significantly reduce the computational cost of inverting generative models that mirror and generalize the structure of modern attention-based neural networks and attention-augmented graph neural networks. The demonstration and validation of our technology—the “pudding” in which we shall offer the “proof” of our approach and technology—is the instantiation of a probabilistic programming language that implements variational Bayes and takes full advantage of modern compute environments.
This is not to say that the community working on active inference has not made great strides in scaling up these technologies. Indeed, in recent years, work on predictive coding, a particular implementation of this line of work, has been shown to scale to the performance of large deep learning architectures; moreover, it does so using a completely distributed architecture (see  for a review). However, despite the success of predictive coding for classification and generation, recent work has shown it not to be suitable for reinforcement learning tasks. This is because, in short, predictive coding relies on Gaussian approximations to the distributions used in active inference, and thus lacks the expressive capacity to perform the kind of sophisticated planning needed to showcase the benefits of active inference in the context of tasks tailored for reinforcement learning.
Over the past several years, researchers in active inference, many of whom are at VERSES, have explored combining deep learning and active inference [2,3,4,5,6]. Essentially, this involves using a deep neural network to learn a direct mapping from observations to an approximate posterior, so-called amortized inference. This works well on reinforcement learning benchmarks, by leveraging the fact that the objective function used in active inference contains a term that leads agents to act in order to resolve uncertainty, leading to behavior premised on information gain (sometimes called an intrinsic objective) . However, by using deep learning-like approaches to perform amortized inference as part of the architecture, this approach ultimately suffers from the same sample inefficiencies as all deep learning approaches.
We believe that the real power of active inference can only be realized if we can perform efficient inference on expressive distributions, and avoid the sample inefficiency incurred by amortization. This is exactly what our new approach enables. We have set out to demonstrate the power of active inference in benchmark work. We will initially demonstrate that we have reduced the computational cost of Bayesian inference, relative to contemporary approaches like NumPyro and PyTorch, and achieved computational costs that are comparable to non-Bayesian approaches to machine learning. A full demonstration of the viability and scalability of this approach will come from applying it to state-of-the-art reinforcement learning problems, comparing sample efficiency and performance against other state-of-the-art architectures. Lastly, going beyond this, we will showcase the unique potential of our technology in a multi-agent setting, where active inference uniquely enables agents to coordinate with one another by sharing beliefs with each other.
Why these benchmarks?
First benchmark: Classification and generation tasks
Our first benchmark will demonstrate the compute and sample efficiency on classification and generation tasks such as MNIST and CIFAR; in particular, demonstrating the computational efficiency of our approach over and above other state-of-the-art Bayesian inference toolboxes, such as NumPyro. We will also show how this approach is competitive with the computational efficiency of traditional deep learning approaches, based on tools like PyTorch—but augmented with the great sample efficiency that comes from adopting a fully Bayesian approach. We hope to share these results demonstrating the efficient compute and improved sample efficiency of our approach to classification and generation tasks around the end of Q1–Q2 2024 in open access publications.
Second benchmark: Atari 100k
The second benchmark will demonstrate that our approach is vastly more sample and compute efficient than state of the art alternatives. To showcase this, we have chosen the Atari 100k challenge. The initial Atari benchmark was introduced in 2015  and involves producing a single AI system that can meet or beat human-level performance on 26 classic Atari games. The task involves working directly from pixel data, using only the score as a reward signal. The initial architecture designed for this was extremely data-heavy, using years of gameplay—usually orders of magnitude more data than a human player might ever have access to. To address this, the Atari 100k benchmark was introduced, which restricts the amount of gameplay used in learning to 100,000 environment steps. Thus, Atari 100k is a good benchmark to showcase the power and sample efficiency properties of the active inference approach. Our goal is to further increase the sample and compute efficiencies of algorithms by using active inference. We expect to see two sources of gains in efficiency. The first comes from fast online learning of the world model. The second comes from efficient policy estimation that does not require periodic resets of the sort used by current state of the art gradient based methods, such as Q-learning. We hope to share these results demonstrating the frugal and sample efficient approach to reinforcement learning tasks using our active inference based architecture in Q3–Q4 2024, again in open access publications.
Our immediate aim is to demonstrate competitive play at the 100k benchmark. To go further and showcase the unique strengths of active inference based AI—namely, vastly improved sample efficiency—we propose the Atari 10k benchmark challenge (roughly 12 minutes of game play), using only raw pixel data and the score as input. The challenge is to reach human level performance (or greater) measured on the same amount of gameplay. We know humans can achieve competent play very very quickly, but how do state of the art architectures perform? We will demonstrate that our system vastly outperforms state of the art deep learning on the 10k benchmark—learning to play the game efficiently, with very little data and quickly. We have preliminary results demonstrating that active inference agents are able to get a basic grasp of gameplay, scoring on simple games, in very few samples.
Third benchmark: NeurIPS 2024 Melting Pot Challenge
One of the bugbears of the active inference community has always been that, when trying to compare capabilities of our algorithms, we are forced to compete with state-of-the-art approaches on benchmarks that have been set up by and for the deep learning community. Accordingly, these benchmarks have tended to cater to the strengths of deep learning approaches, i.e., they often involve noiseless tasks that are completely observed (with no ambiguity) and that involve well-defined reward functions. Yet, these benchmarks do not showcase the power of active inference. Our ultimate goal is to develop more naturalistic benchmarks that showcase the ability of active inference agents to deal with much more uncertain environments. However, as a compromise, for our third benchmark, we will consider the new multi-agent NeurIPS Melting Pot Challenge benchmark. Specifically, one of the main advantages of building active inference agents that work directly in belief space with an explicit representational structure is that it becomes possible to share beliefs between agents. We believe that this benchmark will showcase the benefits that active inference brings for engineering multi-agent systems and align with the central ambitions of VERSES AI research: to create ecosystems of AI systems. We hope to share these results showcasing the unique ability of active inference agents to lay the foundations of smart multiagent systems around Q4 2024–Q1 2025.
 Salvatori, T., Mali, A., Buckley, C. L., Lukasiewicz, T., Rao, R. P., Friston, K., & Ororbia, A. (2023). Brain-inspired computational intelligence via predictive coding. arXiv preprint arXiv:2308.07870.
 Tschantz, A., Baltieri, M., Seth, A. K., & Buckley, C. L. (2020, July). Scaling active inference. In 2020 international joint conference on neural networks (ijcnn) (pp. 1-8). IEEE.
 Tschantz, A., Millidge, B., Seth, A. K., & Buckley, C. L. (2020). Reinforcement Learning through Active Inference. arXiv e-prints, arXiv-2002.
 Mazzaglia, P., Verbelen, T., & Dhoedt, B. (2021). Contrastive active inference. Advances in Neural Information Processing Systems, 34, 13870-13882.
 Çatal, O., Wauthier, S., De Boom, C., Verbelen, T., & Dhoedt, B. (2020). Learning generative state space models for active inference. Frontiers in Computational Neuroscience, 14, 574372.
 Çatal, O., Verbelen, T., Van de Maele, T., Dhoedt, B., & Safron, A. (2021). Robot navigation as hierarchical active inference. Neural Networks, 142, 192-204.
 Mazzaglia, P., Catal, O., Verbelen, T., & Dhoedt, B. (2022). Curiosity-Driven Exploration via Latent Bayesian Surprise. Proceedings of the AAAI Conference on Artificial Intelligence, 36(7), 7752-7760.
 Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.