A quantitative demonstration based on MDPs of the increasing need of a world model (learnt or given) as the complexity of the task and the performance of the agent increase

Jonathan Richens, David Abel, Alexis Bellot, Tom Everitt, General agents contain world models, arXiv cs:AI, Sep. 2025, arXiv:2506.01622.

Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of generalizing to multi-step goal-directed tasks must have learned a predictive model of its environment. We show that this model can be extracted from the agent’s policy, and that increasing the agents performance or the complexity of the goals it can achieve requires learning increasingly accurate world models. This has a number of consequences: from developing safe and general agents, to bounding agent capabilities in complex environments, and providing new algorithms for eliciting world models from agents.

Inclusion of LLMs in multiple task learning for generating rewards

Z. Lin, Y. Chen and Z. Liu, AutoSkill: Hierarchical Open-Ended Skill Acquisition for Long-Horizon Manipulation Tasks via Language-Modulated Rewards, IEEE Transactions on Cognitive and Developmental Systems, vol. 17, no. 5, pp. 1141-1152, Oct. 2025, 10.1109/TCDS.2025.3551298.

A desirable property of generalist robots is the ability to both bootstrap diverse skills and solve new long-horizon tasks in open-ended environments without human intervention. Recent advancements have shown that large language models (LLMs) encapsulate vast-scale semantic knowledge about the world to enable long-horizon robot planning. However, they are typically restricted to reasoning high-level instructions and lack world grounding, which makes it difficult for them to coordinately bootstrap and acquire new skills in unstructured environments. To this end, we propose AutoSkill, a hierarchical system that empowers the physical robot to automatically learn to cope with new long-horizon tasks by growing an open-ended skill library without hand-crafted rewards. AutoSkill consists of two key components: 1) an in-context skill chain generation and new skill bootstrapping guided by LLMs that inform the robot of discrete and interpretable skill instructions for skill retrieval and augmentation within the skill library; and 2) a zero-shot language-modulated reward scheme in conjunction with a meta prompter facilitates online new skill acquisition via expert-free supervision aligned with proposed skill directives. Extensive experiments conducted in both simulated and realistic environments demonstrate AutoSkill’s superiority over other LLM-based planners as well as hierarchical methods in expediting online learning for novel manipulation tasks.

A cognitive map implemented according to the latest biological knowledge and aimed to robotic navigation

M. A. Hicks, T. Lei, C. Luo, D. W. Carruth and Z. Bi, A Bio-Inspired Goal-Directed Cognitive Map Model for Robot Navigation and Exploration, IEEE Transactions on Cognitive and Developmental Systems, vol. 17, no. 5, pp. 1125-1140, Oct. 2025 10.1109/TCDS.2025.3552085.

The concept of a cognitive map (CM), or spatial map, was originally proposed to explain how mammals learn and navigate their environments. Over time, extensive research in neuroscience and psychology has established the CM as a widely accepted model. In this work, we introduce a new goal-directed cognitive map (GDCM) model that takes a nontraditional approach to spatial mapping for robot navigation and path planning. Unlike conventional models, GDCM does not require complete environmental exploration to construct a graph for navigation purposes. Inspired by biological navigation strategies, such as the use of landmarks, Euclidean distance, random motion, and reward-driven behavior. The GDCM can navigate complex, static environments efficiently without needing to explore the entire workspace. The model utilizes known cell types (head direction, speed, border, grid, and place cells) that constitute the CM, arranged in a unique configuration. Each cell model is designed to emulate its biological counterpart in a simple, computationally efficient way. Through simulation-based comparisons, this innovative CM graph-building approach demonstrates more efficient navigation than traditional models that require full exploration. Furthermore, GDCM consistently outperforms several established path planning and navigation algorithms by finding better paths.

On the model that humans use for predicting movements of targets in order to reach them, and some evidence of a biological Kalman filter-like processing

John F. Soechting, John Z. Juveli, and Hrishikesh M. Rao, Models for the Extrapolation of Target Motion for Manual Interception, J Neurophysiol 102: 1491–1502, 2009, 10.1152/jn.00398.2009.

Soechting JF, Juveli JZ, Rao HM. Models for the extrapolation of target motion for manual interception. J Neurophysiol 102: 1491–1502, 2009. First published July 1, 2009; doi:10.1152/jn.00398.2009. Intercepting a moving target requires a prediction of the target’s future motion. This extrapolation could be achieved using sensed parameters of the target motion, e.g., its position and velocity. However, the accuracy of the prediction would be improved if subjects were also able to incorporate the statistical properties of the target’s motion, accumu- lated as they watched the target move. The present experiments were designed to test for this possibility. Subjects intercepted a target moving on the screen of a computer monitor by sliding their extended finger along the monitor’s surface. Along any of the six possible target paths, target speed could be governed by one of three possible rules: constant speed, a power law relation between speed and curvature, or the trajectory resulting from a sum of sinusoids. A go signal was given to initiate interception and was always presented when the target had the same speed, irrespective of the law of motion. The dependence of the initial direction of finger motion on the target’s law of motion was examined. This direction did not depend on the speed profile of the target, contrary to the hypothesis. However, finger direction could be well predicted by assuming that target location was extrapolated using target velocity and that the amount of extrapolation depended on the distance from the finger to the target. Subsequent analysis showed that the same model of target motion was also used for on-line, visually mediated corrections of finger movement when the motion was initially misdirected.

Improvements in offline RL (from previously acquired datasets)

Lan Wu, Quan Liu, Renyang You, State slow feature softmax Q-value regularization for offline reinforcement learning, Engineering Applications of Artificial Intelligence, Volume 160, Part A, 2025, 10.1016/j.engappai.2025.111828.

Offline reinforcement learning is constrained by its reliance on pre-collected datasets, without the opportunity for further interaction with the environment. This restriction often results in distribution shifts, which can exacerbate Q-value overestimation and degrade policy performance. To address these issues, we propose a method called state slow feature softmax Q-value regularization (SQR), which enhances the stability and accuracy of Q-value estimation in offline settings. SQR employs slow feature representation learning to extract dynamic information from state trajectories, promoting the stability and robustness of the state representations. Additionally, a softmax operator is incorporated into the Q-value update process to smooth Q-value estimation, reducing overestimation and improving policy optimization. Finally, we apply our approach to locomotion and navigation tasks and establish a comprehensive experimental analysis framework. Empirical results demonstrate that SQR outperforms state-of-the-art offline RL baselines, achieving performance improvements ranging from 2.5% to 44.6% on locomotion tasks and 2.0% to 71.1% on navigation tasks. Moreover, it achieves the highest score on 7 out of 15 locomotion datasets and 4 out of 6 navigation datasets. Detailed experimental results confirm the stabilizing effect of slow feature learning and the effectiveness of the softmax regularization in mitigating Q-value overestimation, demonstrating the superiority of SQR in addressing key challenges in offline reinforcement learning.

Clustering of states transitions in RLs

Yasaman Saffari, Javad Salimi Sartakhti, A Graph-based State Representation Learning for episodic reinforcement learning in task-oriented dialogue systems, Engineering Applications of Artificial Intelligence, Volume 160, Part A, 2025 10.1016/j.engappai.2025.111793.

Recent research in dialogue state tracking has made significant progress in tracking user goals using pretrained language models and context-driven approaches. However, existing work has primarily focused on contextual representations, often overlooking the structural complexity and topological properties of state transitions in episodic reinforcement learning tasks. In this study, we introduce a cutting-edge, dual-perspective state representation approach that provides a dynamic and inductive method for topological state representation learning in episodic reinforcement learning within task-oriented dialogue systems. The proposed model extracts inherent topological information from state transitions in the Markov Decision Process graph by employing a modified clustering technique to address the limitations of transductive graph representation learning. It inductively captures structural relationships and enables generalization to unseen states. Another key innovation of this approach is the incorporation of dynamic graph representation learning with task-specific rewards using Temporal Difference error. This captures topological features of state transitions, allowing the system to adapt to evolving goals and enhance decision-making in task-oriented dialogue systems. Experiments, including ablation studies, comparisons with existing approaches, and interpretability analysis, reveal that the proposed model significantly outperforms traditional contextual state representations, improving task success rates by 9%–13% across multiple domains. It also surpasses state-of-the-art Q-network-based methods, enhancing adaptability and decision-making in domains such as movie-ticket booking, restaurant reservations, and taxi ordering.

On the abstraction of actions

Bita Banihashemi, Giuseppe De Giacomo, Yves Lespérance, Abstracting situation calculus action theories, Artificial Intelligence, Volume 348, 2025 10.1016/j.artint.2025.104407.

We develop a general framework for agent abstraction based on the situation calculus and the ConGolog agent programming language. We assume that we have a high-level specification and a low-level specification of the agent, both represented as basic action theories. A refinement mapping specifies how each high-level action is implemented by a low-level ConGolog program and how each high-level fluent can be translated into a low-level formula. We define a notion of sound abstraction between such action theories in terms of the existence of a suitable bisimulation between their respective models. Sound abstractions have many useful properties that ensure that we can reason about the agent’s actions (e.g., executability, projection, and planning) at the abstract level, and refine and concretely execute them at the low level. We also characterize the notion of complete abstraction where all actions (including exogenous ones) that the high level thinks can happen can in fact occur at the low level. To facilitate verifying that one has a sound/complete abstraction relative to a mapping, we provide a set of necessary and sufficient conditions. Finally, we identify a set of basic action theory constraints that ensure that for any low-level action sequence, there is a unique high-level action sequence that it refines. This allows us to track/monitor what the low-level agent is doing and describe it in abstract terms (i.e., provide high-level explanations, for instance, to a client or manager).

Learning representations from RL based on symmetries

Alexander Dean, Eduardo Alonso, Esther Mondragón, MAlgebras of actions in an agent’s representations of the world, Artificial Intelligence, Volume 348, 2025, 10.1016/j.tics.2025.06.009.

Learning efficient representations allows robust processing of data, data that can then be generalised across different tasks and domains, and it is thus paramount in various areas of Artificial Intelligence, including computer vision, natural language processing and reinforcement learning, among others. Within the context of reinforcement learning, we propose in this paper a mathematical framework to learn representations by extracting the algebra of the transformations of worlds from the perspective of an agent. As a starting point, we use our framework to reproduce representations from the symmetry-based disentangled representation learning (SBDRL) formalism proposed by [1] and prove that, although useful, they are restricted to transformations that respond to the properties of algebraic groups. We then generalise two important results of SBDRL –the equivariance condition and the disentangling definition– from only working with group-based symmetry representations to working with representations capturing the transformation properties of worlds for any algebra, using examples common in reinforcement learning and generated by an algorithm that computes their corresponding Cayley tables. Finally, we combine our generalised equivariance condition and our generalised disentangling definition to show that disentangled sub-algebras can each have their own individual equivariance conditions, which can be treated independently, using category theory. In so doing, our framework offers a rich formal tool to represent different types of symmetry transformations in reinforcement learning, extending the scope of previous proposals and providing Artificial Intelligence developers with a sound foundation to implement efficient applications.

Short letter with evidences of the use of models in mammal decision making, relating it to reinforcement learning

Ivo Jacobs, Tomas Persson, Peter Gärdenfors, Model-based animal cognition slips through the sequence bottleneck, Trends in Cognitive Sciences, Volume 29, Issue 10, 2025, Pages 872-873, 10.1016/j.tics.2025.06.009.

In a recent article in TiCS, Lind and Jon-And argued that the sequence memory of animals constitutes a cognitive bottleneck, the ‘sequence bottleneck’, and that mental simulations require faithful representation of sequential information. They therefore concluded that animals cannot perform mental simulations, and that behavioral and neurobiological studies suggesting otherwise are best interpreted as results of associative learning. Through examples of predictive maps, cognitive control, and active sleep, we illustrate the overwhelming evidence that mammals and birds make model-based simulations, which suggests the sequence bottleneck to be more limited in scope than proposed by Lind and Jon-And […]

There is a response to this paper.

A good review of the state of the art in hybridizing NNs and physical knowledge

Mikel Merino-Olagüe, Xabier Iriarte, Carlos Castellano-Aldave, Aitor Plaza, Hybrid modelling and identification of mechanical systems using Physics-Enhanced Machine Learning, Engineering Applications of Artificial Intelligence, Volume 159, Part C, 2025, 10.1016/j.engappai.2025.111762.

Obtaining mathematical models for mechanical systems is a key subject in engineering. These models are essential for calculation, simulation and design tasks, and they are usually obtained from physical principles or by fitting a black-box parametric input–output model to experimental data. However, both methodologies have some limitations: physics based models may not take some phenomena into account and black-box models are complicated to interpretate. In this work, we develop a novel methodology based on discrepancy modelling, which combines physical principles with neural networks to model mechanical systems with partially unknown or unmodelled physics. Two different mechanical systems with partially unknown dynamics are successfully modelled and the values of their physical parameters are obtained. Furthermore, the obtained models enable numerical integration for future state prediction, linearization and the possibility of varying the values of the physical parameters. The results show how a hybrid methodology provides accurate and interpretable models for mechanical systems when some physical information is missing. In essence, the presented methodology is a tool to obtain better mathematical models, which could be used for analysis, simulation and design tasks.