Model-based RL that addresses the problem of building models that can produce off-distribution data more safely

X. -Y. Liu et al., DOMAIN: Mildly Conservative Model-Based Offline Reinforcement Learning, IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 55, no. 10, pp. 7142-7155, Oct. 2025, 10.1109/TSMC.2025.3578666.

Model-based reinforcement learning (RL), which learns an environment model from the offline dataset and generates more out-of-distribution model data, has become an effective approach to the problem of distribution shift in offline RL. Due to the gap between the learned and actual environment, conservatism should be incorporated into the algorithm to balance accurate offline data and imprecise model data. The conservatism of current algorithms mostly relies on model uncertainty estimation. However, uncertainty estimation is unreliable and leads to poor performance in certain scenarios, and the previous methods ignore differences between the model data, which brings great conservatism. To address the above issues, this article proposes a mildly conservative model-based offline RL algorithm (DOMAIN) without estimating model uncertainty, and designs the adaptive sampling distribution of model samples, which can adaptively adjust the model data penalty. In this article, we theoretically demonstrate that the Q value learned by the DOMAIN outside the region is a lower bound of the true Q value, the DOMAIN is less conservative than previous model-based offline RL algorithms, and has the guarantee of safety policy improvement. The results of extensive experiments show that DOMAIN outperforms prior RL algorithms and the average performance has improved by 1.8% on the D4RL benchmark.

Related: 10.1109/TSMC.2025.3583392

DRL in non-stationary environments

Y. Fu and Y. Gao, Learning Hidden Transition for Nonstationary Environments With Multistep Tree Search, IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 55, no. 10, pp. 7012-7023, Oct. 2025, 10.1109/TSMC.2025.3578730.

Deep reinforcement learning (DRL) algorithms have shown impressive results in various applications, but nonstationary environments, such as varying operating conditions and external disturbances, remain a significant challenge. To address this challenge, we propose the hidden transition inference (HTI) framework for learning nonstationary transitions in multistep tree search. Different from previous methods that focus on single-step transition changes, the HTI framework improves decision-making by inferring multistep environmental variations. Specifically, this framework constructs a probabilistic graphical model for Monte Carlo tree search (MCTS) in latent space and utilizes the variational lower bound of hidden states for policy improvement. Furthermore, this work theoretically proves the convergence of the HTI framework, ensuring its effectiveness in handling nonstationary environments. The proposed framework is integrated with the state-of-the-art MCTS-based algorithm sampled MuZero and evaluated on multiple control tasks with different nonstationary dynamics transitions. Experimental results show that the HTI framework can improve the inference capability of tree search in nonstationary environments, showcasing its potential for addressing the control challenges in nonstationary environments.

Accelerating recognition processes of images through NNs that do not vary their weights

Yanli Yang, A brain-inspired projection contrastive learning network for instantaneous learning, Engineering Applications of Artificial Intelligence, Volume 158, 2025 , 10.1016/j.engappai.2025.111524.

The biological brain can learn quickly and efficiently, while the learning of artificial neural networks is astonishing time-consuming and energy-consuming. Biosensory information is quickly projected to the memory areas to be identified or to be signed with a label through biological neural networks. Inspired by the fast learning of biological brains, a projection contrastive learning model is designed for the instantaneous learning of samples. This model is composed of an information projection module for rapid information representation and a contrastive learning module for neural manifold disentanglement. An algorithm instance of projection contrastive learning is designed to process some machinery vibration signals and is tested on several public datasets. The test on a mixed dataset containing 1426 training samples and 14,260 testing samples shows that the running time of our algorithm is approximately 37 s and that the average processing time is approximately 2.31 ms per sample, which is comparable to the processing speed of a human vision system. A prominent feature of this algorithm is that it can track the decision-making process to provide an explanation of outputs in addition to its fast running speed.

Adapting a shared teleoperation system to network delays

B. Güleçyüz et al., Enhancing Shared Autonomy in Teleoperation Under Network Delay: Transparency- and Confidence-Aware Arbitration, IEEE Robotics and Automation Letters, vol. 10, no. 10, pp. 9654-9661, Oct. 2025, 10.1109/LRA.2025.3596436.

Shared autonomy bridges human expertise with machine intelligence, yet existing approaches often overlook the impact of teleoperation delays. To address this gap, we propose a novel shared autonomy approach that enables robots to gradually learn from teleoperated demonstrations while adapting to network delays. Our method improves intent prediction by accounting for delayed feedback to the human operator and adjusts the arbitration function to balance reduced human confidence due to delay with confidence in learned autonomy. To ensure system stability, which might be compromised by delay and arbitration of human and autonomy control forces, we introduce a three-port extension of the Time-Domain Passivity Approach with Energy Reflection (TDPA-ER). Experimental validation with 12 participants demonstrated improvements in intent prediction accuracy, task performance, and the quality of final learned autonomy, highlighting the potential of our approach to enhance teleoperation and learning quality in remote environments.

A review of cognitive costs of decision making

Christin Schulze, Ada Aka, Daniel M. Bartels, Stefan F. Bucher, Jake R. Embrey, Todd M. Gureckis, Gerald Häubl, Mark K. Ho, Ian Krajbich, Alexander K. Moore, Gabriele Oettingen, Joan D.K. Ongchoco, Ryan Oprea, Nicholas Reinholtz, Ben R. Newell, A timeline of cognitive costs in decision-making, Trends in Cognitive Sciences, Volume 29, Issue 9, 2025, Pages 827-839, 10.1016/j.tics.2025.04.004.

Recent research from economics, psychology, cognitive science, computer science, and marketing is increasingly interested in the idea that people face cognitive costs when making decisions. Reviewing and synthesizing this research, we develop a framework of cognitive costs that organizes concepts along a temporal dimension and maps out when costs occur in the decision-making process and how they impact decisions. Our unifying framework broadens the scope of research on cognitive costs to a wider timeline of cognitive processing. We identify implications and recommendations emerging from our framework for intervening on behavior to tackle some of the most pressing issues of our day, from improving health and saving decisions to mitigating the consequences of climate change.

Social learning is compatible with reward-based decision making

David Schultner, Lucas Molleman, Björn Lindström, Reward is enough for social learning, Trends in Cognitive Sciences, Volume 29, Issue 9, 2025, Pages 787-789, 10.1016/j.tics.2025.06.012.

Adaptive behaviour relies on selective social learning, yet the mechanisms underlying this capacity remain debated. A new account demonstrates that key strategies can emerge through reward-based learning of social features, explaining the widely observed flexibility of social learning and illuminating the cognitive basis of cultural evolution.

On the theoretical convergence of Q-learning when the environment is not stationary

Diogo S. Carvalho, Pedro A. Santos, Francisco S. Melo, Reinforcement learning in convergently non-stationary environments: Feudal hierarchies and learned representations, Artificial Intelligence, Volume 347, 2025, 10.1016/j.artint.2025.104382.

We study the convergence of Q-learning-based methods in convergently non-stationary environments, particularly in the context of hierarchical reinforcement learning and of dynamic features encountered in deep reinforcement learning. We demonstrate that Q-learning achieves convergence in tabular representations when applied to convergently non-stationary dynamics, such as the ones arising in a feudal hierarchical setting. Additionally, we establish convergence for Q-learning-based deep reinforcement learning methods with convergently non-stationary features, such as the ones arising in representation-based settings. Our findings offer theoretical support for the application of Q-learning in these complex scenarios and present methodologies for extending established theoretical results from standard cases to their convergently non-stationary counterparts.

Improving the generalization of robotic RL by inspiration in the humman motion control system

P. Zhang, Z. Hua and J. Ding, A Central Motor System Inspired Pretraining Reinforcement Learning for Robotic Control, IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 55, no. 9, pp. 6285-6298, Sept. 2025, 10.1109/TSMC.2025.3577698.

Robots typically encounter diverse tasks, bringing a significant challenge for motion control. Pretraining reinforcement learning (PRL) enables robots to adapt quickly to various tasks by exploiting reusable skills. The existing PRL methods often rely on datasets and human expert knowledge, struggle to discover diverse and dynamic skills, and exhibit generalization and adaptability to different types of robots and downstream tasks. This article proposes a novel PRL algorithm based on the central motor system mechanisms, which can discover diverse and dynamic skills without relying on data and expert knowledge, effectively enabling robots to tackle different types of downstream tasks. Inspired by the cerebellum’s role in balance control and skill storage within the central motor system, an intrinsic fused reward is introduced to explore dynamic skills and eliminate dependence on data and expert knowledge during pretraining. Drawing from the basal ganglia’s function in motor programming, a discrete skill encoding method is designed to increase the diversity of discovered skills, improving the performance of complex robots in challenging environments. Furthermore, incorporating the basal ganglia’s role in motor regulation, a skill activity function is proposed to generate skills at varying dynamic levels, thereby improving the adaptability of robots in multiple downstream tasks. The effectiveness of the proposed algorithm has been demonstrated through simulation experiments on four different morphological robots across multiple downstream tasks.

Stacking multiple MDPs in an abstraction hierarchy to better solve RL

Roberto Cipollone, Marco Favorito, Flavio Maiorana, Giuseppe De Giacomo, Luca Iocchi, Fabio Patrizi, Exploiting robot abstractions in episodic RL via reward shaping and heuristics, Robotics and Autonomous Systems, Volume 193, 2025, 10.1016/j.robot.2025.105116.

One major limitation to the applicability of Reinforcement Learning (RL) to many domains of practical relevance, in particular in robotic applications, is the large number of samples required to learn an optimal policy. To address this problem and improve learning efficiency, we consider a linear hierarchy of abstraction layers of the Markov Decision Process (MDP) underlying the target domain. Each layer is an MDP representing a coarser model of the one immediately below in the hierarchy. In this work, we propose novel techniques to automatically define Reward Shaping and Reward Heuristic functions that are based on the solution obtained at a higher level of abstraction and provide rewards to the finer (possibly the concrete) MDP at the lower level, thus inducing an exploration heuristic that can effectively guide the learning process in the more complex domain. In contrast with other works in Hierarchical RL, our technique imposes fewer requirements on the design of the abstract models and is tolerant to modeling errors, thus making the proposed approach practical. We formally analyze the relationship between the abstract models and the exploration heuristic induced in the lower-level domain, we prove that the method guarantees optimal convergence, and finally demonstrate its effectiveness experimentally in several complex robotic domains.

How biology uses primary rewards coming from basic physiological signals and proxy rewards, more immediate (predicting primary rewards) as shaping rewards

Lilian A. Weber, Debbie M. Yee, Dana M. Small, and Frederike H. Petzschner, The interoceptive origin of reinforcement learning, IEEE Robotics and Automation Letters, vol. 10, no. 8, pp. 7723-7730, Aug. 2025, 10.1016/j.tics.2025.05.008.

Rewards play a crucial role in sculpting all motivated behavior. Traditionally, research on reinforcement learning has centered on how rewards guide learning and decision-making. Here, we examine the origins of rewards themselves. Specifically, we discuss that the critical signal sustaining reinforcement for food is generated internally and subliminally during the process of digestion. As such, a shift in our understanding of primary rewards as an immediate sensory gratification to a state-dependent evaluation of an action’s impact on vital phys- iological processes is called for. We integrate this perspective into a revised reinforcement learning framework that recognizes the subliminal nature of bio-logical rewards and their dependency on internal states and goals.