Category Archives: Cognitive Sciences

Using evolutionary computation to find better rewards in the case of partial-observable RL

Zhengwei Zhu, Zhixuan Chen, Chenyang Zhu, Wen Si, Fang Wang, Optimizing potential-based reward automata in partially observable reinforcement learning using genetic local search, Engineering Applications of Artificial Intelligence, Volume 169, 2026, 10.1016/j.engappai.2026.114054.

Partially observable reinforcement learning extends the reinforcement learning framework to environments in which agents have limited visibility of the state space, making it particularly relevant for applications in robotics and autonomous vehicle navigation. However, a primary challenge in partially observable reinforcement learning is defining effective reward functions that can guide the learning process despite partial observability. To address this challenge, this paper introduces a novel approach for constructing potential-based reward automata by employing genetic local search methods. Specifically, our method constructs these automata from compressed representations of exploration trajectories, which succinctly capture critical decision points and essential state transitions while eliminating redundant steps. By optimizing trajectory samples and shortening agent trajectories to their crucial transitions, our technique significantly reduces computational overhead. Formally, we define the learning objective as an optimization problem aimed at maximizing the log-likelihood of future observations while simultaneously minimizing the structural complexity of the learned reward automata. Furthermore, by incorporating value-based strategies to estimate potential values within the reward automata, our approach improves learning efficiency and facilitates the identification of optimal reward structures. We empirically evaluate our proposed method on seven partially observable grid-world benchmarks. Experimental results demonstrate that our method achieves superior performance relative to state-of-the-art reward automata-based techniques, exhibiting both accelerated learning speeds and higher accumulated rewards. Additionally, our genetic local search algorithm consistently outperforms comparative heuristic methods in terms of learning curves and reward accumulation.

Uncovering time variations in decision making of agents that do not always respond with the same policy

Anne E. Urai, Structure uncovered: understanding temporal variability in perceptual decision-making, Trends in Cognitive Sciences, Volume 30, Issue 1, 2026, Pages 54-65 10.1016/j.tics.2025.06.003.

Studies of perceptual decision-making typically present the same stimulus repeatedly over the course of an experimental session but ignore the order of these observations, assuming unrealistic stability of decision strategies over trials. However, even ‘stable,’ ‘steady-state,’ or ‘expert’ decision-making behavior features significant trial-to-trial variability that is richly structured in time. Structured trial-to-trial variability of various forms can be uncovered using latent variable models such as hidden Markov models and autoregressive models, revealing how unobservable internal states change over time. Capturing such temporal structure can avoid confounds in cognitive models, provide insights into inter- and intraindividual variability, and bridge the gap between neural and cognitive mechanisms of variability in perceptual decision-making.

See also: the no so strong influence of time in some cognitive processes, such as speech processing (https://doi.org/10.1016/j.tics.2025.05.017)

Evidences in the natural world of the benefits of communication errors within collaborative agents

Bradley D. Ohlinger, Takao Sasaki, How miscommunication can improve collective performance in social insects, Trends in Cognitive Sciences, Volume 30, Issue 1, 2026, Pages 10-12, 10.1016/j.tics.2025.10.005.

Communication errors are typically viewed as detrimental, yet they can benefit collective foraging in social insects. Temnothorax ants provide a powerful model for studying how such errors arise during tandem running and how they might improve group performance under certain environmental conditions.

RL with both discrete and continuous actions

Chengcheng Yan, Shujie Chen, Jiawei Xu, Xuejie Wang, Zheng Peng, Hybrid Reinforcement Learning in parameterized action space via fluctuates constraint, Engineering Applications of Artificial Intelligence, Volume 162, Part C, 2025 10.1016/j.engappai.2025.112499.

Parameterized actions in Reinforcement Learning (RL) are composed of discrete-continuous hybrid action parameters, which are widely employed in game scenarios. However, previous works have often concentrated on the network structure of RL algorithms to solve hybrid actions, neglecting the impact of fluctuations in action parameters for agent move trajectory. Due to the coupling between discrete and continuous actions, instability in discrete actions influences the selection of corresponding continuous parameters, resulting in the agent deviating from the optimal move path. In this paper, we propose a parameterized RL approach based on parameter fluctuation restriction (PFR) to address this problem, called CP-DQN. Our method effectively mitigated value fluctuation in action parameters by constraining the action parameter between adjacent time steps. Additionally, we have incorporated a supervision module to optimize the entire training process. To quantify the superiority of our approach in minimizing trajectory deviations for agents, we propose an indicator to measure the influence of parameter fluctuations on performance in hybrid action space. Our method is evaluated in three environments with hybrid action spaces, and the experiments demonstrate the superiority of our method compared to existing approaches.

A variant of RL aimed at reducing bias of conventional Q-learning

Fanghui Huang, Wenqi Han, Xiang Li, Xinyang Deng, Wen Jiang, Reducing the estimation bias and variance in reinforcement learning via Maxmean and Aitken value iteration, Engineering Applications of Artificial Intelligence, Volume 162, Part C, 2025, 10.1016/j.engappai.2025.112502.

The value-based reinforcement leaning methods suffer from overestimation bias, because of the existence of max operator, resulting in suboptimal policies. Meanwhile, variance in value estimation will cause the instability of networks. Many algorithms have been presented to solve the mentioned, but these lack the theoretical analysis about the degree of estimation bias, and the trade-off between the estimation bias and variance. Motivated by the above, in this paper, we propose a novel method based on Maxmean and Aitken value iteration, named MMAVI. The Maxmean operation allows the average of multiple state–action values (Q values) to be used as the estimated target value to mitigate the bias and variance. The Aitken value iteration is used to update Q values and improve the convergence rate. Based on the proposed method, combined with Q-learning and deep Q-network, we design two novel algorithms to adapt to different environments. To understand the effect of MMAVI, we analyze it both theoretically and empirically. In theory, we derive the closed-form expressions of reducing bias and variance, and prove that the convergence rate of our proposed method is faster than the traditional methods with Bellman equation. In addition, the convergence of our algorithms is proved in a tabular setting. Finally, we demonstrate that our proposed algorithms outperform the state-of-the-art algorithms in several environments.

A quantitative demonstration based on MDPs of the increasing need of a world model (learnt or given) as the complexity of the task and the performance of the agent increase

Jonathan Richens, David Abel, Alexis Bellot, Tom Everitt, General agents contain world models, arXiv cs:AI, Sep. 2025, arXiv:2506.01622.

Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of generalizing to multi-step goal-directed tasks must have learned a predictive model of its environment. We show that this model can be extracted from the agent’s policy, and that increasing the agents performance or the complexity of the goals it can achieve requires learning increasingly accurate world models. This has a number of consequences: from developing safe and general agents, to bounding agent capabilities in complex environments, and providing new algorithms for eliciting world models from agents.

Inclusion of LLMs in multiple task learning for generating rewards

Z. Lin, Y. Chen and Z. Liu, AutoSkill: Hierarchical Open-Ended Skill Acquisition for Long-Horizon Manipulation Tasks via Language-Modulated Rewards, IEEE Transactions on Cognitive and Developmental Systems, vol. 17, no. 5, pp. 1141-1152, Oct. 2025, 10.1109/TCDS.2025.3551298.

A desirable property of generalist robots is the ability to both bootstrap diverse skills and solve new long-horizon tasks in open-ended environments without human intervention. Recent advancements have shown that large language models (LLMs) encapsulate vast-scale semantic knowledge about the world to enable long-horizon robot planning. However, they are typically restricted to reasoning high-level instructions and lack world grounding, which makes it difficult for them to coordinately bootstrap and acquire new skills in unstructured environments. To this end, we propose AutoSkill, a hierarchical system that empowers the physical robot to automatically learn to cope with new long-horizon tasks by growing an open-ended skill library without hand-crafted rewards. AutoSkill consists of two key components: 1) an in-context skill chain generation and new skill bootstrapping guided by LLMs that inform the robot of discrete and interpretable skill instructions for skill retrieval and augmentation within the skill library; and 2) a zero-shot language-modulated reward scheme in conjunction with a meta prompter facilitates online new skill acquisition via expert-free supervision aligned with proposed skill directives. Extensive experiments conducted in both simulated and realistic environments demonstrate AutoSkill’s superiority over other LLM-based planners as well as hierarchical methods in expediting online learning for novel manipulation tasks.

A cognitive map implemented according to the latest biological knowledge and aimed to robotic navigation

M. A. Hicks, T. Lei, C. Luo, D. W. Carruth and Z. Bi, A Bio-Inspired Goal-Directed Cognitive Map Model for Robot Navigation and Exploration, IEEE Transactions on Cognitive and Developmental Systems, vol. 17, no. 5, pp. 1125-1140, Oct. 2025 10.1109/TCDS.2025.3552085.

The concept of a cognitive map (CM), or spatial map, was originally proposed to explain how mammals learn and navigate their environments. Over time, extensive research in neuroscience and psychology has established the CM as a widely accepted model. In this work, we introduce a new goal-directed cognitive map (GDCM) model that takes a nontraditional approach to spatial mapping for robot navigation and path planning. Unlike conventional models, GDCM does not require complete environmental exploration to construct a graph for navigation purposes. Inspired by biological navigation strategies, such as the use of landmarks, Euclidean distance, random motion, and reward-driven behavior. The GDCM can navigate complex, static environments efficiently without needing to explore the entire workspace. The model utilizes known cell types (head direction, speed, border, grid, and place cells) that constitute the CM, arranged in a unique configuration. Each cell model is designed to emulate its biological counterpart in a simple, computationally efficient way. Through simulation-based comparisons, this innovative CM graph-building approach demonstrates more efficient navigation than traditional models that require full exploration. Furthermore, GDCM consistently outperforms several established path planning and navigation algorithms by finding better paths.

On the model that humans use for predicting movements of targets in order to reach them, and some evidence of a biological Kalman filter-like processing

John F. Soechting, John Z. Juveli, and Hrishikesh M. Rao, Models for the Extrapolation of Target Motion for Manual Interception, J Neurophysiol 102: 1491–1502, 2009, 10.1152/jn.00398.2009.

Soechting JF, Juveli JZ, Rao HM. Models for the extrapolation of target motion for manual interception. J Neurophysiol 102: 1491–1502, 2009. First published July 1, 2009; doi:10.1152/jn.00398.2009. Intercepting a moving target requires a prediction of the target’s future motion. This extrapolation could be achieved using sensed parameters of the target motion, e.g., its position and velocity. However, the accuracy of the prediction would be improved if subjects were also able to incorporate the statistical properties of the target’s motion, accumu- lated as they watched the target move. The present experiments were designed to test for this possibility. Subjects intercepted a target moving on the screen of a computer monitor by sliding their extended finger along the monitor’s surface. Along any of the six possible target paths, target speed could be governed by one of three possible rules: constant speed, a power law relation between speed and curvature, or the trajectory resulting from a sum of sinusoids. A go signal was given to initiate interception and was always presented when the target had the same speed, irrespective of the law of motion. The dependence of the initial direction of finger motion on the target’s law of motion was examined. This direction did not depend on the speed profile of the target, contrary to the hypothesis. However, finger direction could be well predicted by assuming that target location was extrapolated using target velocity and that the amount of extrapolation depended on the distance from the finger to the target. Subsequent analysis showed that the same model of target motion was also used for on-line, visually mediated corrections of finger movement when the motion was initially misdirected.

Improvements in offline RL (from previously acquired datasets)

Lan Wu, Quan Liu, Renyang You, State slow feature softmax Q-value regularization for offline reinforcement learning, Engineering Applications of Artificial Intelligence, Volume 160, Part A, 2025, 10.1016/j.engappai.2025.111828.

Offline reinforcement learning is constrained by its reliance on pre-collected datasets, without the opportunity for further interaction with the environment. This restriction often results in distribution shifts, which can exacerbate Q-value overestimation and degrade policy performance. To address these issues, we propose a method called state slow feature softmax Q-value regularization (SQR), which enhances the stability and accuracy of Q-value estimation in offline settings. SQR employs slow feature representation learning to extract dynamic information from state trajectories, promoting the stability and robustness of the state representations. Additionally, a softmax operator is incorporated into the Q-value update process to smooth Q-value estimation, reducing overestimation and improving policy optimization. Finally, we apply our approach to locomotion and navigation tasks and establish a comprehensive experimental analysis framework. Empirical results demonstrate that SQR outperforms state-of-the-art offline RL baselines, achieving performance improvements ranging from 2.5% to 44.6% on locomotion tasks and 2.0% to 71.1% on navigation tasks. Moreover, it achieves the highest score on 7 out of 15 locomotion datasets and 4 out of 6 navigation datasets. Detailed experimental results confirm the stabilizing effect of slow feature learning and the effectiveness of the softmax regularization in mitigating Q-value overestimation, demonstrating the superiority of SQR in addressing key challenges in offline reinforcement learning.