Selecting the best RL result from offline RL in order to minimize risks

Giorgio Angelotti, Nicolas Drougard, Caroline P․ C. Chanel, An offline risk-aware policy selection method for Bayesian Markov decision processes, Artificial Intelligence, Volume 354, 2026, 10.1016/j.artint.2026.104519.

In Offline Model Learning for Planning and in Offline Reinforcement Learning, the limited data set hinders the estimate of the Value function of the relative Markov Decision Process (MDP). Consequently, the performance of the obtained policy in the real world is bounded and possibly risky, especially when the deployment of a wrong policy can lead to catastrophic consequences. For this reason, several pathways are being followed with the scope of reducing the model error (or the distributional shift between the learned model and the true one) and, more broadly, obtaining risk-aware solutions with respect to model uncertainty. But when it comes to the final application which baseline should a practitioner choose? In an offline context where computational time is not an issue and robustness is the priority we propose Exploitation vs Caution (EvC), a paradigm that (1) elegantly incorporates model uncertainty abiding by the Bayesian formalism, and (2) selects the policy that maximizes a risk-aware objective over the Bayesian posterior between a fixed set of candidate policies provided, for instance, by the current baselines. We validate EvC with state-of-the-art approaches in different discrete, yet simple, environments offering a fair variety of MDP classes. In the tested scenarios EvC manages to select robust policies and hence stands out as a useful tool for practitioners that aim to apply offline planning and reinforcement learning solvers in the real world.

Deciding when to explore more by a robot using DL

Luperto, M., Ferrara, M.M., Princisgh, M. et al., Estimating map completeness in robot exploration, Auton Robot 50, 6 (2026) 10.1007/s10514-025-10221-8.

We present a novel method that, given a grid map of a partially explored indoor environment, estimates the amount of the explored area in the map and whether it is worth continuing to explore the uncovered part of the environment. Our method is based on the idea that modern deep learning models can successfully solve this task by leveraging visual clues in the map. Thus, we train a deep convolutional neural network on images depicting grid maps from partially explored environments, with annotations derived from the knowledge of the entire map, which is not available when the network is used for inference. We show that our network can be used to define a stopping criterion to successfully terminate the exploration process when this is expected to no longer add relevant details about the environment to the map, saving more than 35% of the total exploration time compared to covering the whole environment area.

Fixing artifacts of occupancy grid maps through DL

Leon Davies, Baihua Li, Mohamad Saada, Simon Sølvsten, Qinggang Meng, Transformation & Translation Occupancy Grid Mapping: 2-dimensional deep learning refined SLAM, Robotics and Autonomous Systems, Volume 200, 2026, 10.1016/j.robot.2026.105405.

SLAM (Simultaneous Localisation and Mapping) is an important component in robotics, providing a map of an environment and enabling localisation and navigation. While 3D LiDAR odometry and mapping systems have advanced in recent years, producing accurate motion estimates and detailed 3D maps, high-quality 2D occupancy grid maps (OGMs) remain challenging to obtain in large, complex indoor environments. OGMs are often degraded by drifts in odometry, sensor artefacts, and partial observability, resulting in maps with fractured walls, double boundaries, and artefacts that limit readability for mapping-centric tasks such as floor plan creation. To address this, we propose Transformation & Translation Occupancy Grid Mapping (TT-OGM), a system-level pipeline that targets map fidelity. TT-OGM leverages 3D scan registration to stabilise 2D map construction via projection and standard occupancy updates, then applies a learned GAN-based refinement module as post-processing to remove artefacts, regularise structure, and complete small missing regions. To enable training at scale, we introduce an offline DRL-based data generation process that produces paired but weakly aligned erroneous/clean OGMs spanning diverse error modes and severities. We demonstrate TT-OGM in real-time on a building-scale dataset collected at Loughborough University and evaluate map fidelity against a registered floor-plan reference using mIoU, masked SSIM, and occupied-boundary F1. We additionally report localisation accuracy on S3Ev2 using translation ATE (RMSE) against Cartographer and SLAM Toolbox (Karto). Our results show that 3D registration improves baseline 2D map quality over standard 2D SLAM outputs, and that GAN refinement further increases structural consistency and boundary accuracy in our pipeline. Additional ablations on synthetic stress tests and qualitative transfer to unseen Radish sequences show that the refinement module consistently improves OGM readability under common noise, moderate drift, and clutter conditions.

Using evolutionary computation to find better rewards in the case of partial-observable RL

Zhengwei Zhu, Zhixuan Chen, Chenyang Zhu, Wen Si, Fang Wang, Optimizing potential-based reward automata in partially observable reinforcement learning using genetic local search, Engineering Applications of Artificial Intelligence, Volume 169, 2026, 10.1016/j.engappai.2026.114054.

Partially observable reinforcement learning extends the reinforcement learning framework to environments in which agents have limited visibility of the state space, making it particularly relevant for applications in robotics and autonomous vehicle navigation. However, a primary challenge in partially observable reinforcement learning is defining effective reward functions that can guide the learning process despite partial observability. To address this challenge, this paper introduces a novel approach for constructing potential-based reward automata by employing genetic local search methods. Specifically, our method constructs these automata from compressed representations of exploration trajectories, which succinctly capture critical decision points and essential state transitions while eliminating redundant steps. By optimizing trajectory samples and shortening agent trajectories to their crucial transitions, our technique significantly reduces computational overhead. Formally, we define the learning objective as an optimization problem aimed at maximizing the log-likelihood of future observations while simultaneously minimizing the structural complexity of the learned reward automata. Furthermore, by incorporating value-based strategies to estimate potential values within the reward automata, our approach improves learning efficiency and facilitates the identification of optimal reward structures. We empirically evaluate our proposed method on seven partially observable grid-world benchmarks. Experimental results demonstrate that our method achieves superior performance relative to state-of-the-art reward automata-based techniques, exhibiting both accelerated learning speeds and higher accumulated rewards. Additionally, our genetic local search algorithm consistently outperforms comparative heuristic methods in terms of learning curves and reward accumulation.

Enhancing RRT with a more intelligent sampling of movements

Asmaa Loulou, Mustafa Unel, Hybrid attention-guided RRT*: Learning spatial sampling priors for accelerated path planning, Robotics and Autonomous Systems, Volume 198, 2026, 10.1016/j.robot.2026.105338.

Sampling-based planners such as RRT* are widely used for motion planning in high-dimensional and complex environments. However, their reliance on uniform sampling often leads to slow convergence and inefficiency, especially in scenarios with narrow passages or long-range dependencies. To address this, we propose HAGRRT*, a Hybrid Attention-Guided RRT* algorithm that learns to generate spatially informed sampling priors. Our method introduces a new neural architecture that fuses multi-scale convolutional features with a lightweight cross-attention mechanism, explicitly conditioned on the start and goal positions. These features are decoded via a DPT-inspired module to produce 2D probability maps that guide the sampling process. Additionally, we propose an obstacle-aware loss function that penalizes disconnected and infeasible predictions which further encourages the network to focus on traversable, goal-directed regions. Extensive experiments on both structured (maze) and unstructured (forest) environments show that HAGRRT* achieves significantly faster convergence and improved path quality compared to both classical RRT* and recent deep-learning guided variants. Our method consistently requires fewer iterations and samples and is able to generalize across varying dataset types. On structured scenarios, our method achieves an average reduction of 39.6% in the number of samples and an average of 24.4% reduction in planning time compared to recent deep learning methods. On unstructured forest maps, our method reduces the number of samples by 71.5%, and planning time by 81.7% compared to recent deep learning methods, and improves the success rate from 67% to 93%. These results highlight the robustness, efficiency, and generalization ability of our approach across a wide range of planning environments.

See also: the no so strong influence of time in some cognitive processes, such as speech processing (https://doi.org/10.1016/j.tics.2025.05.017)

Uncovering time variations in decision making of agents that do not always respond with the same policy

Anne E. Urai, Structure uncovered: understanding temporal variability in perceptual decision-making, Trends in Cognitive Sciences, Volume 30, Issue 1, 2026, Pages 54-65 10.1016/j.tics.2025.06.003.

Studies of perceptual decision-making typically present the same stimulus repeatedly over the course of an experimental session but ignore the order of these observations, assuming unrealistic stability of decision strategies over trials. However, even ‘stable,’ ‘steady-state,’ or ‘expert’ decision-making behavior features significant trial-to-trial variability that is richly structured in time. Structured trial-to-trial variability of various forms can be uncovered using latent variable models such as hidden Markov models and autoregressive models, revealing how unobservable internal states change over time. Capturing such temporal structure can avoid confounds in cognitive models, provide insights into inter- and intraindividual variability, and bridge the gap between neural and cognitive mechanisms of variability in perceptual decision-making.

See also: the no so strong influence of time in some cognitive processes, such as speech processing (https://doi.org/10.1016/j.tics.2025.05.017)

Evidences in the natural world of the benefits of communication errors within collaborative agents

Bradley D. Ohlinger, Takao Sasaki, How miscommunication can improve collective performance in social insects, Trends in Cognitive Sciences, Volume 30, Issue 1, 2026, Pages 10-12, 10.1016/j.tics.2025.10.005.

Communication errors are typically viewed as detrimental, yet they can benefit collective foraging in social insects. Temnothorax ants provide a powerful model for studying how such errors arise during tandem running and how they might improve group performance under certain environmental conditions.

Deterministic guarantees (aka certification) for POMDPs

Moran Barenboim, Vadim Indelman, Online POMDP planning with anytime deterministic optimality guarantees, Artificial Intelligence, Volume 350, 2026, 10.1016/j.artint.2025.104442.

Decision-making under uncertainty is a critical aspect of many practical autonomous systems due to incomplete information. Partially Observable Markov Decision Processes (POMDPs) offer a mathematically principled framework for formulating decision-making problems under such conditions. However, finding an optimal solution for a POMDP is generally intractable. In recent years, there has been a significant progress of scaling approximate solvers from small to moderately sized problems, using online tree search solvers. Often, such approximate solvers are limited to probabilistic or asymptotic guarantees towards the optimal solution. In this paper, we derive a deterministic relationship for discrete POMDPs between an approximated and the optimal solution. We show that at any time, we can derive bounds that relate between the existing solution and the optimal one. We show that our derivations provide an avenue for a new set of algorithms and can be attached to existing algorithms that have a certain structure to provide them with deterministic guarantees with marginal computational overhead. In return, not only do we certify the solution quality, but we demonstrate that making a decision based on the deterministic guarantee may result in superior performance compared to the original algorithm without the deterministic certification.

A novel stochastic gradient optimization method that improves over common ones

Mengxiang Zhang, Shengjie Li, Inertial proximal stochastic gradient method with adaptive sampling for non-convex and non-smooth problems, Engineering Applications of Artificial Intelligence, Volume 163, Part 3, 2026, 10.1016/j.engappai.2025.113087.

Stochastic gradient methods with inertia have proven effective in convex optimization, yet most real-world tasks involve non-convex objectives. With the growing scale and dimensionality of modern datasets, non-convex and non-smooth regularization has become essential for improving generalization, controlling complexity, and mitigating overfitting. While widely applied in logistic regression, sparse recovery, medical imaging, and sparse neural networks, such formulations remain challenging due to the high cost of exact gradients, the sensitivity of stochastic gradients to sample size, and convergence difficulties caused by noise and non-smooth non-convexity. We propose a stochastic algorithm that addresses these issues by introducing an adaptive sampling strategy to balance stochastic gradient noise and efficiency, incorporating inertia for acceleration, and a step size update rule coupled with both sample size and inertia. We avoid the need for exact function value computations required by traditional inertial methods in non-convex and non-smooth problems, as well as the costly full-gradient evaluations or substantial memory usage typically associated with variance-reduction techniques. To our knowledge, this is the first stochastic method with adaptive sampling and inertia that guarantees convergence in non-convex and non-smooth settings, attaining O(1/K) rates to critical points under mild variance conditions, while achieving accelerated O(1/k2) convergence in convex optimization. Experiments on logistic regression and neural networks validate its efficiency and provide practical guidance for selecting sample sizes and step sizes.

Analysis of using RL as a PID tuning method

Ufuk Demircioğlu, Halit Bakır, Reinforcement learning–driven proportional–integral–derivative controller tuning for mass–spring systems: Stability, performance, and hyperparameter analysis, Engineering Applications of Artificial Intelligence, Volume 162, Part D, 2025, 10.1016/j.engappai.2025.112692.

Artificial intelligence (AI) methods—particularly reinforcement learning (RL)—are used to tune Proportional–Integral–Derivative (PID) controller parameters for a mass–spring–damper system. Learning is performed with the Twin Delayed Deep Deterministic Policy Gradient (TD3) actor–critic algorithm, implemented in MATLAB (Matrix Laboratory) and Simulink (a simulation environment by MathWorks). The objective is to examine the effect of critical RL hyperparameters—including experience buffer size, mini-batch size, and target policy smoothing noise—on the quality of learned PID gains and control performance. The proposed method eliminates the need for manual gain tuning by enabling the RL agent to autonomously learn optimal control strategies through continuous interaction with the Simulink-modeled mass–spring–damper system, where the agent observes responses and applies control actions to optimize the PID gains. Results show that small buffer sizes and suboptimal batch configurations cause unstable behavior, while buffer sizes of 105 or larger and mini-batch sizes between 64 and 128 yield robust tracking. A target policy smoothing noise of 0.01 produced the best performance, while values between 0.05 and 0.1 also provided stable results. Comparative analysis with the classical Simulink PID tuner indicated that, for this linear system, the conventional tuner achieved slightly better transient performance, particularly in overshoot and settling time. Although the RL-based method showed adaptability and generated valid PID gains, it did not surpass the classical approach in this structured system. These findings highlight the promise of AI- and RL-driven control in uncertain, nonlinear, or variable dynamics, while underscoring the importance of hyperparameter optimization in realizing the potential of RL-based Proportional–Integral–Derivative tuning.