Tag Archives: Deep Reinforcement Learning

Improving the adaptation of RL to robots with different parameters through Fuzzy

A. G. Haddad, M. B. Mohiuddin, I. Boiko and Y. Zweiri, Fuzzy Ensembles of Reinforcement Learning Policies for Systems With Variable Parameters, IEEE Robotics and Automation Letters, vol. 10, no. 6, pp. 5361-5368, June 2025 10.1109/LRA.2025.3559833.

This paper presents a novel approach to improving the generalization capabilities of reinforcement learning (RL) agents for robotic systems with varying physical parameters. We propose the Fuzzy Ensemble of RL policies (FERL), which enhances performance in environments where system parameters differ from those encountered during training. The FERL method selectively fuses aligned policies, determining their collective decision based on fuzzy memberships tailored to the current parameters of the system. Unlike traditional centralized training approaches that rely on shared experiences for policy updates, FERL allows for independent agent training, facilitating efficient parallelization. The effectiveness of FERL is demonstrated through extensive experiments, including a real-world trajectory tracking application in a quadrotor slung-load system. Our method improves the success rates by up to 15.6% across various simulated systems with variable parameters compared to the existing benchmarks of domain randomization and robust adaptive ensemble adversary RL. In the real-world experiments, our method achieves a 30% reduction in 3D position RMSE compared to individual RL policies. The results underscores FERL robustness and applicability to real robotic systems.

Improving reward shaping in Deep RL for avoiding user’s biases and boosting learning efficiency

Jiawei Lin, Xuekai Wei, Weizhi Xian, Jielu Yan, Leong Hou U, Yong Feng, Zhaowei Shang, Mingliang Zhou, Continuous reinforcement learning via advantage value difference reward shaping: A proximal policy optimization perspective, Engineering Applications of Artificial Intelligence, Volume 151, 2025 10.1016/j.engappai.2025.110676.

Deep reinforcement learning has shown great promise in industrial applications. However, these algorithms suffer from low learning efficiency because of sparse reward signals in continuous control tasks. Reward shaping addresses this issue by transforming sparse rewards into more informative signals, but some designs that rely on domain experts or heuristic rules can introduce cognitive biases, leading to suboptimal solutions. To overcome this challenge, this paper proposes the advantage value difference (AVD), a generalized potential-based end-to-end exploration reward function. The main contribution of this paper is to improve the agent’s exploration efficiency, accelerate the learning process, and prevent premature convergence to local optima. The method leverages the temporal difference error to estimate the potential of states and uses the advantage function to guide the learning process toward more effective strategies. In the context of engineering applications, this paper proves the superiority of AVD in continuous control tasks within the multi-joint dynamics with contact (MuJoCo) environment. Specifically, the proposed method achieves an average increase of 23.5% in episode rewards for the Hopper, Swimmer, and Humanoid tasks compared with the state-of-the-art approaches. The results demonstrate the significant improvement in learning efficiency achieved by AVD for industrial robotic systems.

Using Deep RL to model transitions and observations in EKF localization

Islem Kobbi, Abdelhak Benamirouche, Mohamed Tadjine, Enhancing pose estimation for mobile robots: A comparative analysis of deep reinforcement learning algorithms for adaptive Extended Kalman Filter-based estimation, Engineering Applications of Artificial Intelligence, Volume 150, 2025 10.1016/j.engappai.2025.110548.

The Extended Kalman Filter (EKF) is a widely used algorithm for state estimation in control systems. However, its lack of adaptability limits its performance in dynamic and uncertain environments. To address this limitation, we used an approach that leverages Deep Reinforcement Learning (DRL) to achieve adaptive state estimation in the EKF. By integrating DRL techniques, we enable the state estimator to autonomously learn and update the values of the system dynamics and measurement noise covariance matrices, Q and R, based on observed data, which encode environmental changes or system failures. In this research, we compare the performance of four DRL algorithms, namely Deep Deterministic Policy Gradient (DDPG), Twin Delayed Deep Deterministic Policy Gradient (TD3), Soft Actor-Critic (SAC), and Proximal Policy Optimization (PPO), in optimizing the EKF’s adaptability. The experiments are conducted in both simulated and real-world settings using the Gazebo simulation environment and the Robot Operating System (ROS). The results demonstrate that the DRL-based adaptive state estimator outperforms traditional methods in terms of estimation accuracy and robustness. The comparative analysis provides insights into the strengths and limitations of different DRL agents, showing that the TD3 and the DDPG are the most effective algorithms, with TD3 achieving superior performance, resulting in a 91% improvement over the classic EKF, due to its delayed update mechanism that reduces training noise. This research highlights the potential of DRL to advance state estimation algorithms, offering valuable insights for future work in adaptive estimation techniques.

On the explainability of Deep RL and its improvement through the integration of human preferences

Georgios Angelopoulos, Luigi Mangiacapra, Alessandra Rossi, Claudia Di Napoli, Silvia Rossi, What is behind the curtain? Increasing transparency in reinforcement learning with human preferences and explanations, Engineering Applications of Artificial Intelligence, Volume 149, 2025, 10.1016/j.engappai.2025.110520.

In this work, we investigate whether the transparency of a robot’s behaviour is improved when human preferences on the actions the robot performs are taken into account during the learning process. For this purpose, a shielding mechanism called Preference Shielding is proposed and included in a reinforcement learning algorithm to account for human preferences. We also use the shielding to decide when to provide explanations of the robot’s actions. We carried out a within-subjects study involving 26 participants to evaluate the robot’s transparency. Results indicate that considering human preferences during learning improves legibility compared with providing only explanations. In addition, combining human preferences and explanations further amplifies transparency. Results also confirm that increased transparency leads to an increase in people’s perception of the robot’s safety, comfort, and reliability. These findings show the importance of transparency during learning and suggest a paradigm for robotic applications when a robot has to learn a task in the presence of or in collaboration with a human.

Generating intrinsic rewards to address the sparse reward problem of RL

Z. Gao et al., Self-Supervised Exploration via Temporal Inconsistency in Reinforcement Learning, IEEE Transactions on Artificial Intelligence, vol. 5, no. 11, pp. 5530-5539, Nov. 2024, DOI: 10.1109/TAI.2024.3413692.

In sparse extrinsic reward settings, reinforcement learning remains a challenge despite increasing interest in this field. Existing approaches suggest that intrinsic rewards can alleviate issues caused by reward sparsity. However, many studies overlook the critical role of temporal information, essential for human curiosity. This article introduces a novel intrinsic reward mechanism inspired by human learning processes, where curiosity is evaluated by comparing current observations with historical knowledge. Our method involves training a self-supervised prediction model, periodically saving snapshots of the model parameters, and employing the nuclear norm to assess the temporal inconsistency between predictions from different snapshots as intrinsic rewards. Additionally, we propose a variational weighting mechanism to adaptively assign weights to the snapshots, enhancing the model’s robustness and performance. Experimental results across various benchmark environments demonstrate the efficacy of our approach, which outperforms other state-of-the-art methods without incurring additional training costs and exhibits higher noise tolerance. Our findings indicate that leveraging temporal information in intrinsic rewards can significantly improve exploration performance, motivating future research to develop more robust and accurate reward systems for reinforcement learning.

Improving sample efficiency under sparse rewards and large continuous action spaces through predictive control in RL

Antonyshyn, L., Givigi, S., Deep Model-Based Reinforcement Learning for Predictive Control of Robotic Systems with Dense and Sparse Rewards, J Intell Robot Syst 110, 100 (2024) DOI: 10.1007/s10846-024-02118-y.

Sparse rewards and sample efficiency are open areas of research in the field of reinforcement learning. These problems are especially important when considering applications of reinforcement learning to robotics and other cyber-physical systems. This is so because in these domains many tasks are goal-based and naturally expressed with binary successes and failures, action spaces are large and continuous, and real interactions with the environment are limited. In this work, we propose Deep Value-and-Predictive-Model Control (DVPMC), a model-based predictive reinforcement learning algorithm for continuous control that uses system identification, value function approximation and sampling-based optimization to select actions. The algorithm is evaluated on a dense reward and a sparse reward task. We show that it can match the performance of a predictive control approach to the dense reward problem, and outperforms model-free and model-based learning algorithms on the sparse reward task on the metrics of sample efficiency and performance. We verify the performance of an agent trained in simulation using DVPMC on a real robot playing the reach-avoid game. Video of the experiment can be found here: https://youtu.be/0Q274kcfn4c.

Reducing the need of samples in RL through evolutionary techniques

Onori, G., Shahid, A.A., Braghin, F. et al. , Adaptive Optimization of Hyper-Parameters for Robotic Manipulation through Evolutionary Reinforcement Learning, J Intell Robot Syst 110, 108 (2024) DOI: 10.1007/s10846-024-02138-8.

Deep Reinforcement Learning applications are growing due to their capability of teaching the agent any task autonomously and generalizing the learning. However, this comes at the cost of a large number of samples and interactions with the environment. Moreover, the robustness of learned policies is usually achieved by a tedious tuning of hyper-parameters and reward functions. In order to address this issue, this paper proposes an evolutionary RL algorithm for the adaptive optimization of hyper-parameters. The policy is trained using an on-policy algorithm, Proximal Policy Optimization (PPO), coupled with an evolutionary algorithm. The achieved results demonstrate an improvement in the sample efficiency of the RL training on a robotic grasping task. In particular, the learning is improved with respect to the baseline case of a non-evolutionary agent. The evolutionary agent needs % fewer samples to completely learn the grasping task, enabled by the adaptive transfer of knowledge between the agents through the evolutionary algorithm. The proposed approach also demonstrates the possibility of updating reward parameters during training, potentially providing a general approach to creating reward functions.

Improving explainability of deep RL in Robotics

Mehran Taghian, Shotaro Miwa, Yoshihiro Mitsuka, Johannes Günther, Shadan Golestan, Osmar Zaiane, Explainability of deep reinforcement learning algorithms in robotic domains by using Layer-wise Relevance Propagation, Engineering Applications of Artificial Intelligence, Volume 137, Part A, 2024 DOI: 10.1016/j.engappai.2024.109131.

A key component to the recent success of reinforcement learning is the introduction of neural networks for representation learning. Doing so allows for solving challenging problems in several domains, one of which is robotics. However, a major criticism of deep reinforcement learning (DRL) algorithms is their lack of explainability and interpretability. This problem is even exacerbated in robotics as they oftentimes cohabitate space with humans, making it imperative to be able to reason about their behavior. In this paper, we propose to analyze the learned representation in a robotic setting by utilizing Graph Networks (GNs). Using the GN and Layer-wise Relevance Propagation (LRP), we represent the observations as an entity-relationship to allow us to interpret the learned policy. We evaluate our approach in two environments in MuJoCo. These two environments were delicately designed to effectively measure the value of knowledge gained by our approach to analyzing learned representations. This approach allows us to analyze not only how different parts of the observation space contribute to the decision-making process but also differentiate between policies and their differences in performance. This difference in performance also allows for reasoning about the agent’s recovery from faults. These insights are key contributions to explainable deep reinforcement learning in robotic settings.

A relatively simple way of reducing the sampling cost of DQN

Hossein Hassani, Soodeh Nikan, Abdallah Shami, Traffic navigation via reinforcement learning with episodic-guided prioritized experience replay, Engineering Applications of Artificial Intelligence, Volume 137, Part A, 2024, DOI: 10.1016/j.engappai.2024.109147.

Deep Reinforcement Learning (DRL) models play a fundamental role in autonomous driving applications; however, they typically suffer from sample inefficiency because they often require many interactions with the environment to learn effective policies. This makes the training process time-consuming. To address this shortcoming, Prioritized Experience Replay (PER) has proven to be effective by prioritizing samples with high Temporal-Difference (TD) error for learning. In this context, this study contributes to artificial intelligence by proposing a sample-efficient DRL algorithm called Episodic-Guided Prioritized Experience Replay (EPER). The core innovation of EPER lies in the utilization of an episodic memory, dedicated to storing successful training episodes. Within this memory, expected returns for each state–action pair are extracted. These returns, combined with TD error-based prioritization, form a novel objective function for deep Q-network training. To prevent excessive determinism, EPER introduces exploration into the learning process by incorporating a regularization term into the objective function that allows exploration of state-space regions with diverse Q-values. The proposed EPER algorithm is suitable to train a DRL agent for handling episodic tasks, and it can be integrated into off-policy DRL models. EPER is employed for traffic navigation through scenarios such as highway driving, merging, roundabout, and intersection to showcase its application in engineering. The attained results denote that, compared with the PER and an additional state-of-the-art training technique, EPER is superior in expediting the training of the agent and learning a more optimal policy that leads to lower collision rates within the constructed navigation scenarios.

A good survey and taxonomy for DRL in robotics

Chen Tang 1, Ben Abbatematteo 1, Jiaheng Hu 1, Rohan Chandra , Roberto Martı́n-Martı́n , Peter Stone, Deep Reinforcement Learning for Robotics: A Survey of Real-World
Successes,
arXiv:2408.03539 [cs.RO] https://www.arxiv.org/abs/2408.03539.

Reinforcement learning (RL), particularly its combination with deep neural networks referred to as deep RL (DRL), has shown tremendous promise across a wide range of applications, suggesting its potential for enabling the development of sophisticated robotic behaviors. Robotics problems, however, pose fundamental difficulties for the application of RL, stemming from the complexity and cost of interacting with the physical world. This article provides a modern survey of DRL for robotics, with a particular focus on evaluating the real-world successes achieved with DRL in realizing several key robotic competencies. Our analysis aims to identify the key factors underlying those exciting successes, reveal underexplored areas, and provide an overall characterization of the status of DRL in robotics. We highlight several important avenues for future work, emphasizing the need for stable and sample-efficient real-world RL paradigms, holistic approaches for discovering and integrating various competencies to tackle complex long-horizon, open-world tasks, and principled development and evaluation procedures. This survey is designed to offer insights for both RL practitioners and roboticists toward harnessing RL’s power to create generally capable real-world robotic systems.