Category Archives: Applications Of Reinforcement Learning To Robots

A nice summary of RL applied to robot navigation

N. Khlif, N. Khraief and S. Belghith, Reinforcement Learning for Mobile Robot Navigation: An overview IEEE Information Technologies & Smart Industrial Systems (ITSIS), Paris, France, 2022, pp. 1-7 DOI: 10.1109/ITSIS56166.2022.10118362.

For several years, research shows that interest in autonomous mobile robots is increasing and it has more and more grown. Autonomous mobile robots is an object of discussion but nowadays it’s an emerging topic due to the all progress related to field like autonomous driving and UAV (drones). Integrating intelligence into robotic systems requires solving various research problems, including one of the most important problems of mobile robotic systems: navigation. Find the answers to the following three questions: What is the localisation of the robot? Where are the robot going? How can it get there? presenting the solution of mobile robot navigation problem. These questions are answered by basic navigation parts which are localization, mapping and path planning. The paper present an overview of research on autonomous mobile robot navigation. First, a quick introduction to the various features of navigation. We also discuss machine learning and reinforcement learning in mobile robotics. Furthermore, we will discuss some path planning techniques. Some future directions are also suggested.

Mixing rule-based and reinforcement learning navigation for robots

Y. Zhu, Z. Wang, C. Chen and D. Dong, Rule-Based Reinforcement Learning for Efficient Robot Navigation With Space Reduction, IEEE/ASME Transactions on Mechatronics, vol. 27, no. 2, pp. 846-857, April 2022 DOI: 10.1109/TMECH.2021.3072675.

For real-world deployments, it is critical to allow robots to navigate in complex environments autonomously. Traditional methods usually maintain an internal map of the environment, and then design several simple rules, in conjunction with a localization and planning approach, to navigate through the internal map. These approaches often involve a variety of assumptions and prior knowledge. In contrast, recent reinforcement learning (RL) methods can provide a model-free, self-learning mechanism as the robot interacts with an initially unknown environment, but are expensive to deploy in real-world scenarios due to inefficient exploration. In this article, we focus on efficient navigation with the RL technique and combine the advantages of these two kinds of methods into a rule-based RL (RuRL) algorithm for reducing the sample complexity and cost of time. First, we use the rule of wall-following to generate a closed-loop trajectory. Second, we employ a reduction rule to shrink the trajectory, which in turn effectively reduces the redundant exploration space. Besides, we give the detailed theoretical guarantee that the optimal navigation path is still in the reduced space. Third, in the reduced space, we utilize the Pledge rule to guide the exploration strategy for accelerating the RL process at the early stage. Experiments conducted on real robot navigation problems in hex-grid environments demonstrate that RuRL can achieve improved navigation performance.

Incremental learning (i.e., non-stationary environments, online -live- learning, task adaptation, life-long learning,…) for robots with Q-learning

Y. Hu, D. Li, Y. He and J. Han, Incremental Learning Framework for Autonomous Robots Based on Q-Learning and the Adaptive Kernel Linear Model IEEE Transactions on Cognitive and Developmental Systems, vol. 14, no. 1, pp. 64-74, March 2022 DOI: 10.1109/TCDS.2019.2962228.

The performance of autonomous robots in varying environments needs to be improved. For such incremental improvement, here we propose an incremental learning framework based on Q -learning and the adaptive kernel linear (AKL) model. The AKL model is used for storing behavioral policies that are learned by Q -learning. Both the structure and parameters of the AKL model can be trained using a novel L2-norm kernel recursive least squares (L2-KRLS) algorithm. The AKL model initially without nodes and gradually accumulates content. The proposed framework allows to learn new behaviors without forgetting the previous ones. A novel local \u03b5 -greedy policy is proposed to speed the convergence rate of Q -learning. It calculates the exploration probability of each state for generating and selecting more important training samples. The performance of our incremental learning framework was validated in two experiments. A curve-fitting example shows that the L2-KRLS-based AKL model is suitable for incremental learning. The second experiment is based on robot learning tasks. The results show that our framework can incrementally learn behaviors in varying environments. Local \u03b5 -greedy policy-based Q -learning is faster than the existing Q -learning algorithms.

Adaptation of model-free RL to variations in the task under continuous state and action spaces applied to robot grasping

Shahid, A.A., Piga, D., Braghin, F. et al. Continuous control actions learning and adaptation for robotic manipulation through reinforcement learning, Auton Robot 46, 483\u2013498 (2022) DOI: 10.1007/s10514-022-10034-z.

This paper presents a learning-based method that uses simulation data to learn an object manipulation task using two model-free reinforcement learning (RL) algorithms. The learning performance is compared across on-policy and off-policy algorithms: Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). In order to accelerate the learning process, the fine-tuning procedure is proposed that demonstrates the continuous adaptation of on-policy RL to new environments, allowing the learned policy to adapt and execute the (partially) modified task. A dense reward function is designed for the task to enable an efficient learning of the agent. A grasping task involving a Franka Emika Panda manipulator is considered as the reference task to be learned. The learned control policy is demonstrated to be generalizable across multiple object geometries and initial robot/parts configurations. The approach is finally tested on a real Franka Emika Panda robot, showing the possibility to transfer the learned behavior from simulation. Experimental results show 100% of successful grasping tasks, making the proposed approach applicable to real applications.

Using physical human-robot interaction to deduce the goals of the human during learning

Losey DP, Bajcsy A, O’Malley MK, Dragan AD, Physical interaction as communication: Learning robot objectives online from human corrections, The International Journal of Robotics Research. 2022;41(1):20-44 DOI: 10.1177/02783649211050958.

When a robot performs a task next to a human, physical interaction is inevitable: the human might push, pull, twist, or guide the robot. The state of the art treats these interactions as disturbances that the robot should reject or avoid. At best, these robots respond safely while the human interacts; but after the human lets go, these robots simply return to their original behavior. We recognize that physical human\u2013robot interaction (pHRI) is often intentional: the human intervenes on purpose because the robot is not doing the task correctly. In this article, we argue that when pHRI is intentional it is also informative: the robot can leverage interactions to learn how it should complete the rest of its current task even after the person lets go. We formalize pHRI as a dynamical system, where the human has in mind an objective function they want the robot to optimize, but the robot does not get direct access to the parameters of this objective: they are internal to the human. Within our proposed framework human interactions become observations about the true objective. We introduce approximations to learn from and respond to pHRI in real-time. We recognize that not all human corrections are perfect: often users interact with the robot noisily, and so we improve the efficiency of robot learning from pHRI by reducing unintended learning. Finally, we conduct simulations and user studies on a robotic manipulator to compare our proposed approach with the state of the art. Our results indicate that learning from pHRI leads to better task performance and improved human satisfaction.

Learning rewards from diverse human sources

Bıyık E, Losey DP, Palan M, Landolfi NC, Shevchuk G, Sadigh D., Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences, . The International Journal of Robotics Research. 2022;41(1):45-67 DOI: 10.1177/02783649211041652.

Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, data from human teachers can be collected either passively or actively in a variety of forms: passive data sources include demonstrations (e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings) are actively elicited. Prior research has independently applied reward learning to these different data sources. However, there exist many domains where multiple sources are complementary and expressive. Motivated by this general problem, we present a framework to integrate multiple sources of information, which are either passively or actively collected from human users. In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then actively probes the user with preference queries to zero-in on their true reward. This algorithm not only enables us combine multiple data sources, but it also informs the robot when it should leverage each type of information. Further, our approach accounts for the human’s ability to provide data: yielding user-friendly preference queries which are also theoretically optimal. Our extensive simulated experiments and user studies on a Fetch mobile manipulator demonstrate the superiority and the usability of our integrated framework..

Including a safety procedure in RL to avoid physical agent problems while learning

Kim Peter Wabersich, Melanie N. Zeilinger, A predictive safety filter for learning-based control of constrained nonlinear dynamical systems, . Automatica, Volume 129, 2021 DOI: 10.1016/j.automatica.2021.109597.

The transfer of reinforcement learning (RL) techniques into real-world applications is challenged by safety requirements in the presence of physical limitations. Most RL methods, in particular the most popular algorithms, do not support explicit consideration of state and input constraints. In this paper, we address this problem for nonlinear systems with continuous state and input spaces by introducing a predictive safety filter, which is able to turn a constrained dynamical system into an unconstrained safe system and to which any RL algorithm can be applied ‘out-of-the-box’. The predictive safety filter receives the proposed control input and decides, based on the current system state, if it can be safely applied to the real system, or if it has to be modified otherwise. Safety is thereby established by a continuously updated safety policy, which is based on a model predictive control formulation using a data-driven system model and considering state and input dependent uncertainties.

A hierarchical robot control architecture that supports learning of skills at different levels through “curriculum learning” and an interesting approach to mix behaviours

Suro, F., Ferber, J., Stratulat, T. et al., A hierarchical representation of behaviour supporting open ended development and progressive learning for artificial agents, . Auton Robot 45, 245–264 (2021) DOI: 10.1007/s10514-020-09960-7.

One of the challenging aspects of open ended or lifelong agent development is that the final behaviour for which an agent is trained at a given moment can be an element for the future creation of one, or even several, behaviours of greater complexity, whose purpose cannot be anticipated. In this paper, we present modular influence network design (MIND), an artificial agent control architecture suited to open ended and cumulative learning. The MIND architecture encapsulates sub behaviours into modules and combines them into a hierarchy reflecting the modular and hierarchical nature of complex tasks. Compared to similar research, the main original aspect of MIND is the multi layered hierarchy using a generic control signal, the influence, to obtain an efficient global behaviour. This article shows the ability of MIND to learn a curriculum of independent didactic tasks of increasing complexity covering different aspects of a desired behaviour. In so doing we demonstrate the contributions of MIND to open-ended development: encapsulation into modules allows for the preservation and re-usability of all the skills acquired during the curriculum and their focused retraining, the modular structure serves the evolving topology by easing the coordination of new sensors, actuators and heterogeneous learning structures.

Discrete Q-learning used, along a Deep CNN for localization, for mobile robot navigation

Amirhossein Shantia, Rik Timmers, Yiebo Chong, Cornel Kuiper, Francesco Bidoia, Lambert Schomaker, Marco Wiering, Two-stage visual navigation by deep neural networks and multi-goal reinforcement learning, . Robotics and Autonomous Systems, Volume 138, 2021 DOI: 10.1016/j.robot.2021.103731.

In this paper, we propose a two-stage learning framework for visual navigation in which the experience of the agent during exploration of one goal is shared to learn to navigate to other goals. We train a deep neural network for estimating the robot’s position in the environment using ground truth information provided by a classical localization and mapping approach. The second simpler multi-goal Q-function learns to traverse the environment by using the provided discretized map. Transfer learning is applied to the multi-goal Q-function from a maze structure to a 2D simulator and is finally deployed in a 3D simulator where the robot uses the estimated locations from the position estimator deep network. In the experiments, we first compare different architectures to select the best deep network for location estimation, and then compare the effects of the multi-goal reinforcement learning method to traditional reinforcement learning. The results show a significant improvement when multi-goal reinforcement learning is used. Furthermore, the results of the location estimator show that a deep network can learn and generalize in different environments using camera images with high accuracy in both position and orientation.

Summary of the state of the art and current challenges of Deep RL in Robotics

Ibarz J, Tan J, Finn C, Kalakrishnan M, Pastor P, Levine S., How to train your robot with deep reinforcement learning: lessons we have learned, . The International Journal of Robotics Research. 2021;40(4-5):698-721 DOI: 10.1177/0278364920987859.

Deep reinforcement learning (RL) has emerged as a promising approach for autonomously acquiring complex behaviors from low-level sensor observations. Although a large portion of deep RL research has focused on applications in video games and simulated control, which does not connect with the constraints of learning in real environments, deep RL has also demonstrated promise in enabling physical robots to learn complex skills in the real world. At the same time, real-world robotics provides an appealing domain for evaluating such algorithms, as it connects directly to how humans learn: as an embodied agent in the real world. Learning to perceive and move in the real world presents numerous challenges, some of which are easier to address than others, and some of which are often not considered in RL research that focuses only on simulated domains. In this review article, we present a number of case studies involving robotic deep RL. Building off of these case studies, we discuss commonly perceived challenges in deep RL and how they have been addressed in these works. We also provide an overview of other outstanding challenges, many of which are unique to the real-world robotics setting and are not often the focus of mainstream RL research. Our goal is to provide a resource both for roboticists and machine learning researchers who are interested in furthering the progress of deep RL in the real world.


  • Interesting summary of the state of the arts and algorithms used.
  • Defining reward beforehand partly defeats the primary goal of learning by itself.
  • Re-using experiences gathered for learning a task for other tasks, since experiences are mostly task-independent.
  • The problem of leaving the robot unattended while learning, and of mechanism damages and wear-tear. “Learning physically requires human presence for resetting experiments, monitoring hardware status and ensuring safety”. “The majority of robot learning experiments to date were conducted on a single robot closely monitored by a single human operator. This one-to-one relation between robot and operator has been a tedious but effective way to ensure continuous and safe operation. The human can reset the scene, stop the robot in unsafe situations, and simply restart and reset the robot on failures. However, to scale up data collection efforts and increase the throughput of evaluation runs, robots need to run without human supervision. It is impractical to allocate more operators to a set-up with multiple robots, or whenever a single robot is meant to run 24/7, and especially both.” “Repeated falling, self-collisions, jerky actuation, and collisions with obstacles may damage the robot and its surroundings, which will require costly repairs and manual interventions ” “We use the term robot persistence to refer to the capability of the robot to persist in collecting data and training with minimal human intervention.”
  • The Reality Gap can be very important, and so the life-long adaptation. “The reality gap is a major obstacle that prevents the application of learning to robotics”. “we found that the actuator dynamics and the lack of latency modeling are the main causes of the model error” in the reality gap. “Hardware degradation, such as change of battery level, wear and tear, and hardware failure, are the major causes of dynamic changes”
  • Recognizing dangerous situations: section 4.11.3, even learn them.
  • Importance of learning bad situations together with good situations: “to add demonstration data to the data buffer for the off-policy algorithm” -> “tends to be problematic in practice, because commonly used approximate dynamic programming methods (i.e., value function estimation) need to see both good and bad experience to learn which actions are desirable. Therefore, when the demonstrations are much better than the agent’s own experience, the value function will typically learn that the demonstrated states are better, but might fail to learn which actions must be taken to reach those states.” -> can be intertwined together, mixing their results into one (“joint training”) -> better to learn the models in model-based.
  • Simulation is needed to reduce the effort of real learning.”In the last few years, the OpenAI Gym benchmark (Brockman et al., 2016) is the key driving force behind the development of deep RL and its application to robotics”
  • “Generally speaking, among model-free techniques, off-policy methods are about an order of magnitude more data efficient than on-policy methods. Model-based methods could be another order of magnitude more data efficient than their model-free counterparts.”
  • The presence of delays in the learning loop compromises Markovianity and thus RL performance (sect. 4.8). These delays are not covered by simulators. Compensating delay techniques are addressed in sect. 4.3.1. “Latency measures the delay from when the observation is measured at the sensor, to when the action is actually executed at the actuator. This delay is usually on the order of milliseconds to seconds, depending on the hardware and the complexity of the policy. The existence of latency means that the next state of the system does not directly depend on the measured state, but instead on the state after a delay of latency after the measurement, which is not observable. Latency violates the most fundamental assumption of MDP (Xiao et al., 2020), and thus can cause failure to some RL algorithms.” ” For model-based methods, the planning component is often computationally expensive, and incurs additional latency.”
  • “pretrain a policy network with demonstrations via learning (also called behavioral cloning)”
  • Overfitting may be a cause of worsening learning quality with more experiences.
  • “effective exploration is particularly challenging in tasks with sparse reward. In the most extreme version of this problem, the agent must essentially find a (high-reward) needle in a (zero-reward) haystack. Unfortunately, the most natural formulation of many practical robotics tasks has this property. For this reason, a number of prior works have focused on studying exploration for sparse-reward robotic tasks”
  • A main drawback of Deep RL is the need of massive data.
  • High sensitivity of algorithms, particularly Deep ones, to the initial state and to the way their hyperparameters are set, specially for Off-policy algorithms.
  • “There is a tradeoff here as more environment diversity may cause the policies to have lower performance. Often this can be alleviated with larger and better neural network architectures”