Discrete Q-learning used, along a Deep CNN for localization, for mobile robot navigation

Amirhossein Shantia, Rik Timmers, Yiebo Chong, Cornel Kuiper, Francesco Bidoia, Lambert Schomaker, Marco Wiering, Two-stage visual navigation by deep neural networks and multi-goal reinforcement learning, . Robotics and Autonomous Systems, Volume 138, 2021 DOI: 10.1016/j.robot.2021.103731.

In this paper, we propose a two-stage learning framework for visual navigation in which the experience of the agent during exploration of one goal is shared to learn to navigate to other goals. We train a deep neural network for estimating the robot’s position in the environment using ground truth information provided by a classical localization and mapping approach. The second simpler multi-goal Q-function learns to traverse the environment by using the provided discretized map. Transfer learning is applied to the multi-goal Q-function from a maze structure to a 2D simulator and is finally deployed in a 3D simulator where the robot uses the estimated locations from the position estimator deep network. In the experiments, we first compare different architectures to select the best deep network for location estimation, and then compare the effects of the multi-goal reinforcement learning method to traditional reinforcement learning. The results show a significant improvement when multi-goal reinforcement learning is used. Furthermore, the results of the location estimator show that a deep network can learn and generalize in different environments using camera images with high accuracy in both position and orientation.

Summary of the state of the art and current challenges of Deep RL in Robotics

Ibarz J, Tan J, Finn C, Kalakrishnan M, Pastor P, Levine S., How to train your robot with deep reinforcement learning: lessons we have learned, . The International Journal of Robotics Research. 2021;40(4-5):698-721 DOI: 10.1177/0278364920987859.

Deep reinforcement learning (RL) has emerged as a promising approach for autonomously acquiring complex behaviors from low-level sensor observations. Although a large portion of deep RL research has focused on applications in video games and simulated control, which does not connect with the constraints of learning in real environments, deep RL has also demonstrated promise in enabling physical robots to learn complex skills in the real world. At the same time, real-world robotics provides an appealing domain for evaluating such algorithms, as it connects directly to how humans learn: as an embodied agent in the real world. Learning to perceive and move in the real world presents numerous challenges, some of which are easier to address than others, and some of which are often not considered in RL research that focuses only on simulated domains. In this review article, we present a number of case studies involving robotic deep RL. Building off of these case studies, we discuss commonly perceived challenges in deep RL and how they have been addressed in these works. We also provide an overview of other outstanding challenges, many of which are unique to the real-world robotics setting and are not often the focus of mainstream RL research. Our goal is to provide a resource both for roboticists and machine learning researchers who are interested in furthering the progress of deep RL in the real world.


  • Interesting summary of the state of the arts and algorithms used.
  • Defining reward beforehand partly defeats the primary goal of learning by itself.
  • Re-using experiences gathered for learning a task for other tasks, since experiences are mostly task-independent.
  • The problem of leaving the robot unattended while learning, and of mechanism damages and wear-tear. “Learning physically requires human presence for resetting experiments, monitoring hardware status and ensuring safety”. “The majority of robot learning experiments to date were conducted on a single robot closely monitored by a single human operator. This one-to-one relation between robot and operator has been a tedious but effective way to ensure continuous and safe operation. The human can reset the scene, stop the robot in unsafe situations, and simply restart and reset the robot on failures. However, to scale up data collection efforts and increase the throughput of evaluation runs, robots need to run without human supervision. It is impractical to allocate more operators to a set-up with multiple robots, or whenever a single robot is meant to run 24/7, and especially both.” “Repeated falling, self-collisions, jerky actuation, and collisions with obstacles may damage the robot and its surroundings, which will require costly repairs and manual interventions ” “We use the term robot persistence to refer to the capability of the robot to persist in collecting data and training with minimal human intervention.”
  • The Reality Gap can be very important, and so the life-long adaptation. “The reality gap is a major obstacle that prevents the application of learning to robotics”. “we found that the actuator dynamics and the lack of latency modeling are the main causes of the model error” in the reality gap. “Hardware degradation, such as change of battery level, wear and tear, and hardware failure, are the major causes of dynamic changes”
  • Recognizing dangerous situations: section 4.11.3, even learn them.
  • Importance of learning bad situations together with good situations: “to add demonstration data to the data buffer for the off-policy algorithm” -> “tends to be problematic in practice, because commonly used approximate dynamic programming methods (i.e., value function estimation) need to see both good and bad experience to learn which actions are desirable. Therefore, when the demonstrations are much better than the agent’s own experience, the value function will typically learn that the demonstrated states are better, but might fail to learn which actions must be taken to reach those states.” -> can be intertwined together, mixing their results into one (“joint training”) -> better to learn the models in model-based.
  • Simulation is needed to reduce the effort of real learning.”In the last few years, the OpenAI Gym benchmark (Brockman et al., 2016) is the key driving force behind the development of deep RL and its application to robotics”
  • “Generally speaking, among model-free techniques, off-policy methods are about an order of magnitude more data efficient than on-policy methods. Model-based methods could be another order of magnitude more data efficient than their model-free counterparts.”
  • The presence of delays in the learning loop compromises Markovianity and thus RL performance (sect. 4.8). These delays are not covered by simulators. Compensating delay techniques are addressed in sect. 4.3.1. “Latency measures the delay from when the observation is measured at the sensor, to when the action is actually executed at the actuator. This delay is usually on the order of milliseconds to seconds, depending on the hardware and the complexity of the policy. The existence of latency means that the next state of the system does not directly depend on the measured state, but instead on the state after a delay of latency after the measurement, which is not observable. Latency violates the most fundamental assumption of MDP (Xiao et al., 2020), and thus can cause failure to some RL algorithms.” ” For model-based methods, the planning component is often computationally expensive, and incurs additional latency.”
  • “pretrain a policy network with demonstrations via learning (also called behavioral cloning)”
  • Overfitting may be a cause of worsening learning quality with more experiences.
  • “effective exploration is particularly challenging in tasks with sparse reward. In the most extreme version of this problem, the agent must essentially find a (high-reward) needle in a (zero-reward) haystack. Unfortunately, the most natural formulation of many practical robotics tasks has this property. For this reason, a number of prior works have focused on studying exploration for sparse-reward robotic tasks”
  • A main drawback of Deep RL is the need of massive data.
  • High sensitivity of algorithms, particularly Deep ones, to the initial state and to the way their hyperparameters are set, specially for Off-policy algorithms.
  • “There is a tradeoff here as more environment diversity may cause the policies to have lower performance. Often this can be alleviated with larger and better neural network architectures”

Mixing Monte-Carlo Tree Search with Q-learning for robot learning

Francesco Riccio, Roberto Capobianco, Daniele Nardi, LoOP: Iterative learning for optimistic planning on robots, . Robotics and Autonomous Systems, Volume 36, 2021 DOI: 10.1016/j.robot.2020.103693.

Efficient robotic behaviors require robustness and adaptation to dynamic changes of the environment, whose characteristics rapidly vary during robot operation. To generate effective robot action policies, planning and learning techniques have shown the most promising results. However, if considered individually, they present different limitations. Planning techniques lack generalization among similar states and require experts to define behavioral routines at different levels of abstraction. Conversely, learning methods usually require a considerable number of training samples and iterations of the algorithm. To overcome these issues, and to efficiently generate robot behaviors, we introduce LoOP, an iterative learning algorithm for optimistic planning that combines state-of-the-art planning and learning techniques to generate action policies. The main contribution of LoOP is the combination of Monte-Carlo Search Planning and Q-learning, which enables focused exploration during policy refinement in different robotic applications. We demonstrate the robustness and flexibility of LoOP in various domains and multiple robotic platforms, by validating the proposed approach with an extensive experimental evaluation.

Deep learning RL methods for robot navigation

Luong, M., Pham, C., Incremental Learning for Autonomous Navigation of Mobile Robots based on Deep Reinforcement Learning, . J Intell Robot Syst 101, 1 (2021) DOI: 10.1007/s10846-020-01262-5.

This paper presents an incremental learning method and system for autonomous robot navigation. The range finder laser sensor and online deep reinforcement learning are utilized for generating the navigation policy, which is effective for avoiding obstacles along the robot’s trajectories as well as for robot’s reaching the destination. An empirical experiment is conducted under simulation and real-world settings. Under the simulation environment, the results show that the proposed method can generate a highly effective navigation policy (more than 90% accuracy) after only 150k training iterations. Moreover, our system has slightly outperformed deep-Q, while having considerably surpassed Proximal Policy Optimization, two recent state-of-the art robot navigation systems. Finally, two experiments are performed to demonstrate the feasibility and effectiveness of our robot’s proposed navigation system in real-time under real-world settings.

Combination of analytical models with NN learning for predicting action effects

Kloss A, Schaal S, Bohg J. , Combining learned and analytical models for predicting action effects from sensory data . The International Journal of Robotics Research. 2022;41(8):778-797 DOI: 10.1177/0278364920954896.

One of the most basic skills a robot should possess is predicting the effect of physical interactions with objects in the environment. This enables optimal action selection to reach a certain goal state. Traditionally, dynamics are approximated by physics-based analytical models. These models rely on specific state representations that may be hard to obtain from raw sensory data, especially if no knowledge of the object shape is assumed. More recently, we have seen learning approaches that can predict the effect of complex physical interactions directly from sensory input. It is, however, an open question how far these models generalize beyond their training data. In this work, we investigate the advantages and limitations of neural-network-based learning approaches for predicting the effects of actions based on sensory input and show how analytical and learned models can be combined to leverage the best of both worlds. As physical interaction task, we use planar pushing, for which there exists a well-known analytical model and a large real-world dataset. We propose the use of a convolutional neural network to convert raw depth images or organized point clouds into a suitable representation for the analytical model and compare this approach with using neural networks for both, perception and prediction. A systematic evaluation of the proposed approach on a very large real-world dataset shows two main advantages of the hybrid architecture. Compared with a pure neural network, it significantly (i) reduces required training data and (ii) improves generalization to novel physical interaction.

Improving the simulation-to-real transfer of learning robotic skills by learning smaller skills and how to connect them in reality

Julian RC, Heiden E, He Z, et al., Scaling simulation-to-real transfer by learning a latent space of robot skills, . The International Journal of Robotics Research. 2020;39(10-11):1259-1278 DOI: 10.1177/0278364920944474.

We present a strategy for simulation-to-real transfer, which builds on recent advances in robot skill decomposition. Rather than focusing on minimizing the simulation–reality gap, we propose a method for increasing the sample efficiency and robustness of existing simulation-to-real approaches which exploits hierarchy and online adaptation. Instead of learning a unique policy for each desired robotic task, we learn a diverse set of skills and their variations, and embed those skill variations in a continuously parameterized space. We then interpolate, search, and plan in this space to find a transferable policy which solves more complex, high-level tasks by combining low-level skills and their variations. In this work, we first characterize the behavior of this learned skill space, by experimenting with several techniques for composing pre-learned latent skills. We then discuss an algorithm which allows our method to perform long-horizon tasks never seen in simulation, by intelligently sequencing short-horizon latent skills. Our algorithm adapts to unseen tasks online by repeatedly choosing new skills from the latent space, using live sensor data and simulation to predict which latent skill will perform best next in the real world. Importantly, our method learns to control a real robot in joint-space to achieve these high-level tasks with little or no on-robot time, despite the fact that the low-level policies may not be perfectly transferable from simulation to real, and that the low-level skills were not trained on any examples of high-level tasks. In addition to our results indicating a lower sample complexity for families of tasks, we believe that our method provides a promising template for combining learning-based methods with proven classical robotics algorithms such as model-predictive control.

Combination of RL with human provided models for navigation

Amarildo Likmeta, Alberto Maria Metelli, Andrea Tirinzoni, Riccardo Giol, Marcello Restelli, Danilo Romano, Combining reinforcement learning with rule-based controllers for transparent and general decision-making in autonomous driving, . Robotics and Autonomous Systems, Volume 131, 2020 DOI: 10.1016/j.robot.2020.103568.

The design of high-level decision-making systems is a topical problem in the field of autonomous driving. In this paper, we combine traditional rule-based strategies and reinforcement learning (RL) with the goal of achieving transparency and robustness. On the one hand, the use of handcrafted rule-based controllers allows for transparency, i.e., it is always possible to determine why a given decision was made, but they struggle to scale to complex driving scenarios, in which several objectives need to be considered. On the other hand, black-box RL approaches enable us to deal with more complex scenarios, but they are usually hardly interpretable. In this paper, we combine the best properties of these two worlds by designing parametric rule-based controllers, in which interpretable rules can be provided by domain experts and their parameters are learned via RL. After illustrating how to apply parameter-based RL methods (PGPE) to this setting, we present extensive numerical simulations in the highway and in two urban scenarios: intersection and roundabout. For each scenario, we show the formalization as an RL problem and we discuss the results of our approach in comparison with handcrafted rule-based controllers and black-box RL techniques.

Bayesian estimation of the model in model-based RL for robots

Senda, Kei, Hishinuma, Toru, Tani, Yurika, Approximate Bayesian reinforcement learning based on estimation of plant, Autonomous Robots 44(5), DOI: 10.1007/s10514-020-09901-4.

This study proposes an approximate parametric model-based Bayesian reinforcement learning approach for robots, based on online Bayesian estimation and online planning for an estimated model. The proposed approach is designed to learn a robotic task with a few real-world samples and to be robust against model uncertainty, within feasible computational resources. The proposed approach employs two-stage modeling, which is composed of (1) a parametric differential equation model with a few parameters based on prior knowledge such as equations of motion, and (2) a parametric model that interpolates a finite number of transition probability models for online estimation and planning. The proposed approach modifies the online Bayesian estimation to be robust against approximation errors of the parametric model to a real plant. The policy planned for the interpolating model is proven to have a form of theoretical robustness. Numerical simulation and hardware experiments of a planar peg-in-hole task demonstrate the effectiveness of the proposed approach.

Including the models into the state of a POMDP for learning them (using POMCPs in a robotic application)

Akinobu Hayashi, Dirk Ruiken, Tadaaki Hasegawa, Christian Goerick, Reasoning about uncertain parameters and agent behaviors through encoded experiences and belief planning, Artificial Intelligence, Volume 280, 2020 DOI: 10.1016/j.artint.2019.103228.

Robots are expected to handle increasingly complex tasks. Such tasks often include interaction with objects or collaboration with other agents. One of the key challenges for reasoning in such situations is the lack of accurate models that hinders the effectiveness of planners. We present a system for online model adaptation that continuously validates and improves models while solving tasks with a belief space planner. We employ the well known online belief planner POMCP. Particles are used to represent hypotheses about the current state and about models of the world. They are sufficient to configure a simulator to provide transition and observation models. We propose an enhanced particle reinvigoration process that leverages prior experiences encoded in a recurrent neural network (RNN). The network is trained through interaction with a large variety of object and agent parametrizations. The RNN is combined with a mixture density network (MDN) to process the current history of observations in order to propose suitable particles and models parametrizations. The proposed method also ensures that newly generated particles are consistent with the current history. These enhancements to the particle reinvigoration process help alleviate problems arising from poor sampling quality in large state spaces and enable handling of dynamics with discontinuities. The proposed approach can be applied to a variety of domains depending on what uncertainty the decision maker needs to reason about. We evaluate the approach with experiments in several domains and compare against other state-of-the-art methods. Experiments are done in a collaborative multi-agent and a single agent object manipulation domain. The experiments are performed both in simulation and on a real robot. The framework handles reasoning with uncertain agent behaviors and with unknown object and environment parametrizations well. The results show good performance and indicate that the proposed approach can improve existing state-of-the-art methods.

Application of Deep RL to person following by a robot, reducing the training effort of the network by reusing simple state situations in many artificially generated states

Pang, L., Zhang, Y., Coleman, S. et al., Efficient Hybrid-Supervised Deep Reinforcement Learning for Person Following Robot, J Intell Robot Syst 97, 299–312 (2020), DOI: 10.1007/s10846-019-01030-0.

Traditional person following robots usually need hand-crafted features and a well-designed controller to follow the assigned person. Normally it is difficult to be applied in outdoor situations due to variability and complexity of the environment. In this paper, we propose an approach in which an agent is trained by hybrid-supervised deep reinforcement learning (DRL) to perform a person following task in end-to-end manner. The approach enables the robot to learn features autonomously from monocular images and to enhance performance via robot-environment interaction. Experiments show that the proposed approach is adaptive to complex situations with significant illumination variation, object occlusion, target disappearance, pose change, and pedestrian interference. In order to speed up the training process to ensure easy application of DRL to real-world robotic follower controls, we apply an integration method through which the agent receives prior knowledge from a supervised learning (SL) policy network and reinforces its performance with a value-based or policy-based (including actor-critic method) DRL model. We also utilize an efficient data collection approach for supervised learning in the context of person following. Experimental results not only verify the robustness of the proposed DRL-based person following robot system, but also indicate how easily the robot can learn from mistakes and improve performance.