Ibarz J, Tan J, Finn C, Kalakrishnan M, Pastor P, Levine S., How to train your robot with deep reinforcement learning: lessons we have learned, . The International Journal of Robotics Research. 2021;40(4-5):698-721 DOI: 10.1177/0278364920987859.
Deep reinforcement learning (RL) has emerged as a promising approach for autonomously acquiring complex behaviors from low-level sensor observations. Although a large portion of deep RL research has focused on applications in video games and simulated control, which does not connect with the constraints of learning in real environments, deep RL has also demonstrated promise in enabling physical robots to learn complex skills in the real world. At the same time, real-world robotics provides an appealing domain for evaluating such algorithms, as it connects directly to how humans learn: as an embodied agent in the real world. Learning to perceive and move in the real world presents numerous challenges, some of which are easier to address than others, and some of which are often not considered in RL research that focuses only on simulated domains. In this review article, we present a number of case studies involving robotic deep RL. Building off of these case studies, we discuss commonly perceived challenges in deep RL and how they have been addressed in these works. We also provide an overview of other outstanding challenges, many of which are unique to the real-world robotics setting and are not often the focus of mainstream RL research. Our goal is to provide a resource both for roboticists and machine learning researchers who are interested in furthering the progress of deep RL in the real world.
NOTES:
- Interesting summary of the state of the arts and algorithms used.
- Defining reward beforehand partly defeats the primary goal of learning by itself.
- Re-using experiences gathered for learning a task for other tasks, since experiences are mostly task-independent.
- The problem of leaving the robot unattended while learning, and of mechanism damages and wear-tear. “Learning physically requires human presence for resetting experiments, monitoring hardware status and ensuring safety”. “The majority of robot learning experiments to date were conducted on a single robot closely monitored by a single human operator. This one-to-one relation between robot and operator has been a tedious but effective way to ensure continuous and safe operation. The human can reset the scene, stop the robot in unsafe situations, and simply restart and reset the robot on failures. However, to scale up data collection efforts and increase the throughput of evaluation runs, robots need to run without human supervision. It is impractical to allocate more operators to a set-up with multiple robots, or whenever a single robot is meant to run 24/7, and especially both.” “Repeated falling, self-collisions, jerky actuation, and collisions with obstacles may damage the robot and its surroundings, which will require costly repairs and manual interventions ” “We use the term robot persistence to refer to the capability of the robot to persist in collecting data and training with minimal human intervention.”
- The Reality Gap can be very important, and so the life-long adaptation. “The reality gap is a major obstacle that prevents the application of learning to robotics”. “we found that the actuator dynamics and the lack of latency modeling are the main causes of the model error” in the reality gap. “Hardware degradation, such as change of battery level, wear and tear, and hardware failure, are the major causes of dynamic changes”
- Recognizing dangerous situations: section 4.11.3, even learn them.
- Importance of learning bad situations together with good situations: “to add demonstration data to the data buffer for the off-policy algorithm” -> “tends to be problematic in practice, because commonly used approximate dynamic programming methods (i.e., value function estimation) need to see both good and bad experience to learn which actions are desirable. Therefore, when the demonstrations are much better than the agent’s own experience, the value function will typically learn that the demonstrated states are better, but might fail to learn which actions must be taken to reach those states.” -> can be intertwined together, mixing their results into one (“joint training”) -> better to learn the models in model-based.
- Simulation is needed to reduce the effort of real learning.”In the last few years, the OpenAI Gym benchmark (Brockman et al., 2016) is the key driving force behind the development of deep RL and its application to robotics”
- “Generally speaking, among model-free techniques, off-policy methods are about an order of magnitude more data efficient than on-policy methods. Model-based methods could be another order of magnitude more data efficient than their model-free counterparts.”
- The presence of delays in the learning loop compromises Markovianity and thus RL performance (sect. 4.8). These delays are not covered by simulators. Compensating delay techniques are addressed in sect. 4.3.1. “Latency measures the delay from when the observation is measured at the sensor, to when the action is actually executed at the actuator. This delay is usually on the order of milliseconds to seconds, depending on the hardware and the complexity of the policy. The existence of latency means that the next state of the system does not directly depend on the measured state, but instead on the state after a delay of latency after the measurement, which is not observable. Latency violates the most fundamental assumption of MDP (Xiao et al., 2020), and thus can cause failure to some RL algorithms.” ” For model-based methods, the planning component is often computationally expensive, and incurs additional latency.”
- “pretrain a policy network with demonstrations via learning (also called behavioral cloning)”
- Overfitting may be a cause of worsening learning quality with more experiences.
- “effective exploration is particularly challenging in tasks with sparse reward. In the most extreme version of this problem, the agent must essentially find a (high-reward) needle in a (zero-reward) haystack. Unfortunately, the most natural formulation of many practical robotics tasks has this property. For this reason, a number of prior works have focused on studying exploration for sparse-reward robotic tasks”
- A main drawback of Deep RL is the need of massive data.
- High sensitivity of algorithms, particularly Deep ones, to the initial state and to the way their hyperparameters are set, specially for Off-policy algorithms.
- “There is a tradeoff here as more environment diversity may cause the policies to have lower performance. Often this can be alleviated with larger and better neural network architectures”