Tag Archives: Deep Reinforcement Learning

On the influence of the representations obtained through Deep RL in the learning process

Han Wang, Erfan Miahi, Martha White, Marlos C. Machado, Zaheer Abbas, Raksha Kumaraswamy, Vincent Liu, Adam White, Investigating the properties of neural network representations in reinforcement learning, Artificial Intelligence, Volume 330, 2024 DOI: 10.1016/j.artint.2024.104100.

In this paper we investigate the properties of representations learned by deep reinforcement learning systems. Much of the early work on representations for reinforcement learning focused on designing fixed-basis architectures to achieve properties thought to be desirable, such as orthogonality and sparsity. In contrast, the idea behind deep reinforcement learning methods is that the agent designer should not encode representational properties, but rather that the data stream should determine the properties of the representation—good representations emerge under appropriate training schemes. In this paper we bring these two perspectives together, empirically investigating the properties of representations that support transfer in reinforcement learning. We introduce and measure six representational properties over more than 25,000 agent-task settings. We consider Deep Q-learning agents with different auxiliary losses in a pixel-based navigation environment, with source and transfer tasks corresponding to different goal locations. We develop a method to better understand why some representations work better for transfer, through a systematic approach varying task similarity and measuring and correlating representation properties with transfer performance. We demonstrate the generality of the methodology by investigating representations learned by a Rainbow agent that successfully transfers across Atari 2600 game modes.

Graph NNs in RL for improving sample efficiency

Feng Zhang, Chengbin Xuan, Hak-Keung Lam, An obstacle avoidance-specific reinforcement learning method based on fuzzy attention mechanism and heterogeneous graph neural networks, Engineering Applications of Artificial Intelligence, Volume 130, 2024 DOI: 10.1016/j.engappai.2023.107764.

Deep reinforcement learning (RL) is an advancing learning tool to handle robotics control problems. However, it typically suffers from sample efficiency and effectiveness. The emergence of Graph Neural Networks (GNNs) enables the integration of the RL and graph representation learning techniques. It realises outstanding training performance and transfer capability by forming controlling scenarios into the corresponding graph domain. Nevertheless, the existing approaches strongly depend on the artificial graph formation processes with intensive bias and cannot propagate messages discriminatively on explicit physical dependence, which leads to restricted flexibility, size transfer capability and suboptimal performance. This paper proposes a fuzzy attention mechanism-based heterogeneous graph neural network (FAM-HGNN) framework for resolving the control problem under the RL context. FAM emphasises the significant connections and weakening of the trivial connections in a fully connected graph, which mitigates the potential negative influence caused by the artificial graph formation process. HGNN obtains a higher level of relational inductive bias by conducting graph propagations on a masked graph. Experimental results show that our FAM-HGNN outperforms the multi-layer perceptron-based and the existing GNN-based RL approaches regarding training performance and size transfer capability. We also conducted an ablation study and sensitivity analysis to validate the efficacy of the proposed method further.

Using Deep RL (TRPO) for selecting best interest points in the environment for path planning

Jie Fan, Xudong Zhang, Yuan Zou, Hierarchical path planner for unknown space exploration using reinforcement learning-based intelligent frontier selection, Expert Systems with Applications, Volume 230, 2023 DOI: 10.1016/j.eswa.2023.120630.

Path planning in unknown environments is extremely useful for some specific tasks, such as exploration of outer space planets, search and rescue in disaster areas, home sweeping services, etc. However, existing frontier-based path planners suffer from insufficient exploration, while reinforcement learning (RL)-based ones are confronted with problems in efficient training and effective searching. To overcome the above problems, this paper proposes a novel hierarchical path planner for unknown space exploration using RL-based intelligent frontier selection. Firstly, by decomposing the path planner into three-layered architecture (including the perception layer, planning layer, and control layer) and using edge detection to find potential frontiers to track, the path search space is shrunk from the whole map to a handful of points of interest, which significantly saves the computational resources in both training and execution processes. Secondly, one of the advanced RL algorithms, trust region policy optimization (TRPO), is used as a judge to select the best frontier for the robot to track, which ensures the optimality of the path planner with a shorter path length. The proposed method is validated through simulation and compared with both classic and state-of-the-art methods. Results show that the training process could be greatly accelerated compared with the traditional deep-Q network (DQN). Moreover, the proposed method has 4.2%\u201314.3% improvement in exploration region rate and achieves the highest exploration completeness.

Improving safety in deep RL in the case of autonomous driving

Eduardo Candela, Olivier Doustaly, Leandro Parada, Felix Feng, Yiannis Demiris, Panagiotis Angeloudis, Risk-aware controller for autonomous vehicles using model-based collision prediction and reinforcement learning, Artificial Intelligence, Volume 320, 2023 DOI: 10.1016/j.artint.2023.103923.

Autonomous Vehicles (AVs) have the potential to save millions of lives and increase the efficiency of transportation services. However, the successful deployment of AVs requires tackling multiple challenges related to modeling and certifying safety. State-of-the-art decision-making methods usually rely on end-to-end learning or imitation learning approaches, which still pose significant safety risks. Hence the necessity of risk-aware AVs that can better predict and handle dangerous situations. Furthermore, current approaches tend to lack explainability due to their reliance on end-to-end Deep Learning, where significant causal relationships are not guaranteed to be learned from data. This paper introduces a novel risk-aware framework for training AV agents using a bespoke collision prediction model and Reinforcement Learning (RL). The collision prediction model is based on Gaussian Processes and vehicle dynamics, and is used to generate the RL state vector. Using an explicit risk model increases the post-hoc explainability of the AV agent, which is vital for reaching and certifying the high safety levels required for AVs and other safety-sensitive applications. Experimental results obtained with a simulator and state-of-the-art RL algorithms show that the risk-aware RL framework decreases average collision rates by 15%, makes AVs more robust to sudden harsh braking situations, and achieves better performance in both safety and speed when compared to a standard rule-based method (the Intelligent Driver Model). Moreover, the proposed collision prediction model outperforms other models in the literature.

See also: https://doi.org/10.1016/j.artint.2023.103922
And also: https://doi.org/10.1177/02783649231169492

Doing a more intelligent exploration in RL based on measuring uncertainty through prediction

Xiaoshu Zhou, Fei Zhu, Peiyao Zhao, Within the scope of prediction: Shaping intrinsic rewards via evaluating uncertainty, Expert Systems with Applications, Volume 206, 2022 DOI: 10.1016/j.eswa.2022.117775.

The agent of reinforcement learning based approaches needs to explore to learn more about the environment to seek optimal policy. However, simply increasing the frequency of stochastic exploration sometimes fails to work or even causes the agent to fall into traps. To solve the problem, it is essential to improve the quality of exploration. An approach, referred to as the scope of prediction based on uncertainty exploration (SPE), is proposed, taking advantage of the uncertainty mechanism and considering the stochasticity of prospecting. As by uncertainty mechanism, the unexpected states make more curiosity, the model derives higher uncertainty by projecting future scenarios to compare with the actual future to explore the world. The SPE method utilizes a prediction network to predict subsequent observations and calculates the mean squared difference value of the real observations and the following observations to measure uncertainty, encouraging the agent to explore unknown regions more effectively. Moreover, to reduce the noise interference caused by uncertainty, a reward-penalty model is developed to discriminate the noise by current observations and action prediction for future rewards to improve the interference ability against noise so that the agent can escape from the noisy region. Experiment results showed that deep reinforcement learning approaches equipped with SPE demonstrated significant improvements in simulated environments.

Improving the quality of memory replay in RL through an evolutionary algorithm biologically inspired

M. Ramicic and A. Bonarini, Augmented Memory Replay in Reinforcement Learning With Continuous Control, IEEE Transactions on Cognitive and Developmental Systems, vol. 14, no. 2, pp. 485-496, June 2022 DOI: 10.1109/TCDS.2021.3050723.

Online reinforcement learning agents are currently able to process an increasing amount of data by converting it into a higher order value functions. This expansion of the information collected from the environment increases the agent\u2019s state space enabling it to scale up to more complex problems but also increases the risk of forgetting by learning on redundant or conflicting data. To improve the approximation of a large amount of data, a random mini-batch of the past experiences that are stored in the replay memory buffer is often replayed at each learning step. The proposed work takes inspiration from a biological mechanism which acts as a protective layer of higher cognitive functions found in mammalian brain: active memory consolidation mitigates the effect of forgetting previous memories by dynamically processing the new ones. Similar dynamics are implemented by the proposed augmented memory replay or AMR algorithm. The architecture of AMR , based on a simple artificial neural network is able to provide an augmentation policy which modifies each of the agents experiences by augmenting their relevance prior to storing them in the replay memory. The function approximator of AMR is evolved using genetic algorithm in order to obtain the specific augmentation policy function that yields the best performance of a learning agent in a specific environment given by its received cumulative reward. Experimental results show that an evolved AMR augmentation function capable of increasing the significance of the specific memories is able to further increase the stability and convergence speed of the learning algorithms dealing with the complexity of continuous action domains.

Dealing with the exploration with a nice introduction to the problem

Jiayi Lu, Shuai Han, Shuai L�, Meng Kang, Junwei Zhang, Sampling diversity driven exploration with state difference guidance, Expert Systems with Applications, Volume 203, 2022 DOI: 10.1016/j.eswa.2022.117418.

Exploration is one of the key issues of deep reinforcement learning, especially in the environments with sparse or deceptive rewards. Exploration based on intrinsic rewards can handle these environments. However, these methods cannot take both global interaction dynamics and local environment changes into account simultaneously. In this paper, we propose a novel intrinsic reward for off-policy learning, which not only encourages the agent to take actions not fully learned from a global perspective, but also instructs the agent to trigger remarkable changes in the environment from a local perspective. Meanwhile, we propose the double-actors\u2013double-critics framework to combine intrinsic rewards with extrinsic rewards to avoid the inappropriate combination of intrinsic and extrinsic rewards in previous methods. This framework can be applied to off-policy learning algorithms based on the actor\u2013critic method. We provide a comprehensive evaluation of our approach on the MuJoCo benchmark environments. The results demonstrate that our method can perform effective exploration in the environments with dense, deceptive and sparse rewards. Besides, we conduct sufficient ablation and quantitative analyses to intrinsic rewards. Furthermore, we also verify the superiority and rationality of our double-actors\u2013double-critics framework through comparative experiments.

Increasing exploration when the agent performs worse, decreasing when performing better, in the context of DQN for distributing computation among cloud and edge servers, also dealing with hybridization of RL with Fuzzy

Do Bao Son, Ta Huu Binh, Hiep Khac Vo, Binh Minh Nguyen, Huynh Thi Thanh Binh, Shui Yu, Value-based reinforcement learning approaches for task offloading in Delay Constrained Vehicular Edge Computing, Engineering Applications of Artificial Intelligence, Volume 113, 2022 DOI: 10.1016/j.engappai.2022.104898.

In the age of booming information technology, human-being has witnessed the need for new paradigms with both high computational capability and low latency. A potential solution is Vehicular Edge Computing (VEC). Previous work proposed a Fuzzy Deep Q-Network in Offloading scheme (FDQO) that combines Fuzzy rules and Deep Q-Network (DQN) to improve DQN\u2019s early performance by using Fuzzy Controller (FC). However, we notice that frequent usage of FC can hinder the future growth performance of model. One way to overcome this issue is to remove Fuzzy Controller entirely. We introduced an algorithm called baseline DQN (b-DQN), represented by its two variants Static baseline DQN (Sb-DQN) and Dynamic baseline DQN (Db-DQN), to modify the exploration rate base on the average rewards of closest observations. Our findings confirm that these baseline DQN algorithms surpass traditional DQN models in terms of average Quality of Experience (QoE) in 100 time slots by about 6%, but still suffer from poor early performance (such as in the first 5 time slots). Here, we introduce baseline FDQO (b-FDQO). This algorithm has a strategy to modify the Fuzzy Logic usage instead of removing it entirely while still observing the rewards to modify the exploration rate. It brings a higher average QoE in the first 5 time slots compared to other non-fuzzy-logic algorithms by at least 55.12%, prevent the model from getting too bad result over all time slots, while having the late performance as good as that of b-DQN.

Generating contrafactual explanations of Deep RL decisions to identify flawed agents

Matthew L. Olson, Roli Khanna, Lawrence Neal, Fuxin Li, Weng-Keen Wong, Counterfactual state explanations for reinforcement learning agents via generative deep learning, . Artificial Intelligence, Volume 295, 2021 DOI: 10.1016/j.artint.2021.103455.

Counterfactual explanations, which deal with “why not?” scenarios, can provide insightful explanations to an AI agent’s behavior [Miller [38]]. In this work, we focus on generating counterfactual explanations for deep reinforcement learning (RL) agents which operate in visual input environments like Atari. We introduce counterfactual state explanations, a novel example-based approach to counterfactual explanations based on generative deep learning. Specifically, a counterfactual state illustrates what minimal change is needed to an Atari game image such that the agent chooses a different action. We also evaluate the effectiveness of counterfactual states on human participants who are not machine learning experts. Our first user study investigates if humans can discern if the counterfactual state explanations are produced by the actual game or produced by a generative deep learning approach. Our second user study investigates if counterfactual state explanations can help non-expert participants identify a flawed agent; we compare against a baseline approach based on a nearest neighbor explanation which uses images from the actual game. Our results indicate that counterfactual state explanations have sufficient fidelity to the actual game images to enable non-experts to more effectively identify a flawed RL agent compared to the nearest neighbor baseline and to having no explanation at all.

Summary of the state of the art and current challenges of Deep RL in Robotics

Ibarz J, Tan J, Finn C, Kalakrishnan M, Pastor P, Levine S., How to train your robot with deep reinforcement learning: lessons we have learned, . The International Journal of Robotics Research. 2021;40(4-5):698-721 DOI: 10.1177/0278364920987859.

Deep reinforcement learning (RL) has emerged as a promising approach for autonomously acquiring complex behaviors from low-level sensor observations. Although a large portion of deep RL research has focused on applications in video games and simulated control, which does not connect with the constraints of learning in real environments, deep RL has also demonstrated promise in enabling physical robots to learn complex skills in the real world. At the same time, real-world robotics provides an appealing domain for evaluating such algorithms, as it connects directly to how humans learn: as an embodied agent in the real world. Learning to perceive and move in the real world presents numerous challenges, some of which are easier to address than others, and some of which are often not considered in RL research that focuses only on simulated domains. In this review article, we present a number of case studies involving robotic deep RL. Building off of these case studies, we discuss commonly perceived challenges in deep RL and how they have been addressed in these works. We also provide an overview of other outstanding challenges, many of which are unique to the real-world robotics setting and are not often the focus of mainstream RL research. Our goal is to provide a resource both for roboticists and machine learning researchers who are interested in furthering the progress of deep RL in the real world.

NOTES:

  • Interesting summary of the state of the arts and algorithms used.
  • Defining reward beforehand partly defeats the primary goal of learning by itself.
  • Re-using experiences gathered for learning a task for other tasks, since experiences are mostly task-independent.
  • The problem of leaving the robot unattended while learning, and of mechanism damages and wear-tear. “Learning physically requires human presence for resetting experiments, monitoring hardware status and ensuring safety”. “The majority of robot learning experiments to date were conducted on a single robot closely monitored by a single human operator. This one-to-one relation between robot and operator has been a tedious but effective way to ensure continuous and safe operation. The human can reset the scene, stop the robot in unsafe situations, and simply restart and reset the robot on failures. However, to scale up data collection efforts and increase the throughput of evaluation runs, robots need to run without human supervision. It is impractical to allocate more operators to a set-up with multiple robots, or whenever a single robot is meant to run 24/7, and especially both.” “Repeated falling, self-collisions, jerky actuation, and collisions with obstacles may damage the robot and its surroundings, which will require costly repairs and manual interventions ” “We use the term robot persistence to refer to the capability of the robot to persist in collecting data and training with minimal human intervention.”
  • The Reality Gap can be very important, and so the life-long adaptation. “The reality gap is a major obstacle that prevents the application of learning to robotics”. “we found that the actuator dynamics and the lack of latency modeling are the main causes of the model error” in the reality gap. “Hardware degradation, such as change of battery level, wear and tear, and hardware failure, are the major causes of dynamic changes”
  • Recognizing dangerous situations: section 4.11.3, even learn them.
  • Importance of learning bad situations together with good situations: “to add demonstration data to the data buffer for the off-policy algorithm” -> “tends to be problematic in practice, because commonly used approximate dynamic programming methods (i.e., value function estimation) need to see both good and bad experience to learn which actions are desirable. Therefore, when the demonstrations are much better than the agent’s own experience, the value function will typically learn that the demonstrated states are better, but might fail to learn which actions must be taken to reach those states.” -> can be intertwined together, mixing their results into one (“joint training”) -> better to learn the models in model-based.
  • Simulation is needed to reduce the effort of real learning.”In the last few years, the OpenAI Gym benchmark (Brockman et al., 2016) is the key driving force behind the development of deep RL and its application to robotics”
  • “Generally speaking, among model-free techniques, off-policy methods are about an order of magnitude more data efficient than on-policy methods. Model-based methods could be another order of magnitude more data efficient than their model-free counterparts.”
  • The presence of delays in the learning loop compromises Markovianity and thus RL performance (sect. 4.8). These delays are not covered by simulators. Compensating delay techniques are addressed in sect. 4.3.1. “Latency measures the delay from when the observation is measured at the sensor, to when the action is actually executed at the actuator. This delay is usually on the order of milliseconds to seconds, depending on the hardware and the complexity of the policy. The existence of latency means that the next state of the system does not directly depend on the measured state, but instead on the state after a delay of latency after the measurement, which is not observable. Latency violates the most fundamental assumption of MDP (Xiao et al., 2020), and thus can cause failure to some RL algorithms.” ” For model-based methods, the planning component is often computationally expensive, and incurs additional latency.”
  • “pretrain a policy network with demonstrations via learning (also called behavioral cloning)”
  • Overfitting may be a cause of worsening learning quality with more experiences.
  • “effective exploration is particularly challenging in tasks with sparse reward. In the most extreme version of this problem, the agent must essentially find a (high-reward) needle in a (zero-reward) haystack. Unfortunately, the most natural formulation of many practical robotics tasks has this property. For this reason, a number of prior works have focused on studying exploration for sparse-reward robotic tasks”
  • A main drawback of Deep RL is the need of massive data.
  • High sensitivity of algorithms, particularly Deep ones, to the initial state and to the way their hyperparameters are set, specially for Off-policy algorithms.
  • “There is a tradeoff here as more environment diversity may cause the policies to have lower performance. Often this can be alleviated with larger and better neural network architectures”