Category Archives: Reinforcement Learning In Ai

Dealing with the exploration with a nice introduction to the problem

Jiayi Lu, Shuai Han, Shuai L�, Meng Kang, Junwei Zhang, Sampling diversity driven exploration with state difference guidance, Expert Systems with Applications, Volume 203, 2022 DOI: 10.1016/j.eswa.2022.117418.

Exploration is one of the key issues of deep reinforcement learning, especially in the environments with sparse or deceptive rewards. Exploration based on intrinsic rewards can handle these environments. However, these methods cannot take both global interaction dynamics and local environment changes into account simultaneously. In this paper, we propose a novel intrinsic reward for off-policy learning, which not only encourages the agent to take actions not fully learned from a global perspective, but also instructs the agent to trigger remarkable changes in the environment from a local perspective. Meanwhile, we propose the double-actors\u2013double-critics framework to combine intrinsic rewards with extrinsic rewards to avoid the inappropriate combination of intrinsic and extrinsic rewards in previous methods. This framework can be applied to off-policy learning algorithms based on the actor\u2013critic method. We provide a comprehensive evaluation of our approach on the MuJoCo benchmark environments. The results demonstrate that our method can perform effective exploration in the environments with dense, deceptive and sparse rewards. Besides, we conduct sufficient ablation and quantitative analyses to intrinsic rewards. Furthermore, we also verify the superiority and rationality of our double-actors\u2013double-critics framework through comparative experiments.

Increasing exploration when the agent performs worse, decreasing when performing better, in the context of DQN for distributing computation among cloud and edge servers, also dealing with hybridization of RL with Fuzzy

Do Bao Son, Ta Huu Binh, Hiep Khac Vo, Binh Minh Nguyen, Huynh Thi Thanh Binh, Shui Yu, Value-based reinforcement learning approaches for task offloading in Delay Constrained Vehicular Edge Computing, Engineering Applications of Artificial Intelligence, Volume 113, 2022 DOI: 10.1016/j.engappai.2022.104898.

In the age of booming information technology, human-being has witnessed the need for new paradigms with both high computational capability and low latency. A potential solution is Vehicular Edge Computing (VEC). Previous work proposed a Fuzzy Deep Q-Network in Offloading scheme (FDQO) that combines Fuzzy rules and Deep Q-Network (DQN) to improve DQN\u2019s early performance by using Fuzzy Controller (FC). However, we notice that frequent usage of FC can hinder the future growth performance of model. One way to overcome this issue is to remove Fuzzy Controller entirely. We introduced an algorithm called baseline DQN (b-DQN), represented by its two variants Static baseline DQN (Sb-DQN) and Dynamic baseline DQN (Db-DQN), to modify the exploration rate base on the average rewards of closest observations. Our findings confirm that these baseline DQN algorithms surpass traditional DQN models in terms of average Quality of Experience (QoE) in 100 time slots by about 6%, but still suffer from poor early performance (such as in the first 5 time slots). Here, we introduce baseline FDQO (b-FDQO). This algorithm has a strategy to modify the Fuzzy Logic usage instead of removing it entirely while still observing the rewards to modify the exploration rate. It brings a higher average QoE in the first 5 time slots compared to other non-fuzzy-logic algorithms by at least 55.12%, prevent the model from getting too bad result over all time slots, while having the late performance as good as that of b-DQN.

Live-RL enhancement / reduction of unsafe situations by reducing the transition possibility of unsafe actions

Serhat Duman, Hamdi Tolga Kahraman, Yusuf Sonmez, Ugur Guvenc, Mehmet Kati, Sefa Aras, A powerful meta-heuristic search algorithm for solving global optimization and real-world solar photovoltaic parameter estimation problems, Engineering Applications of Artificial Intelligence, Volume 111, 2022 DOI: 10.1016/j.engappai.2022.104763.

The teaching-learning-based artificial bee colony (TLABC) is a new hybrid swarm-based metaheuristic search algorithm. It combines the exploitation of the teaching learning-based optimization (TLBO) with the exploration of the artificial bee colony (ABC). With the hybridization of these two nature-inspired swarm intelligence algorithms, a robust method has been proposed to solve global optimization problems. However, as with swarm-based algorithms, with the TLABC method, it is a great challenge to effectively simulate the selection process. Fitness-distance balance (FDB) is a powerful recently developed method to effectively imitate the selection process in nature. In this study, the three search phases of the TLABC algorithm were redesigned using the FDB method. In this way, the FDB-TLABC algorithm, which imitates nature more effectively and has a robust search performance, was developed. To investigate the exploitation, exploration, and balanced search capabilities of the proposed algorithm, it was tested on standard and complex benchmark suites (Classic, IEEE CEC 2014, IEEE CEC 2017, and IEEE CEC 2020). In order to verify the performance of the proposed FDB-TLABC for global optimization problems and in the photovoltaic parameter estimation problem (a constrained real-world engineering problem) a very comprehensive and qualified experimental study was carried out according to IEEE CEC standards. Statistical analysis results confirmed that the proposed FDB-TLABC provided the best optimum solution and yielded a superior performance compared to other optimization methods.

State of the art of the convergence of Monte Carlo Exploring Starts RL, policy iteration kind, method

Jun Liu, On the convergence of reinforcement learning with Monte Carlo Exploring Starts, . Automatica, Volume 129, 2021 DOI: 10.1016/j.automatica.2021.109693.

A basic simulation-based reinforcement learning algorithm is the Monte Carlo Exploring Starts (MCES) method, also known as optimistic policy iteration, in which the value function is approximated by simulated returns and a greedy policy is selected at each iteration. The convergence of this algorithm in the general setting has been an open question. In this paper, we investigate the convergence of this algorithm for the case with undiscounted costs, also known as the stochastic shortest path problem. The results complement existing partial results on this topic and thereby help further settle the open problem.

Approximating the value function of RL through Max-Plus algebra

Vinicius Mariano Gonçalves, Max-plus approximation for reinforcement learning, . Automatica, Volume 129, 2021 DOI: 10.1016/j.automatica.2021.109623.

Max-Plus Algebra has been applied in several contexts, especially in the control of discrete events systems. In this article, we discuss another application closely related to control: the use of Max-Plus algebra concepts in the context of reinforcement learning. Max-Plus Algebra and reinforcement learning are strongly linked due to the latter’s dependence on the Bellman Equation which, in some cases, is a linear Max-Plus equation. This fact motivates the application of Max-Plus algebra to approximate the value function, central to the Bellman Equation and thus also to reinforcement learning. This article proposes conditions so that this approach can be done in a simple way and following the philosophy of reinforcement learning: explore the environment, receive the rewards and use this information to improve the knowledge of the value function. The proposed conditions are related to two matrices and impose on them a relationship that is analogous to the concept of weak inverses in traditional algebra.

Generating contrafactual explanations of Deep RL decisions to identify flawed agents

Matthew L. Olson, Roli Khanna, Lawrence Neal, Fuxin Li, Weng-Keen Wong, Counterfactual state explanations for reinforcement learning agents via generative deep learning, . Artificial Intelligence, Volume 295, 2021 DOI: 10.1016/j.artint.2021.103455.

Counterfactual explanations, which deal with “why not?” scenarios, can provide insightful explanations to an AI agent’s behavior [Miller [38]]. In this work, we focus on generating counterfactual explanations for deep reinforcement learning (RL) agents which operate in visual input environments like Atari. We introduce counterfactual state explanations, a novel example-based approach to counterfactual explanations based on generative deep learning. Specifically, a counterfactual state illustrates what minimal change is needed to an Atari game image such that the agent chooses a different action. We also evaluate the effectiveness of counterfactual states on human participants who are not machine learning experts. Our first user study investigates if humans can discern if the counterfactual state explanations are produced by the actual game or produced by a generative deep learning approach. Our second user study investigates if counterfactual state explanations can help non-expert participants identify a flawed agent; we compare against a baseline approach based on a nearest neighbor explanation which uses images from the actual game. Our results indicate that counterfactual state explanations have sufficient fidelity to the actual game images to enable non-experts to more effectively identify a flawed RL agent compared to the nearest neighbor baseline and to having no explanation at all.

Model-based (on ordinary differential equations) and partially model-free Policy Iteration on continuous space and time

Jaeyoung Lee, Richard S. Sutton, Policy iterations for reinforcement learning problems in continuous time and space — Fundamental theory and methods, . Automatica, Volume 126, 2021 DOI: 10.1016/j.automatica.2020.109421.

Policy iteration (PI) is a recursive process of policy evaluation and improvement for solving an optimal decision-making/control problem, or in other words, a reinforcement learning (RL) problem. PI has also served as the fundamental for developing RL methods. In this paper, we propose two PI methods, called differential PI (DPI) and integral PI (IPI), and their variants, for a general RL framework in continuous time and space (CTS), where the environment is modeled by a system of ordinary differential equations (ODEs). The proposed methods inherit the current ideas of PI in classical RL and optimal control and theoretically support the existing RL algorithms in CTS: TD-learning and value-gradient-based (VGB) greedy policy update. We also provide case studies including (1) discounted RL and (2) optimal control tasks. Fundamental mathematical properties – admissibility, uniqueness of the solution to the Bellman equation (BE), monotone improvement, convergence, and optimality of the solution to the Hamilton–Jacobi–Bellman equation (HJBE) – are all investigated in-depth and improved from the existing theory, along with the general and case studies. Finally, the proposed ones are simulated with an inverted-pendulum model and their model-based and partially model-free implementations to support the theory and further investigate them beyond.

Finding the policy that generalizes the best in a sample of possible real scenarios by leveraging PAC-Bayes

Majumdar A, Farid A, Sonar A., PAC-Bayes control: learning policies that provably generalize to novel environments. The International Journal of Robotics Research. 2021;40(2-3):574-593 DOI: 10.1177/0278364920959444.

Our goal is to learn control policies for robots that provably generalize well to novel environments given a dataset of example environments. The key technical idea behind our approach is to leverage tools from generalization theory in machine learning by exploiting a precise analogy (which we present in the form of a reduction) between generalization of control policies to novel environments and generalization of hypotheses in the supervised learning setting. In particular, we utilize the probably approximately correct (PAC)-Bayes framework, which allows us to obtain upper bounds that hold with high probability on the expected cost of (stochastic) control policies across novel environments. We propose policy learning algorithms that explicitly seek to minimize this upper bound. The corresponding optimization problem can be solved using convex optimization (relative entropy programming in particular) in the setting where we are optimizing over a finite policy space. In the more general setting of continuously parameterized policies (e.g., neural network policies), we minimize this upper bound using stochastic gradient descent. We present simulated results of our approach applied to learning (1) reactive obstacle avoidance policies and (2) neural network-based grasping policies. We also present hardware results for the Parrot Swing drone navigating through different obstacle environments. Our examples demonstrate the potential of our approach to provide strong generalization guarantees for robotic systems with continuous state and action spaces, complicated (e.g., nonlinear) dynamics, rich sensory inputs (e.g., depth images), and neural network-based policies.

Extracting video summaries from RL processes to explain and understand them

Pedro Sequeira, Melinda Gervasio, Interestingness elements for explainable reinforcement learning: Understanding agents’ capabilities and limitations. Artificial Intelligence, Volume 288, 2020 DOI: 10.1016/j.artint.2020.103367.

We propose an explainable reinforcement learning (XRL) framework that analyzes an agent’s history of interaction with the environment to extract interestingness elements that help explain its behavior. The framework relies on data readily available from standard RL algorithms, augmented with data that can easily be collected by the agent while learning. We describe how to create visual summaries of an agent’s behavior in the form of short video-clips highlighting key interaction moments, based on the proposed elements. We also report on a user study where we evaluated the ability of humans to correctly perceive the aptitude of agents with different characteristics, including their capabilities and limitations, given visual summaries automatically generated by our framework. The results show that the diversity of aspects captured by the different interestingness elements is crucial to help humans correctly understand an agent’s strengths and limitations in performing a task, and determine when it might need adjustments to improve its performance.

A new theory: we are curious about tasks that increase our ability to solve as many future tasks as possible

Franziska Brändle, Charley M. Wu, Eric Schulz, What Are We Curious about?, . Trends in Cognitive Sciences, Volume 24, Issue 9, 2020 DOI: 10.1016/j.tics.2020.05.010.

(no abstract).