Category Archives: Reinforcement Learning In Ai

Meta-RL: given a distribution of tasks, learn a policy capable of adapting to any new task from the task distribution with as little data as possible

Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, Shimon Whiteson, A Survey of Meta-Reinforcement Learning, arXiv:2301.08028 [cs.LG], 2023 DOI: 10.48550/arXiv.2301.08028.

While deep reinforcement learning (RL) has fueled multiple high-profile successes in machine learning, it is held back from more widespread adoption by its often poor data efficiency and the limited generality of the policies it produces. A promising approach for alleviating these limitations is to cast the development of better RL algorithms as a machine learning problem itself in a process called meta-RL. Meta-RL is most commonly studied in a problem setting where, given a distribution of tasks, the goal is to learn a policy that is capable of adapting to any new task from the task distribution with as little data as possible. In this survey, we describe the meta-RL problem setting in detail as well as its major variations. We discuss how, at a high level, meta-RL research can be clustered based on the presence of a task distribution and the learning budget available for each individual task. Using these clusters, we then survey meta-RL algorithms and applications. We conclude by presenting the open problems on the path to making meta-RL part of the standard toolbox for a deep RL practitioner.

Limiting human intervention in the design of RL solutions (now called “Automated RL”)

Marco Mussi, Davide Lombarda, Alberto Maria Metelli, Francesco Trov�, Marcello Restelli, ARLO: A framework for Automated Reinforcement Learning, Expert Systems with Applications, Volume 224, 2023 DOI: 10.1016/j.eswa.2023.119883.

Automated Reinforcement Learning (AutoRL) is a relatively new area of research that is gaining increasing attention. The objective of AutoRL consists in easing the employment of Reinforcement Learning (RL) techniques for the broader public by alleviating some of its main challenges, including data collection, algorithm selection, and hyper-parameter tuning. In this work, we propose a general and flexible framework, namely ARLO: Automated Reinforcement Learning Optimizer, to construct automated pipelines for AutoRL. Based on this, we propose a pipeline for offline and one for online RL, discussing the components, interaction, and highlighting the difference between the two settings. Furthermore, we provide a Python implementation of such pipelines, released as an open-source library. Our implementation is tested on an illustrative LQG domain and on classic MuJoCo environments, showing the ability to reach competitive performances requiring limited human intervention. We also showcase the full pipeline on a realistic dam environment, automatically performing the feature selection and the model generation tasks.

Multi-task RL through common perceptions

Jinling Meng, Fei Zhu, Seek for commonalities: Shared features extraction for multi-task reinforcement learning via adversarial training, Expert Systems with Applications, Volume 224, 2023 DOI: 10.1016/j.eswa.2023.119975.

Multi-task reinforcement learning is promising to alleviate the low sample efficiency and high computation cost of reinforcement learning algorithms. However, current methods mostly focus on unique features that are not conducive to the transfer between tasks. Moreover, they usually lack a balance mechanism among tasks, which often leads to the unnecessary occupation of training resources by tasks that have already been trained. To address the problems, a simple yet effective method referred to as Adaptive Experience buffer with Shared Features Multi-Task Reinforcement Learning (AESF-MTRL) is proposed. In AESF-MTRL, input observation of the environment is divided into shared features and unique features, which are extracted using different feature extractors. Unique features are extracted by simple gradient descent, while shared features are extracted using adversarial training, with an additional discriminator trained to ensure that the extracted features are indeed shared features. AESF-MTRL also maintains a reward stack to adjust the sampling ratio of trajectories from different tasks dynamically during the update period to balance the learning process of different tasks. Experiments on multiple robotics control environments demonstrate the effectiveness of the proposed method.

Dealing with the exploration with a nice introduction to the problem

Jiayi Lu, Shuai Han, Shuai L�, Meng Kang, Junwei Zhang, Sampling diversity driven exploration with state difference guidance, Expert Systems with Applications, Volume 203, 2022 DOI: 10.1016/j.eswa.2022.117418.

Exploration is one of the key issues of deep reinforcement learning, especially in the environments with sparse or deceptive rewards. Exploration based on intrinsic rewards can handle these environments. However, these methods cannot take both global interaction dynamics and local environment changes into account simultaneously. In this paper, we propose a novel intrinsic reward for off-policy learning, which not only encourages the agent to take actions not fully learned from a global perspective, but also instructs the agent to trigger remarkable changes in the environment from a local perspective. Meanwhile, we propose the double-actors\u2013double-critics framework to combine intrinsic rewards with extrinsic rewards to avoid the inappropriate combination of intrinsic and extrinsic rewards in previous methods. This framework can be applied to off-policy learning algorithms based on the actor\u2013critic method. We provide a comprehensive evaluation of our approach on the MuJoCo benchmark environments. The results demonstrate that our method can perform effective exploration in the environments with dense, deceptive and sparse rewards. Besides, we conduct sufficient ablation and quantitative analyses to intrinsic rewards. Furthermore, we also verify the superiority and rationality of our double-actors\u2013double-critics framework through comparative experiments.

Increasing exploration when the agent performs worse, decreasing when performing better, in the context of DQN for distributing computation among cloud and edge servers, also dealing with hybridization of RL with Fuzzy

Do Bao Son, Ta Huu Binh, Hiep Khac Vo, Binh Minh Nguyen, Huynh Thi Thanh Binh, Shui Yu, Value-based reinforcement learning approaches for task offloading in Delay Constrained Vehicular Edge Computing, Engineering Applications of Artificial Intelligence, Volume 113, 2022 DOI: 10.1016/j.engappai.2022.104898.

In the age of booming information technology, human-being has witnessed the need for new paradigms with both high computational capability and low latency. A potential solution is Vehicular Edge Computing (VEC). Previous work proposed a Fuzzy Deep Q-Network in Offloading scheme (FDQO) that combines Fuzzy rules and Deep Q-Network (DQN) to improve DQN\u2019s early performance by using Fuzzy Controller (FC). However, we notice that frequent usage of FC can hinder the future growth performance of model. One way to overcome this issue is to remove Fuzzy Controller entirely. We introduced an algorithm called baseline DQN (b-DQN), represented by its two variants Static baseline DQN (Sb-DQN) and Dynamic baseline DQN (Db-DQN), to modify the exploration rate base on the average rewards of closest observations. Our findings confirm that these baseline DQN algorithms surpass traditional DQN models in terms of average Quality of Experience (QoE) in 100 time slots by about 6%, but still suffer from poor early performance (such as in the first 5 time slots). Here, we introduce baseline FDQO (b-FDQO). This algorithm has a strategy to modify the Fuzzy Logic usage instead of removing it entirely while still observing the rewards to modify the exploration rate. It brings a higher average QoE in the first 5 time slots compared to other non-fuzzy-logic algorithms by at least 55.12%, prevent the model from getting too bad result over all time slots, while having the late performance as good as that of b-DQN.

Live-RL enhancement / reduction of unsafe situations by reducing the transition possibility of unsafe actions

Serhat Duman, Hamdi Tolga Kahraman, Yusuf Sonmez, Ugur Guvenc, Mehmet Kati, Sefa Aras, A powerful meta-heuristic search algorithm for solving global optimization and real-world solar photovoltaic parameter estimation problems, Engineering Applications of Artificial Intelligence, Volume 111, 2022 DOI: 10.1016/j.engappai.2022.104763.

The teaching-learning-based artificial bee colony (TLABC) is a new hybrid swarm-based metaheuristic search algorithm. It combines the exploitation of the teaching learning-based optimization (TLBO) with the exploration of the artificial bee colony (ABC). With the hybridization of these two nature-inspired swarm intelligence algorithms, a robust method has been proposed to solve global optimization problems. However, as with swarm-based algorithms, with the TLABC method, it is a great challenge to effectively simulate the selection process. Fitness-distance balance (FDB) is a powerful recently developed method to effectively imitate the selection process in nature. In this study, the three search phases of the TLABC algorithm were redesigned using the FDB method. In this way, the FDB-TLABC algorithm, which imitates nature more effectively and has a robust search performance, was developed. To investigate the exploitation, exploration, and balanced search capabilities of the proposed algorithm, it was tested on standard and complex benchmark suites (Classic, IEEE CEC 2014, IEEE CEC 2017, and IEEE CEC 2020). In order to verify the performance of the proposed FDB-TLABC for global optimization problems and in the photovoltaic parameter estimation problem (a constrained real-world engineering problem) a very comprehensive and qualified experimental study was carried out according to IEEE CEC standards. Statistical analysis results confirmed that the proposed FDB-TLABC provided the best optimum solution and yielded a superior performance compared to other optimization methods.

State of the art of the convergence of Monte Carlo Exploring Starts RL, policy iteration kind, method

Jun Liu, On the convergence of reinforcement learning with Monte Carlo Exploring Starts, . Automatica, Volume 129, 2021 DOI: 10.1016/j.automatica.2021.109693.

A basic simulation-based reinforcement learning algorithm is the Monte Carlo Exploring Starts (MCES) method, also known as optimistic policy iteration, in which the value function is approximated by simulated returns and a greedy policy is selected at each iteration. The convergence of this algorithm in the general setting has been an open question. In this paper, we investigate the convergence of this algorithm for the case with undiscounted costs, also known as the stochastic shortest path problem. The results complement existing partial results on this topic and thereby help further settle the open problem.

Approximating the value function of RL through Max-Plus algebra

Vinicius Mariano Gonçalves, Max-plus approximation for reinforcement learning, . Automatica, Volume 129, 2021 DOI: 10.1016/j.automatica.2021.109623.

Max-Plus Algebra has been applied in several contexts, especially in the control of discrete events systems. In this article, we discuss another application closely related to control: the use of Max-Plus algebra concepts in the context of reinforcement learning. Max-Plus Algebra and reinforcement learning are strongly linked due to the latter’s dependence on the Bellman Equation which, in some cases, is a linear Max-Plus equation. This fact motivates the application of Max-Plus algebra to approximate the value function, central to the Bellman Equation and thus also to reinforcement learning. This article proposes conditions so that this approach can be done in a simple way and following the philosophy of reinforcement learning: explore the environment, receive the rewards and use this information to improve the knowledge of the value function. The proposed conditions are related to two matrices and impose on them a relationship that is analogous to the concept of weak inverses in traditional algebra.

Generating contrafactual explanations of Deep RL decisions to identify flawed agents

Matthew L. Olson, Roli Khanna, Lawrence Neal, Fuxin Li, Weng-Keen Wong, Counterfactual state explanations for reinforcement learning agents via generative deep learning, . Artificial Intelligence, Volume 295, 2021 DOI: 10.1016/j.artint.2021.103455.

Counterfactual explanations, which deal with “why not?” scenarios, can provide insightful explanations to an AI agent’s behavior [Miller [38]]. In this work, we focus on generating counterfactual explanations for deep reinforcement learning (RL) agents which operate in visual input environments like Atari. We introduce counterfactual state explanations, a novel example-based approach to counterfactual explanations based on generative deep learning. Specifically, a counterfactual state illustrates what minimal change is needed to an Atari game image such that the agent chooses a different action. We also evaluate the effectiveness of counterfactual states on human participants who are not machine learning experts. Our first user study investigates if humans can discern if the counterfactual state explanations are produced by the actual game or produced by a generative deep learning approach. Our second user study investigates if counterfactual state explanations can help non-expert participants identify a flawed agent; we compare against a baseline approach based on a nearest neighbor explanation which uses images from the actual game. Our results indicate that counterfactual state explanations have sufficient fidelity to the actual game images to enable non-experts to more effectively identify a flawed RL agent compared to the nearest neighbor baseline and to having no explanation at all.

Model-based (on ordinary differential equations) and partially model-free Policy Iteration on continuous space and time

Jaeyoung Lee, Richard S. Sutton, Policy iterations for reinforcement learning problems in continuous time and space — Fundamental theory and methods, . Automatica, Volume 126, 2021 DOI: 10.1016/j.automatica.2020.109421.

Policy iteration (PI) is a recursive process of policy evaluation and improvement for solving an optimal decision-making/control problem, or in other words, a reinforcement learning (RL) problem. PI has also served as the fundamental for developing RL methods. In this paper, we propose two PI methods, called differential PI (DPI) and integral PI (IPI), and their variants, for a general RL framework in continuous time and space (CTS), where the environment is modeled by a system of ordinary differential equations (ODEs). The proposed methods inherit the current ideas of PI in classical RL and optimal control and theoretically support the existing RL algorithms in CTS: TD-learning and value-gradient-based (VGB) greedy policy update. We also provide case studies including (1) discounted RL and (2) optimal control tasks. Fundamental mathematical properties – admissibility, uniqueness of the solution to the Bellman equation (BE), monotone improvement, convergence, and optimality of the solution to the Hamilton–Jacobi–Bellman equation (HJBE) – are all investigated in-depth and improved from the existing theory, along with the general and case studies. Finally, the proposed ones are simulated with an inverted-pendulum model and their model-based and partially model-free implementations to support the theory and further investigate them beyond.