Category Archives: Reinforcement Learning In Ai

RL in periodic scenarios

A. Aniket and A. Chattopadhyay, Online Reinforcement Learning in Periodic MDP, IEEE Transactions on Artificial Intelligence, vol. 5, no. 7, pp. 3624-3637, July 2024 DOI: 10.1109/TAI.2024.3375258.

We study learning in periodic Markov decision process (MDP), a special type of nonstationary MDP where both the state transition probabilities and reward functions vary periodically, under the average reward maximization setting. We formulate the problem as a stationary MDP by augmenting the state space with the period index and propose a periodic upper confidence bound reinforcement learning-2 (PUCRL2) algorithm. We show that the regret of PUCRL2 varies linearly with the period N and as O(TlogT−−−−−√) with the horizon length T . Utilizing the information about the sparsity of transition matrix of augmented MDP, we propose another algorithm [periodic upper confidence reinforcement learning with Bernstein bounds (PUCRLB) which enhances upon PUCRL2, both in terms of regret ( O(N−−√) dependency on period] and empirical performance. Finally, we propose two other algorithms U-PUCRL2 and U-PUCRLB for extended uncertainty in the environment in which the period is unknown but a set of candidate periods are known. Numerical results demonstrate the efficacy of all the algorithms.

Reducing discovered skills in DRL to the essential ones, modelling skills with SMDP Q-learning

Shuai Qing, Fei Zhu, Refine to the essence: Less-redundant skill learning via diversity clustering, Engineering Applications of Artificial Intelligence, Volume 133, Part A, 2024 DOI: 10.1016/j.engappai.2024.107981.

In reinforcement learning, skill is a potentially conditional policy that solves tasks in a hierarchically controlled manner. Progress on skill discovery helps agents learn a set of diverse and useful skills without external supervision to tackle complex tasks with sparse rewards. Although most of the studies have aimed to maximize the diversity of skills discovered, the distinguishability between skills diminishes as the number of skills increases, leading to a subset of similar and redundant skills. To tackle this problem, a method called Refine to the Essence of Skills (RE-Skill) is proposed, which aims at learning skills with less redundancy. RE-Skill integrates the concepts of cluster analysis and policy distillation, clustering similar skills together based on their unique features, learning the most optimal performance within each cluster, and filtering out similar skills that involve excessive and intricate actions, thereby reducing redundancy among skills. By refining clusters of similar skills into less-redundant independent skills, RE-Skill demonstrates superior performance compared to other skill discovery algorithms and shows how these less-redundant skills effectively address downstream tasks, indicating that RE-Skill is able to extend its efficacy to engineering applications in robot control and obstacle training tasks within complex environments.

A novel RL setting for non-Markovian systems

Ronen I. Brafman, Giuseppe De Giacomo, Regular decision processes, Artificial Intelligence, Volume 331, 2024 DOI: 10.1016/j.artint.2024.104113.

We introduce and study Regular Decision Processes (RDPs), a new, compact model for domains with non-Markovian dynamics and rewards, in which the dependence on the past is regular, in the language theoretic sense. RDPs are an intermediate model between MDPs and POMDPs. They generalize k-order MDPs and can be viewed as a POMDP in which the hidden state is a regular function of the entire history. In factored RDPs, transition and reward functions are specified using formulas in linear temporal logics over finite traces, or using regular expressions. This allows specifying complex dependence on the past using intuitive and compact formulas, and building models of partially observable domains without specifying an underlying state space.

POMDPs focused on obtaining policies that can be understood well just through the observation of the robot actions

Miguel Faria, Francisco S. Melo, Ana Paiva, “Guess what I’m doing”: Extending legibility to sequential decision tasks, Artificial Intelligence, Volume 330, 2024 DOI: 10.1016/j.artint.2024.104107.

In this paper we investigate the notion of legibility in sequential decision tasks under uncertainty. Previous works that extend legibility to scenarios beyond robot motion either focus on deterministic settings or are computationally too expensive. Our proposed approach, dubbed PoLMDP, is able to handle uncertainty while remaining computationally tractable. We establish the advantages of our approach against state-of-the-art approaches in several scenarios of varying complexity. We also showcase the use of our legible policies as demonstrations in machine teaching scenarios, establishing their superiority in teaching new behaviours against the commonly used demonstrations based on the optimal policy. Finally, we assess the legibility of our computed policies through a user study, where people are asked to infer the goal of a mobile robot following a legible policy by observing its actions.

On the influence of the representations obtained through Deep RL in the learning process

Han Wang, Erfan Miahi, Martha White, Marlos C. Machado, Zaheer Abbas, Raksha Kumaraswamy, Vincent Liu, Adam White, Investigating the properties of neural network representations in reinforcement learning, Artificial Intelligence, Volume 330, 2024 DOI: 10.1016/j.artint.2024.104100.

In this paper we investigate the properties of representations learned by deep reinforcement learning systems. Much of the early work on representations for reinforcement learning focused on designing fixed-basis architectures to achieve properties thought to be desirable, such as orthogonality and sparsity. In contrast, the idea behind deep reinforcement learning methods is that the agent designer should not encode representational properties, but rather that the data stream should determine the properties of the representation—good representations emerge under appropriate training schemes. In this paper we bring these two perspectives together, empirically investigating the properties of representations that support transfer in reinforcement learning. We introduce and measure six representational properties over more than 25,000 agent-task settings. We consider Deep Q-learning agents with different auxiliary losses in a pixel-based navigation environment, with source and transfer tasks corresponding to different goal locations. We develop a method to better understand why some representations work better for transfer, through a systematic approach varying task similarity and measuring and correlating representation properties with transfer performance. We demonstrate the generality of the methodology by investigating representations learned by a Rainbow agent that successfully transfers across Atari 2600 game modes.

Object oriented paradigm to improve transfer learning in RL, i.e., a sort of symbolic abstraction mechanism

Ofir Marom, Benjamin Rosman, Transferable dynamics models for efficient object-oriented reinforcement learning, Robotics and Autonomous Systems, Volume 174, 2024 DOI: 10.1016/j.artint.2024.104079.

The Reinforcement Learning (RL) framework offers a general paradigm for constructing autonomous agents that can make effective decisions when solving tasks. An important area of study within the field of RL is transfer learning, where an agent utilizes knowledge gained from solving previous tasks to solve a new task more efficiently. While the notion of transfer learning is conceptually appealing, in practice, not all RL representations are amenable to transfer learning. Moreover, much of the research on transfer learning in RL is purely empirical. Previous research has shown that object-oriented representations are suitable for the purposes of transfer learning with theoretical efficiency guarantees. Such representations leverage the notion of object classes to learn lifted rules that apply to grounded object instantiations. In this paper, we extend previous research on object-oriented representations and introduce two formalisms: the first is based on deictic predicates, and is used to learn a transferable transition dynamics model; the second is based on propositions, and is used to learn a transferable reward dynamics model. In addition, we extend previously introduced efficient learning algorithms for object-oriented representations to our proposed formalisms. Our frameworks are then combined into a single efficient algorithm that learns transferable transition and reward dynamics models across a domain of related tasks. We illustrate our proposed algorithm empirically on an extended version of the Taxi domain, as well as the more difficult Sokoban domain, showing the benefits of our approach with regards to efficient learning and transfer.

Improving sample efficiency of RL through memory reconstruction

Y. Kang et al., Sample Efficient Reinforcement Learning Using Graph-Based Memory Reconstruction, IEEE Transactions on Artificial Intelligence, vol. 5, no. 2, pp. 751-762, Feb. 2024 DOI: 10.1109/TAI.2023.3268612.

Reinforcement learning (RL) algorithms typically require orders of magnitude more interactions than humans to learn effective policies. Research on memory in neuroscience suggests that humans’ learning efficiency benefits from associating their experiences and reconstructing potential events. Inspired by this finding, we introduce a human brainlike memory structure for agents and build a general learning framework based on this structure to improve the RL sampling efficiency. Since this framework is similar to the memory reconstruction process in psychology, we name the newly proposed RL framework as graph-based memory reconstruction (GBMR). In particular, GBMR first maintains an attribute graph on the agent’s memory and then retrieves its critical nodes to build and update potential paths among these nodes. This novel pipeline drives the RL agent to learn faster with its memory-enhanced value functions and reduces interactions with the environment by reconstructing its valuable paths. Extensive experimental analyses and evaluations in the grid maze and some challenging Atari environments demonstrate GBMRs superiority over traditional RL methods. We will release the source code and trained models to facilitate further studies in this research direction.

Improving sample efficiency in actor-critic RL (A2C with NNs) through multimodal advantage function

Jonghyeok Park, Soohee Han, Reinforcement learning with multimodal advantage function for accurate advantage estimation in robot learning, Engineering Applications of Artificial Intelligence, Volume 126, Part C, 2023 DOI: 10.1016/j.engappai.2023.107019.

In this paper, we propose a reinforcement learning (RL) framework that uses a multimodal advantage function (MAF) to come close to the true advantage function, thereby achieving high returns. The MAF, which is constructed as a logarithm of a mixture of Gaussians policy (MoG-P) and trained by globally collected past experiences, directly assesses the complex true advantage function with its multi-modality and is expected to enhance the sample-efficiency of RL. To realize the expected enhanced learning performance with the proposed RL framework, two practical techniques are developed that include mode selection and rounding off of actions during the policy update process. Mode selection is conducted to sample the action around the most influential or weighted mode for efficient environment exploration. For fast policy updates, past actions are rounded off to discretized action values when calculating the multimodal advantage function. The proposed RL framework was validated using simulation environments and a real inverted pendulum system. The findings showed that the proposed framework can achieve a more sample-efficient performance or higher returns than other advantage-based RL benchmarks.

Learning options in RL and using rewards adequately in that context

Richard S. Sutton, Marlos C. Machado, G. Zacharias Holland, David Szepesvari, Finbarr Timbers, Brian Tanner, Adam White, Reward-respecting subtasks for model-based reinforcement learning, Artificial Intelligence, Volume 324, 2023, DOI: 10.1016/j.artint.2023.104001.

To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress with state abstraction, but temporal abstraction has rarely been used, despite extensively developed theory based on the options framework. One reason for this is that the space of possible options is immense, and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks, such as reaching a bottleneck state or maximizing the cumulative sum of a sensory signal other than reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. In most previous work, the subtasks ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option terminates. We show that option models obtained from such reward-respecting subtasks are much more likely to be useful in planning than eigenoptions, shortest path options based on bottleneck states, or reward-respecting options generated by the option-critic. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how values, policies, options, and models can all be learned online and off-policy using standard algorithms and general value functions.

Reward machines as reward specification method for RL and their automated learning

Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, Margarita P. Castro, Ethan Waldie, Sheila A. McIlraith, Learning reward machines: A study in partially observable reinforcement learning, Artificial Intelligence, Volume 323, 2023 DOI: 10.1016/j.artint.2023.103989.

Reinforcement Learning (RL) is a machine learning paradigm wherein an artificial agent interacts with an environment with the purpose of learning behaviour that maximizes the expected cumulative reward it receives from the environment. Reward machines (RMs) provide a structured, automata-based representation of a reward function that enables an RL agent to decompose an RL problem into structured subproblems that can be efficiently learned via off-policy learning. Here we show that RMs can be learned from experience, instead of being specified by the user, and that the resulting problem decomposition can be used to effectively solve partially observable RL problems. We pose the task of learning RMs as a discrete optimization problem where the objective is to find an RM that decomposes the problem into a set of subproblems such that the combination of their optimal memoryless policies is an optimal policy for the original problem. We show the effectiveness of this approach on three partially observable domains, where it significantly outperforms A3C, PPO, and ACER, and discuss its advantages, limitations, and broader potential.