Category Archives: Reinforcement Learning Theory

On the effects of large variances in the transition function for Q-learning

D. Lee and W. B. Powell, Bias-Corrected Q-Learning With Multistate Extension. IEEE Transactions on Automatic Control, vol. 64, no. 10, pp. 4011-4023, DOI: 10.1109/TAC.2019.2912443.

Q-learning is a sample-based model-free algorithm that solves Markov decision problems asymptotically, but in finite time, it can perform poorly when random rewards and transitions result in large variance of value estimates. We pinpoint its cause to be the estimation bias due to the maximum operator in Q-learning algorithm, and present the evidence of max-operator bias in its Q value estimates. We then present an asymptotically optimal bias-correction strategy and construct an extension to bias-corrected Q-learning algorithm to multistate Markov decision processes, with asymptotic convergence properties as strong as those from Q-learning. We report the empirical performance of the bias-corrected Q-learning algorithm with multistate extension in two model problems: A multiarmed bandit version of Roulette and an electricity storage control simulation. The bias-corrected Q-learning algorithm with multistate extension is shown to control max-operator bias effectively, where the bias-resistance can be tuned predictably by adjusting a correction parameter.

A nice (short) survey of deep RL

Matthew Botvinick, Sam Ritter, Jane X. Wang, Zeb Kurth-Nelson, Charles Blundell, Demis Hassabis, Reinforcement Learning, Fast and Slow, Trends in Cognitive Sciences, Volume 23, Issue 5, 2019, Pages 408-422 DOI: 10.1016/j.tics.2019.02.006.

Deep reinforcement learning (RL) methods have driven impressive advances in artificial intelligence in recent years, exceeding human performance in domains ranging from Atari to Go to no-limit poker. This progress has drawn the attention of cognitive scientists interested in understanding human learning. However, the concern has been raised that deep RL may be too sample-inefficient – that is, it may simply be too slow – to provide a plausible model of how humans learn. In the present review, we counter this critique by describing recently developed techniques that allow deep RL to operate more nimbly, solving problems much more quickly than previous methods. Although these techniques were developed in an AI context, we propose that they may have rich implications for psychology and neuroscience. A key insight, arising from these AI methods, concerns the fundamental connection between fast RL and slower, more incremental forms of learning.

Transfer learning in reinforcement learning through case-based and the use of heuristics for selecting actions

Reinaldo A.C. Bianchi, Luiz A. Celiberto Jr., Paulo E. Santos, Jackson P. Matsuura, Ramon Lopez de Mantaras, Transferring knowledge as heuristics in reinforcement learning: A case-based approach, Artificial Intelligence, Volume 226, September 2015, Pages 102-121, ISSN 0004-3702, DOI: 10.1016/j.artint.2015.05.008.

The goal of this paper is to propose and analyse a transfer learning meta-algorithm that allows the implementation of distinct methods using heuristics to accelerate a Reinforcement Learning procedure in one domain (the target) that are obtained from another (simpler) domain (the source domain). This meta-algorithm works in three stages: first, it uses a Reinforcement Learning step to learn a task on the source domain, storing the knowledge thus obtained in a case base; second, it does an unsupervised mapping of the source-domain actions to the target-domain actions; and, third, the case base obtained in the first stage is used as heuristics to speed up the learning process in the target domain.
A set of empirical evaluations were conducted in two target domains: the 3D mountain car (using a learned case base from a 2D simulation) and stability learning for a humanoid robot in the Robocup 3D Soccer Simulator (that uses knowledge learned from the Acrobot domain). The results attest that our transfer learning algorithm outperforms recent heuristically-accelerated reinforcement learning and transfer learning algorithms.

Finding the common utility of actions in several tasks learnt in the same domain in order to reduce the learning cost of reinforcement learning

Rosman, B.; Ramamoorthy, S., Action Priors for Learning Domain Invariances, Autonomous Mental Development, IEEE Transactions on , vol.7, no.2, pp.107,118, June 2015, DOI: 10.1109/TAMD.2015.2419715.

An agent tasked with solving a number of different decision making problems in similar environments has an opportunity to learn over a longer timescale than each individual task. Through examining solutions to different tasks, it can uncover behavioral invariances in the domain, by identifying actions to be prioritized in local contexts, invariant to task details. This information has the effect of greatly increasing the speed of solving new problems. We formalise this notion as action priors, defined as distributions over the action space, conditioned on environment state, and show how these can be learnt from a set of value functions. We apply action priors in the setting of reinforcement learning, to bias action selection during exploration. Aggressive use of action priors performs context based pruning of the available actions, thus reducing the complexity of lookahead during search. We additionally define action priors over observation features, rather than states, which provides further flexibility and generalizability, with the additional benefit of enabling feature selection. Action priors are demonstrated in experiments in a simulated factory environment and a large random graph domain, and show significant speed ups in learning new tasks. Furthermore, we argue that this mechanism is cognitively plausible, and is compatible with findings from cognitive psychology.

Partially observable reinforcement learning and the problem of representing the history of the learning process efficiently

Doshi-Velez, F.; Pfau, D.; Wood, F.; Roy, N., Bayesian Nonparametric Methods for Partially-Observable Reinforcement Learning, Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.37, no.2, pp.394,407, Feb. 2015, DOI: 10.1109/TPAMI.2013.191

Making intelligent decisions from incomplete information is critical in many applications: for example, robots must choose actions based on imperfect sensors, and speech-based interfaces must infer a user\u2019s needs from noisy microphone inputs. What makes these tasks hard is that often we do not have a natural representation with which to model the domain and use for choosing actions; we must learn about the domain\u2019s properties while simultaneously performing the task. Learning a representation also involves trade-offs between modeling the data that we have seen previously and being able to make predictions about new data. This article explores learning representations of stochastic systems using Bayesian nonparametric statistics. Bayesian nonparametric methods allow the sophistication of a representation to scale gracefully with the complexity in the data. Our main contribution is a careful empirical evaluation of how representations learned using Bayesian nonparametric methods compare to other standard learning approaches, especially in support of planning and control. We show that the Bayesian aspects of the methods result in achieving state-of-the-art performance in decision making with relatively few samples, while the nonparametric aspects often result in fewer computations. These results hold across a variety of different techniques for choosing actions given a representation.