Tag Archives: Reinforcement Learning

A new model of reinforcement learning based on the human brain that copes with continuous spaces through continuous rewards, with a short but nice state-of-the-art of RL applied to large, continuous spaces

Feifei Zhao, Yi Zeng, Guixiang Wang, Jun Bai, Bo Xu, A Brain-Inspired Decision Making Model Based on Top-Down Biasing of Prefrontal Cortex to Basal Ganglia and Its Application in Autonomous UAV Explorations, Cognitive Computation, Volume 10, Issue 2, pp 296–306, DOI: 10.1007/s12559-017-9511-3.

Decision making is a fundamental ability for intelligent agents (e.g., humanoid robots and unmanned aerial vehicles). During decision making process, agents can improve the strategy for interacting with the dynamic environment through reinforcement learning. Many state-of-the-art reinforcement learning models deal with relatively smaller number of state-action pairs, and the states are preferably discrete, such as Q-learning and Actor-Critic algorithms. While in practice, in many scenario, the states are continuous and hard to be properly discretized. Better autonomous decision making methods need to be proposed to handle these problems. Inspired by the mechanism of decision making in human brain, we propose a general computational model, named as prefrontal cortex-basal ganglia (PFC-BG) algorithm. The proposed model is inspired by the biological reinforcement learning pathway and mechanisms from the following perspectives: (1) Dopamine signals continuously update reward-relevant information for both basal ganglia and working memory in prefrontal cortex. (2) We maintain the contextual reward information in working memory. This has a top-down biasing effect on reinforcement learning in basal ganglia. The proposed model separates the continuous states into smaller distinguishable states, and introduces continuous reward function for each state to obtain reward information at different time. To verify the performance of our model, we apply it to many UAV decision making experiments, such as avoiding obstacles and flying through window and door, and the experiments support the effectiveness of the model. Compared with traditional Q-learning and Actor-Critic algorithms, the proposed model is more biologically inspired, and more accurate and faster to make decision.

A very interesting analysis on how reinforcement learning depends on time, both for MDPs and for the psychological basis of RL in the human brain

Elijah A. Petter, Samuel J. Gershman, Warren H. Meck, Integrating Models of Interval Timing and Reinforcement Learning, Trends in Cognitive Sciences, Volume 22, Issue 10, 2018, Pages 911-922 DOI: 10.1016/j.tics.2018.08.004.

We present an integrated view of interval timing and reinforcement learning (RL) in the brain. The computational goal of RL is to maximize future rewards, and this depends crucially on a representation of time. Different RL systems in the brain process time in distinct ways. A model-based system learns ‘what happens when’, employing this internal model to generate action plans, while a model-free system learns to predict reward directly from a set of temporal basis functions. We describe how these systems are subserved by a computational division of labor between several brain regions, with a focus on the basal ganglia and the hippocampus, as well as how these regions are influenced by the neuromodulator dopamine.

Some quotes beyond the abstract:

The Markov assumption also makes explicit the requirements for temporal representation. All temporal dynamics must be captured by the state-transition function, which means that the state representation must encode the time-invariant structure of the environment.

An interesting model of Basal Ganglia that performs similarly to Q learning when applied to a robot

Y. Zeng, G. Wang and B. Xu, A Basal Ganglia Network Centric Reinforcement Learning Model and Its Application in Unmanned Aerial Vehicle, IEEE Transactions on Cognitive and Developmental Systems, vol. 10, no. 2, pp. 290-303 DOI: 10.1109/TCDS.2017.2649564.

Reinforcement learning brings flexibility and generality for machine learning, while most of them are mathematical optimization driven approaches, and lack of cognitive and neural evidence. In order to provide a more cognitive and neural mechanisms driven foundation and validate its applicability in complex task, we develop a basal ganglia (BG) network centric reinforcement learning model. Compared to existing work on modeling BG, this paper is unique from the following perspectives: 1) the orbitofrontal cortex (OFC) is taken into consideration. OFC is critical in decision making because of its responsibility for reward representation and is critical in controlling the learning process, while most of the BG centric models do not include OFC; 2) to compensate the inaccurate memory of numeric values, precise encoding is proposed to enable working memory system remember important values during the learning process. The method combines vector convolution and the idea of storage by digit bit and is efficient for accurate value storage; and 3) for information coding, the Hodgkin-Huxley model is used to obtain a more biological plausible description of action potential with plenty of ionic activities. To validate the effectiveness of the proposed model, we apply the model to the unmanned aerial vehicle (UAV) autonomous learning process in a 3-D environment. Experimental results show that our model is able to give the UAV the ability of free exploration in the environment and has comparable learning speed as the Q learning algorithm, while the major advances for our model is that it is with solid cognitive and neural basis.

Relation between optimization and reinforcement learning

Megumi Miyashita, Shiro Yano, Toshiyuki Kondo Mirror descent search and its acceleration, Robotics and Autonomous Systems, Volume 106, 2018, Pages 107-116 DOI: 10.1016/j.robot.2018.04.009.

In recent years, attention has been focused on the relationship between black-box optimization problem and reinforcement learning problem. In this research, we propose the Mirror Descent Search (MDS) algorithm which is applicable both for black box optimization problems and reinforcement learning problems. Our method is based on the mirror descent method, which is a general optimization algorithm. The contribution of this research is roughly twofold. We propose two essential algorithms, called MDS and Accelerated Mirror Descent Search (AMDS), and two more approximate algorithms: Gaussian Mirror Descent Search (G-MDS) and Gaussian Accelerated Mirror Descent Search (G-AMDS). This research shows that the advanced methods developed in the context of the mirror descent research can be applied to reinforcement learning problem. We also clarify the relationship between an existing reinforcement learning algorithm and our method. With two evaluation experiments, we show our proposed algorithms converge faster than some state-of-the-art methods.

Using interactive reinforcement learning with the advisor being another reinforcement learning agent

Francisco Cruz, Sven Magg, Yukie Nagai & Stefan Wermter, Improving interactive reinforcement learning: What makes a good teacher?, Connection Science, DOI: 10.1080/09540091.2018.1443318.

Interactive reinforcement learning (IRL) has become an important apprenticeship approach to speed up convergence in classic reinforcement learning (RL) problems. In this regard, a variant of IRL is policy shaping which uses a parent-like trainer to propose the next action to be performed and by doing so reduces the search space by advice. On some occasions, the trainer may be another artificial agent which in turn was trained using RL methods to afterward becoming an advisor for other learner-agents. In this work, we analyse internal representations and characteristics of artificial agents to determine which agent may outperform others to become a better trainer-agent. Using a polymath agent, as compared to a specialist agent, an advisor leads to a larger reward and faster convergence of the reward signal and also to a more stable behaviour in terms of the state visit frequency of the learner-agents. Moreover, we analyse system interaction parameters in order to determine how influential they are in the apprenticeship process, where the consistency of feedback is much more relevant when dealing with different learner obedience parameters.

Deep reinforcement learning applied to learn both attention and classification in a task of vehicle classification

D. Zhao, Y. Chen and L. Lv, Deep Reinforcement Learning With Visual Attention for Vehicle Classification, IEEE Transactions on Cognitive and Developmental Systems, vol. 9, no. 4, pp. 356-367, DOI: 10.1109/TCDS.2016.2614675.

Automatic vehicle classification is crucial to intelligent transportation system, especially for vehicle-tracking by police. Due to the complex lighting and image capture conditions, image-based vehicle classification in real-world environments is still a challenging task and the performance is far from being satisfactory. However, owing to the mechanism of visual attention, the human vision system shows remarkable capability compared with the computer vision system, especially in distinguishing nuances processing. Inspired by this mechanism, we propose a convolutional neural network (CNN) model of visual attention for image classification. A visual attention-based image processing module is used to highlight one part of an image and weaken the others, generating a focused image. Then the focused image is input into the CNN to be classified. According to the classification probability distribution, we compute the information entropy to guide a reinforcement learning agent to achieve a better policy for image classification to select the key parts of an image. Systematic experiments on a surveillance-nature dataset which contains images captured by surveillance cameras in the front view, demonstrate that the proposed model is more competitive than the large-scale CNN in vehicle classification tasks.

Interesting mixture of automated planning with reinforcement learning

Matteo Leonetti, Luca Iocchi, Peter Stone, A synthesis of automated planning and reinforcement learning for efficient, robust decision-making, Artificial Intelligence, Volume 241, 2016, Pages 103-130, ISSN 0004-3702, DOI: 10.1016/j.artint.2016.07.004.

Automated planning and reinforcement learning are characterized by complementary views on decision making: the former relies on previous knowledge and computation, while the latter on interaction with the world, and experience. Planning allows robots to carry out different tasks in the same domain, without the need to acquire knowledge about each one of them, but relies strongly on the accuracy of the model. Reinforcement learning, on the other hand, does not require previous knowledge, and allows robots to robustly adapt to the environment, but often necessitates an infeasible amount of experience. We present Domain Approximation for Reinforcement LearnING (DARLING), a method that takes advantage of planning to constrain the behavior of the agent to reasonable choices, and of reinforcement learning to adapt to the environment, and increase the reliability of the decision making process. We demonstrate the effectiveness of the proposed method on a service robot, carrying out a variety of tasks in an office building. We find that when the robot makes decisions by planning alone on a given model it often fails, and when it makes decisions by reinforcement learning alone it often cannot complete its tasks in a reasonable amount of time. When employing DARLING, even when seeded with the same model that was used for planning alone, however, the robot can quickly learn a behavior to carry out all the tasks, improves over time, and adapts to the environment as it changes.

Survey of model-based reinforcement learning (and of reinforcement learning in general), for its application to improve learning time in robotics; a lot of references but not so many -or clear- explanations

Athanasios S. Polydoros, Lazaros Nalpantidis, Survey of Model-Based Reinforcement Learning: Applications on Robotics, Journal of Intelligent & Robotic Systems, May 2017, Volume 86, Issue 2, pp 153–173, DOI: 10.1007/s10846-017-0468-y.

Reinforcement learning is an appealing approach for allowing robots to learn new tasks. Relevant literature reveals a plethora of methods, but at the same time makes clear the lack of implementations for dealing with real life challenges. Current expectations raise the demand for adaptable robots. We argue that, by employing model-based reinforcement learning, the—now limited—adaptability characteristics of robotic systems can be expanded. Also, model-based reinforcement learning exhibits advantages that makes it more applicable to real life use-cases compared to model-free methods. Thus, in this survey, model-based methods that have been applied in robotics are covered. We categorize them based on the derivation of an optimal policy, the definition of the returns function, the type of the transition model and the learned task. Finally, we discuss the applicability of model-based reinforcement learning approaches in new applications, taking into consideration the state of the art in both algorithms and hardware.

A nive review of reinforcement learning from the perspective of its physiological foundations and its application to Robotics

Cornelius Weber, Mark Elshaw, Stefan Wermter, Jochen Triesch and Christopher Willmot, Reinforcement Learning Embedded in Brains and Robots, Reinforcement Learning: Theory and Applications, Book edited by Cornelius Weber, Mark Elshaw and Norbert Michael Mayer, ISBN 978-3-902613-14-1, pp.424, January 2008, I-Tech Education and Publishing, Vienna, Austria. (Local copy)

“Nexting” (predicting events that occur next, possibly at different time scales) implemented in a robot through temporal difference learning and with a large number of learners

Joseph Modayil, Adam White, Richard S. Sutton (2011), Multi-timescale Nexting in a Reinforcement Learning Robot, arXiv:1112.1133 [cs.LG]. ARXIV, (this version to appear in the Proceedings of the Conference on the Simulation of Adaptive Behavior, 2012).

The term “nexting” has been used by psychologists to refer to the propensity of people and many other animals to continually predict what will happen next in an immediate, local, and personal sense. The ability to “next” constitutes a basic kind of awareness and knowledge of one’s environment. In this paper we present results with a robot that learns to next in real time, predicting thousands of features of the world’s state, including all sensory inputs, at timescales from 0.1 to 8 seconds. This was achieved by treating each state feature as a reward-like target and applying temporal-difference methods to learn a corresponding value function with a discount rate corresponding to the timescale. We show that two thousand predictions, each dependent on six thousand state features, can be learned and updated online at better than 10Hz on a laptop computer, using the standard TD(lambda) algorithm with linear function approximation. We show that this approach is efficient enough to be practical, with most of the learning complete within 30 minutes. We also show that a single tile-coded feature representation suffices to accurately predict many different signals at a significant range of timescales. Finally, we show that the accuracy of our learned predictions compares favorably with the optimal off-line solution.