Tag Archives: Reinforcement Learning

Learning rewards from diverse human sources

Bıyık E, Losey DP, Palan M, Landolfi NC, Shevchuk G, Sadigh D., Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences, . The International Journal of Robotics Research. 2022;41(1):45-67 DOI: 10.1177/02783649211041652.

Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, data from human teachers can be collected either passively or actively in a variety of forms: passive data sources include demonstrations (e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings) are actively elicited. Prior research has independently applied reward learning to these different data sources. However, there exist many domains where multiple sources are complementary and expressive. Motivated by this general problem, we present a framework to integrate multiple sources of information, which are either passively or actively collected from human users. In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then actively probes the user with preference queries to zero-in on their true reward. This algorithm not only enables us combine multiple data sources, but it also informs the robot when it should leverage each type of information. Further, our approach accounts for the human’s ability to provide data: yielding user-friendly preference queries which are also theoretically optimal. Our extensive simulated experiments and user studies on a Fetch mobile manipulator demonstrate the superiority and the usability of our integrated framework..

Trying to reach general AI through just decision-making (rewards) instead of using a diversity of paradigms

avid Silver, Satinder Singh, Doina Precup, Richard S. Sutton, Reward is enough, . Artificial Intelligence, Volume 299, 2021 DOI: 10.1016/j.artint.2021.103535.

In this article we hypothesise that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward. Accordingly, reward is enough to drive behaviour that exhibits abilities studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language, generalisation and imitation. This is in contrast to the view that specialised problem formulations are needed for each ability, based on other signals or objectives. Furthermore, we suggest that agents that learn through trial and error experience to maximise reward could learn behaviour that exhibits most if not all of these abilities, and therefore that powerful reinforcement learning agents could constitute a solution to artificial general intelligence.


  • The computational and physical limitations of the agent to cope with a too complex world is the main reason to use learning instead of pre-built knowledge (evolution): it allows the agent to focus on acquiring skills for its own circumstances first, that are the most important for it.
  • Argument why classification (supervised learning) is less powerful and efficient than RL.
  • Same with multi-agent settings vs. one agent confronted with a single complex environment (containing other agents).

A nice survey on active learning, in particular for robotics

Annalisa T. Taylor, Thomas A. Berrueta, Todd D. Murphey, Active learning in robotics: A review of control principles, . Mechatronics, Volume 77, 2021 DOI: 10.1016/j.mechatronics.2021.102576.

Active learning is a decision-making process. In both abstract and physical settings, active learning demands
both analysis and action. This is a review of active learning in robotics, focusing on methods amenable to
the demands of embodied learning systems. Robots must be able to learn efficiently and flexibly through
continuous online deployment. This poses a distinct set of control-oriented challenges??one must choose
suitable measures as objectives, synthesize real-time control, and produce analyses that guarantee performance
and safety with limited knowledge of the environment or robot itself. In this work, we survey the fundamental
components of robotic active learning systems. We discuss classes of learning tasks that robots typically
encounter, measures with which they gauge the information content of observations, and algorithms for
generating action plans. Moreover, we provide a variety of examples ?? from environmental mapping to
nonparametric shape estimation ?? that highlight the qualitative differences between learning tasks, information
measures, and control techniques. We conclude with a discussion of control-oriented open challenges, including
safety-constrained learning and distributed learning.


  • RL can be considered one of the areas within computational learning theory, that usually ignore physical embodiment aspects of the learning agent. However, that is only so when RL explores through decision-making, not when it explores randomly, without much purpose of enhancing learning itself through its actions.
  • RL caveats (particularly Deep RL): their large data requirements, lack of generalizability between tasks, as well as their inability to learn incrementally and guarantee
  • Bayesian filters can be seen as learner systems: they learn parameters of objects (pose) or environments (maps) aided by some models. However, they are more active learners when they use the robot actions to improve that parameter learning.
  • Gaussian processes can be effective in learning those models when no parameterical form is available or much first-principle knowledge, for instance, when the robot has to learn the model only observing a small part of the environment (local).
  • Entropy/information, Fisher’s information (conditional information) and ergodicity are the main ways of measuring information gain in active learning.

State of the art of the convergence of Monte Carlo Exploring Starts RL, policy iteration kind, method

Jun Liu, On the convergence of reinforcement learning with Monte Carlo Exploring Starts, . Automatica, Volume 129, 2021 DOI: 10.1016/j.automatica.2021.109693.

A basic simulation-based reinforcement learning algorithm is the Monte Carlo Exploring Starts (MCES) method, also known as optimistic policy iteration, in which the value function is approximated by simulated returns and a greedy policy is selected at each iteration. The convergence of this algorithm in the general setting has been an open question. In this paper, we investigate the convergence of this algorithm for the case with undiscounted costs, also known as the stochastic shortest path problem. The results complement existing partial results on this topic and thereby help further settle the open problem.

Approximating the value function of RL through Max-Plus algebra

Vinicius Mariano Gonçalves, Max-plus approximation for reinforcement learning, . Automatica, Volume 129, 2021 DOI: 10.1016/j.automatica.2021.109623.

Max-Plus Algebra has been applied in several contexts, especially in the control of discrete events systems. In this article, we discuss another application closely related to control: the use of Max-Plus algebra concepts in the context of reinforcement learning. Max-Plus Algebra and reinforcement learning are strongly linked due to the latter’s dependence on the Bellman Equation which, in some cases, is a linear Max-Plus equation. This fact motivates the application of Max-Plus algebra to approximate the value function, central to the Bellman Equation and thus also to reinforcement learning. This article proposes conditions so that this approach can be done in a simple way and following the philosophy of reinforcement learning: explore the environment, receive the rewards and use this information to improve the knowledge of the value function. The proposed conditions are related to two matrices and impose on them a relationship that is analogous to the concept of weak inverses in traditional algebra.

Including a safety procedure in RL to avoid physical agent problems while learning

Kim Peter Wabersich, Melanie N. Zeilinger, A predictive safety filter for learning-based control of constrained nonlinear dynamical systems, . Automatica, Volume 129, 2021 DOI: 10.1016/j.automatica.2021.109597.

The transfer of reinforcement learning (RL) techniques into real-world applications is challenged by safety requirements in the presence of physical limitations. Most RL methods, in particular the most popular algorithms, do not support explicit consideration of state and input constraints. In this paper, we address this problem for nonlinear systems with continuous state and input spaces by introducing a predictive safety filter, which is able to turn a constrained dynamical system into an unconstrained safe system and to which any RL algorithm can be applied ‘out-of-the-box’. The predictive safety filter receives the proposed control input and decides, based on the current system state, if it can be safely applied to the real system, or if it has to be modified otherwise. Safety is thereby established by a continuously updated safety policy, which is based on a model predictive control formulation using a data-driven system model and considering state and input dependent uncertainties.

Qualitative modelling of quadcopters that is claimed to be better than reinforcement learning

Šoberl, D., Bratko, I. & Žabkar, Learning to Control a Quadcopter Qualitatively., . J Intell Robot Syst 100, 1097–1110 (2020) DOI: 10.1007/s10846-020-01228-7.

Qualitative modeling allows autonomous agents to learn comprehensible control models, formulated in a way that is close to human intuition. By abstracting away certain numerical information, qualitative models can provide better insights into operating principles of a dynamic system in comparison to traditional numerical models. We show that qualitative models, learned from numerical traces, contain enough information to allow motion planning and path following. We demonstrate our methods on the task of flying a quadcopter. A qualitative control model is learned through motor babbling. Training is significantly faster than training times reported in papers using reinforcement learning with similar quadcopter experiments. A qualitative collision-free trajectory is computed by means of qualitative simulation, and executed reactively while dynamically adapting to numerical characteristics of the system. Experiments have been conducted and assessed in the V-REP robotic simulator.

Interesting review of pshycological motivation and the role of RL in studying it

Randall C. O’Reilly, Unraveling the Mysteries of Motivation, Trends in Cognitive Sciences, Volume 24, Issue 6, 2020, Pages 425-434, DOI: 10.1016/j.tics.2020.03.001.

Motivation plays a central role in human behavior and cognition but is not well captured by widely used artificial intelligence (AI) and computational modeling frameworks. This Opinion article addresses two central questions regarding the nature of motivation: what are the nature and dynamics of the internal goals that drive our motivational system and how can this system be sufficiently flexible to support our ability to rapidly adapt to novel situations, tasks, etc.? In reviewing existing systems and neuroscience research and theorizing on these questions, a wealth of insights to constrain the development of computational models of motivation can be found.

On rewards and values when the RL theory is applied to human brain

Keno Juechems, Christopher Summerfield, Where Does Value Come From?. Trends in Cognitive Sciences, Volume 23, Issue 10, 2019, Pages 836-850, DOI: 10.1016/j.tics.2019.07.012.

The computational framework of reinforcement learning (RL) has allowed us to both understand biological brains and build successful artificial agents. However, in this opinion, we highlight open challenges for RL as a model of animal behaviour in natural environments. We ask how the external reward function is designed for biological systems, and how we can account for the context sensitivity of valuation. We summarise both old and new theories proposing that animals track current and desired internal states and seek to minimise the distance to a goal across multiple value dimensions. We suggest that this framework readily accounts for canonical phenomena observed in the fields of psychology, behavioural ecology, and economics, and recent findings from brain-imaging studies of value-guided decision-making.

Human interaction with the RL process

Celemin, C., Ruiz-del-Solar, J. & Kober, A fast hybrid reinforcement learning framework with human corrective feedback, Auton Robot (2019) 43: 1173, DOI: 10.1007/s10514-018-9786-6.

Reinforcement Learning agents can be supported by feedback from human teachers in the learning loop that guides the learning process. In this work we propose two hybrid strategies of Policy Search Reinforcement Learning and Interactive Machine Learning that benefit from both sources of information, the cost function and the human corrective feedback, for accelerating the convergence and improving the final performance of the learning process. Experiments with simulated and real systems of balancing tasks and a 3 DoF robot arm validate the advantages of the proposed learning strategies: (i) they speed up the convergence of the learning process between 3 and 30 times, saving considerable time during the agent adaptation, and (ii) they allow including non-expert feedback because they have low sensibility to erroneous human advice.