Tag Archives: Reinforcement Learning

Using a physical simulator for sampled rollouts in stochastic optimal control

Carius J, Ranftl R, Farshidian F, Hutter M. Constrained stochastic optimal control with learned importance sampling: A path integral approach, The International Journal of Robotics Research. 2022;41(2):189-209, DOI: 10.1177/02783649211047890.

Modern robotic systems are expected to operate robustly in partially unknown environments. This article proposes an algorithm capable of controlling a wide range of high-dimensional robotic systems in such challenging scenarios. Our method is based on the path integral formulation of stochastic optimal control, which we extend with constraint-handling capabilities. Under our control law, the optimal input is inferred from a set of stochastic rollouts of the system dynamics. These rollouts are simulated by a physics engine, placing minimal restrictions on the types of systems and environments that can be modeled. Although sampling-based algorithms are typically not suitable for online control, we demonstrate in this work how importance sampling and constraints can be used to effectively curb the sampling complexity and enable real-time control applications. Furthermore, the path integral framework provides a natural way of incorporating existing control architectures as ancillary controllers for shaping the sampling distribution. Our results reveal that even in cases where the ancillary controller would fail, our stochastic control algorithm provides an additional safety and robustness layer. Moreover, in the absence of an existing ancillary controller, our method can be used to train a parametrized importance sampling policy using data from the stochastic rollouts. The algorithm may thereby bootstrap itself by learning an importance sampling policy offline and then refining it to unseen environments during online control. We validate our results on three robotic systems, including hardware experiments on a quadrupedal robot.

Learning rewards from diverse human sources

Bıyık E, Losey DP, Palan M, Landolfi NC, Shevchuk G, Sadigh D., Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences, . The International Journal of Robotics Research. 2022;41(1):45-67 DOI: 10.1177/02783649211041652.

Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, data from human teachers can be collected either passively or actively in a variety of forms: passive data sources include demonstrations (e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings) are actively elicited. Prior research has independently applied reward learning to these different data sources. However, there exist many domains where multiple sources are complementary and expressive. Motivated by this general problem, we present a framework to integrate multiple sources of information, which are either passively or actively collected from human users. In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then actively probes the user with preference queries to zero-in on their true reward. This algorithm not only enables us combine multiple data sources, but it also informs the robot when it should leverage each type of information. Further, our approach accounts for the human’s ability to provide data: yielding user-friendly preference queries which are also theoretically optimal. Our extensive simulated experiments and user studies on a Fetch mobile manipulator demonstrate the superiority and the usability of our integrated framework..

Trying to reach general AI through just decision-making (rewards) instead of using a diversity of paradigms

avid Silver, Satinder Singh, Doina Precup, Richard S. Sutton, Reward is enough, . Artificial Intelligence, Volume 299, 2021 DOI: 10.1016/j.artint.2021.103535.

In this article we hypothesise that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward. Accordingly, reward is enough to drive behaviour that exhibits abilities studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language, generalisation and imitation. This is in contrast to the view that specialised problem formulations are needed for each ability, based on other signals or objectives. Furthermore, we suggest that agents that learn through trial and error experience to maximise reward could learn behaviour that exhibits most if not all of these abilities, and therefore that powerful reinforcement learning agents could constitute a solution to artificial general intelligence.

NOTES:

  • The computational and physical limitations of the agent to cope with a too complex world is the main reason to use learning instead of pre-built knowledge (evolution): it allows the agent to focus on acquiring skills for its own circumstances first, that are the most important for it.
  • Argument why classification (supervised learning) is less powerful and efficient than RL.
  • Same with multi-agent settings vs. one agent confronted with a single complex environment (containing other agents).

A nice survey on active learning, in particular for robotics

Annalisa T. Taylor, Thomas A. Berrueta, Todd D. Murphey, Active learning in robotics: A review of control principles, . Mechatronics, Volume 77, 2021 DOI: 10.1016/j.mechatronics.2021.102576.

Active learning is a decision-making process. In both abstract and physical settings, active learning demands
both analysis and action. This is a review of active learning in robotics, focusing on methods amenable to
the demands of embodied learning systems. Robots must be able to learn efficiently and flexibly through
continuous online deployment. This poses a distinct set of control-oriented challenges??one must choose
suitable measures as objectives, synthesize real-time control, and produce analyses that guarantee performance
and safety with limited knowledge of the environment or robot itself. In this work, we survey the fundamental
components of robotic active learning systems. We discuss classes of learning tasks that robots typically
encounter, measures with which they gauge the information content of observations, and algorithms for
generating action plans. Moreover, we provide a variety of examples ?? from environmental mapping to
nonparametric shape estimation ?? that highlight the qualitative differences between learning tasks, information
measures, and control techniques. We conclude with a discussion of control-oriented open challenges, including
safety-constrained learning and distributed learning.

NOTES:

  • RL can be considered one of the areas within computational learning theory, that usually ignore physical embodiment aspects of the learning agent. However, that is only so when RL explores through decision-making, not when it explores randomly, without much purpose of enhancing learning itself through its actions.
  • RL caveats (particularly Deep RL): their large data requirements, lack of generalizability between tasks, as well as their inability to learn incrementally and guarantee
    safety.
  • Bayesian filters can be seen as learner systems: they learn parameters of objects (pose) or environments (maps) aided by some models. However, they are more active learners when they use the robot actions to improve that parameter learning.
  • Gaussian processes can be effective in learning those models when no parameterical form is available or much first-principle knowledge, for instance, when the robot has to learn the model only observing a small part of the environment (local).
  • Entropy/information, Fisher’s information (conditional information) and ergodicity are the main ways of measuring information gain in active learning.

State of the art of the convergence of Monte Carlo Exploring Starts RL, policy iteration kind, method

Jun Liu, On the convergence of reinforcement learning with Monte Carlo Exploring Starts, . Automatica, Volume 129, 2021 DOI: 10.1016/j.automatica.2021.109693.

A basic simulation-based reinforcement learning algorithm is the Monte Carlo Exploring Starts (MCES) method, also known as optimistic policy iteration, in which the value function is approximated by simulated returns and a greedy policy is selected at each iteration. The convergence of this algorithm in the general setting has been an open question. In this paper, we investigate the convergence of this algorithm for the case with undiscounted costs, also known as the stochastic shortest path problem. The results complement existing partial results on this topic and thereby help further settle the open problem.

Approximating the value function of RL through Max-Plus algebra

Vinicius Mariano Gonçalves, Max-plus approximation for reinforcement learning, . Automatica, Volume 129, 2021 DOI: 10.1016/j.automatica.2021.109623.

Max-Plus Algebra has been applied in several contexts, especially in the control of discrete events systems. In this article, we discuss another application closely related to control: the use of Max-Plus algebra concepts in the context of reinforcement learning. Max-Plus Algebra and reinforcement learning are strongly linked due to the latter’s dependence on the Bellman Equation which, in some cases, is a linear Max-Plus equation. This fact motivates the application of Max-Plus algebra to approximate the value function, central to the Bellman Equation and thus also to reinforcement learning. This article proposes conditions so that this approach can be done in a simple way and following the philosophy of reinforcement learning: explore the environment, receive the rewards and use this information to improve the knowledge of the value function. The proposed conditions are related to two matrices and impose on them a relationship that is analogous to the concept of weak inverses in traditional algebra.

Including a safety procedure in RL to avoid physical agent problems while learning

Kim Peter Wabersich, Melanie N. Zeilinger, A predictive safety filter for learning-based control of constrained nonlinear dynamical systems, . Automatica, Volume 129, 2021 DOI: 10.1016/j.automatica.2021.109597.

The transfer of reinforcement learning (RL) techniques into real-world applications is challenged by safety requirements in the presence of physical limitations. Most RL methods, in particular the most popular algorithms, do not support explicit consideration of state and input constraints. In this paper, we address this problem for nonlinear systems with continuous state and input spaces by introducing a predictive safety filter, which is able to turn a constrained dynamical system into an unconstrained safe system and to which any RL algorithm can be applied ‘out-of-the-box’. The predictive safety filter receives the proposed control input and decides, based on the current system state, if it can be safely applied to the real system, or if it has to be modified otherwise. Safety is thereby established by a continuously updated safety policy, which is based on a model predictive control formulation using a data-driven system model and considering state and input dependent uncertainties.

Qualitative modelling of quadcopters that is claimed to be better than reinforcement learning

Šoberl, D., Bratko, I. & Žabkar, Learning to Control a Quadcopter Qualitatively., . J Intell Robot Syst 100, 1097–1110 (2020) DOI: 10.1007/s10846-020-01228-7.

Qualitative modeling allows autonomous agents to learn comprehensible control models, formulated in a way that is close to human intuition. By abstracting away certain numerical information, qualitative models can provide better insights into operating principles of a dynamic system in comparison to traditional numerical models. We show that qualitative models, learned from numerical traces, contain enough information to allow motion planning and path following. We demonstrate our methods on the task of flying a quadcopter. A qualitative control model is learned through motor babbling. Training is significantly faster than training times reported in papers using reinforcement learning with similar quadcopter experiments. A qualitative collision-free trajectory is computed by means of qualitative simulation, and executed reactively while dynamically adapting to numerical characteristics of the system. Experiments have been conducted and assessed in the V-REP robotic simulator.

Interesting review of pshycological motivation and the role of RL in studying it

Randall C. O’Reilly, Unraveling the Mysteries of Motivation, Trends in Cognitive Sciences, Volume 24, Issue 6, 2020, Pages 425-434, DOI: 10.1016/j.tics.2020.03.001.

Motivation plays a central role in human behavior and cognition but is not well captured by widely used artificial intelligence (AI) and computational modeling frameworks. This Opinion article addresses two central questions regarding the nature of motivation: what are the nature and dynamics of the internal goals that drive our motivational system and how can this system be sufficiently flexible to support our ability to rapidly adapt to novel situations, tasks, etc.? In reviewing existing systems and neuroscience research and theorizing on these questions, a wealth of insights to constrain the development of computational models of motivation can be found.

On rewards and values when the RL theory is applied to human brain

Keno Juechems, Christopher Summerfield, Where Does Value Come From?. Trends in Cognitive Sciences, Volume 23, Issue 10, 2019, Pages 836-850, DOI: 10.1016/j.tics.2019.07.012.

The computational framework of reinforcement learning (RL) has allowed us to both understand biological brains and build successful artificial agents. However, in this opinion, we highlight open challenges for RL as a model of animal behaviour in natural environments. We ask how the external reward function is designed for biological systems, and how we can account for the context sensitivity of valuation. We summarise both old and new theories proposing that animals track current and desired internal states and seek to minimise the distance to a goal across multiple value dimensions. We suggest that this framework readily accounts for canonical phenomena observed in the fields of psychology, behavioural ecology, and economics, and recent findings from brain-imaging studies of value-guided decision-making.