Category Archives: Applications Of Reinforcement Learning To Control Engineering

RL for multiple tasks in the case of quadrotors and a short state of the art about the general problem

J. Xing, I. Geles, Y. Song, E. Aljalbout and D. Scaramuzza, Multi-Task Reinforcement Learning for Quadrotors, IEEE Robotics and Automation Letters, vol. 10, no. 3, pp. 2112-2119, March 2025, DOI: 10.1109/LRA.2024.3520894.

Reinforcement learning (RL) has shown great effectiveness in quadrotor control, enabling specialized policies to develop even human-champion-level performance in single-task scenarios. However, these specialized policies often struggle with novel tasks, requiring a complete retraining of the policy from scratch. To address this limitation, this paper presents a novel multi-task reinforcement learning (MTRL) framework tailored for quadrotor control, leveraging the shared physical dynamics of the platform to enhance sample efficiency and task performance. By employing a multi-critic architecture and shared task encoders, our framework facilitates knowledge transfer across tasks, enabling a single policy to execute diverse maneuvers, including high-speed stabilization, velocity tracking, and autonomous racing. Our experimental results, validated both in simulation and real-world scenarios, demonstrate that our framework outperforms baseline approaches in terms of sample efficiency and overall task performance. Video is available at

Integrating the physical model of a Model Predictive Controller into an Actor-Critic RL framework to improve safety and flexibility at the same time

Angel Romero, Yunlong Song, Davide Scaramuzza, Actor-Critic Model Predictive Control, IEEE International Conference on Robotics and Automation, Yokohama, 2024 arXiv:2306.09852 [cs.RO].

An open research question in robotics is how
to combine the benefits of model-free reinforcement learning
(RL)—known for its strong task performance and flexibility in
optimizing general reward formulations—with the robustness
and online replanning capabilities of model predictive control
(MPC). This paper provides an answer by introducing a new
framework called Actor-Critic Model Predictive Control. The
key idea is to embed a differentiable MPC within an actor-
critic RL framework. The proposed approach leverages the
short-term predictive optimization capabilities of MPC with
the exploratory and end-to-end training properties of RL. The
resulting policy effectively manages both short-term decisions
through the MPC-based actor and long-term prediction via
the critic network, unifying the benefits of both model-based
control and end-to-end learning. We validate our method in
both simulation and the real world with a quadcopter platform
across various high-level tasks. We show that the proposed
architecture can achieve real-time control performance, learn
complex behaviors via trial and error, and retain the predictive
properties of the MPC to better handle out of distribution

RL in manufacturing control

Vladimir Samsonov, Karim Ben Hicham, Tobias Meisen, Reinforcement Learning in Manufacturing Control: Baselines, challenges and ways forward, Engineering Applications of Artificial Intelligence, Volume 112, 2022 DOI: 10.1016/j.engappai.2022.104868.

The field of Neural Combinatorial Optimization (NCO) offers multiple learning-based approaches to solve well-known combinatorial optimization tasks such as Traveling Salesman or Knapsack problem capable of competing with classical optimization approaches in terms of both solution quality and speed. This brought the attention of the research community to the tasks of Manufacturing Control (MC) with combinatorial nature. In this paper we outline the main components of MC tasks, select the most promising application fields and analyze dedicated learning-based solutions available in the literature. We draw multiple parallels to the current state of the art in the NCO field and allocate the main research gaps and directions on the perception, cognition and interaction levels. Using a set of practical examples we implement and benchmark common design patterns for single-agent Reinforcement Learning (RL) solutions. Along with testing existing solutions, we build on the ranked reward idea (Laterre et al., 2018) and offer a novel Multi-Instance Ranked Reward (m-R2) approach tailored to MC optimization tasks. It minimizes the reward shaping effort and defines a suitable training curriculum for more stable learning by separately tracking the agent\u2019s performance on every scheduling task and rewarding only policies contributing towards better scheduling solutions. We implement all solution design patterns as a set of interchangeable modules with a shared API, unified in a benchmarking framework with the focus on standardization of training and evaluation processes, reproducibility and simplified experiment lifecycle management. In addition to the framework, we make available our discrete-event simulation of a job shop production.


Zhihao Liu, Quan Liu, Wenjun Xu, Lihui Wang, Zude Zhou,
Robot learning towards smart robotic manufacturing: A review,
Robotics and Computer-Integrated Manufacturing,
Volume 77,
ISSN 0736-5845,

Shorter exploration stage in RL through the use of expert (a PID) that sets the expectation of the explored action

J. Enrique Sierra-Garcia, Matilde Santos, Ravi Pandit, Wind turbine pitch reinforcement learning control improved by PID regulator and learning observer, Engineering Applications of Artificial Intelligence, Volume 111, 2022 DOI: 10.1016/j.engappai.2022.104769.

Wind turbine (WT) pitch control is a challenging issue due to the non-linearities of the wind device and its complex dynamics, the coupling of the variables and the uncertainty of the environment. Reinforcement learning (RL) based control arises as a promising technique to address these problems. However, its applicability is still limited due to the slowness of the learning process. To help alleviate this drawback, in this work we present a hybrid RL-based control that combines a RL-based controller with a proportional\u2013integral\u2013derivative (PID) regulator, and a learning observer. The PID is beneficial during the first training episodes as the RL based control does not have any experience to learn from. The learning observer oversees the learning process by adjusting the exploration rate and the exploration window in order to reduce the oscillations during the training and improve convergence. Simulation experiments on a small real WT show how the learning significantly improves with this control architecture, speeding up the learning convergence up to 37%, and increasing the efficiency of the intelligent control strategy. The best hybrid controller reduces the error of the output power by around 41% regarding a PID regulator. Moreover, the proposed intelligent hybrid control configuration has proved more efficient than a fuzzy controller and a neuro-control strategy.

Using a physical simulator for sampled rollouts in stochastic optimal control

Carius J, Ranftl R, Farshidian F, Hutter M. Constrained stochastic optimal control with learned importance sampling: A path integral approach, The International Journal of Robotics Research. 2022;41(2):189-209, DOI: 10.1177/02783649211047890.

Modern robotic systems are expected to operate robustly in partially unknown environments. This article proposes an algorithm capable of controlling a wide range of high-dimensional robotic systems in such challenging scenarios. Our method is based on the path integral formulation of stochastic optimal control, which we extend with constraint-handling capabilities. Under our control law, the optimal input is inferred from a set of stochastic rollouts of the system dynamics. These rollouts are simulated by a physics engine, placing minimal restrictions on the types of systems and environments that can be modeled. Although sampling-based algorithms are typically not suitable for online control, we demonstrate in this work how importance sampling and constraints can be used to effectively curb the sampling complexity and enable real-time control applications. Furthermore, the path integral framework provides a natural way of incorporating existing control architectures as ancillary controllers for shaping the sampling distribution. Our results reveal that even in cases where the ancillary controller would fail, our stochastic control algorithm provides an additional safety and robustness layer. Moreover, in the absence of an existing ancillary controller, our method can be used to train a parametrized importance sampling policy using data from the stochastic rollouts. The algorithm may thereby bootstrap itself by learning an importance sampling policy offline and then refining it to unseen environments during online control. We validate our results on three robotic systems, including hardware experiments on a quadrupedal robot.

Model-based RL for controling a soft manipulator arm

T. G. Thuruthel, E. Falotico, F. Renda and C. Laschi, Model-Based Reinforcement Learning for Closed-Loop Dynamic Control of Soft Robotic Manipulators, IEEE Transactions on Robotics, vol. 35, no. 1, pp. 124-134, Feb. 2019. DOI: 10.1109/TRO.2018.2878318.

Dynamic control of soft robotic manipulators is an open problem yet to be well explored and analyzed. Most of the current applications of soft robotic manipulators utilize static or quasi-dynamic controllers based on kinematic models or linearity in the joint space. However, such approaches are not truly exploiting the rich dynamics of a soft-bodied system. In this paper, we present a model-based policy learning algorithm for closed-loop predictive control of a soft robotic manipulator. The forward dynamic model is represented using a recurrent neural network. The closed-loop policy is derived using trajectory optimization and supervised learning. The approach is verified first on a simulated piecewise constant strain model of a cable driven under-actuated soft manipulator. Furthermore, we experimentally demonstrate on a soft pneumatically actuated manipulator how closed-loop control policies can be derived that can accommodate variable frequency control and unmodeled external loads.

Value iteration applied in control systems when the model of the plant is substituted by data acquired from the plant

Yongqiang Li, Zhongsheng Hou, Yuanjing Feng, Ronghu Chi, Data-driven approximate value iteration with optimality error bound analysis, Automatica, Volume 78, April 2017, Pages 79-87, ISSN 0005-1098, DOI: 10.1016/j.automatica.2016.12.019.

Features of the data-driven approximate value iteration (AVI) algorithm, proposed in Li et al. (2014) for dealing with the optimal stabilization problem, include that only process data is required and that the estimate of the domain of attraction for the closed-loop is enlarged. However, the controller generated by the data-driven AVI algorithm is an approximate solution for the optimal control problem. In this work, a quantitative analysis result on the error bound between the optimal cost and the cost under the designed controller is given. This error bound is determined by the approximation error of the estimation for the optimal cost and the approximation error of the controller function estimator. The first one is concretely determined by the approximation error of the data-driven dynamic programming (DP) operator to the DP operator and the approximation error of the value function estimator. These three approximation errors are zeros when the data set of the plant is sufficient and infinitely complete, and the number of samples in the interested state space is infinite. This means that the cost under the designed controller equals to the optimal cost when the number of iterations is infinite.

NOTE: Another paper on the same issue in the same journal.

Model-based reinforcement learning with a reduced number of basis functions to aproximate the value function, a study of its convergence guarantees, and a nice state of the art on the use of (mdel-based) reinforcement learning for automatic control

Rushikesh Kamalapurkar, Joel A. Rosenfeld, Warren E. Dixon, Efficient model-based reinforcement learning for approximate online optimal control, Automatica, Volume 74, 2016, Pages 247-258, ISSN 0005-1098, DOI: 10.1016/j.automatica.2016.08.004.

An infinite horizon optimal regulation problem is solved online for a deterministic control-affine nonlinear dynamical system using a state following (StaF) kernel method to approximate the value function. Unlike traditional methods that aim to approximate a function over a large compact set, the StaF kernel method aims to approximate a function in a small neighborhood of a state that travels within a compact set. Simulation results demonstrate that stability and approximate optimality of the control system can be achieved with significantly fewer basis functions than may be required for global approximation methods.

Reinforcement learning in the automatic control area

Yu Jiang; Zhong-Ping Jiang, Global Adaptive Dynamic Programming for Continuous-Time Nonlinear Systems, in Automatic Control, IEEE Transactions on , vol.60, no.11, pp.2917-2929, Nov. 2015, DOI: 10.1109/TAC.2015.2414811.

This paper presents a novel method of global adaptive dynamic programming (ADP) for the adaptive optimal control of nonlinear polynomial systems. The strategy consists of relaxing the problem of solving the Hamilton-Jacobi-Bellman (HJB) equation to an optimization problem, which is solved via a new policy iteration method. The proposed method distinguishes from previously known nonlinear ADP methods in that the neural network approximation is avoided, giving rise to significant computational improvement. Instead of semiglobally or locally stabilizing, the resultant control policy is globally stabilizing for a general class of nonlinear polynomial systems. Furthermore, in the absence of the a priori knowledge of the system dynamics, an online learning method is devised to implement the proposed policy iteration technique by generalizing the current ADP theory. Finally, three numerical examples are provided to validate the effectiveness of the proposed method.

Nice summary of reinforcement learning in control (Adaptive Dynamic Programming) and the use of Q-learning plus NN approximators for solving a control problem under a game theory framework

Kyriakos G. Vamvoudakis, Non-zero sum Nash Q-learning for unknown deterministic continuous-time linear systems, Automatica, Volume 61, November 2015, Pages 274-281, ISSN 0005-1098, DOI: 10.1016/j.automatica.2015.08.017.

This work proposes a novel Q-learning algorithm to solve the problem of non-zero sum Nash games of linear time invariant systems with N -players (control inputs) and centralized uncertain/unknown dynamics. We first formulate the Q-function of each player as a parametrization of the state and all other the control inputs or players. An integral reinforcement learning approach is used to develop a model-free structure of N -actors/ N -critics to estimate the parameters of the N -coupled Q-functions online while also guaranteeing closed-loop stability and convergence of the control policies to a Nash equilibrium. A 4th order, simulation example with five players is presented to show the efficacy of the proposed approach.