Tag Archives: Actor-critic

Integrating the physical model of a Model Predictive Controller into an Actor-Critic RL framework to improve safety and flexibility at the same time

Angel Romero, Yunlong Song, Davide Scaramuzza, Actor-Critic Model Predictive Control, IEEE International Conference on Robotics and Automation, Yokohama, 2024 arXiv:2306.09852 [cs.RO].

An open research question in robotics is how
to combine the benefits of model-free reinforcement learning
(RL)—known for its strong task performance and flexibility in
optimizing general reward formulations—with the robustness
and online replanning capabilities of model predictive control
(MPC). This paper provides an answer by introducing a new
framework called Actor-Critic Model Predictive Control. The
key idea is to embed a differentiable MPC within an actor-
critic RL framework. The proposed approach leverages the
short-term predictive optimization capabilities of MPC with
the exploratory and end-to-end training properties of RL. The
resulting policy effectively manages both short-term decisions
through the MPC-based actor and long-term prediction via
the critic network, unifying the benefits of both model-based
control and end-to-end learning. We validate our method in
both simulation and the real world with a quadcopter platform
across various high-level tasks. We show that the proposed
architecture can achieve real-time control performance, learn
complex behaviors via trial and error, and retain the predictive
properties of the MPC to better handle out of distribution
behaviour.

Improving sample efficiency in actor-critic RL (A2C with NNs) through multimodal advantage function

Jonghyeok Park, Soohee Han, Reinforcement learning with multimodal advantage function for accurate advantage estimation in robot learning, Engineering Applications of Artificial Intelligence, Volume 126, Part C, 2023 DOI: 10.1016/j.engappai.2023.107019.

In this paper, we propose a reinforcement learning (RL) framework that uses a multimodal advantage function (MAF) to come close to the true advantage function, thereby achieving high returns. The MAF, which is constructed as a logarithm of a mixture of Gaussians policy (MoG-P) and trained by globally collected past experiences, directly assesses the complex true advantage function with its multi-modality and is expected to enhance the sample-efficiency of RL. To realize the expected enhanced learning performance with the proposed RL framework, two practical techniques are developed that include mode selection and rounding off of actions during the policy update process. Mode selection is conducted to sample the action around the most influential or weighted mode for efficient environment exploration. For fast policy updates, past actions are rounded off to discretized action values when calculating the multimodal advantage function. The proposed RL framework was validated using simulation environments and a real inverted pendulum system. The findings showed that the proposed framework can achieve a more sample-efficient performance or higher returns than other advantage-based RL benchmarks.

Application of Deep RL to person following by a robot, reducing the training effort of the network by reusing simple state situations in many artificially generated states

Pang, L., Zhang, Y., Coleman, S. et al., Efficient Hybrid-Supervised Deep Reinforcement Learning for Person Following Robot, J Intell Robot Syst 97, 299–312 (2020), DOI: 10.1007/s10846-019-01030-0.

Traditional person following robots usually need hand-crafted features and a well-designed controller to follow the assigned person. Normally it is difficult to be applied in outdoor situations due to variability and complexity of the environment. In this paper, we propose an approach in which an agent is trained by hybrid-supervised deep reinforcement learning (DRL) to perform a person following task in end-to-end manner. The approach enables the robot to learn features autonomously from monocular images and to enhance performance via robot-environment interaction. Experiments show that the proposed approach is adaptive to complex situations with significant illumination variation, object occlusion, target disappearance, pose change, and pedestrian interference. In order to speed up the training process to ensure easy application of DRL to real-world robotic follower controls, we apply an integration method through which the agent receives prior knowledge from a supervised learning (SL) policy network and reinforces its performance with a value-based or policy-based (including actor-critic method) DRL model. We also utilize an efficient data collection approach for supervised learning in the context of person following. Experimental results not only verify the robustness of the proposed DRL-based person following robot system, but also indicate how easily the robot can learn from mistakes and improve performance.

A good intro about actor-critic and decision making without model on MDPs

J. Wang and I. C. Paschalidis, “An Actor-Critic Algorithm With Second-Order Actor and Critic,” in IEEE Transactions on Automatic Control, vol. 62, no. 6, pp. 2689-2703, June 2017.DOI: 10.1109/TAC.2016.2616384.

Actor-critic algorithms solve dynamic decision making problems by optimizing a performance metric of interest over a user-specified parametric class of policies. They employ a combination of an actor, making policy improvement steps, and a critic, computing policy improvement directions. Many existing algorithms use a steepest ascent method to improve the policy, which is known to suffer from slow convergence for ill-conditioned problems. In this paper, we first develop an estimate of the (Hessian) matrix containing the second derivatives of the performance metric with respect to policy parameters. Using this estimate, we introduce a new second-order policy improvement method and couple it with a critic using a second-order learning method. We establish almost sure convergence of the new method to a neighborhood of a policy parameter stationary point. We compare the new algorithm with some existing algorithms in two applications and demonstrate that it leads to significantly faster convergence.