Tag Archives: Policy Iteration

Offline RL in robotics

L. Yao, B. Zhao, X. Xu, Z. Wang, P. K. Wong and Y. Hu, Efficient Incremental Offline Reinforcement Learning With Sparse Broad Critic Approximation, IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 54, no. 1, pp. 156-169, Jan. 2024 DOI: 10.1109/TSMC.2023.3305498.

Offline reinforcement learning (ORL) has been getting increasing attention in robot learning, benefiting from its ability to avoid hazardous exploration and learn policies directly from precollected samples. Approximate policy iteration (API) is one of the most commonly investigated ORL approaches in robotics, due to its linear representation of policies, which makes it fairly transparent in both theoretical and engineering analysis. One open problem of API is how to design efficient and effective basis functions. The broad learning system (BLS) has been extensively studied in supervised and unsupervised learning in various applications. However, few investigations have been conducted on ORL. In this article, a novel incremental ORL approach with sparse broad critic approximation (BORL) is proposed with the advantages of BLS, which approximates the critic function in a linear manner with randomly projected sparse and compact features and dynamically expands its broad structure. The BORL is the first extension of API with BLS in the field of robotics and ORL. The approximation ability and convergence performance of BORL are also analyzed. Comprehensive simulation studies are then conducted on two benchmarks, and the results demonstrate that the proposed BORL can obtain comparable or better performance than conventional API methods without laborious hyperparameter fine-tuning work. To further demonstrate the effectiveness of BORL in practical robotic applications, a variable force tracking problem in robotic ultrasound scanning (RUSS) is investigated, and a learning-based adaptive impedance control (LAIC) algorithm is proposed based on BORL. The experimental results demonstrate the advantages of LAIC compared with conventional force tracking methods.

See also: X. Wang, D. Hou, L. Huang and Y. Cheng, “Offline\u2013Online Actor\u2013Critic,” in IEEE Transactions on Artificial Intelligence, vol. 5, no. 1, pp. 61-69, Jan. 2024, doi: 10.1109/TAI.2022.3225251

Model-based (on ordinary differential equations) and partially model-free Policy Iteration on continuous space and time

Jaeyoung Lee, Richard S. Sutton, Policy iterations for reinforcement learning problems in continuous time and space — Fundamental theory and methods, . Automatica, Volume 126, 2021 DOI: 10.1016/j.automatica.2020.109421.

Policy iteration (PI) is a recursive process of policy evaluation and improvement for solving an optimal decision-making/control problem, or in other words, a reinforcement learning (RL) problem. PI has also served as the fundamental for developing RL methods. In this paper, we propose two PI methods, called differential PI (DPI) and integral PI (IPI), and their variants, for a general RL framework in continuous time and space (CTS), where the environment is modeled by a system of ordinary differential equations (ODEs). The proposed methods inherit the current ideas of PI in classical RL and optimal control and theoretically support the existing RL algorithms in CTS: TD-learning and value-gradient-based (VGB) greedy policy update. We also provide case studies including (1) discounted RL and (2) optimal control tasks. Fundamental mathematical properties – admissibility, uniqueness of the solution to the Bellman equation (BE), monotone improvement, convergence, and optimality of the solution to the Hamilton–Jacobi–Bellman equation (HJBE) – are all investigated in-depth and improved from the existing theory, along with the general and case studies. Finally, the proposed ones are simulated with an inverted-pendulum model and their model-based and partially model-free implementations to support the theory and further investigate them beyond.