Reward machines as reward specification method for RL and their automated learning

Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, Margarita P. Castro, Ethan Waldie, Sheila A. McIlraith, Learning reward machines: A study in partially observable reinforcement learning, Artificial Intelligence, Volume 323, 2023 DOI: 10.1016/j.artint.2023.103989.

Reinforcement Learning (RL) is a machine learning paradigm wherein an artificial agent interacts with an environment with the purpose of learning behaviour that maximizes the expected cumulative reward it receives from the environment. Reward machines (RMs) provide a structured, automata-based representation of a reward function that enables an RL agent to decompose an RL problem into structured subproblems that can be efficiently learned via off-policy learning. Here we show that RMs can be learned from experience, instead of being specified by the user, and that the resulting problem decomposition can be used to effectively solve partially observable RL problems. We pose the task of learning RMs as a discrete optimization problem where the objective is to find an RM that decomposes the problem into a set of subproblems such that the combination of their optimal memoryless policies is an optimal policy for the original problem. We show the effectiveness of this approach on three partially observable domains, where it significantly outperforms A3C, PPO, and ACER, and discuss its advantages, limitations, and broader potential.

A PID-based global optimization algorithm

Yuansheng Gao, PID-based search algorithm: A novel metaheuristic algorithm based on PID algorithm, Expert Systems with Applications, Volume 232, 2023, DOI: 10.1016/j.eswa.2023.120886.

In this paper, a metaheuristic algorithm called PID-based search algorithm (PSA) is proposed for global optimization. The algorithm is based on an incremental PID algorithm that converges the entire population to an optimal state by continuously adjusting the system deviations. PSA is mathematically modeled and implemented to achieve optimization in a wide range of search spaces. PSA is used to solve CEC2017 benchmark test functions and six constrained problems. The optimization performance of PSA is verified by comparing it with seven metaheuristics proposed in recent years. The Kruskal-Wallis, Holm and Friedman tests verified the superiority of PSA in terms of statistical significance. The results show that PSA can be better balanced exploration and exploitation with strong optimization capability. Source codes�of PSA are publicly available at

A review of RL algorithms

Ashish Kumar Shakya, Gopinatha Pillai, Sohom Chakrabarty, Reinforcement learning algorithms: A brief survey, Expert Systems with Applications, Volume 231, 2023 DOI: 10.1016/j.eswa.2023.120495.

Reinforcement Learning (RL) is a machine learning (ML) technique to learn sequential decision-making in complex problems. RL is inspired by trial-and-error based human/animal learning. It can learn an optimal policy autonomously with knowledge obtained by continuous interaction with a stochastic dynamical environment. Problems considered virtually impossible to solve, such as learning to play video games just from pixel information, are now successfully solved using deep reinforcement learning. Without human intervention, RL agents can surpass human performance in challenging tasks. This review gives a broad overview of RL, covering its fundamental principles, essential methods, and illustrative applications. The authors aim to develop an initial reference point for researchers commencing their research work in RL. In this review, the authors cover some fundamental model-free RL algorithms and pathbreaking function approximation-based deep RL (DRL) algorithms for complex uncertain tasks with continuous action and state spaces, making RL useful in various interdisciplinary fields. This article also provides a brief review of model-based and multi-agent RL approaches. Finally, some promising research directions for RL are briefly presented.

Interesting way of explaining pointers and arrays of C in teaching programming

W. Rong, T. Xu, Z. Sun, Z. Sun, Y. Ouyang and Z. Xiong, An Object Tuple Model for Understanding Pointer and Array in C Language, IEEE Transactions on Education, vol. 66, no. 4, pp. 318-329, Aug. 2023 DOI: 10.1109/TE.2023.3236027.

Contribution: In this study, an object tuple model has been proposed, and a quasi-experimental study on its usage in an introductory programming language course has been reported. This work can be adopted by all C language teachers and students in learning pointer and array-related concepts. Background: C language has been extensively employed in numerous universities as an introductory programming practice. However, the pointer and array have long been recognized as some of the most difficult concepts for novice students learning C language. To help students become familiar with the concept of pointer and array and also their related operations, a comprehensive understanding from memory management\u2019s perspective might be helpful. Research Questions: 1) How does the object tuple model help students understand all kinds of object types from a generalized perspective? 2) Why is it important to let the students consider multiple arrays from a 1-D perspective? and 3) How do the memory-oriented operations from the object\u2019s perspective help students comprehensively understand the pointer and array? Methodology: The students were divided into experimental and control groups, and the object tuple model was presented in the experimental group. An examination was conducted at end of the semester, and test data were gathered for further analysis. Findings: The proposed object tuple model is effective in giving students clear guidance and helping them further understand the pointer and array in C language.

Pure pursuit with linear velocity regulation

Macenski, S., Singh, S., Mart�n, F. et al. Regulated pure pursuit for robot path tracking, Auton Robot 47, 685\u2013694 (2023) DOI: 10.1007/s10514-023-10097-6.

The accelerated deployment of service robots have spawned a number of algorithm variations to better handle real-world conditions. Many local trajectory planning techniques have been deployed on practical robot systems successfully. While most formulations of Dynamic Window Approach and Model Predictive Control can progress along paths and optimize for additional criteria, the use of pure path tracking algorithms is still commonplace. Decades later, Pure Pursuit and its variants continues to be one of the most commonly utilized classes of local trajectory planners. However, few Pure Pursuit variants have been proposed with schema for variable linear velocities\u2014they either assume a constant velocity or fails to address the point at all. This paper presents a variant of Pure Pursuit designed with additional heuristics to regulate linear velocities, built atop the existing Adaptive variant. The Regulated Pure Pursuit algorithm makes incremental improvements on state of the art by adjusting linear velocities with particular focus on safety in constrained and partially observable spaces commonly negotiated by deployed robots. We present experiments with the Regulated Pure Pursuit algorithm on industrial-grade service robots. We also provide a high-quality reference implementation that is freely included ROS 2 Nav2 framework at for fast evaluation.


H. A. G. C. Premachandra, R. Liu, C. Yuen and U. -X. Tan, UWB Radar SLAM: An Anchorless Approach in Vision Denied Indoor Environments, IEEE Robotics and Automation Letters, vol. 8, no. 9, pp. 5299-5306, Sept. 2023 DOI: 10.1109/LRA.2023.3293354.

LiDAR and cameras are frequently used as sensors for simultaneous localization and mapping (SLAM). However, these sensors are prone to failure under low visibility (e.g. smoke) or places with reflective surfaces (e.g. mirrors). On the other hand, electromagnetic waves exhibit better penetration properties when the wavelength increases, thus are not affected by low visibility. Hence, this letter presents ultra-wideband (UWB) radar as an alternative to the existing sensors. UWB is generally known to be used in anchor-tag SLAM systems. One or more anchors are installed in the environment and the tags are attached to the robots. Although this method performs well under low visibility, modifying the existing infrastructure is not always feasible. UWB has also been used in peer-to-peer ranging collaborative SLAM systems. However, this requires more than a single robot and does not include mapping in the mentioned environment like smoke. Therefore, the presented approach in this letter solely depends on the UWB transceivers mounted on-board. In addition, an extended Kalman filter (EKF) SLAM is used to solve the SLAM problem at the back-end. Experiments were conducted and demonstrated that the proposed UWB-based radar SLAM is able to map natural point landmarks inside an indoor environment while improving robot localization.

They had to do it: Certified RL (through online reward shaping/definition)

Hosein Hasanbeig, Daniel Kroening, Alessandro Abate, Certified reinforcement learning with logic guidance, Artificial Intelligence, Volume 322, 2023 DOI: 10.1016/j.artint.2023.103949.

Reinforcement Learning (RL) is a widely employed machine learning architecture that has been applied to a variety of control problems. However, applications in safety-critical domains require a systematic and formal approach to specifying requirements as tasks or goals. We propose a model-free RL algorithm that enables the use of Linear Temporal Logic (LTL) to formulate a goal for unknown continuous-state/action Markov Decision Processes (MDPs). The given LTL property is translated into a Limit-Deterministic Generalised B�chi Automaton (LDGBA), which is then used to shape a synchronous reward function on-the-fly. Under certain assumptions, the algorithm is guaranteed to synthesise a control policy whose traces satisfy the LTL specification with maximal probability.

Meta-RL: given a distribution of tasks, learn a policy capable of adapting to any new task from the task distribution with as little data as possible

Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, Shimon Whiteson, A Survey of Meta-Reinforcement Learning, arXiv:2301.08028 [cs.LG], 2023 DOI: 10.48550/arXiv.2301.08028.

While deep reinforcement learning (RL) has fueled multiple high-profile successes in machine learning, it is held back from more widespread adoption by its often poor data efficiency and the limited generality of the policies it produces. A promising approach for alleviating these limitations is to cast the development of better RL algorithms as a machine learning problem itself in a process called meta-RL. Meta-RL is most commonly studied in a problem setting where, given a distribution of tasks, the goal is to learn a policy that is capable of adapting to any new task from the task distribution with as little data as possible. In this survey, we describe the meta-RL problem setting in detail as well as its major variations. We discuss how, at a high level, meta-RL research can be clustered based on the presence of a task distribution and the learning budget available for each individual task. Using these clusters, we then survey meta-RL algorithms and applications. We conclude by presenting the open problems on the path to making meta-RL part of the standard toolbox for a deep RL practitioner.

Using “empowerment” to better select actions in RL when there are only sparse rewards

Dai, S., Xu, W., Hofmann, A. et al. An empowerment-based solution to robotic manipulation tasks with sparse rewards, Auton Robot 47, 617\u2013633 (2023) DOI: 10.1007/s10514-023-10087-8.

In order to provide adaptive and user-friendly solutions to robotic manipulation, it is important that the agent can learn to accomplish tasks even if they are only provided with very sparse instruction signals. To address the issues reinforcement learning algorithms face when task rewards are sparse, this paper proposes an intrinsic motivation approach that can be easily integrated into any standard reinforcement learning algorithm and can allow robotic manipulators to learn useful manipulation skills with only sparse extrinsic rewards. Through integrating and balancing empowerment and curiosity, this approach shows superior performance compared to other state-of-the-art intrinsic exploration approaches during extensive empirical testing. When combined with other strategies for tackling the exploration challenge, e.g. curriculum learning, our approach is able to further improve the exploration efficiency and task success rate. Qualitative analysis also shows that when combined with diversity-driven intrinsic motivations, this approach can help manipulators learn a set of diverse skills which could potentially be applied to other more complicated manipulation tasks and accelerate their learning process.

Using Deep RL (TRPO) for selecting best interest points in the environment for path planning

Jie Fan, Xudong Zhang, Yuan Zou, Hierarchical path planner for unknown space exploration using reinforcement learning-based intelligent frontier selection, Expert Systems with Applications, Volume 230, 2023 DOI: 10.1016/j.eswa.2023.120630.

Path planning in unknown environments is extremely useful for some specific tasks, such as exploration of outer space planets, search and rescue in disaster areas, home sweeping services, etc. However, existing frontier-based path planners suffer from insufficient exploration, while reinforcement learning (RL)-based ones are confronted with problems in efficient training and effective searching. To overcome the above problems, this paper proposes a novel hierarchical path planner for unknown space exploration using RL-based intelligent frontier selection. Firstly, by decomposing the path planner into three-layered architecture (including the perception layer, planning layer, and control layer) and using edge detection to find potential frontiers to track, the path search space is shrunk from the whole map to a handful of points of interest, which significantly saves the computational resources in both training and execution processes. Secondly, one of the advanced RL algorithms, trust region policy optimization (TRPO), is used as a judge to select the best frontier for the robot to track, which ensures the optimality of the path planner with a shorter path length. The proposed method is validated through simulation and compared with both classic and state-of-the-art methods. Results show that the training process could be greatly accelerated compared with the traditional deep-Q network (DQN). Moreover, the proposed method has 4.2%\u201314.3% improvement in exploration region rate and achieves the highest exploration completeness.