Category Archives: Artificial Intelligence

Improving sample efficiency under sparse rewards and large continuous action spaces through predictive control in RL

Antonyshyn, L., Givigi, S., Deep Model-Based Reinforcement Learning for Predictive Control of Robotic Systems with Dense and Sparse Rewards, J Intell Robot Syst 110, 100 (2024) DOI: 10.1007/s10846-024-02118-y.

Sparse rewards and sample efficiency are open areas of research in the field of reinforcement learning. These problems are especially important when considering applications of reinforcement learning to robotics and other cyber-physical systems. This is so because in these domains many tasks are goal-based and naturally expressed with binary successes and failures, action spaces are large and continuous, and real interactions with the environment are limited. In this work, we propose Deep Value-and-Predictive-Model Control (DVPMC), a model-based predictive reinforcement learning algorithm for continuous control that uses system identification, value function approximation and sampling-based optimization to select actions. The algorithm is evaluated on a dense reward and a sparse reward task. We show that it can match the performance of a predictive control approach to the dense reward problem, and outperforms model-free and model-based learning algorithms on the sparse reward task on the metrics of sample efficiency and performance. We verify the performance of an agent trained in simulation using DVPMC on a real robot playing the reach-avoid game. Video of the experiment can be found here: https://youtu.be/0Q274kcfn4c.

An interesting survey -before the “generative AI” boom- of the integration of sub-symbolic (for learning) and symbolic (for reasoning) systems

Artur d’Avila Garcez, Luis C. Lamb, Neurosymbolic AI: The 3rd Wave, arXiv:2012.05876 [cs.AI] https://arxiv.org/abs/2012.05876v2.

Current advances in Artificial Intelligence (AI) and Machine Learning (ML) have achieved unprecedented impact across research communities and industry. Nevertheless, concerns about trust, safety, interpretability and accountability of AI were raised by influential thinkers. Many have identified the need for well-founded knowledge representation and reasoning to be integrated with deep learning and for sound explainability. Neural-symbolic computing has been an active area of research for many years seeking to bring together robust learning in neural networks with reasoning and explainability via symbolic representations for network models. In this paper, we relate recent and early research results in neurosymbolic AI with the objective of identifying the key ingredients of the next wave of AI systems. We focus on research that integrates in a principled way neural network-based learning with symbolic knowledge representation and logical reasoning. The insights provided by 20 years of neural-symbolic computing are shown to shed new light onto the increasingly prominent role of trust, safety, interpretability and accountability of AI. We also identify promising directions and challenges for the next decade of AI research from the perspective of neural-symbolic systems.

A relatively simple way of reducing the sampling cost of DQN

Hossein Hassani, Soodeh Nikan, Abdallah Shami, Traffic navigation via reinforcement learning with episodic-guided prioritized experience replay, Engineering Applications of Artificial Intelligence, Volume 137, Part A, 2024, DOI: 10.1016/j.engappai.2024.109147.

Deep Reinforcement Learning (DRL) models play a fundamental role in autonomous driving applications; however, they typically suffer from sample inefficiency because they often require many interactions with the environment to learn effective policies. This makes the training process time-consuming. To address this shortcoming, Prioritized Experience Replay (PER) has proven to be effective by prioritizing samples with high Temporal-Difference (TD) error for learning. In this context, this study contributes to artificial intelligence by proposing a sample-efficient DRL algorithm called Episodic-Guided Prioritized Experience Replay (EPER). The core innovation of EPER lies in the utilization of an episodic memory, dedicated to storing successful training episodes. Within this memory, expected returns for each state–action pair are extracted. These returns, combined with TD error-based prioritization, form a novel objective function for deep Q-network training. To prevent excessive determinism, EPER introduces exploration into the learning process by incorporating a regularization term into the objective function that allows exploration of state-space regions with diverse Q-values. The proposed EPER algorithm is suitable to train a DRL agent for handling episodic tasks, and it can be integrated into off-policy DRL models. EPER is employed for traffic navigation through scenarios such as highway driving, merging, roundabout, and intersection to showcase its application in engineering. The attained results denote that, compared with the PER and an additional state-of-the-art training technique, EPER is superior in expediting the training of the agent and learning a more optimal policy that leads to lower collision rates within the constructed navigation scenarios.

Equivalence between Transformers and SVMs

Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, Samet Oymak, Transformers as Support Vector Machines, arXiv:2308.16898 [cs.LG], https://arxiv.org/abs/2308.16898.

Since its inception in “Attention Is All You Need”, transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a sequence of input tokens X and makes them interact through pairwise similarities computed as softmax(XQK⊤X⊤), where (K,Q) are the trainable key-query parameters. In this work, we establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem that separates optimal input tokens from non-optimal tokens using linear constraints on the outer-products of token pairs. This formalism allows us to characterize the implicit bias of 1-layer transformers optimized with gradient descent: (1) Optimizing the attention layer with vanishing regularization, parameterized by (K,Q), converges in direction to an SVM solution minimizing the nuclear norm of the combined parameter W=KQ⊤. Instead, directly parameterizing by W minimizes a Frobenius norm objective. We characterize this convergence, highlighting that it can occur toward locally-optimal directions rather than global ones. (2) Complementing this, we prove the local/global directional convergence of gradient descent under suitable geometric conditions. Importantly, we show that over-parameterization catalyzes global convergence by ensuring the feasibility of the SVM problem and by guaranteeing a benign optimization landscape devoid of stationary points. (3) While our theory applies primarily to linear prediction heads, we propose a more general SVM equivalence that predicts the implicit bias with nonlinear heads. Our findings are applicable to arbitrary datasets and their validity is verified via experiments. We also introduce several open problems and research directions. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.

A novel way of addressing the maximization bias in RL

Martin Waltz, Ostap Okhrin, Addressing maximization bias in reinforcement learning with two-sample testing, Artificial Intelligence, Volume 336, 2024, DOI: 10.1016/j.artint.2024.104204.

Value-based reinforcement-learning algorithms have shown strong results in games, robotics, and other real-world applications. Overestimation bias is a known threat to those algorithms and can sometimes lead to dramatic performance decreases or even complete algorithmic failure. We frame the bias problem statistically and consider it an instance of estimating the maximum expected value (MEV) of a set of random variables. We propose the T-Estimator (TE) based on two-sample testing for the mean, that flexibly interpolates between over- and underestimation by adjusting the significance level of the underlying hypothesis tests. We also introduce a generalization, termed K-Estimator (KE), that obeys the same bias and variance bounds as the TE and relies on a nearly arbitrary kernel function. We introduce modifications of Q-Learning and the Bootstrapped Deep Q-Network (BDQN) using the TE and the KE, and prove convergence in the tabular setting. Furthermore, we propose an adaptive variant of the TE-based BDQN that dynamically adjusts the significance level to minimize the absolute estimation bias. All proposed estimators and algorithms are thoroughly tested and validated on diverse tasks and environments, illustrating the bias control and performance potential of the TE and KE.

RL in periodic scenarios

A. Aniket and A. Chattopadhyay, Online Reinforcement Learning in Periodic MDP, IEEE Transactions on Artificial Intelligence, vol. 5, no. 7, pp. 3624-3637, July 2024 DOI: 10.1109/TAI.2024.3375258.

We study learning in periodic Markov decision process (MDP), a special type of nonstationary MDP where both the state transition probabilities and reward functions vary periodically, under the average reward maximization setting. We formulate the problem as a stationary MDP by augmenting the state space with the period index and propose a periodic upper confidence bound reinforcement learning-2 (PUCRL2) algorithm. We show that the regret of PUCRL2 varies linearly with the period N and as O(TlogT−−−−−√) with the horizon length T . Utilizing the information about the sparsity of transition matrix of augmented MDP, we propose another algorithm [periodic upper confidence reinforcement learning with Bernstein bounds (PUCRLB) which enhances upon PUCRL2, both in terms of regret ( O(N−−√) dependency on period] and empirical performance. Finally, we propose two other algorithms U-PUCRL2 and U-PUCRLB for extended uncertainty in the environment in which the period is unknown but a set of candidate periods are known. Numerical results demonstrate the efficacy of all the algorithms.

Review of the current methologies for achieving continuous learning, and its biological bases

Buddhi Wickramasinghe, Gobinda Saha , and Kaushik Roy, Continual Learning: A Review of Techniques, Challenges, and Future Directions, IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 6, JUNE 2024 DOI: 10.1109/TAI.2023.3339091.

Continual learning (CL), or the ability to acquire, process, and learn from new information without forgetting acquired knowledge, is a fundamental quality of an intelligent agent. The human brain has evolved into gracefully dealing with ever-changing circumstances and learning from experience with the help of complex neurophysiological mechanisms. Even though artificial intelligence takes after human intelligence, traditional neural networks do not possess the ability to adapt to dynamic environments. When presented with new information, an artificial neural network (ANN) often completely forgets its prior knowledge, a phenomenon called catastrophic forgetting or catastrophic interference. Incorporating CL capabilities into ANNs is an active field of research and is integral to achieving artificial general intelligence. In this review, we revisit CL approaches and critically examine their strengths and limitations. We conclude that CL approaches should look beyond mitigating catastrophic forgetting and strive for systems that can learn, store, recall, and transfer knowledge, much like the human brain. To this end, we highlight the importance of adopting alternative brain-inspired data representations and learning algorithms and provide our perspective on promising new directions where CL could play an instrumental role.

See also: doi: 10.1109/TAI.2024.3355879

Reducing discovered skills in DRL to the essential ones, modelling skills with SMDP Q-learning

Shuai Qing, Fei Zhu, Refine to the essence: Less-redundant skill learning via diversity clustering, Engineering Applications of Artificial Intelligence, Volume 133, Part A, 2024 DOI: 10.1016/j.engappai.2024.107981.

In reinforcement learning, skill is a potentially conditional policy that solves tasks in a hierarchically controlled manner. Progress on skill discovery helps agents learn a set of diverse and useful skills without external supervision to tackle complex tasks with sparse rewards. Although most of the studies have aimed to maximize the diversity of skills discovered, the distinguishability between skills diminishes as the number of skills increases, leading to a subset of similar and redundant skills. To tackle this problem, a method called Refine to the Essence of Skills (RE-Skill) is proposed, which aims at learning skills with less redundancy. RE-Skill integrates the concepts of cluster analysis and policy distillation, clustering similar skills together based on their unique features, learning the most optimal performance within each cluster, and filtering out similar skills that involve excessive and intricate actions, thereby reducing redundancy among skills. By refining clusters of similar skills into less-redundant independent skills, RE-Skill demonstrates superior performance compared to other skill discovery algorithms and shows how these less-redundant skills effectively address downstream tasks, indicating that RE-Skill is able to extend its efficacy to engineering applications in robot control and obstacle training tasks within complex environments.

A novel RL setting for non-Markovian systems

Ronen I. Brafman, Giuseppe De Giacomo, Regular decision processes, Artificial Intelligence, Volume 331, 2024 DOI: 10.1016/j.artint.2024.104113.

We introduce and study Regular Decision Processes (RDPs), a new, compact model for domains with non-Markovian dynamics and rewards, in which the dependence on the past is regular, in the language theoretic sense. RDPs are an intermediate model between MDPs and POMDPs. They generalize k-order MDPs and can be viewed as a POMDP in which the hidden state is a regular function of the entire history. In factored RDPs, transition and reward functions are specified using formulas in linear temporal logics over finite traces, or using regular expressions. This allows specifying complex dependence on the past using intuitive and compact formulas, and building models of partially observable domains without specifying an underlying state space.

The problem of incorporating novelties into the knowledge of an AI agent

Shivam Goel, Panagiotis Lymperopoulos, Ravenna Thielstrom, Evan Krause, Patrick Feeney, Pierrick Lorang, Sarah Schneider, Yichen Wei, Eric Kildebeck, Stephen Goss, Michael C. Hughes, Liping Liu, Jivko Sinapov, Matthias Scheutz, A neurosymbolic cognitive architecture framework for handling novelties in open worlds, Artificial Intelligence, Volume 331, 2024 DOI: 10.1016/j.artint.2024.104111.

“Open world” environments are those in which novel objects, agents, events, and more can appear and contradict previous understandings of the environment. This runs counter to the “closed world” assumption used in most AI research, where the environment is assumed to be fully understood and unchanging. The types of environments AI agents can be deployed in are limited by the inability to handle the novelties that occur in open world environments. This paper presents a novel cognitive architecture framework to handle open-world novelties. This framework combines symbolic planning, counterfactual reasoning, reinforcement learning, and deep computer vision to detect and accommodate novelties. We introduce general algorithms for exploring open worlds using inference and machine learning methodologies to facilitate novelty accommodation. The ability to detect and accommodate novelties allows agents built on this framework to successfully complete tasks despite a variety of novel changes to the world. Both the framework components and the entire system are evaluated in Minecraft-like simulated environments. Our results indicate that agents are able to efficiently complete tasks while accommodating “concealed novelties” not shared with the architecture development team.