Tag Archives: Mdps

Safety in MDPs by measuring the probability of reaching dangerous states

Rafal Wisniewski, Luminita-Manuela Bujorianu, Safety of stochastic systems: An analytic and computational approach, . Automatica, Volume 133, 2021 DOI: 10.1016/j.automatica.2021.109839.

We refine the concept of stochastic reach avoidance for a general class of Markov processes introducing a threshold of p for the reaching probability. This new problem is called p-safety, and it aims to ensure that the given process reaches a forbidden set before leaving its ‘working’ state space with a probability of less than p. In the situation when an initial probability measure characterizes the initial states, a variant of p-safety is put forward. We call this form of safety weak p-safety. In this work, we characterize both p-safety and weak p-safety and show how to compute them. We employ semi-definite programming to compute p-safety and linear programming to compute weak p-safety. To get to this point, we use certificates of positivity of polynomials translated into the sum of squares and the Bernstein forms.

Synthesizing a supervisor (a Finite State Machine) instead of finding a standard policy in MDPs, applied to multi-agent systems

B. Wu, X. Zhang and H. Lin Permissive Supervisor Synthesis for Markov Decision Processes Through Learning. IEEE Transactions on Automatic Control, vol. 64, no. 8, pp. 3332-3338, Aug. 2019. DOI: 10.1109/TAC.2018.2879505.

This paper considers the permissive supervisor synthesis for probabilistic systems modeled as Markov Decision Processes (MDP). Such systems are prevalent in power grids, transportation networks, communication networks, and robotics. We propose a novel supervisor synthesis framework using automata learning and compositional model checking to generate the permissive local supervisors in a distributed manner. With the recent advances in assume-guarantee reasoning verification for MDPs, constructing the composed system can be avoided to alleviate the state space explosion. Our framework learns the supervisors iteratively using counterexamples from the verification and is guaranteed to terminate in finite steps and to be correct.

An application of MDPs to UAV collision-free navigation with an interesting taxonomy of the state-of-the-art

Xiang Yu1, Xiaobin Zhou2, Youmin Zhang, Collision-Free Trajectory Generation and Tracking for UAVs Using Markov Decision Process in a Cluttered Environment, Journal of Intelligent & Robotic Systems, 2019, 93:17–32 DOI: 10.1007/s10846-018-0802-z.

A collision-free trajectory generation and tracking method capable of re-planning unmanned aerial vehicle (UAV) trajectories can increase flight safety and decrease the possibility of mission failures. In this paper, a Markov decision process (MDP) based algorithm combined with backtracking method is presented to create a safe trajectory in the case of hostile environments. Subsequently, a differential flatness method is adopted to smooth the profile of the rerouted trajectory for satisfying the UAV physical constraints. Lastly, a flight controller based on passivity-based control (PBC) is designed to maintain UAV’s stability and trajectory tracking performance. simulation results demonstrate that the UAV with the proposed strategy is capable of avoiding obstacles in a hostile environment.

Considering the robot and all the intermmediate objects that participate in the manipulation of another object as a MDP

Yilun Zhou, Benjamin Burchfiel, George Konidaris, Representing, learning, and controlling complex object interactions, Autonomous Robots, Volume 42, Issue 7, pp 1355–1367, DOI: 10.1007/s1051.

We present a framework for representing scenarios with complex object interactions, where a robot cannot directly interact with the object it wishes to control and must instead influence it via intermediate objects. For instance, a robot learning to drive a car can only change the car’s pose indirectly via the steering wheel, and must represent and reason about the relationship between its own grippers and the steering wheel, and the relationship between the steering wheel and the car. We formalize these interactions as chains and graphs of Markov decision processes (MDPs) and show how such models can be learned from data. We also consider how they can be controlled given known or learned dynamics. We show that our complex model can be collapsed into a single MDP and solved to find an optimal policy for the combined system. Since the resulting MDP may be very large, we also introduce a planning algorithm that efficiently produces a potentially suboptimal policy. We apply these models to two systems in which a robot uses learning from demonstration to achieve indirect control: playing a computer game using a joystick, and using a hot water dispenser to heat a cup of water.

Solving MDPs with discounted rewards for minimizing variance instead of expected (discounted) reward

Li Xia, Mean–variance optimization of discrete time discounted Markov decision processes, Automatica, Volume 88, 2018, Pages 76-82, DOI: 10.1016/j.automatica.2017.11.012.

In this paper, we study a mean–variance optimization problem in an infinite horizon discrete time discounted Markov decision process (MDP). The objective is to minimize the variance of system rewards with the constraint of mean performance. Different from most of works in the literature which require the mean performance already achieve optimum, we can let the discounted performance equal any constant. The difficulty of this problem is caused by the quadratic form of the variance function which makes the variance minimization problem not a standard MDP. By proving the decomposable structure of the feasible policy space, we transform this constrained variance minimization problem to an equivalent unconstrained MDP under a new discounted criterion and a new reward function. The difference of the variances of Markov chains under any two feasible policies is quantified by a difference formula. Based on the variance difference formula, a policy iteration algorithm is developed to find the optimal policy. We also prove the optimality of deterministic policy over the randomized policy generated in the mean-constrained policy space. Numerical experiments demonstrate the effectiveness of our approach.

A good intro about actor-critic and decision making without model on MDPs

J. Wang and I. C. Paschalidis, “An Actor-Critic Algorithm With Second-Order Actor and Critic,” in IEEE Transactions on Automatic Control, vol. 62, no. 6, pp. 2689-2703, June 2017.DOI: 10.1109/TAC.2016.2616384.

Actor-critic algorithms solve dynamic decision making problems by optimizing a performance metric of interest over a user-specified parametric class of policies. They employ a combination of an actor, making policy improvement steps, and a critic, computing policy improvement directions. Many existing algorithms use a steepest ascent method to improve the policy, which is known to suffer from slow convergence for ill-conditioned problems. In this paper, we first develop an estimate of the (Hessian) matrix containing the second derivatives of the performance metric with respect to policy parameters. Using this estimate, we introduce a new second-order policy improvement method and couple it with a critic using a second-order learning method. We establish almost sure convergence of the new method to a neighborhood of a policy parameter stationary point. We compare the new algorithm with some existing algorithms in two applications and demonstrate that it leads to significantly faster convergence.

Modelling hierarchical stochastic signals (i.e., decomposable into sub-signals hierarchichally)

Truyen Tran, Dinh Phung, Hung Bui, Svetha Venkatesh, Hierarchical semi-Markov conditional random fields for deep recursive sequential data, Artificial Intelligence, Volume 246, May 2017, Pages 53-85, ISSN 0004-3702, DOI: 10.1016/j.artint.2017.02.003.

We present the hierarchical semi-Markov conditional random field (HSCRF), a generalisation of linear-chain conditional random fields to model deep nested Markov processes. It is parameterised as a conditional log-linear model and has polynomial time algorithms for learning and inference. We derive algorithms for partially-supervised learning and constrained inference. We develop numerical scaling procedures that handle the overflow problem. We show that when depth is two, the HSCRF can be reduced to the semi-Markov conditional random fields. Finally, we demonstrate the HSCRF on two applications: (i) recognising human activities of daily living (ADLs) from indoor surveillance cameras, and (ii) noun-phrase chunking. The HSCRF is capable of learning rich hierarchical models with reasonable accuracy in both fully and partially observed data cases.

Implementation of affects in artificial systems through MDPs

Jesse Hoey, Tobias Schröder, Areej Alhothali, Affect control processes: Intelligent affective interaction using a partially observable Markov decision process, Artificial Intelligence, Volume 230, January 2016, Pages 134-172, DOI: 10.1016/j.artint.2015.09.004.

This paper describes a novel method for building affectively intelligent human-interactive agents. The method is based on a key sociological insight that has been developed and extensively verified over the last twenty years, but has yet to make an impact in artificial intelligence. The insight is that resource bounded humans will, by default, act to maintain affective consistency. Humans have culturally shared fundamental affective sentiments about identities, behaviours, and objects, and they act so that the transient affective sentiments created during interactions confirm the fundamental sentiments. Humans seek and create situations that confirm or are consistent with, and avoid and suppress situations that disconfirm or are inconsistent with, their culturally shared affective sentiments. This “affect control principle” has been shown to be a powerful predictor of human behaviour. In this paper, we present a probabilistic and decision-theoretic generalisation of this principle, and we demonstrate how it can be leveraged to build affectively intelligent artificial agents. The new model, called BayesAct, can maintain multiple hypotheses about sentiments simultaneously as a probability distribution, and can make use of an explicit utility function to make value-directed action choices. This allows the model to generate affectively intelligent interactions with people by learning about their identity, predicting their behaviours using the affect control principle, and taking actions that are simultaneously goal-directed and affect-sensitive. We demonstrate this generalisation with a set of simulations. We then show how our model can be used as an emotional “plug-in” for artificially intelligent systems that interact with humans in two different settings: an exam practice assistant (tutor) and an assistive device for persons with a cognitive disability.

Using MDPs when the transition probability matrix is just partially specified, therefore getting closer to a model-free approach

Karina V. Delgado, Leliane N. de Barros, Daniel B. Dias, Scott Sanner, Real-time dynamic programming for Markov decision processes with imprecise probabilities, Artificial Intelligence, Volume 230, January 2016, Pages 192-223, ISSN 0004-3702, DOI: 10.1016/j.artint.2015.09.005.

Markov Decision Processes have become the standard model for probabilistic planning. However, when applied to many practical problems, the estimates of transition probabilities are inaccurate. This may be due to conflicting elicitations from experts or insufficient state transition information. The Markov Decision Process with Imprecise Transition Probabilities (MDP-IPs) was introduced to obtain a robust policy where there is uncertainty in the transition. Although it has been proposed a symbolic dynamic programming algorithm for MDP-IPs (called SPUDD-IP) that can solve problems up to 22 state variables, in practice, solving MDP-IP problems is time-consuming. In this paper we propose efficient algorithms for a more general class of MDP-IPs, called Stochastic Shortest Path MDP-IPs (SSP MDP-IPs) that use initial state information to solve complex problems by focusing on reachable states. The (L)RTDP-IP algorithm, a (Labeled) Real Time Dynamic Programming algorithm for SSP MDP-IPs, is proposed together with three different methods for sampling the next state. It is shown here that the convergence of (L)RTDP-IP can be obtained by using any of these three methods, although the Bellman backups for this class of problems prescribe a minimax optimization. As far as we are aware, this is the first asynchronous algorithm for SSP MDP-IPs given in terms of a general set of probability constraints that requires non-linear optimization over imprecise probabilities in the Bellman backup. Our results show up to three orders of magnitude speedup for (L)RTDP-IP when compared with the SPUDD-IP algorithm.

See also:

  • Karina Valdivia Delgado, Scott Sanner, Leliane Nunes de Barros, Efficient solutions to factored MDPs with imprecise transition probabilities, Artif. Intell. 175 (9–10) (2011) 1498–1527.
  • Satia, J. K., and Lave Jr., R. E. 1970. MDPs with uncertain transition probabilities. Operations Research 21:728–740
  • White III, C. C., and El-Deib, H. K. 1994. MDPs with Imprecise Transition Probabilities. Operations Research 42(4):739–749

Planning tasks in mobile robots with MDPs that maximize the probability of satisfying user’s requirements specified through temporal logics, with estimation of transition probabilities through simulation only when needed

Jing Wang, Xuchu Ding, Morteza Lahijanian, Ioannis Ch. Paschalidis, and Calin A. Belta, Temporal logic motion control using actor–critic methods, The International Journal of Robotics Research September 2015 34: 1329-1344, first published on May 26, 2015. DOI: 10.1177/0278364915581505.

This paper considers the problem of deploying a robot from a specification given as a temporal logic statement about some properties satisfied by the regions of a large, partitioned environment. We assume that the robot has noisy sensors and actuators and model its motion through the regions of the environment as a Markov decision process (MDP). The robot control problem becomes finding the control policy which maximizes the probability of satisfying the temporal logic task on the MDP. For a large environment, obtaining transition probabilities for each state–action pair, as well as solving the necessary optimization problem for the optimal policy, are computationally intensive. To address these issues, we propose an approximate dynamic programming framework based on a least-squares temporal difference learning method of the actor–critic type. This framework operates on sample paths of the robot and optimizes a randomized control policy with respect to a small set of parameters. The transition probabilities are obtained only when needed. Simulations confirm that convergence of the parameters translates to an approximately optimal policy.