MDPs | kipr

A novel RL setting for non-Markovian systems

April 26, 2024 09:38 , Juan-Antonio Fernández-Madrigal

Ronen I. Brafman, Giuseppe De Giacomo, Regular decision processes, Artificial Intelligence, Volume 331, 2024 DOI: 10.1016/j.artint.2024.104113.

We introduce and study Regular Decision Processes (RDPs), a new, compact model for domains with non-Markovian dynamics and rewards, in which the dependence on the past is regular, in the language theoretic sense. RDPs are an intermediate model between MDPs and POMDPs. They generalize k-order MDPs and can be viewed as a POMDP in which the hidden state is a regular function of the entire history. In factored RDPs, transition and reward functions are specified using formulas in linear temporal logics over finite traces, or using regular expressions. This allows specifying complex dependence on the past using intuitive and compact formulas, and building models of partially observable domains without specifying an underlying state space.

Posted in: Reinforcement learning in AI , Tagged: Markovianity, MDPs, POMDPs

Safety in MDPs by measuring the probability of reaching dangerous states

July 21, 2023 09:54 , Juan-Antonio Fernández-Madrigal

Rafal Wisniewski, Luminita-Manuela Bujorianu, Safety of stochastic systems: An analytic and computational approach, . Automatica, Volume 133, 2021 DOI: 10.1016/j.automatica.2021.109839.

We refine the concept of stochastic reach avoidance for a general class of Markov processes introducing a threshold of p for the reaching probability. This new problem is called p-safety, and it aims to ensure that the given process reaches a forbidden set before leaving its ‘working’ state space with a probability of less than p. In the situation when an initial probability measure characterizes the initial states, a variant of p-safety is put forward. We call this form of safety weak p-safety. In this work, we characterize both p-safety and weak p-safety and show how to compute them. We employ semi-definite programming to compute p-safety and linear programming to compute weak p-safety. To get to this point, we use certificates of positivity of polynomials translated into the sum of squares and the Bernstein forms.

Posted in: Artificial Intelligence , Tagged: MDPs, Safety

Synthesizing a supervisor (a Finite State Machine) instead of finding a standard policy in MDPs, applied to multi-agent systems

September 16, 2019 15:23 , Juan-Antonio Fernández-Madrigal

B. Wu, X. Zhang and H. Lin Permissive Supervisor Synthesis for Markov Decision Processes Through Learning. IEEE Transactions on Automatic Control, vol. 64, no. 8, pp. 3332-3338, Aug. 2019. DOI: 10.1109/TAC.2018.2879505.

This paper considers the permissive supervisor synthesis for probabilistic systems modeled as Markov Decision Processes (MDP). Such systems are prevalent in power grids, transportation networks, communication networks, and robotics. We propose a novel supervisor synthesis framework using automata learning and compositional model checking to generate the permissive local supervisors in a distributed manner. With the recent advances in assume-guarantee reasoning verification for MDPs, constructing the composed system can be avoided to alleviate the state space explosion. Our framework learns the supervisors iteratively using counterexamples from the verification and is guaranteed to terminate in finite steps and to be correct.

Posted in: Reinforcement learning in AI , Tagged: MDPs, Multiagent systems

An application of MDPs to UAV collision-free navigation with an interesting taxonomy of the state-of-the-art

February 5, 2019 10:24 , Juan-Antonio Fernández-Madrigal

Xiang Yu1, Xiaobin Zhou2, Youmin Zhang, Collision-Free Trajectory Generation and Tracking for UAVs Using Markov Decision Process in a Cluttered Environment, Journal of Intelligent & Robotic Systems, 2019, 93:17–32 DOI: 10.1007/s10846-018-0802-z.

A collision-free trajectory generation and tracking method capable of re-planning unmanned aerial vehicle (UAV) trajectories can increase flight safety and decrease the possibility of mission failures. In this paper, a Markov decision process (MDP) based algorithm combined with backtracking method is presented to create a safe trajectory in the case of hostile environments. Subsequently, a differential flatness method is adopted to smooth the profile of the rerouted trajectory for satisfying the UAV physical constraints. Lastly, a flight controller based on passivity-based control (PBC) is designed to maintain UAV’s stability and trajectory tracking performance. simulation results demonstrate that the UAV with the proposed strategy is capable of avoiding obstacles in a hostile environment.

Posted in: Robot motion planning , Tagged: Collision avoidance, MDPs, Robot navigation, UAVs

Considering the robot and all the intermmediate objects that participate in the manipulation of another object as a MDP

September 3, 2018 15:58 , Juan-Antonio Fernández-Madrigal

Yilun Zhou, Benjamin Burchfiel, George Konidaris, Representing, learning, and controlling complex object interactions, Autonomous Robots, Volume 42, Issue 7, pp 1355–1367, DOI: 10.1007/s1051.

We present a framework for representing scenarios with complex object interactions, where a robot cannot directly interact with the object it wishes to control and must instead influence it via intermediate objects. For instance, a robot learning to drive a car can only change the car’s pose indirectly via the steering wheel, and must represent and reason about the relationship between its own grippers and the steering wheel, and the relationship between the steering wheel and the car. We formalize these interactions as chains and graphs of Markov decision processes (MDPs) and show how such models can be learned from data. We also consider how they can be controlled given known or learned dynamics. We show that our complex model can be collapsed into a single MDP and solved to find an optimal policy for the combined system. Since the resulting MDP may be very large, we also introduce a planning algorithm that efficiently produces a potentially suboptimal policy. We apply these models to two systems in which a robot uses learning from demonstration to achieve indirect control: playing a computer game using a joystick, and using a hot water dispenser to heat a cup of water.

Posted in: Robot task planning , Tagged: Manipulation, MDPs

Solving MDPs with discounted rewards for minimizing variance instead of expected (discounted) reward

January 3, 2018 10:29 , Juan-Antonio Fernández-Madrigal

Li Xia, Mean–variance optimization of discrete time discounted Markov decision processes, Automatica, Volume 88, 2018, Pages 76-82, DOI: 10.1016/j.automatica.2017.11.012.

In this paper, we study a mean–variance optimization problem in an infinite horizon discrete time discounted Markov decision process (MDP). The objective is to minimize the variance of system rewards with the constraint of mean performance. Different from most of works in the literature which require the mean performance already achieve optimum, we can let the discounted performance equal any constant. The difficulty of this problem is caused by the quadratic form of the variance function which makes the variance minimization problem not a standard MDP. By proving the decomposable structure of the feasible policy space, we transform this constrained variance minimization problem to an equivalent unconstrained MDP under a new discounted criterion and a new reward function. The difference of the variances of Markov chains under any two feasible policies is quantified by a difference formula. Based on the variance difference formula, a policy iteration algorithm is developed to find the optimal policy. We also prove the optimality of deterministic policy over the randomized policy generated in the mean-constrained policy space. Numerical experiments demonstrate the effectiveness of our approach.

Posted in: Probability and statistics , Tagged: MDPs

A good intro about actor-critic and decision making without model on MDPs

June 30, 2017 11:31 , Juan-Antonio Fernández-Madrigal

J. Wang and I. C. Paschalidis, “An Actor-Critic Algorithm With Second-Order Actor and Critic,” in IEEE Transactions on Automatic Control, vol. 62, no. 6, pp. 2689-2703, June 2017.DOI: 10.1109/TAC.2016.2616384.

Actor-critic algorithms solve dynamic decision making problems by optimizing a performance metric of interest over a user-specified parametric class of policies. They employ a combination of an actor, making policy improvement steps, and a critic, computing policy improvement directions. Many existing algorithms use a steepest ascent method to improve the policy, which is known to suffer from slow convergence for ill-conditioned problems. In this paper, we first develop an estimate of the (Hessian) matrix containing the second derivatives of the performance metric with respect to policy parameters. Using this estimate, we introduce a new second-order policy improvement method and couple it with a critic using a second-order learning method. We establish almost sure convergence of the new method to a neighborhood of a policy parameter stationary point. We compare the new algorithm with some existing algorithms in two applications and demonstrate that it leads to significantly faster convergence.

Posted in: Artificial Intelligence , Tagged: Actor-critic, Decision making, MDPs

Modelling hierarchical stochastic signals (i.e., decomposable into sub-signals hierarchichally)

June 27, 2017 11:17 , Juan-Antonio Fernández-Madrigal

Truyen Tran, Dinh Phung, Hung Bui, Svetha Venkatesh, Hierarchical semi-Markov conditional random fields for deep recursive sequential data, Artificial Intelligence, Volume 246, May 2017, Pages 53-85, ISSN 0004-3702, DOI: 10.1016/j.artint.2017.02.003.

We present the hierarchical semi-Markov conditional random field (HSCRF), a generalisation of linear-chain conditional random fields to model deep nested Markov processes. It is parameterised as a conditional log-linear model and has polynomial time algorithms for learning and inference. We derive algorithms for partially-supervised learning and constrained inference. We develop numerical scaling procedures that handle the overflow problem. We show that when depth is two, the HSCRF can be reduced to the semi-Markov conditional random fields. Finally, we demonstrate the HSCRF on two applications: (i) recognising human activities of daily living (ADLs) from indoor surveillance cameras, and (ii) noun-phrase chunking. The HSCRF is capable of learning rich hierarchical models with reasonable accuracy in both fully and partially observed data cases.

Posted in: Artificial Intelligence, Probability and statistics , Tagged: Conditional Random Fields, Hierarchies of abstraction, MDPs, Semi-markov processes

Implementation of affects in artificial systems through MDPs

December 22, 2015 20:08 , Juan-Antonio Fernández-Madrigal

Jesse Hoey, Tobias Schröder, Areej Alhothali, Affect control processes: Intelligent affective interaction using a partially observable Markov decision process, Artificial Intelligence, Volume 230, January 2016, Pages 134-172, DOI: 10.1016/j.artint.2015.09.004.

This paper describes a novel method for building affectively intelligent human-interactive agents. The method is based on a key sociological insight that has been developed and extensively verified over the last twenty years, but has yet to make an impact in artificial intelligence. The insight is that resource bounded humans will, by default, act to maintain affective consistency. Humans have culturally shared fundamental affective sentiments about identities, behaviours, and objects, and they act so that the transient affective sentiments created during interactions confirm the fundamental sentiments. Humans seek and create situations that confirm or are consistent with, and avoid and suppress situations that disconfirm or are inconsistent with, their culturally shared affective sentiments. This “affect control principle” has been shown to be a powerful predictor of human behaviour. In this paper, we present a probabilistic and decision-theoretic generalisation of this principle, and we demonstrate how it can be leveraged to build affectively intelligent artificial agents. The new model, called BayesAct, can maintain multiple hypotheses about sentiments simultaneously as a probability distribution, and can make use of an explicit utility function to make value-directed action choices. This allows the model to generate affectively intelligent interactions with people by learning about their identity, predicting their behaviours using the affect control principle, and taking actions that are simultaneously goal-directed and affect-sensitive. We demonstrate this generalisation with a set of simulations. We then show how our model can be used as an emotional “plug-in” for artificially intelligent systems that interact with humans in two different settings: an exam practice assistant (tutor) and an assistive device for persons with a cognitive disability.

Posted in: Artificial Intelligence , Tagged: Affections, Human-Machine Interaction, MDPs

Using MDPs when the transition probability matrix is just partially specified, therefore getting closer to a model-free approach

November 12, 2015 19:42 , Juan-Antonio Fernández-Madrigal

Karina V. Delgado, Leliane N. de Barros, Daniel B. Dias, Scott Sanner, Real-time dynamic programming for Markov decision processes with imprecise probabilities, Artificial Intelligence, Volume 230, January 2016, Pages 192-223, ISSN 0004-3702, DOI: 10.1016/j.artint.2015.09.005.

Markov Decision Processes have become the standard model for probabilistic planning. However, when applied to many practical problems, the estimates of transition probabilities are inaccurate. This may be due to conflicting elicitations from experts or insufficient state transition information. The Markov Decision Process with Imprecise Transition Probabilities (MDP-IPs) was introduced to obtain a robust policy where there is uncertainty in the transition. Although it has been proposed a symbolic dynamic programming algorithm for MDP-IPs (called SPUDD-IP) that can solve problems up to 22 state variables, in practice, solving MDP-IP problems is time-consuming. In this paper we propose efficient algorithms for a more general class of MDP-IPs, called Stochastic Shortest Path MDP-IPs (SSP MDP-IPs) that use initial state information to solve complex problems by focusing on reachable states. The (L)RTDP-IP algorithm, a (Labeled) Real Time Dynamic Programming algorithm for SSP MDP-IPs, is proposed together with three different methods for sampling the next state. It is shown here that the convergence of (L)RTDP-IP can be obtained by using any of these three methods, although the Bellman backups for this class of problems prescribe a minimax optimization. As far as we are aware, this is the first asynchronous algorithm for SSP MDP-IPs given in terms of a general set of probability constraints that requires non-linear optimization over imprecise probabilities in the Bellman backup. Our results show up to three orders of magnitude speedup for (L)RTDP-IP when compared with the SPUDD-IP algorithm.

See also:

Karina Valdivia Delgado, Scott Sanner, Leliane Nunes de Barros, Efficient solutions to factored MDPs with imprecise transition probabilities, Artif. Intell. 175 (9–10) (2011) 1498–1527.
Satia, J. K., and Lave Jr., R. E. 1970. MDPs with uncertain transition probabilities. Operations Research 21:728–740
White III, C. C., and El-Deib, H. K. 1994. MDPs with Imprecise Transition Probabilities. Operations Research 42(4):739–749

Posted in: Artificial Intelligence , Tagged: MDPs, Task planning

1 2 Next »

Tag Archives: Mdps

A novel RL setting for non-Markovian systems

Ronen I. Brafman, Giuseppe De Giacomo, Regular decision processes, Artificial Intelligence, Volume 331, 2024 DOI: 10.1016/j.artint.2024.104113.

Safety in MDPs by measuring the probability of reaching dangerous states

Rafal Wisniewski, Luminita-Manuela Bujorianu, Safety of stochastic systems: An analytic and computational approach, . Automatica, Volume 133, 2021 DOI: 10.1016/j.automatica.2021.109839.

Synthesizing a supervisor (a Finite State Machine) instead of finding a standard policy in MDPs, applied to multi-agent systems

B. Wu, X. Zhang and H. Lin Permissive Supervisor Synthesis for Markov Decision Processes Through Learning. IEEE Transactions on Automatic Control, vol. 64, no. 8, pp. 3332-3338, Aug. 2019. DOI: 10.1109/TAC.2018.2879505.

An application of MDPs to UAV collision-free navigation with an interesting taxonomy of the state-of-the-art

Xiang Yu1, Xiaobin Zhou2, Youmin Zhang, Collision-Free Trajectory Generation and Tracking for UAVs Using Markov Decision Process in a Cluttered Environment, Journal of Intelligent & Robotic Systems, 2019, 93:17–32 DOI: 10.1007/s10846-018-0802-z.

Considering the robot and all the intermmediate objects that participate in the manipulation of another object as a MDP

Yilun Zhou, Benjamin Burchfiel, George Konidaris, Representing, learning, and controlling complex object interactions, Autonomous Robots, Volume 42, Issue 7, pp 1355–1367, DOI: 10.1007/s1051.

Solving MDPs with discounted rewards for minimizing variance instead of expected (discounted) reward

Li Xia, Mean–variance optimization of discrete time discounted Markov decision processes, Automatica, Volume 88, 2018, Pages 76-82, DOI: 10.1016/j.automatica.2017.11.012.

A good intro about actor-critic and decision making without model on MDPs

J. Wang and I. C. Paschalidis, “An Actor-Critic Algorithm With Second-Order Actor and Critic,” in IEEE Transactions on Automatic Control, vol. 62, no. 6, pp. 2689-2703, June 2017.DOI: 10.1109/TAC.2016.2616384.

Modelling hierarchical stochastic signals (i.e., decomposable into sub-signals hierarchichally)

Truyen Tran, Dinh Phung, Hung Bui, Svetha Venkatesh, Hierarchical semi-Markov conditional random fields for deep recursive sequential data, Artificial Intelligence, Volume 246, May 2017, Pages 53-85, ISSN 0004-3702, DOI: 10.1016/j.artint.2017.02.003.

Implementation of affects in artificial systems through MDPs

Jesse Hoey, Tobias Schröder, Areej Alhothali, Affect control processes: Intelligent affective interaction using a partially observable Markov decision process, Artificial Intelligence, Volume 230, January 2016, Pages 134-172, DOI: 10.1016/j.artint.2015.09.004.

Using MDPs when the transition probability matrix is just partially specified, therefore getting closer to a model-free approach

Karina V. Delgado, Leliane N. de Barros, Daniel B. Dias, Scott Sanner, Real-time dynamic programming for Markov decision processes with imprecise probabilities, Artificial Intelligence, Volume 230, January 2016, Pages 192-223, ISSN 0004-3702, DOI: 10.1016/j.artint.2015.09.005.

Post Navigation

Fields, areas and lines of research

Archives

Tag Archives: Mdps

Ronen I. Brafman, Giuseppe De Giacomo, Regular decision processes, Artificial Intelligence, Volume 331, 2024 DOI: 10.1016/j.artint.2024.104113.

Rafal Wisniewski, Luminita-Manuela Bujorianu, Safety of stochastic systems: An analytic and computational approach, . Automatica, Volume 133, 2021 DOI: 10.1016/j.automatica.2021.109839.

B. Wu, X. Zhang and H. Lin Permissive Supervisor Synthesis for Markov Decision Processes Through Learning. IEEE Transactions on Automatic Control, vol. 64, no. 8, pp. 3332-3338, Aug. 2019. DOI: 10.1109/TAC.2018.2879505.

Xiang Yu1, Xiaobin Zhou2, Youmin Zhang, Collision-Free Trajectory Generation and Tracking for UAVs Using Markov Decision Process in a Cluttered Environment, Journal of Intelligent & Robotic Systems, 2019, 93:17–32 DOI: 10.1007/s10846-018-0802-z.

Yilun Zhou, Benjamin Burchfiel, George Konidaris, Representing, learning, and controlling complex object interactions, Autonomous Robots, Volume 42, Issue 7, pp 1355–1367, DOI: 10.1007/s1051.

Li Xia, Mean–variance optimization of discrete time discounted Markov decision processes, Automatica, Volume 88, 2018, Pages 76-82, DOI: 10.1016/j.automatica.2017.11.012.

J. Wang and I. C. Paschalidis, “An Actor-Critic Algorithm With Second-Order Actor and Critic,” in IEEE Transactions on Automatic Control, vol. 62, no. 6, pp. 2689-2703, June 2017.DOI: 10.1109/TAC.2016.2616384.

Truyen Tran, Dinh Phung, Hung Bui, Svetha Venkatesh, Hierarchical semi-Markov conditional random fields for deep recursive sequential data, Artificial Intelligence, Volume 246, May 2017, Pages 53-85, ISSN 0004-3702, DOI: 10.1016/j.artint.2017.02.003.

Jesse Hoey, Tobias Schröder, Areej Alhothali, Affect control processes: Intelligent affective interaction using a partially observable Markov decision process, Artificial Intelligence, Volume 230, January 2016, Pages 134-172, DOI: 10.1016/j.artint.2015.09.004.

Karina V. Delgado, Leliane N. de Barros, Daniel B. Dias, Scott Sanner, Real-time dynamic programming for Markov decision processes with imprecise probabilities, Artificial Intelligence, Volume 230, January 2016, Pages 192-223, ISSN 0004-3702, DOI: 10.1016/j.artint.2015.09.005.

Post Navigation

Fields, areas and lines of research

Transversal topics, methods and tools

Archives