ABSTRACT
This chapter represents a new dynamic programming to the solution of a
stochastic decision process that can be described by a finite number of states. The
transition probabilities between the states are described by a Markov chain. The reward
structure of the process is also described by a matrix whose individual elements represent
the revenue (or cost) resulting from moving one state to another.
Both the transition and revenue matrices depend on the decision alternatives
available to the decision maker. The objective of the problem is to determine the optimal
policy that maximizes the expected revenue of the process over a finite or infinite number
of stages.
INTRODUCTION
Markov decision processes (MDPs), named after Andrey Markov, provide a
mathematical framework for modeling decision making in situations where outcomes are
partly random and partly under the control of a decision maker. MDPs are useful for studying a
wide range of optimization problems solved via dynamic programming and reinforcement
learning. MDPs were known at least as early as the 1950s (cf. Bellman 1957). A core body of
research on Markov decision processes resulted from Ronald A. Howard's book published in
1960, Dynamic Programming and Markov Processes. They are used in a wide area of disciplines,
including robotics, automated control, economics, and manufacturing.
More precisely, a Markov Decision Process is a discrete time stochastic control process.
At each time step, the process is in some state , and the decision maker may choose any action
that is available in state . The process responds at the next time step by randomly moving into a
new state , and giving the decision maker a corresponding reward .
The probability that the process moves into its new state is influenced by the chosen
action. Specifically, it is given by the state transition function . Thus, the next state
depends on the current state and the decision maker's action . But given and , it is
conditionally independent of all previous states and actions; in other words, the state transitions
of an MDP possess the Markov property.
Markov decision processes are an extension of Markov chains; the difference is the
addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one
action exists for each state and all rewards are zero, a Markov decision process reduces to
a Markov chain.
Definition
A Markov decision process is a 4-tuple , where
is a finite set of states,
is a finite set of actions (alternatively, is the finite set of actions
available from state ),
is the probability that
action in state at time will lead to state at time ,
Is the immediate reward (or expected immediate reward) received
after transition to state from state with transition probability .
(The theory of Markov decision processes does not actually require or to be finite,
but the basic algorithms below assume that they are finite.)
Example of a simple MDP with three states and two actions.
Problem
The core problem of MDPs is to find a "policy" for the decision maker: a
function that specifies the action that the decision maker will choose when in
state . Note that once a Markov decision process is combined with a policy in this way,
this fixes the action for each state and the resulting combination behaves like a Markov
chain.
The goal is to choose a policy that will maximize some cumulative function of
the random rewards, typically the expected discounted sum over a potentially infinite
horizon:
(where we choose ) where is the discount factor
and satisfies .
(For example, when the discount rate is r.) is typically close
to 1. Because of the Markov property, the optimal policy for this particular problem can
indeed be written as a function of only, as assumed above.
Algorithms
MDPs can be solved by linear programming or dynamic programming. In what
follows we present the latter approach.
Suppose we know the state transition function and the reward function , and
we wish to calculate the policy that maximizes the expected discounted reward.
The standard family of algorithms to calculate this optimal policy requires storage
for two arrays indexed by state: value , which contains real values, and policy which
contains actions. At the end of the algorithm, will contain the solution and will
contain the discounted sum of the rewards to be earned (on average) by following that
solution from state .
The algorithm has the following two kinds of steps, which are repeated in some
order for all the states until no further changes take place. They are
Their order depends on the variant of the algorithm; one can also do them for all
states at once or state by state, and more often to some states than others. As long as no
state is permanently excluded from either of the steps, the algorithm will eventually
arrive at the correct solution.
Notable variants
Value iteration
In value iteration (Bellman 1957), which is also called backward induction, the
array is not used; instead, the value of is calculated whenever it is needed.
Shapley's 1953 paper on stochastic games included as a special case the value iteration
method for MDPs, but this was recognized only later on.
Substituting the calculation of into the calculation of gives the
combined step:
This update rule is iterated for all states until it converges with the left-hand
side equal to the right-hand side (which is the "Bellman equation" for this problem).
Policy iteration
In policy iteration (Howard 1960), step one is performed once, and then step two
is repeated until it converges. Then step one is again performed once and so on.
Instead of repeating step two to convergence, it may be formulated and solved as
a set of linear equations.
This variant has the advantage that there is a definite stopping condition: when the
array does not change in the course of applying step 1 to all states, the algorithm is
completed.
Modified policy iteration
In modified policy iteration (van Nunen, 1976; Puterman and Shin 1978), step one
is performed once, and then step two is repeated several times. Then step one is again
performed once and so on.
Prioritized sweeping
In this variant, the steps are preferentially applied to states which are in some way
important - whether based on the algorithm (there were large changes in or around
those states recently) or based on use (those states are near the starting state, or otherwise
of interest to the person or program using the algorithm).
Scope of the Markovian Decicion Problem –The Gardener Example
The example paraphrases the number of important applications in the areas of
inventory, replacement, cash flow management, and regulation of the water reservoir
capacity.
An avid gardener attends a plot of land in his backyard. Every year, at the
beginning of the gardening season, she uses chemical tests; she can classify the garden’s
productivity for the new season as good, fair, or poor.
Over the years, the gardener observed that current year’s productivity can be
assumed to depend only on last year’s soil condition. She is thus able to represent the
transition probabilities over the year period from one productivity state to another in
terms of the following Markov chain:
State of the system next year
1 2 3
1 .2 .5 .3
State of the system this year 2 0 .5 .5 = P1
3 0 0 1
The representation assumes the following correspondence between productivity
and the states of the chain:
Productivity State of the system
(Soil condition)
Good 1
Fair 2
Poor 3
The transition probabilities in P1 indicate that the productivity for a current year
can be no better that last year’s. for example, if the soil condition for this year is fair
(state 2), next year’s productivity may remain fair with probability .5 or become poor
(state 3), also with probability .5.
The gardener can alter the transition probabilities P1 by taking other courses of
action available to her. Typically, she may decide to fertilize the garden to boost the soil
condition. If she does not, her transition probabilities will remain as given in P1. But if
she does, the following transition matrix P2 will result:
1 2 3
1 .3 .6 .1
P2 = 2 .1 .6 .3
3 .05 .4 .55
In the new transition P2, it is possible to improve the condition of the soil over last
year’s.
To put the decision problem in perspective, the gardener associates a return
function (or a reward structure) with the transition from one state to another. The return
function expresses the gain or loss during a 1-year period, depending on the states
between which the transition is made. Since the gardener has the option of using or not
using fertilizer, her gain and losses are expected to vary depending on the decision she
makes. The matrices R1 and R2 summarize the return function in hundreds of dollars
associated with the matrices P1 and P2, respectively. Thus R1
applies when no fertilizer is
used; otherwise, R2 is utilized in the representation of the return function.
1 2 3
1 7 6 3
R1 = // r1ij// = 2 0 5 1
3 0 0 -1
1 2 3
1 6 5 -1
R2 = // r2ij// = 2 7 4 0
3 6 3 -2
Notice that the elements r2ij of R2 take into account the cost of applying the
fertilizer. For example , if the system was in state 1 and remained in state 1 during next
year, its gain would be r211 = 6 compared to r211 = 7 when no fertilizer is used.
What kind of decision problem does the gardener have? First, we must know
whether the gardener activity will continue for a limited number of years or, for all
practical purposes, indefinitely. These situations are referred to as finite-stage and
infinite-stage decision problems. In both cases, the gardener would need to determine the
best course of action she should follow (fertilize or do not fertilize) given the outcome of
the chemical tests (state of the system). He optimization process will be based on
maximization of expected revenue.
The gardener may also be interested in evaluating the expected revenue resulting
from following the prespecified course of action whenever a given state of action of the
system occurs. For example, she may decide to fertilize whenever the soil condition is
poor (state 3). The decision making process in this case is said to be represented by a
stationary policy.
We must note that each stationary policy will be associated with a different
transition and return matrices, which, in general, can be constructed from the matrices P 1,
P2, R1, and R2. For example, for the stationary policy calling for applying fertilizer only
when the soil condition is poor (state 3), the resulting transition and return matrices, P
and R, respectively, are given as
.2 .5 .3
P= 0 .5 .5
.05 .4 .55 and
7 6 3
R= 0 5 1
6 3 -2
These matrices differ from P1 and R1 in the third rows only, which are taken
directly from P2 and R2. The reason is the P2 and R2 are the matrices that result when
fertilizer is applied in every state.
Extension and Generalizations
A Markov decision process is a stochastic game with only one player.
Partial observability
Main article: partially observable Markov decision process
The solution above assumes that the state is known when action is to be taken;
otherwise cannot be calculated. When this assumption is not true, the problem is
called a partially observable Markov decision process or POMDP.
A major breakthrough in this area was provided in "Optimal adaptive policies for
Markov decision processes" [2] by Burnetas and Katehakis. In this work a class of
adaptive policies that possess uniformly maximum convergence rate properties for the
total expected finite horizon reward, were constructed under the assumptions of finite
state-action spaces and irreducibility of the transition law. These policies prescribe that
the choice of actions, at each state and time period, should be based on indices that are
inflations of the right-hand side of the estimated average reward optimality equations.
Reinforcement learning
If the probabilities or rewards are unknown, the problem is one of reinforcement
learning (Sutton and Barto, 1998).
For this purpose it is useful to define a further function, which corresponds to
taking the action and then continuing optimally (or according to whatever policy one
currently has):
While this function is also unknown, experience during learning is based
on pairs (together with the outcome ); that is, "I was in state and I tried
doing and happened"). Thus, one has an array and uses experience to update it
directly. This is known as Q-learning.
Reinforcement learning can solve Markov decision processes without explicit
specification of the transition probabilities; the values of the transition probabilities are
needed in value and policy iteration. In reinforcement learning, instead of explicit
specification of the transition probabilities, the transition probabilities are accessed
through a simulator that is typically restarted many times from a uniformly random initial
state. Reinforcement learning can also be combined with function approximation to
address problems with a very large number of states.
Continuous-Time Markov Decision Process
In discrete-time Markov Decision Processes, decisions are made at discrete time epoch.
However, for Continuous-time Markov Decision Process, decisions can be made at any
time when decision maker wants. Unlike discrete-time Markov Decision Process,
Continuous-time Markov Decision Process could better model the decision making
process when the interested system has continuous dynamics, i.e., the system dynamics is
defined by partial differential equations (PDEs).
Definition
In order to discuss the continuous-time Markov Decision Process, we introduce
two sets of notations:
If the state space and action space are finite,
: State space;
: Action space;
: , transition rate function;
: , a reward function.
If the state space and action space are continuous,
: State space.;
: Space of possible control;
: , a transition rate function;
: , a reward rate function such
that , where is the reward function we
discussed in previous case.
Problem
Like the Discrete-time Markov Decision Processes, in Continuous-time Markov
Decision Process we want to find the optimal policy or control which could give us the
optimal expected integrated reward:
Where
Linear programming formulation
If the state space and action space are finite, we could use linear programming
formulation to find the optimal policy, which was one of the earliest solution approaches.
Here we only consider the ergodic model, which means our continuous-time MDP
becomes an ergodic continuous-time Markov Chain under a stationary policy. Under this
assumption, although the decision maker could make decision at any time, on the current
state, he could not get more benefit to make more than one actions. It is better for him to
take action only at the time when system transit from current state to another state. Under
some conditions,(for detail check Corollary 3.14 of Continuous-Time Markov Decision
Processes), if our optimal value function is independent of state i, we will have a
following equation:
If there exists a function , then will be the smallest g could satisfied the
above equation. In order to find the , we could have the following linear programming
model:
Primal linear program(P-LP)
Dual linear program(D-LP)
is a feasible solution to the D-LP if is nonnative and satisfied the
constraints in the D-LP problem. A feasible solution to the D-LP is said to be
an optimal solution if
for all feasible solution y(i,a) to the D-LP. Once we found the optimal solution
, we could use those optimal solution to establish the optimal policies.
Hamilton-Jacobi-Bellman equation
In continuous-time MDP, if the state space and action space are continuous, the
optimal criterion could be found by solving Hamilton-Jacobi-Bellman (HJB) partial
differential equation. In order to discuss the HJB equation, we need to reformulate our
problem
D( ) is the terminal reward function, is the system state vector, is the system
control vector we try to find. f( ) shows how the state vector change over time. Hamilton-
Jacobi-Bellman equation is as follows:
We could solve the equation to find the optimal control , which could give us the
optimal value
Application
Queueing system, epidemic processes, Population process.
Alternative Notations
The terminology and notation for MDPs are not entirely settled. There are two
main streams — one focuses on maximization problems from contexts like economics,
using the terms action, reward, value, and calling the discount factor or , while the
other focuses on minimization problems from engineering and navigation, using the
terms control, cost, cost-to-go, and calling the discount factor . In addition, the notation
for the transition probability varies.
in this article alternative comment
action control
reward cost is the negative of
value cost-to-go is the negative of
policy policy
discounting factor discounting factor
Transition probability transition probability
In addition, transition probability is sometimes written
, or, rarely,
Finite-Stage Dynamic Programming Model
Suppose that the gardener plans to “retire” from exercising her hobby in N
years. She is thus interested in determining her optimal course of action for each year (to
fertilize or not fertilize) over a finite planning horizon. Optimality here is defined such
that the gardener will accumulate the highest expected revenue at the end of the N years.
Let k = 1 and 2 represent the two courses of action (alternatives) available to the
gardener. The matrices Pk and Rk representing the transition probabilities and reward
function for alternative k.
The summary is:
.2 .5 .3
P1 = // p1ij// = 0 .5 .5 ,
0 0 1
7 6 3
R1 = // r1ij// = 0 5 1 and
.3 .6 .1
P2 = // p2ij// = .1 .6 .3 ,
.05 .4 .55
6 5 -1
R2 = // r2ij// = 7 4 0
6 3 -2
Recall that the system has three states: good (state 1), fair (state 2), and poor (state
3).
We can express the gardener’s problem as a finite-stage dynamic programming
(DP) model as follows. For the sake of generalization, suppose that the number of states
for each stage (year) is m (=3 in the gardener’s example) and define fn (i) = optimal
expected revenue of stages n, n+1, . . . , N, given that the state of the system (soil
condition) at the beginning of year n is i.
The backward recursive equation relating fn and fn+1 can be written as:
Fn(i) = maxk { ∑mj=I pkij[rkij+fn+1(j)]}, n=1, 2, . . ., N
Where fN+1(j) =0 for all j.
Stage n Stage n+1
fn(1) fn+1 (1)
1 1
: pki1 , rki1 :
fn(i) i
pki1 , rki1 fn+1
j
(j)
: pki1 , rki1 :
fn(m) m m fn+1 (m)
A justification for the equation above is that the cumulative revenue, rkij +
fn+1 (j), resulting from reaching state j at stage n + 1 from state I at stage n occurs with
probability pkij. In fact, if Vki represents the expected return resulting from a single single
transition from state I given alternative k, then Vki can be expresses as:
Vki = ∑mj=1 pkij rkij
The DP recursive equation can thus written as
fn(i) = maxk{Vki}
fn(i) = { Vki = ∑mj=1p kij fn+1(j)}, n=1,2, . . . , N – 1
Before showing how the recursive equation is used to solve the gardener’s
problem, we illustratethe computation of Vki , which is part of the recursive equation. For
example, suppose that no fertilizer is used (k = 1); then
V11 = .2 * 7 + .5 +6 +.3 *3 = 5,3
V12 = 0 * 0 + .5 * 5 + .5 * 1 = 3
V13 = 0 * 0 + 0 + 1 * - 1 = - 1
These values show that if the soil condition is found good (state 1) at he
beginning of the year, a single transition is expected to yield 5.3 for that year. Similarly, if
the soil condition is fair (poor), the expected revenue is 3 (- 1).
The (finite horizon) gardener’s problem can be generalized in two ways. First, he
transition probabilities and their return functions need not be the same for every year.
Second, she may apply a discounting factor to the expected revenue of the successive
stages so hat the values of f 1(i) would represent the present value of the expected
revenues of all the stages.
The first generalization would reqire simply that the return value r kij and
tranbsition probabilities pike be additionally functions of the stage, n. in this case, he DP
recursive equations appears as:
fN (i) = maxk { Vki, N }
fn (i) maxk { Vkk, n + ∑mj=1 p kij fn+1(j)}, n = 1, 2, . . . , N-1
where:
Vki, n = ∑mj=1 1p kij, n rkij, n
The second generalization is accomplished as follows. Let a ( less than 1)
be the discount factor per year, which is normally computed as a = 1/(1 + t), where t is
the annual interest rate. Thus D dollars a year from now are equivalent to aD dollars now.
The introduction of the discount factor will modify the original recursive equation as
follows:
fN(i) = maxk {Vki }
fn (i) = maxk {Vki + a ∑mj=1 p kij fn+1(j) }, n = 1, 2, . . . , N – 1
The application of this recursive equation is in general, the use of a discount
factor may result in a different optimum decision in comparison with when no discount is
used.
The DP recursive equation can be used to evaluate any stationary policy for the
gardener’s problem. Assuming that no discounting is used (i.e., a = 1) , the recursive
equation for evaluating a stationary policy is
fn(i) = vi + ∑mj=1 p kij fn+1(j)
where pij is the (i, j)th element of the transition matrix associated with the policy and vi is
the expected one-step transition revenue of the policy.
Infinite-Stage Model
The long-run behavior of the Markovian process is characterized by its
independence of the initial state of the system. In this case the system is said to have
reached steady state. We are thus primarily interested in evaluating policies for which the
associated Markov chains allow the existence of a steady-state solution.
Here, we are interested in determining the optimum long-run policy of a
Markovian decision problem. It is logical to base the evaluation of a policy on
maximizing (minimizing) the expected revenue (cost) per transition period. For example,
in the gardener’s problem, the selection of the best (infinite-stage) policy is based on the
maximum expected revenue per year.
There are two methods in solving the infinite-stage problem. The first method
calls for enumerating all possible stationary policies of the decision problem. By
evaluating each policy, the optimum solution can be determined. This is basically
equivalent to an alternative enumeration process and can be used only if the total number
of the stationary policies is reasonably small for practical computations.
The second method, called policy iteration, alleviates the computational
difficulties that could arise in the exhaustive enumeration procedure. The new method ids
generally efficient in the sense that it determines the optimum policy in a small number of
iterations.
Naturally, both methods must lead to the same optimum solution. We demonstrate
these points well as the application of the two methods via the gardener example.
Exhaustive Enumeration Method
Suppose that the decision problem has a total of S policies, and assume that Ps and
Rs are the (one-step) transition and revenue matrices associated with the kth policy, s = 1,
2, . . . , S. the steps of the enumeration method are as follows.
Step 1: Compute Vsi , the expected one-step (one-period) revenue of policy s
given state I, I = 1, 2, . . . , m.
Step 2: Compute πsi, the long-run stationary probabilities of the transition matrix
Ps associated with policy s. these probabilities, when they exist, are computed from
equations:
π s Ps = π s
πs1 + πs2 + . . . + πsm = 1
where πs = (πs1, πs2, . . . , πsm) .
Step 3: Deteremine Es, the expected revenue of the policy s per transition step
(period), by using the formula:
Es = ∑mi=1 πsi Vsi
Step 4: The optimum policy s* is determined such that
Es = maxs { Es }
Policy Iteration Method with Discounting
The policy iteration algorithm can be extended to include discounting.
Specifically, given that a (less than 1) is the discount factor, the finite-stage recursive
equation can be written as:
fn (i) maxk { Vki + a ∑mj=1 p kij fn+1(j) }
where n represents the number of stages to go.
Policy Iteration Method without Discounting
To gain an appreciation of the difficulty associated with the exhaustive
enumeration method, let us assume that the gardener has four courses of action
(alternatives) instead of two: do not fertilize, fertilize once during the season, fertilize
twice, and fertilize three times. In this case, the gardener would have a total of 4 3 = 256
stationary policies. Thus, by increasing he number of alternatives from 2 to 4, the number
stationary policies “soars” exponentially from 8 to 256. Not only it is difficult to
enumerate all the policies explicitly, but the number of computations involved in the
evaluation of these policies may also be prohibitively large.
The policy iteration method is based principally on the following development.
For any specific policy, we showed that the expected total return at stage n is expressed
by the recursive equation
fn (i) Vi + ∑mj=1 p kij fn+1(j), I = 1, 2, . . . , m
Summary
This research provides models for the solution of the Markovian decision
problem. The models developed include the finite-stage models solved directly by the DP
recursive equations. In the infinite-stage model, it is shown that exhaustive enumeration
is not practical for large problems. The policy iteration algorithm, which is based on the
DP recursive equation, is shown to be more efficient computationally than the exhaustive
enumeration method, since it normally converges in a small number of iterations.
Discounting is shown to result in a possible change of the optimal policy in comparison
with the case where no discounting is used. This conclusion applies to both the finite- and
infinite-stage models.
The LP formulation is quite interesting but not as efficient computationally as the
policy iteration algorithm. For problems with K decision alternatives and m states, the
associated LP model would include (m + 1) constraints and mK variables, which tend to
be large for large values of m and K.
Although we represented the simplified gardener example to demonstrate the
development of the algorithms, the Markovian decision problem has applications in such
areas as inventory, maintenance, replacement, and water resources.
References
R. Bellman. A Markovian Decision Process. Journal of Mathematics and Mechanics 6,
1957.
R. E. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, 1957.
Dover paperback edition (2003), ISBN 0-486-42809-5.
Ronald A. Howard Dynamic Programming and Markov Processes, The M.I.T. Press,
1960.
D. Bertsekas. Dynamic Programming and Optimal Control. Volume 2, Athena, MA,
1995.
Burnetas, A.N. and M. N. Katehakis. "Optimal Adaptive Policies for Markov Decision
Processes, Mathematics of Operations Research, 22,(1), 1995.
M. L. Puterman. Markov Decision Processes. Wiley, 1994.
H.C. Tijms. A First Course in Stochastic Models. Wiley, 2003.
Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press,
Cambridge, MA, 1998.
J.A. E. E van Nunen. A set of successive approximation methods for discounted
Markovian decision problems. Z. Operations Research, 20:203-208, 1976.
S. P. Meyn, 2007. Control Techniques for Complex Networks, Cambridge University
Press, 2007. ISBN 978-0-521-88441-9. Appendix contains abridged Meyn & Tweedie.
S. M. Ross. 1983. Introduction to stochastic dynamic programming. Academic press
X. Guo and O. Hernández-Lerma. Continuous-Time Markov Decision Processes,
Springer, 2009.
M. L. Puterman and Shin M. C. Modified Policy Iteration Algorithms for Discounted
Markov Decision Problems, Management Science 24, 1978.