0% found this document useful (0 votes)
79 views

MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes

This document discusses nonlinear optimization and Markov decision processes (MDPs). It provides two examples of using maximum likelihood estimation to estimate parameters for queueing models based on observed data. It then introduces MDPs and Bellman's equation for solving sequential decision problems to maximize reward over time. Two solution methods are described: linear programming and value iteration. An example perpetual option problem is presented to illustrate an MDP.

Uploaded by

l f
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes

This document discusses nonlinear optimization and Markov decision processes (MDPs). It provides two examples of using maximum likelihood estimation to estimate parameters for queueing models based on observed data. It then introduces MDPs and Bellman's equation for solving sequential decision problems to maximize reward over time. Two solution methods are described: linear programming and value iteration. An example perpetual option problem is presented to illustrate an MDP.

Uploaded by

l f
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

MS&E 221: Stochastic Modeling

Session 7: Nonlinear Optimization, Markov Decision Processes

Lin Fan

February 20, 2019

1 / 18
Nonlinear Optimization: Finding Maximum Likelihood
Estimates

Example 2 (from Estimation Slide Deck): Queueing model

Xn+1 = [Xn + Zn+1 − 2]+

where (Zn : n ≥ 1) i.i.d. Geometric(p∗ )

From the data, estimate p∗

2 / 18
Nonlinear Optimization: Finding Maximum Likelihood
Estimates

Example 2 (from Estimation Slide Deck):

Observed data:
Z1 = 0, Z2 = 1, Z3 = 3, Z4 = 1, Z5 = 2
L(p) = (1 − p) p · (1 − p)1 p · (1 − p)3 p · (1 − p)1 p · (1 − p)2 p = (1 − p)7 p5
0

The maximizing value of p is p̂ = 5/12

Find p̂ using Matlab:

fun=@(p)(-(1-p)^7*p^5);
p0=0.1;
p_hat=fminsearch(fun,p0)

3 / 18
Nonlinear Optimization: Finding Maximum Likelihood
Estimates

Example 2 Variant (from Estimation Slide Deck): Queueing model

Xn+1 = [Xn + Zn+1 − 1]+

where (Zn : n ≥ 1) i.i.d. Poisson(λ∗ )

From the data, estimate λ∗

4 / 18
Nonlinear Optimization: Finding Maximum Likelihood
Estimates
Example 2 Variant (from Estimation Slide Deck): Xn+1 = [Xn + Zn+1 − 1]+

Observed data:

X1 = 0, X2 = 0, X3 = 3, X4 = 2, X5 = 1, X6 = 0, X7 = 0
3 
λ4 λ0
   
λ λ
L(λ) = e−λ + e−λ · e−λ · e−λ · e−λ + e−λ
1! 4! 0! 1!
The maximizing value of λ is λ̂ = 0.8165

Find λ̂ using Matlab:

fun=@(lambda)(-(exp(-lambda)+exp(-lambda)*lambda)*exp(-lambda)*lambda^4...
*exp(-3*lambda)*(exp(-lambda)+exp(-lambda)*lambda))
lambda0=0.1;
lambda_hat=fminsearch(fun,lambda0)

5 / 18
Sequential Decision Making

Goal: Maximize reward sequentially over time

Reward is a mathematical expression of a desirable state


Decisions made in stages
Current decision affects future outcomes, and therefore future decisions
Balance high present reward vs. potentially low future rewards

6 / 18
Markov Decision Processes

S: set of states, A(x) ⊆ A: set of actions permissible at state x ∈ S


r : S × A → R+ : reward function
Xn : state of system at time n
An : S → A: action mapping at time n, with An (x) ∈ A(x)

P (Xn+1 = xn+1 |X0 = x0 , A0 = a0 , . . . , Xn = xn , An = an )


= P (Xn+1 = xn+1 |Xn = xn , An = an ) =: Pan (xn , xn+1 )

Goal: Denoting the policy Π = (An )n≥0 , solve


"∞ #
X
−αn
maximize EΠ e r(Xn , An (Xn ))
Π
n=0

where α > 0 is discount rate.

7 / 18
Applications

Robotics, Control
Rockets
Autonomous Robots
Business Decisions
Inventory management
Scheduling, controlling queues
Personalized marketing
Finance
Portfolio management (e.g. pension funds)
Option pricing
Education (edtech services)
And many others...

8 / 18
Example: Perpetual Option

Consider an option on a stock that you can exercise at any time


Let Xn be the stock price at time n
If you exercise the option (action a0 ) at time n, you get r(Xn , a0 ), otherwise
0 (action a1 )
Once you exercise option (state E), no reward from then on
S = {E, 0, 1, 2, . . . , }, A = {a0 , a1 }
Some transition matrix Pa with Pa0 (x, E) = 1
For a ∈ A, Pa (E, E) = 1 and r(E, a) = 0

9 / 18
Bellman’s Equation

Recall An : S → A, and Π = (An )n≥0


Define optimal value


" #
X
? −αn
V (x) = max EΠ e r(Xn , An (Xn )) X0 = x

Π
n=0

Theorem 1 (Bellman’s Equation)


Suppose |S| < ∞ and |A| < ∞. Then, V ? satisfies for all x ∈ S
 
 X 
V ? (x) = max r(x, a) + e−α Pa (x, y)V ? (y) .
a∈A(x)  
y∈S

Further, V ? is the unique finite solution to the above fixed point equation.

10 / 18
Sketch of Proof n o
Goal: Show V ? satisfies V ? (x) = maxa∈A(x) r(x, a) + e−α y∈S Pa (x, y)V ? (y)
P

? ?
We postulate that optimal policy is given n = A for some A : S → A (with
P∞by A−αn
? ? ?
A (x) ∈ A(x)). Then, V (x) = EA [ n=0 e
? r(Xn , A (Xn ))].
From first transition analysis, we have
X
V ? (x) = r(x, A? (x)) + e−α PA? (x) (x, y)V ? (y).
y∈S

Similarly, we can show for all a ∈ A(x)


X
V ? (x) ≥ r(x, a) + e−α Pa (x, y)V ? (y)
y∈S

Intuition: Playing optimally is better than playing action a at time 0, and then
optimally onwards

11 / 18
Solution Methods

Can we compute V ? (x) for all x ∈ S?


Use Bellman’s equation
 
 X 
V ? (x) = max r(x, a) + e−α Pa (x, y)V ? (y) .
a∈A(x)  
y∈S

Once we have V ? , the optimal strategy A? : S → A is given by


 
 X 
A? (x) = argmax r(x, a) + e−α Pa (x, y)V ? (y)
a∈A(x)  y∈S

12 / 18
Approach 1: Linear Programming

Since V ? is the unique solution to


 
 X 
V ? (x) = max r(x, a) + e−α Pa (x, y)V ? (y) ,
a∈A(x)  
y∈S

it is given by the linear program


X
minimize V (x)
V
x∈S
X
s.t. V (x) ≥ r(x, a) + e−α Pa (x, y)V (y)
y∈S

for all x ∈ S, a ∈ A(x)

Drawback: Computationally expensive when |S| and |A| are large!

13 / 18
Approach 2: Value Iteration

Let T : R|S| → R|S| be nthe Bellman operator o


(T V )(x) = maxa∈A(x) r(x, a) + e−α y∈S Pa (x, y)V (y) .
P

The value function V ? is a fixed point of T : V ? = T V ?

Theorem 2 (Value Iteration)


For any vector V0 , we have limk→∞ (T k V0 )(x) = V ? (x) for all x ∈ S.

Starting with some |S|-dimensional vector V0 , iteratively apply Vk+1 = T Vk !

14 / 18
Example: Perpetual Option

Let Pa0 (x, E) = 1 for x ∈ S and

1 2 3 E
 
1 1/2 1/2 0 0
2  1/3 0 2/3 0 
Pa1 =  .
3  1/3 1/3 1/3 0 
E 0 0 0 1

Let r(1, a0 ) = 0, r(2, a0 ) = 1, r(3, a0 ) = 2, r(x, a1 ) = 0 for all x ∈ S,


and r(E, a0 ) = r(E, a1 ) = 0 (so V (E) = 0).

15 / 18
Example: Perpetual Option
n o
Compute (T V )(x) = maxa∈A(x) r(x, a) + e−α y∈S Pa (x, y)V (y)
P

Suppose that e−α = 1/2. Let Pa0 (x, E) = 1 for x ∈ S and


1 2 3 E
 
1 1/2 1/2 0 0
2  1/3 0 2/3 0 
Pa1 =  .
3  1/3 1/3 1/3 0 
E 0 0 0 1
Let r(1, a0 ) = 0, r(2, a0 ) = 1, r(3, a0 ) = 2, r(x, a1 ) = 0 for all x ∈ S,
and r(E, a0 ) = r(E, a1 ) = 0 (so V (E) = 0).
  
1 1 1
(T V )(1) = max 0, V (1) + V (2)
2 2 2
 
1 1
= max 0, V (1) + V (2)
4 4

16 / 18
Example: Perpetual Option
n o
Compute (T V )(x) = maxa∈A(x) r(x, a) + e−α y∈S Pa (x, y)V (y)
P

Suppose that e−α = 1/2. Let Pa0 (x, E) = 1 for x ∈ S and


1 2 3 E
 
1 1/2 1/2 0 0
2  1/3 0 2/3 0 
Pa1 =  .
3  1/3 1/3 1/3 0 
E 0 0 0 1
Let r(1, a0 ) = 0, r(2, a0 ) = 1, r(3, a0 ) = 2, r(x, a1 ) = 0 for all x ∈ S,
and r(E, a0 ) = r(E, a1 ) = 0 (so V (E) = 0).
  
1 1 2
(T V )(2) = max 1, V (1) + V (3)
2 3 3
 
1 1
= max 1, V (1) + V (3)
6 3

17 / 18
Example: Perpetual Option
n o
Compute (T V )(x) = maxa∈A(x) r(x, a) + e−α y∈S Pa (x, y)V (y)
P

Suppose that e−α = 1/2. Let Pa0 (x, E) = 1 for x ∈ S and

1 2 3 E
 
1 1/2 1/2 0 0
2 
 1/3 0 2/3 0 
Pa1 = .
3  1/3 1/3 1/3 0 
E 0 0 0 1

Let r(1, a0 ) = 0, r(2, a0 ) = 1, r(3, a0 ) = 2, r(x, a1 ) = 0 for all x ∈ S,


and r(E, a0 ) = r(E, a1 ) = 0 (so V (E) = 0).
  
1 1 1 1
(T V )(3) = max 2, V (1) + V (2) + V (3)
2 3 3 3
 
1 1 1
= max 2, V (1) + V (2) + V (3)
6 6 6

18 / 18

You might also like