0% found this document useful (0 votes)
223 views18 pages

Stochastic Dynamic Programming Guide

Stochastic dynamic programming deals with problems where the next state is determined probabilistically rather than deterministically based on the current state and action. The optimality equations involve minimizing expected future costs over the possible next states, weighted by their probabilities. Dynamic programming can be applied to problems involving sequential decision making under uncertainty.

Uploaded by

Armee Justitia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
223 views18 pages

Stochastic Dynamic Programming Guide

Stochastic dynamic programming deals with problems where the next state is determined probabilistically rather than deterministically based on the current state and action. The optimality equations involve minimizing expected future costs over the possible next states, weighted by their probabilities. Dynamic programming can be applied to problems involving sequential decision making under uncertainty.

Uploaded by

Armee Justitia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Stochastic Dynamic

Programming

1 / 18 Ya-Tang Chuang Dynamic Programming


S TOCHASTIC DYNAMIC PROGRAMMING
Stochastic dynamic programming differs from deterministic dynamic
programming in that the state at the next stage is not completely
determined by the state and policy decision at the current stage.

Rather, there is a probability distribution for what the next state will
be. However, this probability distribution still is completely determined
by the state and policy decision at the current stage.

2 / 18 Ya-Tang Chuang Dynamic Programming


S TOCHASTIC DYNAMIC PROGRAMMING

Let A, a finite set, be the set of all possible actions. fn (sn , xn )


represents the minimum expected sum from stage n onward,
given that the state and policy decision at stage n are sn and xn ,
respectively.

Optimality equation
S
X
fn (sn , xn ) = pi [Ci + fn∗ (i)].
i=1

The optimal value



fn+1 (i) = min fn+1 (i, xn+1 ),
xn+1

where this minimization is taken over the feasible values of xn+1 .

3 / 18 Ya-Tang Chuang Dynamic Programming


E XAMPLE : D ETERMINING R EJECT A LLOWANCES
A company has received an order to supply one item of a particular
type. However, the customer has specified such stringent quality
requirements that the manufacturer may have to produce more than
one item to obtain an item that is acceptable.

The number of extra items produced in a production run is called the


reject allowance. Including a reject allowance is common practice
when producing for a custom order, and it seems advisable in this
case.

4 / 18 Ya-Tang Chuang Dynamic Programming


E XAMPLE : D ETERMINING R EJECT A LLOWANCES
The manufacturer estimates that each item of this type that is
produced will be acceptable with probability 1/2 and defective with
probability 1/2. Thus, the number of acceptable items produced in a
lot of size L will have a binomial distribution; i.e., the probability of
producing no acceptable items in such a lot is (1/2)L

Marginal production costs for this product are estimated to be $100


per item (even if defective), and excess items are worthless. In
addition, a setup cost of $300 must be incurred whenever the
production process is set up for this product, and a completely new
setup at this same cost is required for each subsequent production
run if a lengthy inspection procedure reveals that a completed lot has
not yielded an acceptable item.

5 / 18 Ya-Tang Chuang Dynamic Programming


E XAMPLE : D ETERMINING R EJECT A LLOWANCES
The manufacturer has time to make no more than three production
runs. If an acceptable item has not been obtained by the end of the
third production run, the cost to the manufacturer in lost sales income
and penalty costs will be $1,600.

The objective is to determine the policy regarding the lot size for the
required production run(s) that minimizes total expected cost for the
manufacturer.

6 / 18 Ya-Tang Chuang Dynamic Programming


E XAMPLE : W INNING IN L AS V EGAS
An enterprising young statistician believes that she has developed a
system for winning a popular Las Vegas game. Her colleagues do not
believe that her system works, so they have made a large bet with her
that if she starts with three chips, she will not have at least five chips
after three plays of the game.

Each play of the game involves betting any desired number of


available chips and then either winning or losing this number of chips.
The statistician believes that her system will give her a probability of
2/3 of winning a given play of the game.

7 / 18 Ya-Tang Chuang Dynamic Programming


E XAMPLE : W INNING IN L AS V EGAS
Assuming the statistician is correct, we now use dynamic
programming to determine her optimal policy regarding how many
chips to bet at each of the three plays of the game.

The decision at each play should take into account the results of
earlier plays. The objective is to maximize the probability of winning
her bet with her colleagues.

8 / 18 Ya-Tang Chuang Dynamic Programming


E XAMPLE : A GAMBLING MODEL

At each play of the game a gambler can bet any nonnegative


amount up to his present fortune and will either win or lose that
amount with probabilities p and q = 1 − p, respectively. The
gambler is allowed to make n bets, and his objective is to
maximize the expectation of the logarithm of his final fortune.
What strategy achieves this end?

What’s the state space, action space and value function?

9 / 18 Ya-Tang Chuang Dynamic Programming


E XAMPLE : A GAMBLING MODEL

At each play of the game a gambler can bet any nonnegative


amount up to his present fortune and will either win or lose that
amount with probabilities p and q = 1 − p, respectively. The
gambler is allowed to make n bets, and his objective is to
maximize the expectation of the logarithm of his final fortune.
What strategy achieves this end?

Let Vn (x) denote the maximal expected return if the gambler has
a present fortune of x and is allowed n more gambles

Let x denote the present fortune and α the fraction of the


gambler’s fortune.

Vn (x) = max {pVn−1 ( ) + qVn−1 ( )} ,


0≤α≤1

with the boundary condition V0 (x) = log x.

10 / 18 Ya-Tang Chuang Dynamic Programming


E XAMPLE : A GAMBLING MODEL

At each play of the game a gambler can bet any nonnegative


amount up to his present fortune and will either win or lose that
amount with probabilities p and q = 1 − p, respectively. The
gambler is allowed to make n bets, and his objective is to
maximize the expectation of the logarithm of his final fortune.
What strategy achieves this end?

Let Vn (x) denote the maximal expected return if the gambler has
a present fortune of x and is allowed n more gambles

Let x denote the present fortune and α the fraction of the


gambler’s fortune.

Vn (x) = max {pVn−1 (x + αx) + qVn−1 (x − αx)} ,


0≤α≤1

with the boundary condition V0 (x) = log x.

11 / 18 Ya-Tang Chuang Dynamic Programming


E XAMPLE : I NVENTORY CONTROL

A store has, at time n, sn items in stock. It then orders (and


receives) an items, and sells dn items, where dn follows a given
probability distribution. Assume that backorders are not allowed.
Thus,

sn+1 =

Suppose there is a unit holding cost h if there is any inventory on


hand (sn+1 > 0) as well as unit shortage penalty p if demand is
not fulfilled (sn+1 < 0). Then the (random) cost incurred in period
n, is

r (sn , an ) =

12 / 18 Ya-Tang Chuang Dynamic Programming


E XAMPLE : I NVENTORY CONTROL

A store has, at time n, sn items in stock. It then orders (and


receives) an items, and sells dn items, where dn follows a given
probability distribution. Assume that backorders are not allowed.
Thus,

sn+1 = (sn + an − dn )+ .

Suppose there is a unit holding cost h if there is any inventory on


hand (sn+1 > 0) as well as unit shortage penalty p if demand is
not fulfilled (sn+1 < 0). Then the (random) cost incurred in period
n, is

r (sn , an ) = h · (sn + an − dn )+ + p · (dn − sn − an )+ .

13 / 18 Ya-Tang Chuang Dynamic Programming


E XAMPLE : I NVENTORY CONTROL

Objective for a N-stage problem


N−1
X
V0 (s0 ) = min E r (sn , an ),
a0 ,...,aN−1
n=0

where V0 (s0 ) is the expected total cost over N planning horizon.

Optimality equation

Vn (sn ) = min {R(sn , an ) + EF [Vn+1 (sn+1 )]} ,


an ∈A

where R(sn , an ) = E[r (sn , an )].

14 / 18 Ya-Tang Chuang Dynamic Programming


E XAMPLE : P RICING TO MAXIMIZE REVENUE

Assume that you have s0 items to sell over N periods, and that
you cannot order more items during that time. However, you can
influence demand by varying the price from period to period. The
problem is to select a price at the beginning of each time period
to maximize the total expected revenue.

What’s the state space, action space, reward function and state
transition?

15 / 18 Ya-Tang Chuang Dynamic Programming


E XAMPLE : P RICING TO MAXIMIZE REVENUE

Assume that you have s0 items to sell over N periods, and that
you cannot order more items during that time. However, you can
influence demand by varying the price from period to period. The
problem is to select a price at the beginning of each time period
to maximize the total expected revenue.

Let (1) sn be the inventory at the beginning of period n, (2) an be


the price per item during period n, (3) dn be amount demanded
during period n.

Since backorders are not allowed, the state is updated by

sn+1 =

The (random) revenue during period n is

r (sn , an ) =

16 / 18 Ya-Tang Chuang Dynamic Programming


E XAMPLE : P RICING TO MAXIMIZE REVENUE

Assume that you have s0 items to sell over N periods, and that
you cannot order more items during that time. However, you can
influence demand by varying the price from period to period. The
problem is to select a price at the beginning of each time period
to maximize the total expected revenue.

Let (1) sn be the inventory at the beginning of period n, (2) an be


the price per item during period n, (3) dn be amount demanded
during period n.

Since backorders are not allowed, the state is updated by

sn+1 = (sn − dn )+ .

The (random) revenue during period n is

r (sn , an ) = an · min(sn , dn ).

17 / 18 Ya-Tang Chuang Dynamic Programming


E XAMPLE : P RICING TO MAXIMIZE REVENUE

Objective for a N-stage problem


N−1
X
π
V0 (s0 ) = max E r (sn , an ),
a0 ,...,aN−1
n=0

where V0 (s0 ) is the expected total revenue over N planning


horizon.

Optimality equation

Vn (sn ) = max {R(sn , an ) + EF [Vn+1 (sn+1 )]} ,


an ∈A

where R(sn , an ) = E[r (sn , an )].

18 / 18 Ya-Tang Chuang Dynamic Programming

You might also like