Learning and Planning in Dynamical Systems – 046194 Spring 2016
Formula Sheet
Section 1 – Deterministic Dynamic Programming
Finite-horizon Dynamic Programming
(i) Initialize the value function: VN ( s) rN ( s) , s S N .
(ii) Backward recursion: For k N 1, ,0 , compute
Vk ( s) max rk ( s, a) Vk 1 ( f k ( s, a)) ,
aAk
s Sk
(iii) Optimal policy: Choose any control policy ( k ) that satisfies:
aAk
k ( s) arg max rk ( s, a) Vk 1 ( f k ( s, a)) , k 0, , N 1
Shortest Path on a Graph
DP equation (or Bellman's equation) for the shortest path problem:
d (u, v) min{w(u, u ') d (u ', v) : (u, u ') E}
Bellman-Ford Algorithm:
Input: A weighted directed graph G , and destination node t .
Initialization: d [t ] 0 , d [v] for v V \{t} , [v] for v V
for i 1 to | V | 1
d '[v] d[v], v V \{t}
for each vertex v V \{t} ,
compute q[v] minu {w(v, u) d '[u]: (v, u) E}
if q[v] d [v] ,
set d [v] q[v] , [v] arg minu {w(v, u) d '[u]: (v, u) E}
return {d [v], [v]}
Dijkstra's Algorithm:
Input: A weighted directed graph, and destination node t .
Initialization: d [t ] 0 , d [v] for v V \{t} , [v] for v V
S
while S V ,
choose u V \ S with minimal value d [u ] , add it to S
for each vertexv with (v, u) E ,
if d[v] w(v, u) d[u] ,
set d[v] w(v, u) d[u] , [v] u
return {d [v], [v]}
Learning and Planning in Dynamical Systems – 046194 Spring 2016
Section 2 – Planning
Markov Chains
Stationary Distribution: is a stationary distribution for the Markov chain if P ,
namely j i i pij j .
Theorem: Recurrence of finite Markov chains: Let ( X t ) be an irreducible, a-periodic
Markov chain over a finite state space X . Then the following properties hold:
1. All states are positive recurrent
2. There exists a unique stationary distribution
3. Convergence to the stationary distribution: limt pij j (j )
(t )
1 t 1
4. Ergodicity: limt f ( X s ) i (i) f (i)
t s 0
f
Theorem: Countable Markov chains: Let ( X t ) be an irreducible and a-periodic Markov
chain over a countable state space X . Then:
(a) Either (i) all states are positive recurrent, or (ii) all states are null recurrent, or (iii) all
states are transient.
(b) If (i) holds, then properties (2)-(4) of the previous Theorem hold as well.
(c) Conversely, if there exists a stationary distribution then properties (1)-(4) are
satisfied.
Contraction Operators
Norm: A norm || || over n
is a real-valued function || ||: n
such that, for any pair
of vectors x, y n
and scalar a ,
(1) || ax ||| a | || ax || , (2) || x y |||| x || || y || , (3) || x || 0 only if x 0 .
is a contraction in || || if there exists (0,1)
n n
Contraction: An operator T :
such that
|| T (v1) T (v2 ) || || v1 v2 || , for all v1, v2 n
Theorem: Banach's fixed point theorem
Let T :
n
n
be a contraction operator. Then
(C1) The equation T (v) v has a unique solution V * n .
(C2) For any v0 n , limn T n (v0 ) V * . In fact, || T n (v0 ) V * || O( n ) .
Finite Horizon MDPs
Performance Criterion
J ( s0 ) E ,s0 T 1
r ( s , a ) rT (sT )
t 0 t t t
Finite-horizon Dynamic Programming
(i) Backward recursion: Set VT ( s) rT ( s) for s ST .
For k T 1, ,0 , Vk ( s) may be computed using the following recursion:
aAk
Vk ( s) max rk ( s, a) s 'S
k 1
pk ( s ' | s, a)Vk 1( s ') , s Sk
(ii) Optimal policy: Any Markov policy * that satisfies, for t 0, , T 1,
aAt
t* ( s) arg max rt ( s, a) s 'S
t 1
pt ( s ' | s, a)Vt 1( s ') , s St
Learning and Planning in Dynamical Systems – 046194 Spring 2016
is an optimal control policy. Furthermore, * maximizes J ( s0 ) simultaneously
for every initial state s0 S0
Discounted MDPs
Value Function (fixed policy)
V ( s) E ,s ( t r (st , at ))
t 0
Bellman (fixed policy)
V (s) r (s, (s)) s 'S p(s ' | s, (s))V (s ') , sS
Vector form: V r P V
Optimal Value Function
V * ( s) sup E ,s ( t r ( st , at ))
t 0
Theorem: Bellman's Optimality Equation
(i) V * is the unique solution of of the following set of (nonlinear) equations:
V (s) max r ( s, a) s 'S p( s ' | s, a)V ( s ') ,
aA
sS .
(ii) Any stationary policy * that satisfies
* (s) arg max aA r ( s, a) s 'S p( s ' | s, a)V * (s ')
is an optimal policy (for any initial state s0 S ) .
Operator Notation
(T (V ))(s) r (s, a) s 'S p(s ' | s, (s))V (s '), s S
(T * (V ))(s) max r (s, a) s 'S p(s ' | s, a)V ( s ') , s S
aA
Theorem: Contraction property
(i) T is a -contraction operator with respect to the max-norm, namely
|| T (V1) T (V2 ) || || V1 V2 || for all V1, V2 S
(ii) Similarly, T * is an -contraction operator with respect to the max-norm.
Value Iteration
Starting with an arbitrary V0 ( s) , s S , compute recursively
Vn1 (s) max r (s, a) s 'S p(s ' | s, a)Vn ( s ') , s S
aA
Then limn Vn V (component-wise). The rate of convergence is O( n ) .
*
Policy Iteration
Initialization: choose some stationary policy 0 .
For k 0,1, :
k
1. Policy evaluation: compute V .
k
2. Policy Improvement: Compute k 1 , a greedy policy with respect to V :
k 1 (s) arg max aA r ( s, a) s 'S p( s ' | s, a)V k (s ') .
k
3. Stop if V k 1 V k (or if V satisfied the optimality equation) , else repeat.
Theorem: Policy Iteration
Learning and Planning in Dynamical Systems – 046194 Spring 2016
k 1
(i) Each policy k 1 is improving over the previous one k , in the sense that V V k
(component-wise).
k 1
(ii) V V k if and only if k is an optimal policy.
(iii) k converges to the optimal policy after a finite number of steps.
Section 3 – Learning - MAB
Hoeffding's Inequality:
Union bound:
Exploratory MAB
PAC definition: An algorithm is a ( , ) PAC algorithm for the multi armed bandit with
complexity T , if it outputs an -optimal arm, a ' , with probability at least 1 , when it
Learning and Planning in Dynamical Systems – 046194 Spring 2016
terminates, and the number of time steps the algorithm performs until it terminates is
bounded by T .
Naïve Algorithm:
Successive Elimination (known biases):
Successive Elimination (unknown biases):
Learning and Planning in Dynamical Systems – 046194 Spring 2016
Median Elimination:
Regret Minimization in MAB
t
Expected Regret definition: [regrett ] tr * [ r ].t
1
UCB algorithm:
Arm differences: i * i .
Theorem: UCB
Learning and Planning in Dynamical Systems – 046194 Spring 2016
Section 4 – Reinforcement Learning
Stochastic Approximation
Martingales:
Martingale Difference: the same, but with E ( X k 1 | k )0 .
Stochastic Approximation Algorithm:
n1 n n [h(n ) n ]
Assumptions:
SA Convergence Theorems:
Learning and Planning in Dynamical Systems – 046194 Spring 2016
Reinforcement Learning – Tabular Methods
Deterministic Q-learning:
Policy Evaluation: TD(0), TD( )
Policy Evaluation: SARSA
Policy Improvement: Q-learning
Learning and Planning in Dynamical Systems – 046194 Spring 2016
Reinforcement Learning – Function Approximation
Linear Function Approximation:
V ( s) ( s) w
-weighted Euclidean norm:
n
( j ) X ( j)2
2
X
j 1
Projection Operator: : projects a vector V
n n n
to a linear subspace S that is
the span of k features 1 ( s),,k ( s) w.r.t.‖ ·‖ :
V w *
w* arg min w w V
2
k
1
Explicitly, we can write in vector notation as: .
The projection is non-expansive:‖ V V‖ ‖ V V‖
Lemma: When is the stationary distribution: ‖ Pz‖ 2 ‖ z‖ 2
Theorem: T is a contraction in the‖ ‖ norm. Its unique fixed point satisfies:
P I w r 0 .
TD(0) with Function Approximation
wt 1 wt t t ( st )
t r ( st ) ( st 1 ) wt ( st ) wt
Theorem: Convergence of TD(0) with Function Approximation. Let the step-sizes satisfy G1.
Also, assume that is full rank, and that the stationary distribution exists. Then
limt wt w* with probability 1, and w* is the fixed point of T .
Error Bound:
1
w* V V V
1
LSTD Algorithm:
Aˆ ( si ) ( si ') ( si )
N
i 1
N
bˆ r ( si ) ( si )
i 1
w Aˆ 1bˆ
LSPI Algorithm:
start with some arbitrary w0
Learning and Planning in Dynamical Systems – 046194 Spring 2016
for m 0,1, 2,
Aˆ ( si , ai ) ( si ', greedy ( si '; wm )) ( si , ai )
N
i 1
N
bˆ r ( si , ai ) ( si , ai )
i 1
wm 1 Aˆ 1bˆ
Where greedy ( s; w) arg max a ( s, a) w
Reinforcement Learning – Policy Gradients
Gradient ascent scheme:
k 1 k k J (k ) .
Likelihood ratio formula:
J ( ) τ R( τ ) p ( τ )
τ [ R( τ ) log p ( τ )] p ( τ )
E R( τ ) log p ( τ )
Gradient estimation algorithm:
Simulate a single episode ("rollout") of the controlled system with policy .
T 1
Compute R( τ ) as R( τ) t 0 r ( xt , ut ) rT ( xT ) , or directly using the
t 0 Rt .
T
observed rewards R( τ )
T 1
ˆ J ( ) R( τ ) log (u | x )
Compute t t
t 0
Policy Gradient Theorem:
T 1
J ( ) E
( log (ut | xt ))Q ( xt , ut )
t 0