Content
- 2.1 version 0.5
- 2.2 version 1.0
- 2.3 MPC
- 2.4 PILCO
- 2.5 model-based v.s. model-free
- 2.6 what model to choose
4 policy learning with a learned model
1. Control and Planning
Open-loop Trajectory optimization methods
assumptions: a (learned) dynamics model in hand
objective: find the optimal action sequence that maximizes the expected return of the trajectories given that action sequence under the known dynamics model.
inputs: a dynamics model, a cost function (to minimize)
side note open-loop versus closed-loop control
open-loop = no-feedback system
closed-loop = feedback system
feedback means process output is required for next decision.
1.1 stochastic optimization
1.1.1 random shooting
generate random action sequences from an arbitrary distribution, and take the sequence among them that yields the lowest cost as the final solution.
weakness in high-dim action space, there are low odds to sample the optimal actions. (the curse of dimensionality)
1.1.2 CEM: Cross-Entropy Methods
loop {
1. generate samples (action sequences) from a parameterized distribution.
2. evaluate costs and pick the lowest k samples
3. fit the distribution to the k samples
}
intuition use a parameterized distribution that adapt to the lower cost action space iteratively.
1.1.3 CMA-ES
like CEM with momentum
1.2 Monte Carlo Tree Search with Upper Confidence Bounds for Trees
MCTS
Given: a root node, a TreePolicy, a DefaultPolicy
do {
1. get to a leaf node according to the TreePolicy
2. rollout the node according to the DefaultPolicy until a terminating state.
3. compute the reward of the rollout trajectory, update the statistics along the route from current leaf node to the root node
} until some criterion met.
DefaultPolicy
Given: a node (state)
1. sample an action uniformly from the allowed action space of the state
2. return the next state (inferred under the known dynamics)
TreePolicy: UCT (Upper Confidence Bounds for Trees) algorithm
Given: the root node
1. do {
1. choose the best child node of the current node according to the UCT function
2. set the current node the best child node.
} until the current node is expandable.
2. randomly choose an action from the allowed action space of the current node.
3. get to the next state under the known dynamics.
4. expand the current node with a new node of this state, return this child node.
The statistics of each node includes the total reward and number of visits of the subtree with itself as the root.
As to how we define the goodness of a child, we use the UCT fuction. The UCT function balances against exploration and exploitation, intuitively prefers to expand those nodes that are less visited than their siblings with high average returns.
side note this is straight-forward for the deterministic dynamics, for the stochastic case, some reparameterization tricks are required
Case Study
Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning
combines DAgger with MCTS as such
loop {
1. train a policy on state-action pairs D
2. run the policy to obtain unseen states {
o_j}
3. label the states with actions suggested by MCTS to forge state-action pairs D_u
4. aggregate D with D_u
}
1.3 Linear Quadratic Regulator (LQR)
given: a linear dynamics function, a quadratic cost (Q) function
objective: find the action sequence that minimizes the cost function under the dynamics.
and with the substitution of the dynamics constraint, we turn it to an unconstrained optimization problem.
At time step t, we first optimizes Q to get the optimal action.
And we compute the optimum, which is the value at this state.
For the last time step t-1, we clean the Q-function via Bellman Backup, into the form of a quadratic.
Again we solve for the optimal action at time t-1.
==> 1.3.1 The Deterministic case
logics
loop {
1. At time t we can find the optimal action w.r.t the Q-function. (The action is a linear function of the state)
2. Since this action optimizes the cost function, it also optimizes the Q-function. Substituting the a_t = K_t*s_t+k_t into the Q-function, we obtain the optimal value at time t, with the only variable s_t. Now the value-function is a quadratic function of s_t.
3. At time t-1, based on Bellman Backup, the Q-function equals the current step cost plus the next step value-function. We obtained the time t value-function w.r.t s_t, and now we substitute s_t with the dynamics function to get rid of s_t. Now we obtain the Q_function at time t-1 dependent on only variable s_{
t-1} and a_{
t-1}.
}
To summarize, this is a process backpropagating through time where we iteratively compute 1) the optimal action ( a t = f ( s t ) a_t = f(s_t) a