[CS294-112] model-based RL

本文深入探讨了模型基强化学习(Model-Based RL)的各种方法,包括随机优化、蒙特卡洛树搜索(MCTS)、线性二次调节器(LQR)和iLQR/DDP。此外,还讨论了模型基RL的不同版本,如MPC、PILCO,并分析了模型基与模型自由方法的优缺点。同时,文章提到了局部模型的适用场景和学习策略,如Guided Policy Search(GPS)和DAgger。最后,文章给出了选择RL方法的样本效率排名和推荐路径。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Content

1 Control and Planning

2 Model-Based RL

3 Local Models

4 policy learning with a learned model

5 How to choose methods

1. Control and Planning

Open-loop Trajectory optimization methods
assumptions: a (learned) dynamics model in hand
objective: find the optimal action sequence that maximizes the expected return of the trajectories given that action sequence under the known dynamics model.
在这里插入图片描述
inputs: a dynamics model, a cost function (to minimize)

side note open-loop versus closed-loop control
open-loop = no-feedback system
closed-loop = feedback system
feedback means process output is required for next decision.

1.1 stochastic optimization

1.1.1 random shooting
generate random action sequences from an arbitrary distribution, and take the sequence among them that yields the lowest cost as the final solution.

weakness in high-dim action space, there are low odds to sample the optimal actions. (the curse of dimensionality)

1.1.2 CEM: Cross-Entropy Methods

loop {
   
   
	1. generate samples (action sequences) from a parameterized distribution.  
	2. evaluate costs and pick the lowest k samples  
	3. fit the distribution to the k samples
}

intuition use a parameterized distribution that adapt to the lower cost action space iteratively.

1.1.3 CMA-ES
like CEM with momentum

1.2 Monte Carlo Tree Search with Upper Confidence Bounds for Trees

MCTS
Given: a root node, a TreePolicy, a DefaultPolicy

do {
   
   
	1. get to a leaf node according to the TreePolicy
	2. rollout the node according to the DefaultPolicy until a terminating state.
	3. compute the reward of the rollout trajectory, update the statistics along the route from current leaf node to the root node
} until some criterion met.

DefaultPolicy
Given: a node (state)

1. sample an action uniformly from the allowed action space of the state
2. return the next state (inferred under the known dynamics)

TreePolicy: UCT (Upper Confidence Bounds for Trees) algorithm
Given: the root node

1. do {
   
   
	1. choose the best child node of the current node according to the UCT function
	2. set the current node the best child node.
} until the current node is expandable.

2. randomly choose an action from the allowed action space of the current node.
3. get to the next state under the known dynamics.
4. expand the current node with a new node of this state, return this child node.

The statistics of each node includes the total reward and number of visits of the subtree with itself as the root.

As to how we define the goodness of a child, we use the UCT fuction. The UCT function balances against exploration and exploitation, intuitively prefers to expand those nodes that are less visited than their siblings with high average returns.
在这里插入图片描述
side note this is straight-forward for the deterministic dynamics, for the stochastic case, some reparameterization tricks are required

Case Study
Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning
combines DAgger with MCTS as such

loop {
   
   
	1. train a policy on state-action pairs D
	2. run the policy to obtain unseen states {
   
   o_j}
	3. label the states with actions suggested by MCTS to forge state-action pairs D_u
	4. aggregate D with D_u
}

1.3 Linear Quadratic Regulator (LQR)

given: a linear dynamics function, a quadratic cost (Q) function
在这里插入图片描述
objective: find the action sequence that minimizes the cost function under the dynamics.
在这里插入图片描述
and with the substitution of the dynamics constraint, we turn it to an unconstrained optimization problem.
在这里插入图片描述
At time step t, we first optimizes Q to get the optimal action.
在这里插入图片描述
And we compute the optimum, which is the value at this state.
在这里插入图片描述
For the last time step t-1, we clean the Q-function via Bellman Backup, into the form of a quadratic.
在这里插入图片描述
Again we solve for the optimal action at time t-1.
在这里插入图片描述
==> 1.3.1 The Deterministic case
logics

loop {
   
   
	1. At time t we can find the optimal action w.r.t the  Q-function. (The action is a linear function of the state)
	2. Since this action optimizes the cost function, it also optimizes the Q-function. Substituting the a_t = K_t*s_t+k_t into the Q-function, we obtain the optimal value at time t, with the only variable s_t. Now the value-function is a quadratic function of s_t.
	3. At time t-1, based on Bellman Backup, the Q-function equals the current step cost plus the next step value-function. We obtained the time t value-function w.r.t s_t, and now we substitute s_t with the dynamics function to get rid of s_t. Now we obtain the Q_function at time t-1 dependent on only variable s_{
   
   t-1} and a_{
   
   t-1}.
}

在这里插入图片描述
To summarize, this is a process backpropagating through time where we iteratively compute 1) the optimal action ( a t = f ( s t ) a_t = f(s_t) a

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值