1.Sarsa算法
每次使用五元组(st,at,rt,st+1,at+1)(s_t,a_t,r_t,s_{t+1},a_{t+1})(st,at,rt,st+1,at+1)更新参数,即State-Action-Reward-State-Action (SARSA)
1.0.Derive TD Target
Discounted Return, RtR_tRt depends on (St,At,St+1)(S_t,A_t,S_{t+1})(St,At,St+1)
Ut=Rt+γ⋅Rt+1+γ2⋅Rt+2⋅⋅⋅=Rt+γ⋅(Rt+1+γ⋅Rt+2⋅⋅⋅)=Rt+γ⋅Ut+1
\begin{aligned}
U_t&=R_t+\gamma·R_{t+1}+\gamma^2·R_{t+2}···
\\&=R_t+\gamma·(R_{t+1}+\gamma·R_{t+2}···)
\\&=R_t+\gamma·U_{t+1}
\end{aligned}
Ut=Rt+γ⋅Rt+1+γ2⋅Rt+2⋅⋅⋅=Rt+γ⋅(Rt+1+γ⋅Rt+2⋅⋅⋅)=Rt+γ⋅Ut+1
Qπ(st,at)=E[Ut∣st,at]=E[Rt+γ⋅Ut+1∣st,at]=E[Rt∣st,at]+γ⋅E[Ut+1∣st,at]=E[Rt∣st,at]+γ⋅E[Qπ(St+1,At+1)∣st,at] \begin{aligned} Q_\pi(s_t,a_t)&=E[U_t|s_t,a_t] \\&=E[R_t+\gamma·U_{t+1}|s_t,a_t] \\&=E[R_t|s_t,a_t]+\gamma·E[U_{t+1}|s_t,a_t] \\&=E[R_t|s_t,a_t]+\gamma·E[Q_\pi(S_{t+1},A_{t+1})|s_t,a_t] \end{aligned} Qπ(st,at)=E[Ut∣st,at]=E[Rt+γ⋅Ut+1∣st,at]=E[Rt∣st,at]+γ⋅E[Ut+1∣st,at]=E[Rt∣st,at]+γ⋅E[Qπ(St+1,At+1)∣st,at]
Identity: Qπ(st,at)=E[Rt+γ⋅Qπ(St+1,At+1)]Q_\pi(s_t,a_t)=E[R_t+\gamma·Q_\pi(S_{t+1},A_{t+1})]Qπ(st,at)=E[Rt+γ⋅Qπ(St+1,At+1)], for all π\piπ.
蒙特卡洛近似: Qπ(st,at)≈rt+γQπ(st+1,at+1)=ytQ_\pi(s_t,a_t)\approx r_t+\gamma Q_\pi(s_{t+1},a_{t+1})=y_tQπ(st,at)≈rt+γQπ(st+1,at+1)=yt
yty_tyt为TD Target
1.1.Tabular Version
适用于规模较小、表格较小的问题,由状态和动作组成QQQ表,使用Sarsa算法更新表格。
- 观测得到transition st,at,rt,st+1s_t,a_t,r_t,s_{t+1}st,at,rt,st+1
- 根据策略π(⋅∣St+1)\pi(·|S_{t+1})π(⋅∣St+1)采样得到动作at+1a_{t+1}at+1
- TD target: yt=rt+γ⋅Qπ(st+1,at+1)y_t=r_t+\gamma·Q_\pi(s_{t+1},a_{t+1})yt=rt+γ⋅Qπ(st+1,at+1),其中Qπ(st+1,at+1)Q_\pi(s_{t+1},a_{t+1})Qπ(st+1,at+1)查表得到
- TD error: δt=Qπ(st,at)−yt\delta_t=Q_\pi(s_t,a_t)-y_tδt=Qπ(st,at)−yt
- 更新QQQ表: Qπ(st,at)←Qπ(st,at)−α⋅δtQ_\pi(s_t,a_t)\leftarrow Q_\pi(s_t,a_t)-\alpha·\delta_tQπ(st,at)←Qπ(st,at)−α⋅δt
1.2.Neural Network Version
使用神经网络value network q(s,a;w)q(s,a;w)q(s,a;w)近似计算Qπ(s,a)Q_\pi(s,a)Qπ(s,a)
-
TD target: yt=rt+γ⋅q(st+1,at+1;w)y_t=r_t+\gamma·q(s_{t+1},a_{t+1};w)yt=rt+γ⋅q(st+1,at+1;w)
-
TD error: δt=q(st,at;w)−yt\delta_t=q(s_t,a_t;w)-y_tδt=q(st,at;w)−yt
-
Loss: δt2/2\delta_t^2/2δt2/2
-
Gradient: ∂δt2/2∂w=δt⋅∂q(st,at;w)∂w\frac{\partial \delta^2_t/2}{\partial w}=\delta_t·\frac{\partial q(s_t,a_t;w)}{\partial w}∂w∂δt2/2=δt⋅∂w∂q(st,at;w)
-
Gradient descent: w←w−α⋅δ⋅∂q(st,at;w)∂ww\leftarrow w-\alpha·\delta·\frac{\partial q(s_t,a_t;w)}{\partial w}w←w−α⋅δ⋅∂w∂q(st,at;w)
2.Q-Learning
比较Q-Learning和Sarsa:
Sarse | Q-Learning | |
---|---|---|
目标函数 | Qπ(s,a)Q_\pi(s,a)Qπ(s,a) | Q∗(s,a)Q^*(s,a)Q∗(s,a) |
TD target | yt=rt+γ⋅Qπ(st+1,at+1)y_t=r_t+\gamma·Q_\pi(s_{t+1},a_{t+1})yt=rt+γ⋅Qπ(st+1,at+1) | yt=rt+γ⋅maxaQ∗(st+1,a)y_t=r_t+\gamma·\max_a Q^*(s_{t+1},a)yt=rt+γ⋅maxaQ∗(st+1,a) |
参数更新 | value network; critic | DQN |
2.0.TD Target
在1.0中已经计算,对于策略π\piπ:
Qπ(st,at)=E[Rt+γ⋅Qπ(St+1,At+1)]
Q_\pi(s_t,a_t)=E[R_t+\gamma·Q_\pi(S_{t+1},A_{t+1})]
Qπ(st,at)=E[Rt+γ⋅Qπ(St+1,At+1)]
对于最优策略optimal policy π∗\pi^*π∗:
Q∗(st,at)=E[Rt+γ⋅Q∗(St+1,At+1)]
Q^*(s_t,a_t)=E[R_t+\gamma·Q^*(S_{t+1},A_{t+1})]
Q∗(st,at)=E[Rt+γ⋅Q∗(St+1,At+1)]
取行动At+1A_{t+1}At+1为At+1=argmaxaQ∗(St+1,a)A_{t+1}=\arg\max_a Q^*(S_{t+1},a)At+1=argmaxaQ∗(St+1,a),
则Q∗(St+1,At+1)=maxaQ∗(St+1,a)Q^*(S_{t+1},A_{t+1})=\max_a Q^*(S_t+1,a)Q∗(St+1,At+1)=maxaQ∗(St+1,a)
Q∗(st,at)=E[Rt+γ⋅maxaQ∗(St+1,a)]
Q^*(s_t,a_t)=E[R_t+\gamma·\max_a Q^*(S_{t+1},a)]
Q∗(st,at)=E[Rt+γ⋅amaxQ∗(St+1,a)]
使用蒙特卡洛近似,得到TD target yty_tyt:
Q∗(st,at)≈rt+γ⋅maxaQ∗(st+1,a)=yt
Q^*(s_t,a_t)\approx r_t+\gamma·\max_a Q^*(s_{t+1},a)=y_t
Q∗(st,at)≈rt+γ⋅amaxQ∗(st+1,a)=yt
2.1.Tabular Version
适用于规模较小、表格较小的问题,由状态和动作组成Q∗Q^*Q∗表,使用Q-Learning算法更新表格。
- 观测得到transition st,at,rt,st+1s_t,a_t,r_t,s_{t+1}st,at,rt,st+1
- TD target: yt=rt+γ⋅maxaQ∗(st+1,a)y_t=r_t+\gamma·\max_a Q^*(s_{t+1},a)yt=rt+γ⋅maxaQ∗(st+1,a),在st+1s_{t+1}st+1对应的行动中,找到表格值最大的一项
- TD error: δt=Q∗(st,at)−yt\delta_t=Q^*(s_t,a_t)-y_tδt=Q∗(st,at)−yt
- 更新QQQ表: Q∗(st,at)←Q∗(st,at)−α⋅δtQ^*(s_t,a_t)\leftarrow Q^*(s_t,a_t)-\alpha·\delta_tQ∗(st,at)←Q∗(st,at)−α⋅δt
2.2.DQN Version
使用DQN网络 Q(s,a;w)Q(s,a;w)Q(s,a;w)近似计算Q∗(s,a)Q^*(s,a)Q∗(s,a),控制agent执行行动at=argmaxaQ(st,a;w)a_t=\arg\max_a Q(s_t,a;w)at=argmaxaQ(st,a;w)
可使用Q-Learning算法训练DQN:
-
观测得到transition st,at,rt,st+1s_t,a_t,r_t,s_{t+1}st,at,rt,st+1
-
TD target: yt=rt+γ⋅maxaQ(st+1,a;w)y_t=r_t+\gamma·\max_a Q(s_{t+1},a;w)yt=rt+γ⋅maxaQ(st+1,a;w)
-
TD error: δt=Q(st,at;w)−yt\delta_t=Q(s_t,a_t;w)-y_tδt=Q(st,at;w)−yt
-
Loss: δt2/2\delta_t^2/2δt2/2
-
Gradient: ∂δt2/2∂w=δt⋅∂Q(st,at;w)∂w\frac{\partial \delta^2_t/2}{\partial w}=\delta_t·\frac{\partial Q(s_t,a_t;w)}{\partial w}∂w∂δt2/2=δt⋅∂w∂Q(st,at;w)
-
Gradient descent: w←w−α⋅δ⋅∂Q(st,at;w)∂ww\leftarrow w-\alpha·\delta·\frac{\partial Q(s_t,a_t;w)}{\partial w}w←w−α⋅δ⋅∂w∂Q(st,at;w)
3.Multi-Step TD Target
3.0
之前的算法中,只使用了一步的Reward进行训练,如果使用多步的Reward,可以得到更好的效果。
3.1.Multi-Step Return
Ut=Rt+γ⋅Ut+1 U_t=R_t+\gamma·U_{t+1} Ut=Rt+γ⋅Ut+1
对上式进行递归,得到:
Ut=Rt+γ⋅(Rt+1+γ⋅Ut+2)=Rt+γ⋅Rt+1+γ2⋅Ut+2
\begin{aligned}
U_t&=R_t+\gamma·(R_{t+1}+\gamma·U_{t+2})
\\&=R_t+\gamma·R_{t+1}+\gamma^2·U_{t+2}
\end{aligned}
Ut=Rt+γ⋅(Rt+1+γ⋅Ut+2)=Rt+γ⋅Rt+1+γ2⋅Ut+2
继续递归:
Ut=∑i=0m−1γi⋅Rt+i+γm⋅Ut+m
U_t=\sum_{i=0}^{m-1}\gamma^i·R_{t+i}+\gamma^m·U_{t+m}
Ut=i=0∑m−1γi⋅Rt+i+γm⋅Ut+m
3.2.Multi-Step TD Target
-
m-step TD target for Sarsa:
yt=∑i=0m−1γi⋅rt+i+γm⋅Qπ(st+m,at+m) y_t=\sum_{i=0}^{m-1}\gamma^i·r_{t+i}+\gamma^m·Q_\pi(s_{t+m},a_{t+m}) yt=i=0∑m−1γi⋅rt+i+γm⋅Qπ(st+m,at+m) -
m-step TD target for Q-Learning
yt=∑i=0m−1γi⋅rt+i+γm⋅maxaQ∗(st+m,a) y_t=\sum_{i=0}^{m-1}\gamma^i·r_{t+i}+\gamma^m·\max_aQ^*(s_{t+m},a) yt=i=0∑m−1γi⋅rt+i+γm⋅amaxQ∗(st+m,a)
s_{t+m},a_{t+m})
$$ -
m-step TD target for Q-Learning
yt=∑i=0m−1γi⋅rt+i+γm⋅maxaQ∗(st+m,a) y_t=\sum_{i=0}^{m-1}\gamma^i·r_{t+i}+\gamma^m·\max_aQ^*(s_{t+m},a) yt=i=0∑m−1γi⋅rt+i+γm⋅amaxQ∗(st+m,a)