强化学习笔记_5_TD-Learning

本文深入探讨了Sarsa和Q-Learning两种强化学习算法,详细阐述了它们的目标函数、TD目标计算以及在表格和神经网络版本中的实现。同时,介绍了多步TD目标的概念,揭示了使用多步回报在训练中的优势。通过对这两种算法的比较,突显了它们在策略更新和最优策略估计上的差异。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.Sarsa算法

每次使用五元组(st,at,rt,st+1,at+1)(s_t,a_t,r_t,s_{t+1},a_{t+1})(st,at,rt,st+1,at+1)更新参数,即State-Action-Reward-State-Action (SARSA)

1.0.Derive TD Target

Discounted Return, RtR_tRt depends on (St,At,St+1)(S_t,A_t,S_{t+1})(St,At,St+1)
Ut=Rt+γ⋅Rt+1+γ2⋅Rt+2⋅⋅⋅=Rt+γ⋅(Rt+1+γ⋅Rt+2⋅⋅⋅)=Rt+γ⋅Ut+1 \begin{aligned} U_t&=R_t+\gamma·R_{t+1}+\gamma^2·R_{t+2}··· \\&=R_t+\gamma·(R_{t+1}+\gamma·R_{t+2}···) \\&=R_t+\gamma·U_{t+1} \end{aligned} Ut=Rt+γRt+1+γ2Rt+2⋅⋅⋅=Rt+γ(Rt+1+γRt+2⋅⋅⋅)=Rt+γUt+1

Qπ(st,at)=E[Ut∣st,at]=E[Rt+γ⋅Ut+1∣st,at]=E[Rt∣st,at]+γ⋅E[Ut+1∣st,at]=E[Rt∣st,at]+γ⋅E[Qπ(St+1,At+1)∣st,at] \begin{aligned} Q_\pi(s_t,a_t)&=E[U_t|s_t,a_t] \\&=E[R_t+\gamma·U_{t+1}|s_t,a_t] \\&=E[R_t|s_t,a_t]+\gamma·E[U_{t+1}|s_t,a_t] \\&=E[R_t|s_t,a_t]+\gamma·E[Q_\pi(S_{t+1},A_{t+1})|s_t,a_t] \end{aligned} Qπ(st,at)=E[Utst,at]=E[Rt+γUt+1st,at]=E[Rtst,at]+γE[Ut+1st,at]=E[Rtst,at]+γE[Qπ(St+1,At+1)st,at]

Identity: Qπ(st,at)=E[Rt+γ⋅Qπ(St+1,At+1)]Q_\pi(s_t,a_t)=E[R_t+\gamma·Q_\pi(S_{t+1},A_{t+1})]Qπ(st,at)=E[Rt+γQπ(St+1,At+1)], for all π\piπ.

蒙特卡洛近似: Qπ(st,at)≈rt+γQπ(st+1,at+1)=ytQ_\pi(s_t,a_t)\approx r_t+\gamma Q_\pi(s_{t+1},a_{t+1})=y_tQπ(st,at)rt+γQπ(st+1,at+1)=yt

yty_tyt为TD Target

1.1.Tabular Version

适用于规模较小、表格较小的问题,由状态和动作组成QQQ表,使用Sarsa算法更新表格。

  • 观测得到transition st,at,rt,st+1s_t,a_t,r_t,s_{t+1}st,at,rt,st+1
  • 根据策略π(⋅∣St+1)\pi(·|S_{t+1})π(St+1)采样得到动作at+1a_{t+1}at+1
  • TD target: yt=rt+γ⋅Qπ(st+1,at+1)y_t=r_t+\gamma·Q_\pi(s_{t+1},a_{t+1})yt=rt+γQπ(st+1,at+1),其中Qπ(st+1,at+1)Q_\pi(s_{t+1},a_{t+1})Qπ(st+1,at+1)查表得到
  • TD error: δt=Qπ(st,at)−yt\delta_t=Q_\pi(s_t,a_t)-y_tδt=Qπ(st,at)yt
  • 更新QQQ表: Qπ(st,at)←Qπ(st,at)−α⋅δtQ_\pi(s_t,a_t)\leftarrow Q_\pi(s_t,a_t)-\alpha·\delta_tQπ(st,at)Qπ(st,at)αδt
1.2.Neural Network Version

使用神经网络value network q(s,a;w)q(s,a;w)q(s,a;w)近似计算Qπ(s,a)Q_\pi(s,a)Qπ(s,a)

  • TD target: yt=rt+γ⋅q(st+1,at+1;w)y_t=r_t+\gamma·q(s_{t+1},a_{t+1};w)yt=rt+γq(st+1,at+1;w)

  • TD error: δt=q(st,at;w)−yt\delta_t=q(s_t,a_t;w)-y_tδt=q(st,at;w)yt

  • Loss: δt2/2\delta_t^2/2δt2/2

  • Gradient: ∂δt2/2∂w=δt⋅∂q(st,at;w)∂w\frac{\partial \delta^2_t/2}{\partial w}=\delta_t·\frac{\partial q(s_t,a_t;w)}{\partial w}wδt2/2=δtwq(st,at;w)

  • Gradient descent: w←w−α⋅δ⋅∂q(st,at;w)∂ww\leftarrow w-\alpha·\delta·\frac{\partial q(s_t,a_t;w)}{\partial w}wwαδwq(st,at;w)

2.Q-Learning

比较Q-Learning和Sarsa:

SarseQ-Learning
目标函数Qπ(s,a)Q_\pi(s,a)Qπ(s,a)Q∗(s,a)Q^*(s,a)Q(s,a)
TD targetyt=rt+γ⋅Qπ(st+1,at+1)y_t=r_t+\gamma·Q_\pi(s_{t+1},a_{t+1})yt=rt+γQπ(st+1,at+1)yt=rt+γ⋅max⁡aQ∗(st+1,a)y_t=r_t+\gamma·\max_a Q^*(s_{t+1},a)yt=rt+γmaxaQ(st+1,a)
参数更新value network; criticDQN
2.0.TD Target

在1.0中已经计算,对于策略π\piπ
Qπ(st,at)=E[Rt+γ⋅Qπ(St+1,At+1)] Q_\pi(s_t,a_t)=E[R_t+\gamma·Q_\pi(S_{t+1},A_{t+1})] Qπ(st,at)=E[Rt+γQπ(St+1,At+1)]
对于最优策略optimal policy π∗\pi^*π
Q∗(st,at)=E[Rt+γ⋅Q∗(St+1,At+1)] Q^*(s_t,a_t)=E[R_t+\gamma·Q^*(S_{t+1},A_{t+1})] Q(st,at)=E[Rt+γQ(St+1,At+1)]
取行动At+1A_{t+1}At+1At+1=arg⁡max⁡aQ∗(St+1,a)A_{t+1}=\arg\max_a Q^*(S_{t+1},a)At+1=argmaxaQ(St+1,a)

Q∗(St+1,At+1)=max⁡aQ∗(St+1,a)Q^*(S_{t+1},A_{t+1})=\max_a Q^*(S_t+1,a)Q(St+1,At+1)=maxaQ(St+1,a)
Q∗(st,at)=E[Rt+γ⋅max⁡aQ∗(St+1,a)] Q^*(s_t,a_t)=E[R_t+\gamma·\max_a Q^*(S_{t+1},a)] Q(st,at)=E[Rt+γamaxQ(St+1,a)]
使用蒙特卡洛近似,得到TD target yty_tyt
Q∗(st,at)≈rt+γ⋅max⁡aQ∗(st+1,a)=yt Q^*(s_t,a_t)\approx r_t+\gamma·\max_a Q^*(s_{t+1},a)=y_t Q(st,at)rt+γamaxQ(st+1,a)=yt

2.1.Tabular Version

适用于规模较小、表格较小的问题,由状态和动作组成Q∗Q^*Q表,使用Q-Learning算法更新表格。

  • 观测得到transition st,at,rt,st+1s_t,a_t,r_t,s_{t+1}st,at,rt,st+1
  • TD target: yt=rt+γ⋅max⁡aQ∗(st+1,a)y_t=r_t+\gamma·\max_a Q^*(s_{t+1},a)yt=rt+γmaxaQ(st+1,a),在st+1s_{t+1}st+1对应的行动中,找到表格值最大的一项
  • TD error: δt=Q∗(st,at)−yt\delta_t=Q^*(s_t,a_t)-y_tδt=Q(st,at)yt
  • 更新QQQ表: Q∗(st,at)←Q∗(st,at)−α⋅δtQ^*(s_t,a_t)\leftarrow Q^*(s_t,a_t)-\alpha·\delta_tQ(st,at)Q(st,at)αδt
2.2.DQN Version

使用DQN网络 Q(s,a;w)Q(s,a;w)Q(s,a;w)近似计算Q∗(s,a)Q^*(s,a)Q(s,a),控制agent执行行动at=arg⁡max⁡aQ(st,a;w)a_t=\arg\max_a Q(s_t,a;w)at=argmaxaQ(st,a;w)

可使用Q-Learning算法训练DQN:

  • 观测得到transition st,at,rt,st+1s_t,a_t,r_t,s_{t+1}st,at,rt,st+1

  • TD target: yt=rt+γ⋅max⁡aQ(st+1,a;w)y_t=r_t+\gamma·\max_a Q(s_{t+1},a;w)yt=rt+γmaxaQ(st+1,a;w)

  • TD error: δt=Q(st,at;w)−yt\delta_t=Q(s_t,a_t;w)-y_tδt=Q(st,at;w)yt

  • Loss: δt2/2\delta_t^2/2δt2/2

  • Gradient: ∂δt2/2∂w=δt⋅∂Q(st,at;w)∂w\frac{\partial \delta^2_t/2}{\partial w}=\delta_t·\frac{\partial Q(s_t,a_t;w)}{\partial w}wδt2/2=δtwQ(st,at;w)

  • Gradient descent: w←w−α⋅δ⋅∂Q(st,at;w)∂ww\leftarrow w-\alpha·\delta·\frac{\partial Q(s_t,a_t;w)}{\partial w}wwαδwQ(st,at;w)

3.Multi-Step TD Target
3.0

之前的算法中,只使用了一步的Reward进行训练,如果使用多步的Reward,可以得到更好的效果。

image-20221018210512493

image-20221018210703026

3.1.Multi-Step Return

Ut=Rt+γ⋅Ut+1 U_t=R_t+\gamma·U_{t+1} Ut=Rt+γUt+1

对上式进行递归,得到:
Ut=Rt+γ⋅(Rt+1+γ⋅Ut+2)=Rt+γ⋅Rt+1+γ2⋅Ut+2 \begin{aligned} U_t&=R_t+\gamma·(R_{t+1}+\gamma·U_{t+2}) \\&=R_t+\gamma·R_{t+1}+\gamma^2·U_{t+2} \end{aligned} Ut=Rt+γ(Rt+1+γUt+2)=Rt+γRt+1+γ2Ut+2
继续递归:
Ut=∑i=0m−1γi⋅Rt+i+γm⋅Ut+m U_t=\sum_{i=0}^{m-1}\gamma^i·R_{t+i}+\gamma^m·U_{t+m} Ut=i=0m1γiRt+i+γmUt+m

3.2.Multi-Step TD Target
  • m-step TD target for Sarsa:
    yt=∑i=0m−1γi⋅rt+i+γm⋅Qπ(st+m,at+m) y_t=\sum_{i=0}^{m-1}\gamma^i·r_{t+i}+\gamma^m·Q_\pi(s_{t+m},a_{t+m}) yt=i=0m1γirt+i+γmQπ(st+m,at+m)

  • m-step TD target for Q-Learning
    yt=∑i=0m−1γi⋅rt+i+γm⋅max⁡aQ∗(st+m,a) y_t=\sum_{i=0}^{m-1}\gamma^i·r_{t+i}+\gamma^m·\max_aQ^*(s_{t+m},a) yt=i=0m1γirt+i+γmamaxQ(st+m,a)
    s_{t+m},a_{t+m})
    $$

  • m-step TD target for Q-Learning
    yt=∑i=0m−1γi⋅rt+i+γm⋅max⁡aQ∗(st+m,a) y_t=\sum_{i=0}^{m-1}\gamma^i·r_{t+i}+\gamma^m·\max_aQ^*(s_{t+m},a) yt=i=0m1γirt+i+γmamaxQ(st+m,a)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值