强化学习笔记_7_策略学习中的Baseline

本文深入探讨了强化学习中的策略梯度算法,解释了策略网络和状态值函数的作用。引入基线函数以降低蒙特卡洛估计的方差,通过实例说明了如何使用状态值作为基线,并分析了其对算法收敛速度的影响。此外,还对比了带有基线的策略梯度(Reinforce)与Advantage Actor-Critic(A2C)算法的差异,重点讨论了A2C中多步目标(TD目标)的应用及其优势。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1. Policy Gradient with Baseline
1.1 Policy Gradient
  • policy network π(a∣s;θ)\pi(a|s;\theta)π(as;θ)

  • State-value function:
    Vπ(s)=EA∼π[Qπ(s,A)]=∑aπ(s∣s;θ)⋅Qπ(s,a) \begin{aligned} V_\pi(s)&=E_{A\sim\pi}[Q_\pi(s,A)] \\&=\sum_a\pi(s|s;\theta)\cdot Q_\pi(s,a) \end{aligned} Vπ(s)=EAπ[Qπ(s,A)]=aπ(ss;θ)Qπ(s,a)

  • Policy gradient:
    ∂Vπ(s)∂θ=EA∼π[∂ln⁡π(A∣s;θ)∂θ⋅Qπ(s,A)] \frac{\partial V_\pi(s)}{\partial\theta}=E_{A\sim\pi}[\frac{\partial \ln\pi(A|s;\theta)}{\partial\theta}\cdot Q_\pi(s,A)] θVπ(s)=EAπ[θlnπ(As;θ)Qπ(s,A)]

1.2. Baseline in Policy Gradient
  • Baseline 函数 bbb,不依赖于动作AAA

  • 性质:如果bbbAAA无关,则 EA∼π[b⋅∂ln⁡π(A∣s;θ)∂θ]=0E_{A\sim\pi}[b\cdot \frac{\partial\ln\pi(A|s;\theta)}{\partial\theta}]=0EAπ[bθlnπ(As;θ)]=0
    EA∼π[b⋅∂ln⁡π(A∣s;θ)∂θ]=b⋅EA∼π[∂ln⁡π(A∣s;θ)∂θ]=b⋅∑aπ(a∣s;θ)⋅∂ln⁡π(a∣s;θ)∂θ=b⋅∑aπ(a∣s;θ)⋅1π(a∣s;θ)∂π(a∣s;θ)∂θ=b⋅∑a∂π(a∣s;θ)∂θ=b⋅∂∑aπ(a∣s;θ)∂θ=b⋅1∂θ=0 \begin{aligned} E_{A\sim\pi}[b\cdot \frac{\partial\ln\pi(A|s;\theta)}{\partial\theta}]&= b\cdot E_{A\sim\pi}[\frac{\partial\ln\pi(A|s;\theta)}{\partial\theta}] \\&=b\cdot \sum_a\pi(a|s;\theta)\cdot \frac{\partial\ln\pi(a|s;\theta)}{\partial\theta} \\&=b\cdot \sum_a\pi(a|s;\theta)\cdot \frac{1}{\pi(a|s;\theta)}\frac{\partial\pi(a|s;\theta)}{\partial\theta} \\&=b\cdot \sum_a\frac{\partial\pi(a|s;\theta)}{\partial\theta} \\&=b\cdot \frac{\partial\sum_a\pi(a|s;\theta)}{\partial\theta} \\&=b\cdot \frac{1}{\partial\theta} \\&=0 \end{aligned} EAπ[bθlnπ(As;θ)]=bEAπ[θlnπ(As;θ)]=baπ(as;θ)θlnπ(as;θ)=baπ(as;θ)π(as;θ)1θπ(as;θ)=baθπ(as;θ)=bθaπ(as;θ)=bθ1=0

  • policy gradient
    ∂Vπ(s)∂θ=EA∼π[∂ln⁡(π(A∣s;θ))∂θ⋅Qπ(s,A)]=EA∼π[∂ln⁡(π(A∣s;θ))∂θ⋅Qπ(s,A)]−EA∼π[b⋅∂ln⁡π(A∣s;θ)∂θ]=EA∼π[∂ln⁡(π(A∣s;θ))∂θ⋅(Qπ(s,A)−b)] \begin{aligned} \frac{\partial V_\pi(s)}{\partial\theta} &=E_{A\sim\pi}[\frac{\partial \ln(\pi(A|s;\theta))}{\partial\theta}·Q_\pi(s,A)] \\&=E_{A\sim\pi}[\frac{\partial \ln(\pi(A|s;\theta))}{\partial\theta}·Q_\pi(s,A)] -E_{A\sim\pi}[b\cdot \frac{\partial\ln\pi(A|s;\theta)}{\partial\theta}] \\&=E_{A\sim\pi}[\frac{\partial \ln(\pi(A|s;\theta))}{\partial\theta}·(Q_\pi(s,A)-b)] \end{aligned} θVπ(s)=EAπ[θln(π(As;θ))Qπ(s,A)]=EAπ[θln(π(As;θ))Qπ(s,A)]EAπ[bθlnπ(As;θ)]=EAπ[θln(π(As;θ))(Qπ(s,A)b)]
    即:
    ∂Vπ(st)∂θ=EA∼π[∂ln⁡(π(At∣st;θ))∂θ⋅(Qπ(st,At)−b)] \begin{aligned} \frac{\partial V_\pi(s_t)}{\partial\theta} =E_{A\sim\pi}[\frac{\partial \ln(\pi(A_t|s_t;\theta))}{\partial\theta}·(Q_\pi(s_t,A_t)-b)] \end{aligned} θVπ(st)=EAπ[θln(π(Atst;θ))(Qπ(st,At)b)]
    在实际计算中常使用蒙特卡洛近似,bbb不会影响方差,但是会影响蒙特卡洛近似,合理取值可以减小蒙特卡洛的方差,使收敛更快。

1.3. Monte Carlo Approximation

g(At)=∂ln⁡(π(At∣st;θ))∂θ⋅(Qπ(st,At)−b)∂Vπ(st)∂θ=EA∼π[∂ln⁡(π(At∣st;θ))∂θ⋅(Qπ(st,At)−b)]=EA∼π[g(At)] \begin{aligned} g(A_t)&=\frac{\partial \ln(\pi(A_t|s_t;\theta))}{\partial\theta}·(Q_\pi(s_t,A_t)-b) \\\frac{\partial V_\pi(s_t)}{\partial\theta} &=E_{A\sim\pi}[\frac{\partial \ln(\pi(A_t|s_t;\theta))}{\partial\theta}·(Q_\pi(s_t,A_t)-b)] \\&=E_{A\sim\pi}[g(A_t)] \end{aligned} g(At)θVπ(st)=θln(π(Atst;θ))(Qπ(st,At)b)=EAπ[θln(π(Atst;θ))(Qπ(st,At)b)]=EAπ[g(At)]

以概率密度函数at∼π(⋅∣st;θ)a_t\sim\pi(·|s_t;\theta)atπ(st;θ)抽样得到行动ata_tat,计算得到g(at)g(a_t)g(at)为其期望的蒙特卡洛近似,也是对策略梯度的一个无偏估计。

  • Stochastic Policy Gradient(梯度上升)
    g(at)=∂ln⁡(π(at∣st;θ))∂θ⋅(Qπ(st,at)−b)θ←θ+β⋅g(at) \begin{aligned} g(a_t)&=\frac{\partial \ln(\pi(a_t|s_t;\theta))}{\partial\theta}·(Q_\pi(s_t,a_t)-b) \\\theta&\leftarrow \theta+\beta\cdot g(a_t) \end{aligned} g(at)θ=θln(π(atst;θ))(Qπ(st,at)b)θ+βg(at)
    bbbAtA_tAt无关,故不会影响g(At)g(A_t)g(At)的期望,但是会影响其方差。如果选取的bbb很接近于QπQ_\piQπ,则方差会很小。
1.4. Choices of Baseline
  • Choice 1: b=0b=0b=0,不使用Baseline

  • Choice 2: bbb is state-value, b=Vπ(st)b=V_\pi(s_t)b=Vπ(st)

    状态sts_tst是先于AtA_tAt被观测到的,于是和AtA_tAt无关。

    (有点像Dueling network,使用优势函数)

2. Reinforce with Baseline
2.1. Policy Gradient

使用Vπ(st)V_\pi(s_t)Vπ(st)作为Baseline:
g(At)=∂ln⁡(π(At∣st;θ))∂θ⋅(Qπ(st,At)−Vπ(st))∂Vπ(st)∂θ=EA∼π[∂ln⁡(π(At∣st;θ))∂θ⋅(Qπ(st,At)−Vπ(st))]=EA∼π[g(At)] \begin{aligned} g(A_t)&=\frac{\partial \ln(\pi(A_t|s_t;\theta))}{\partial\theta}·(Q_\pi(s_t,A_t)-V_\pi(s_t)) \\\frac{\partial V_\pi(s_t)}{\partial\theta} &=E_{A\sim\pi}[\frac{\partial \ln(\pi(A_t|s_t;\theta))}{\partial\theta}·(Q_\pi(s_t,A_t)-V_\pi(s_t))] \\&=E_{A\sim\pi}[g(A_t)] \end{aligned} g(At)θVπ(st)=θln(π(Atst;θ))(Qπ(st,At)Vπ(st))=EAπ[θln(π(Atst;θ))(Qπ(st,At)Vπ(st))]=EAπ[g(At)]

2.2. Approximation
  • 随机抽样得到行动at∼π(⋅∣st;θ)a_t\sim\pi(·|s_t;\theta)atπ(st;θ),得到Stochastic Policy Gradient:
    g(at)=∂ln⁡(π(at∣st;θ))∂θ⋅(Qπ(st,at)−Vπ(st)) g(a_t)=\frac{\partial \ln(\pi(a_t|s_t;\theta))}{\partial\theta}·(Q_\pi(s_t,a_t)-V_\pi(s_t)) g(at)=θln(π(atst;θ))(Qπ(st,at)Vπ(st))

  • Qπ(st,at)Q_\pi(s_t,a_t)Qπ(st,at)近似
    Qπ(st,at)=E[Ut∣st,at] Q_\pi(s_t,a_t)=E[U_t|s_t,a_t] Qπ(st,at)=E[Utst,at]
    使用观测到的回报utu_tut近似Qπ(st,at∼ut)Q_\pi(s_t,a_t\sim u_t)Qπ(st,atut)

    • 观测到一条完整轨迹:st,at,rt,st+1,at+1,rt+1,⋅⋅⋅,sn,an,rns_t,a_t,r_t,s_{t+1},a_{t+1},r_{t+1},···,s_n,a_n,r_nst,at,rt,st+1,at+1,rt+1,⋅⋅⋅,sn,an,rn
    • 计算回报:ut=∑i=tnγi−t⋅riu_t=\sum_{i=t}^n \gamma^{i-t}\cdot r_iut=i=tnγitri
    • 使用utu_tut作为Qπ(st,at)Q_\pi(s_t,a_t)Qπ(st,at)的无偏估计
  • Vπ(st)V_\pi(s_t)Vπ(st)近似,使用神经网络value network v(s;w)v(s;w)v(s;w)近似

三次Approximation后得到的结果为:
∂Vπ(st)∂θ≈g(at)≈∂ln⁡(π(at∣st;θ))∂θ⋅(ut−v(st,w)) \frac{\partial V_\pi(s_t)}{\partial\theta}\approx g(a_t) \approx \frac{\partial \ln(\pi(a_t|s_t;\theta))}{\partial\theta}·(u_t-v(s_t,w)) θVπ(st)g(at)θln(π(atst;θ))(utv(st,w))

2.3. Policy and Value Network
  • Policy Network

    image-20221020105307406

  • Value Network

    image-20221020105325599

  • Parameter Sharing

    image-20221020105449361

2.4. Reinforce with Baseline
  • Updating the policy network

    Policy gradient: ∂Vπ(st)∂θ≈∂ln⁡(π(at∣st;θ))∂θ⋅(ut−v(st,w))\frac{\partial V_\pi(s_t)}{\partial\theta}\approx \frac{\partial \ln(\pi(a_t|s_t;\theta))}{\partial\theta}·(u_t-v(s_t,w))θVπ(st)θln(π(atst;θ))(utv(st,w))

    Gradient ascent: θ←θ+β⋅∂ln⁡(π(at∣st;θ))∂θ⋅(ut−v(st,w))\theta\leftarrow\theta+\beta\cdot\frac{\partial \ln(\pi(a_t|s_t;\theta))}{\partial\theta}·(u_t-v(s_t,w))θθ+βθln(π(atst;θ))(utv(st,w))


    δt=v(st;w)−ut \delta_t=v(s_t;w)-u_t δt=v(st;w)ut
    则Gradient ascent也可表示为:
    θ←θ−β⋅∂ln⁡(π(at∣st;θ))∂θ⋅δt \theta\leftarrow\theta-\beta\cdot\frac{\partial \ln(\pi(a_t|s_t;\theta))}{\partial\theta}·\delta_t θθβθln(π(atst;θ))δt

  • Updating the value network

    使v(st;w)v(s_t;w)v(st;w)接近Vπ(st)=E[Ut∣st]V_\pi(s_t)=E[U_t|s_t]Vπ(st)=E[Utst],使用观测值utu_tut进行拟合。

    • Prediction error: δt=v(st,w)−ut\delta_t=v(s_t,w)-u_tδt=v(st,w)ut
    • Gradient: ∂δt2/2∂w=δt⋅∂v(st;w)∂w\frac{\partial\delta_t^2/2}{\partial w}=\delta_t\cdot\frac{\partial v(s_t;w)}{\partial w}wδt2/2=δtwv(st;w)
    • Gradient descent: w←w−α⋅δt⋅∂v(st;w)∂ww\leftarrow w-\alpha\cdot\delta_t\cdot\frac{\partial v(s_t;w)}{\partial w}wwαδtwv(st;w)

在这里插入图片描述

3. Advantage Actor-Critic (A2C)
3.1. Actor and Critic

使用2.中相同结构的神经网络,但是训练方法不同。与之前Actor-Critic不同的是,这里使用状态价值而不是行动价值,状态价值只与当前状态相关,更容易训练。

在这里插入图片描述

3.2. Training of A2C
  • Observe a transition (st,at,rt,st+1)(s_t,a_t,r_t,s_{t+1})(st,at,rt,st+1)

  • TD target yt=rt+γ⋅v(st+1;w)y_t=r_t+\gamma\cdot v(s_{t+1};w)yt=rt+γv(st+1;w)

  • TD error δt=v(st;w)−yt\delta_t=v(s_t;w)-y_tδt=v(st;w)yt

  • Update the policy network (actor) by:
    θ←θ−β⋅δt⋅∂ln⁡π(at∣st;θ)∂θ \theta\leftarrow\theta-\beta\cdot\delta_t\cdot\frac{\partial\ln\pi(a_t|s_t;\theta)}{\partial\theta} θθβδtθlnπ(atst;θ)

  • Update the value network (critic) by:
    w←w−α⋅δt⋅∂v(st;w)∂w w\leftarrow w-\alpha\cdot\delta_t\cdot\frac{\partial v(s_t;w)}{\partial w} wwαδtwv(st;w)

3.3.算法的数学推导
  • Value functions

    在TD算法中已经得到:Qπ(st,at)=ESt+1,At+1[Rt+γ⋅Qπ(St+1,At+1)]Q_\pi(s_t,a_t)=E_{S_{t+1},A_{t+1}}[R_t+\gamma·Q_\pi(S_{t+1},A_{t+1})]Qπ(st,at)=ESt+1,At+1[Rt+γQπ(St+1,At+1)]

    由于RtR_tRtSt+1S_{t+1}St+1相关而与At+1A_{t+1}At+1无关,Qπ(St+1,At+1)Q_\pi(S_{t+1},A_{t+1})Qπ(St+1,At+1)与两者都有关,于是继续得到:
    Qπ(st,at)=ESt+1[Rt+γ⋅EAt+1[Qπ(St+1,At+1)]=ESt+1[Rt+γ⋅Vπ(St+1)] \begin{aligned} Q_\pi(s_t,a_t)&=E_{S_{t+1}}[R_t+\gamma·E_{A_{t+1}}[Q_\pi(S_{t+1},A_{t+1})] \\&=E_{S_{t+1}}[R_t+\gamma·V_\pi(S_{t+1})] \end{aligned} Qπ(st,at)=ESt+1[Rt+γEAt+1[Qπ(St+1,At+1)]=ESt+1[Rt+γVπ(St+1)]

    • Theorem 1: Qπ(st,at)=ESt+1[Rt+γ⋅Vπ(St+1)]Q_\pi(s_t,a_t)=E_{S_{t+1}}[R_t+\gamma·V_\pi(S_{t+1})]Qπ(st,at)=ESt+1[Rt+γVπ(St+1)]

    继续利用价值函数的定义Vπ(st)=EAt[Qπ(st,At)]V_\pi(s_t)=E_{A_t}[Q_\pi(s_t,A_t)]Vπ(st)=EAt[Qπ(st,At)],得到
    Vπ(st)=EAt[ESt+1[Rt+γ⋅Vπ(St+1)]] V_\pi(s_t)=E_{A_t}[E_{S_{t+1}}[R_t+\gamma·V_\pi(S_{t+1})]] Vπ(st)=EAt[ESt+1[Rt+γVπ(St+1)]]

    • Theorem 2: Vπ(st)=EAt,St+1[Rt+γ⋅Vπ(St+1)]V_\pi(s_t)=E_{A_t,S_{t+1}}[R_t+\gamma·V_\pi(S_{t+1})]Vπ(st)=EAt,St+1[Rt+γVπ(St+1)]
  • Monte Carlo approximations

    观测得到一个transition,对Theorem 1和Theorem 2进行蒙特卡洛近似:
    Qπ(st,at)=rt+γ⋅Vπ(st+1)Vπ(st)=rt+γ⋅Vπ(st+1) \begin{aligned} Q_\pi(s_t,a_t)&=r_t+\gamma·V_\pi(s_{t+1}) \\V_\pi(s_t)&=r_t+\gamma·V_\pi(s_{t+1}) \end{aligned} Qπ(st,at)Vπ(st)=rt+γVπ(st+1)=rt+γVπ(st+1)

  • Updating policy network

    带有Baseline的策略梯度下降: g(at)=∂ln⁡(π(at∣st;θ))∂θ⋅(Qπ(st,at)−Vπ(st))g(a_t)=\frac{\partial \ln(\pi(a_t|s_t;\theta))}{\partial\theta}·(Q_\pi(s_t,a_t)-V_\pi(s_t))g(at)=θln(π(atst;θ))(Qπ(st,at)Vπ(st))

    Qπ(st,at)Q_\pi(s_t,a_t)Qπ(st,at)进行蒙特卡洛近似: g(at)=∂ln⁡(π(at∣st;θ))∂θ⋅rt+γ⋅Vπ(st+1)−Vπ(st))g(a_t)=\frac{\partial \ln(\pi(a_t|s_t;\theta))}{\partial\theta}·r_t+\gamma·V_\pi(s_{t+1})-V_\pi(s_t))g(at)=θln(π(atst;θ))rt+γVπ(st+1)Vπ(st))

    使用value network v(s;w)v(s;w)v(s;w)Vπ(st)V_\pi(s_t)Vπ(st)进行拟合:
    g(at)=∂ln⁡(π(at∣st;θ))∂θ⋅(rt+γ⋅v(st+1;w)−v(st;w)) g(a_t)=\frac{\partial \ln(\pi(a_t|s_t;\theta))}{\partial\theta}·(r_t+\gamma·v(s_{t+1};w)-v(s_t;w)) g(at)=θln(π(atst;θ))(rt+γv(st+1;w)v(st;w))
    TD target: yt=rt+γ⋅v(st+1;w)y_t=r_t+\gamma\cdot v(s_{t+1};w)yt=rt+γv(st+1;w)
    g(at)=∂ln⁡(π(at∣st;θ))∂θ⋅(yt−v(st;w)) g(a_t)=\frac{\partial \ln(\pi(a_t|s_t;\theta))}{\partial\theta}·(y_t-v(s_t;w)) g(at)=θln(π(atst;θ))(ytv(st;w))
    梯度上升:
    θ←θ−β⋅δt⋅∂ln⁡π(at∣st;θ)∂θ \theta\leftarrow\theta-\beta\cdot\delta_t\cdot\frac{\partial\ln\pi(a_t|s_t;\theta)}{\partial\theta} θθβδtθlnπ(atst;θ)

  • Updating value network

    使用value network v(s;w)v(s;w)v(s;w)Vπ(st)V_\pi(s_t)Vπ(st)进行拟合:
    V(st;w)≈rt+γ⋅V(st+1;w)=yt V(s_t;w)\approx r_t+\gamma·V(s_{t+1};w)=y_t V(st;w)rt+γV(st+1;w)=yt
    TD error: δt=v(st;w)−yt\delta_t=v(s_t;w)-y_tδt=v(st;w)yt

    Gradient: ∂δt2/2∂w=δt∂v(st;w)∂w\frac{\partial \delta_t^2/2}{\partial w}=\delta_t\frac{\partial v(s_t;w)}{\partial w}wδt2/2=δtwv(st;w)

    Gradient descent: w←w−α⋅δt⋅∂v(st;w)∂ww\leftarrow w-\alpha\cdot\delta_t\cdot\frac{\partial v(s_t;w)}{\partial w}wwαδtwv(st;w)

4. ReinForce versus A2C
4.1 Policy and Value Networks

两种算法的网络结构相同,都包括价值网络和策略网络。Reinforce with Baseline中,价值网络仅作为Baseline以降低随机梯度造成的方差;A2C中的价值网络用于对actor进行评价(critic)。

从算法流程上看,两种算法仅在TD target和error的部分有差别。Reinforce使用了真实的观测值Return,而A2C使用了TD target,部分基于观测值,部分基于预测值。

image-20221020133241260

对于Multi-Step TD target,在只计算一步的情况下,为one-step TD target;在计算所有步的情况下,则变为ut=∑i=tnγi−t⋅riu_t=\sum_{i=t}^n\gamma^{i-t}\cdot r_iut=i=tnγitri,A2C算法与Reinforce相同。Reinforce是A2C算法的一个特例。

4.2 A2C with Multi-Step TD Target

one-Step TD Target:
yt=rt+γ⋅v(st+1;w) y_t=r_t+\gamma\cdot v(s_{t+1};w) yt=rt+γv(st+1;w)
m-Step TD Target:
yt=∑i=0m−1γi⋅rt+i+γm⋅v(st+m;w) y_t=\sum_{i=0}^{m-1}\gamma^i\cdot r_{t+i}+\gamma^m\cdot v(s_{t+m};w) yt=i=0m1γirt+i+γmv(st+m;w)
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值