强化学习笔记_8_连续控制

本文探讨了离散控制与连续控制的区别,重点介绍了确定策略梯度(DPG)算法,包括确定性策略网络和价值网络的更新。DPG中,通过目标网络解决Bootstrapping问题,并讨论了改进方法如使用经验回放缓冲区和多步TD目标。同时,文章还对比了离散策略与连续策略,特别是在连续控制中的应用,如使用策略网络和多变量正态分布来生成随机动作。最后,提到了策略梯度方法如REINFORCE和Actor-Critic,并引入基线以提高训练稳定性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1. 离散控制与连续控制 Discrete VS Continuous Control
  • Discrete Action Space

  • Continuous Action Space

  • DQN算法、Policy Network等可以解决离散控制问题,输出为一个确定维度的向量

  • Discretization,离散化,将动作空间变为有限的离散空间;适用于维度比较小的问题。

    设控制问题的自由度为ddd,则动作空间为ddd维的,离散化时,离散空间内点的数量随着ddd指数增加,导致维数灾难、训练困难。

  • 其他方法:Deterministic policy network; Stochastic policy network.

2. Deterministic Policy Gradient (DPG, 确定策略梯度)
2.1. Deterministic Actor-Critic
  • deterministic policy network (actor): a=π(s;θ)a=\pi(s;\theta)a=π(s;θ),输出不是一个概率,而是一个具体的动作aaa,输出维度为动作空间的维数;
  • value network (critic): q(s,a;w)q(s,a;w)q(s,a;w)

image-20221020135237782

2.2. Updating Value Network by TD
  • Transition: (st,at,rt,st+1)(s_t,a_t,r_t,s_{t+1})(st,at,rt,st+1)

  • Value network:
    qt=q(s,a;w)qt+1=q(st+1,at+1;w),whereat+1′=π(st+1;θ) \begin{aligned} q_t&=q(s,a;w) \\q_{t+1}&=q(s_{t+1},a_{t+1};w),where\quad a_{t+1}'=\pi(s_{t+1};\theta) \end{aligned} qtqt+1=q(s,a;w)=q(st+1,at+1;w),whereat+1=π(st+1;θ)

  • TD error: δt=qt−(rt+γ⋅qt+1)\delta_t=q_t-(r_t+\gamma\cdot q_{t+1})δt=qt(rt+γqt+1)

    TD target: yt=rt+γ⋅qt+1y_t=r_t+\gamma\cdot q_{t+1}yt=rt+γqt+1

  • Update: w←w−α⋅δt⋅∂q(st,at;w)∂ww\leftarrow w-\alpha\cdot\delta_t\cdot\frac{\partial q(s_t,a_t;w)}{\partial w}wwαδtwq(st,at;w)

2.3. Updating Policy Network by DPG

Goal: Increasing q(s,a;w)q(s,a;w)q(s,a;w), where a=π(s;θ)a=\pi(s;\theta)a=π(s;θ)

DPG:
g=∂q(s,π(s;θ);w)∂θ=∂a∂θ⋅∂q(s,a;w)∂ag=∂q(s,π(s;θ);w)∂θ=∂a∂θ⋅∂q(s,a;w)∂a g=\frac{\partial q(s,\pi(s;\theta);w)}{\partial \theta}=\frac{\partial a}{\partial \theta}\cdot\frac{\partial q(s,a;w)}{\partial a}g=\frac{\partial q(s,\pi(s;\theta);w)}{\partial \theta}=\frac{\partial a}{\partial \theta}\cdot\frac{\partial q(s,a;w)}{\partial a} g=θq(s,π(s;θ);w)=θaaq(s,a;w)g=θq(s,π(s;θ);w)=θaaq(s,a;w)
Gradient ascent: θ←θ+β⋅g\theta\leftarrow \theta+\beta\cdot gθθ+βg

2.4. Improvement: Using Target Network
  • Bootstrapping

    TD error δt=qt−(rt+γ⋅qt+1)\delta_t=q_t-(r_t+\gamma\cdot q_{t+1})δt=qt(rt+γqt+1)导致Bootstrapping,如果初始产生了高估(低估),则会导致后续的高估(低谷)。

    解决方案:使用不同的网络计算TD target——target networks

  • target networks

    • Value network: qt=q(st,at;w)q_t=q(s_t,a_t;w)qt=q(st,at;w)

    • Value network: qt+1=q(st+1,at+1′;w−),whereat+1′=π(st+1;θ−)q_{t+1}=q(s_{t+1},a'_{t+1};w^-),where\quad a_{t+1}'=\pi(s_{t+1};\theta^-)qt+1=q(st+1,at+1;w),whereat+1=π(st+1;θ)

      Target value network: q(st,at′;w−)q(s_{t},a'_{t};w^-)q(st,at;w)

      Target policy network: π(st;θ−)\pi(s_{t};\theta^-)π(st;θ)

  • 用到的算法概括如下:

在这里插入图片描述

  • Updating target network

    hyper-parameter τ∈(0,1)\tau\in(0,1)τ(0,1),使用加权平均 (weighted averaging) 更新参数:
    w−←τ⋅w+(1−τ)⋅w−θ−←τ⋅θ+(1−τ)⋅θ− \begin{aligned} w^-&\leftarrow\tau\cdot w+(1-\tau)\cdot w^- \\\theta^-&\leftarrow\tau\cdot \theta+(1-\tau)\cdot \theta^- \end{aligned} wθτw+(1τ)wτθ+(1τ)θ
    target network中的参数依然与原网络相关,故无法完全解决bootstrapping

2.5. Improvements
  • Target network
  • Experience relay
  • Multi-step TD target
2.6. Stochastic Policy VS Deterministic Policy

image-20221020143459274

3. Stochastic Policy for Continuous Control (离散策略)
3.1. Policy Network
  • Univariate Normal Distribution (单变量正态分布)

    考虑单自由度情况,自由度d=1d=1d=1,均值(mean) μ\muμ和标准差(std) σ\sigmaσ是状态sss的函数;

    使用正态分布的概率密度函数作为策略函数:
    π(a∣s)=12πσ⋅exp⁡(−(a−μ)22σ2) \pi(a|s)=\frac{1}{\sqrt{2\pi}\sigma}\cdot\exp(-\frac{({a-\mu})^2}{2\sigma^2}) π(as)=2πσ1exp(2σ2(aμ)2)

  • Multivariate Normal Distribution (多变量正态分布)

    自由度为ddd,动作空间action aaaddd维,均值和标准差分别为μ,σ:S→Rd\pmb{\mu},\pmb{\sigma}:\mathcal{S}\rightarrow\R^dμ,σ:SRd,输入为状态sss,输出为ddd维向量。

    使用μi,σi\mu_i,\sigma_iμi,σi表示μ(s),σ(s)\pmb{\mu}(s),\pmb\sigma(s)μ(s),σ(s)的第iii个分量。假设动作空间内各个维度都是独立的,则PDF:
    π(a∣s)=Πi=1d12πσi⋅exp⁡(−(ai−μi)22σi2) \pi(a|s)=\Pi_{i=1}^d \frac{1}{\sqrt{2\pi}\sigma_i}\cdot\exp(-\frac{(a_i-\mu_i)^2}{2\sigma_i^2}) π(as)=Πi=1d2πσi1exp(2σi2(aiμi)2)

  • Function Approximation

    • 使用神经网络μ(s;θμ)\pmb\mu(s;\pmb\theta^\mu)μ(s;θμ)对均值μ(s)\pmb\mu(s)μ(s)进行近似;

    • 使用神经网络σ(s;θσ)\pmb\sigma(s;\pmb\theta^\sigma)σ(s;θσ)对均值σ(s)\pmb\sigma(s)σ(s)进行近似(效果不好);

    • 对方差的对数进行近似,使用神经网络ρ(s;θρ)\pmb\rho(s;\theta^\rho)ρ(s;θρ)ρ\rhoρ近似
      ρi=ln⁡σi2,i=1,⋅⋅⋅,d \rho_i=\ln\sigma_i^2,i=1,···,d ρi=lnσi2,i=1,⋅⋅⋅,d

    image-20221020145725079

  • Continuous Control

    • 观测得到当前状态sts_tst

    • 计算均值和方差μ^=μ(s;θμ)\pmb{\hat{\mu}}=\pmb\mu(s;\theta^\mu)μ^=μ(s;θμ),方差ρ^=μ(s;θρ),σ^i2=exp⁡(ρ^i)\pmb{\hat{\rho}}=\pmb\mu(s;\theta^\rho),\hat{\sigma}_i^2=\exp(\hat{\rho}_i)ρ^=μ(s;θρ),σ^i2=exp(ρ^i)

    • 随机采样得到动作
      a∼N(μ^,σ^i),i=1,⋅⋅⋅,d a\sim\mathcal{N}(\hat{\mu},\hat{\sigma}_i),i=1,···,d aN(μ^,σ^i),i=1,⋅⋅⋅,d

  • Training Policy Network

    • Auxiliary network,辅助神经网络

    • Policy gradient methods

      • option 1: REINFORCE
      • option 2: Actor-Critic
3.2. Training (1/2): Auxiliary Network
  • Stochastic policy gradient:
    g(a)=∂ln⁡π(a∣s;θ)∂θ⋅Qπ(s,a) g(a)=\frac{\partial \ln\pi(a|s;\theta)}{\partial\theta}\cdot Q_\pi(s,a) g(a)=θlnπ(as;θ)Qπ(s,a)

  • Policy network:
    π(a∣s;θμ,θρ)=Πi=1d12πσi⋅exp⁡(−(ai−μi)22σi2) \pi(a|s;\pmb\theta^\mu,\pmb\theta^\rho)=\Pi_{i=1}^d \frac{1}{\sqrt{2\pi}\sigma_i}\cdot\exp(-\frac{(a_i-\mu_i)^2}{2\sigma_i^2}) π(as;θμ,θρ)=Πi=1d2πσi1exp(2σi2(aiμi)2)
    Log of policy network:
    ln⁡π(a∣s;θμ,θρ)=∑i=1d[−ln⁡σi−(ai−μi)22σi2]+const=∑i=1d[−ρi2−(ai−μi)22⋅exp⁡(ρi)]+const \begin{aligned} \ln\pi(a|s;\pmb\theta^\mu,\pmb\theta^\rho) &=\sum_{i=1}^d [-\ln\sigma_i-\frac{(a_i-\mu_i)^2}{2\sigma_i^2}]+const \\&=\sum_{i=1}^d [-\frac{\rho_i}{2}-\frac{(a_i-\mu_i)^2}{2\cdot\exp(\rho_i)}]+const \end{aligned} lnπ(as;θμ,θρ)=i=1d[lnσi2σi2(aiμi)2]+const=i=1d[2ρi2exp(ρi)(aiμi)2]+const

  • Auxiliary Network:
    f(s,a;θ)=∑i=1d[−ρi2−(ai−μi)22⋅exp⁡(ρi)],θ=(θμ,θρ) f(s,a;\pmb\theta)=\sum_{i=1}^d [-\frac{\rho_i}{2}-\frac{(a_i-\mu_i)^2}{2\cdot\exp(\rho_i)}],\pmb\theta=(\pmb\theta^\mu,\pmb\theta^\rho) f(s,a;θ)=i=1d[2ρi2exp(ρi)(aiμi)2],θ=(θμ,θρ)
    image-20221020151311963

    image-20221020151338060

3.2. Training (2/2): Policy gradient methods
  • Stochastic policy gradient:
    f(s,a;θ)=ln⁡π(a∣s;θ)+constg(a)=∂ln⁡π(a∣s;θ)∂θ⋅Qπ(s,a) \begin{aligned} f(s,a;\pmb\theta)&=\ln\pi(a|s;\pmb\theta)+const \\g(a)&=\frac{\partial \ln\pi(a|s;\pmb\theta)}{\partial\theta}\cdot Q_\pi(s,a) \end{aligned} f(s,a;θ)g(a)=lnπ(as;θ)+const=θlnπ(as;θ)Qπ(s,a)
    得到:
    g(a)=∂f(s,a;θ)∂θ⋅Qπ(s,a) g(a)=\frac{\partial f(s,a;\pmb\theta)}{\partial\theta}\cdot Q_\pi(s,a) g(a)=θf(s,a;θ)Qπ(s,a)
    接下来对Qπ(s,a)Q_\pi(s,a)Qπ(s,a)进行近似。

  • option 1: REINFORCE

    蒙特卡洛近似,使用观测值utu_tut进行近似Qπ(s,a)Q_\pi(s,a)Qπ(s,a),参数更新:
    θ←θ+β⋅∂f(s,a;θ)∂θ⋅ut \pmb\theta\leftarrow\pmb\theta+\beta\cdot\frac{\partial f(s,a;\pmb\theta)}{\partial\pmb\theta}\cdot u_t θθ+βθf(s,a;θ)ut

  • option 2: Actor-Critic

    使用价值网络q(s,a;w)q(s,a;\pmb w)q(s,a;w)进行近似Qπ(s,a)Q_\pi(s,a)Qπ(s,a),参数更新:
    θ←θ+β⋅∂f(s,a;θ)∂θ⋅q(s,a;w) \pmb\theta\leftarrow\pmb\theta+\beta\cdot\frac{\partial f(s,a;\pmb\theta)}{\partial\pmb\theta}\cdot q(s,a;\pmb w) θθ+βθf(s,a;θ)q(s,a;w)
    使用TD learning训练价值网络q(s,a;w)q(s,a;\pmb w)q(s,a;w)

3.3. Improvement: Policy gradient with baseline
  • Reinforce: Reinforce with baseline.
  • Actor-Critic: Advantage Actor-Critic (A2C).
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值