Policy-based Reinforcement learning

强化学习

这一章会讲基于策略的强化学习

前言

说明一下:这是我的一个学习笔记,课程链接如下:

最易懂的强化学习课程
公众号:AI那些事
在这里插入图片描述


一、policy函数

我们回顾一下Action-Value Functions:
在这里插入图片描述
  Policy函数记作π(a|s),是一个概率密度函数,我们可以用他来自动控制agent运动,策略函数的输入是当前状态s,它的输出是一个概率分布,给每一个输出一个概率值,有了各个动作的概率值,agent就会做一次随机抽样得到动作a,每个动作都可能被抽到,概率越大的动作越容易被抽到。

  所以说只要有了一个好的策略函数π,我们就可以用π来自动控制agent运动,但是我们怎样才能得到这个策略函数呢?

  假如这个游戏只有5个动作10个状态那很好办,我们画一张5*10的表,表里面每一格对应一个概率,我们通过玩游戏将这50个概率值算出来就可以了,但是超级玛丽这种游戏有无数个状态,我没法将每个状态的每个动作的概率记在一张表里,想要agent自动玩超级玛丽这种游戏,我没法直接算策略函数。

  所以我得到函数近似,学出来一个函数来近似策略函数,函数近似的方法有很多,可以用线性函数、神经网络等,如果用神经网络的话,我们可以将这个函数称为Policy network策略网络,把它记为π(a|s;θ),这里的θ是神经网络的参数,一开始θ是初始化的,然后我们通过学习来改进θ,如果是超级玛丽这个问题我们可以将神经网络设置为:
在这里插入图片描述
  由于策略函数π是一个概率密度函数,所以π要满足:
在这里插入图片描述
  这是什么意思呢?是对于所有的动作a,把π函数的输出都加起来要等于1。
  这也是为什么要在输出时放

Multi-agent reinforcement learning (MARL) is a subfield of reinforcement learning (RL) that involves multiple agents learning simultaneously in a shared environment. MARL has been studied for several decades, but recent advances in deep learning and computational power have led to significant progress in the field. The development of MARL can be divided into several key stages: 1. Early approaches: In the early days, MARL algorithms were based on game theory and heuristic methods. These approaches were limited in their ability to handle complex environments or large numbers of agents. 2. Independent Learners: The Independent Learners (IL) algorithm was proposed in the 1990s, which allowed agents to learn independently while interacting with a shared environment. This approach was successful in simple environments but often led to convergence issues in more complex scenarios. 3. Decentralized Partially Observable Markov Decision Process (Dec-POMDP): The Dec-POMDP framework was introduced to address the challenges of coordinating multiple agents in a decentralized manner. This approach models the environment as a Partially Observable Markov Decision Process (POMDP), which allows agents to reason about the beliefs and actions of other agents. 4. Deep MARL: The development of deep learning techniques, such as deep neural networks, has enabled the use of MARL in more complex environments. Deep MARL algorithms, such as Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG), have achieved state-of-the-art performance in many applications. 5. Multi-Agent Actor-Critic (MAAC): MAAC is a recent algorithm that combines the advantages of policy-based and value-based methods. MAAC uses an actor-critic architecture to learn decentralized policies and value functions for each agent, while also incorporating a centralized critic to estimate the global value function. Overall, the development of MARL has been driven by the need to address the challenges of coordinating multiple agents in complex environments. While there is still much to be learned in this field, recent advancements in deep learning and reinforcement learning have opened up new possibilities for developing more effective MARL algorithms.
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

写Bug那些事

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值