1. Finite Markov Decision Process

有限马尔科夫决策过程(Finite Markov Decision Process, FMDP)是一种决策理论模型,用于描述智能体在环境中如何通过观察状态、选择行动、获取奖励来学习最优策略。该过程涉及贝尔曼方程,用以计算不同策略下的价值函数。然而,实际中可能面临环境动态未知、状态过多无法穷举等问题。最优策略是指在长期中最大化总奖励的策略,满足特定的贝尔曼等式。寻找并实现这样的最优策略是强化学习的核心挑战。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

目录 

Finite Markov Decision Process

Definition

Formula( Bellman equation )

Consideration

Limitation

 Optimal policies

Definition

Optimal Bellman equation


RL an introduction link:

Sutton & Barto Book: Reinforcement Learning: AnIntroduction

Finite Markov Decision Process

Definition

The ego (agent) monitors the situations from the environment, such as by data flow or sensors (cameras, lidars or others), which is called state in the view of term. We have to highlight that we presume the agent clearly know enough information of the situations all the time, by which the agent could make their decisions. So we have \mathbf{S_t} at the time step of t. 

After the agent knows the current state, it could have finite actions to choose ( \mathbf{A_t} ). And after taking an action, the agent will obtain the reward in this step ( \mathbf{R_{t+1}} ) and run into the next state ( \mathbf{S_{t+1}} ), in which process, the agent could know the environment's dynamics ( \mathbf{pr( r^{'}, s^{'} | s, a)} ).Then the agent will continue deal with this state until this scenario ends.

 The environment's dynamics are not decided by people, but the policy of taking which action depends on the agent's jugement. In every state \mathbf{S_t} ,we could choose actions, which gives us more rewards totally not just in the short run, but also in the long run. Therefore, the policy of choosing actions in state is the core of reinforcement learning. we use \mathbf{\pi (a|s)} to describe the probability of each action taken in the current state.


Therefore, the Finite Markov Decision Process is the process,in which the agent know the current state ,actions that is about to choose, even the probability of \mathbf{R_{t+1}} and \mathbf{S_{t+1}} for each action (  \mathbf{pr( r^{'}, s^{'} | s, a)} ) and obtain the expected return in different policy. 

Formula( Bellman equation )

Mathematically, we could calculate the value function \mathbf{v_{\pi }}.

v_{\pi } (s) = E[ G_t | S_t = s] = E[ R_{t+1} + \gamma G_{t+1} | S_t = s] = \sum_{a}^{}\pi (a|s)\sum_{s^{'}}^{}\sum_{r}^{}r p(r,s^{'} | a, s) [ r+ \gamma v_\pi ( s^{'} ) ]

Consideration

For every scenario, we know the dynamics of the environment \mathbf{pr( r^{'}, s^{'} | s, a)}, the state set \mathbf{S_t} and coresponding action set \mathbf{A_t}. For evey policy we set, we know \mathbf{\pi (a|s)}. So we could obtain N equations for \mathbf{v_\pi (s)}.

Limitation

  1. Many times, we could not know the dynamics of the environment. 
  2. Many times, such as gammon, there are too many states. So we have no capacity to compute this probelm in this way ( solve equations )
  3. problems have Markov property, which means \mathbf{r^{'}} and \mathbf{s^{'}} only depend on r and a. In other words, r and a could get all possible \mathbf{r^{'}} and \mathbf{s^{'}}.

Optimal policies

Definition

For policy \pi and policy  \pi^{'}, if for each state s, the ineuqation can be fulfilled: 

v(\pi)\geq v(\pi^{'})

 then we can say \pi is better than \pi^{'}.

Therefore, there must be more than one policy, that is the optimal policy \pi_{*}.

 At the meantime, every state in policy \pi_{*} also will meet Bellman equation.

 v_{\pi_{*} } (s) = \sum_{a}^{}\pi_{*} (a|s)\sum_{s^{'}}^{}\sum_{r}^{}r p(r,s^{'} | a, s) [ r+ \gamma v_{ \pi_{*} } ( s^{'} ) ]

 Optimal Bellman equation

for \pi_{*}  in the total policy set:

v_{ \pi_{*}}(s)= max.\sum_{a} \pi(s|a)q(s,a) =max. \sum_{a}^{}\pi (a|s)\sum_{s^{'}}^{}\sum_{r}^{}r p(r,s^{'} | a, s) [ r+ \gamma v_{ \pi } ( s^{'} ) ]

For a specific case, the environment's dynamic is constant. We can only change the apportionment of \pi(a|s).

For a maximum v(s), we should apportion 1 to the max q(s,a). 

Therefore, the optimal policy is actually a greedy policy without any exploration. 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值