目录
Finite Markov Decision Process
RL an introduction link:
Sutton & Barto Book: Reinforcement Learning: AnIntroduction
Finite Markov Decision Process
Definition
The ego (agent) monitors the situations from the environment, such as by data flow or sensors (cameras, lidars or others), which is called state in the view of term. We have to highlight that we presume the agent clearly know enough information of the situations all the time, by which the agent could make their decisions. So we have at the time step of t.
After the agent knows the current state, it could have finite actions to choose ( ). And after taking an action, the agent will obtain the reward in this step (
) and run into the next state (
), in which process, the agent could know the environment's dynamics (
).Then the agent will continue deal with this state until this scenario ends.
The environment's dynamics are not decided by people, but the policy of taking which action depends on the agent's jugement. In every state ,we could choose actions, which gives us more rewards totally not just in the short run, but also in the long run. Therefore, the policy of choosing actions in state is the core of reinforcement learning. we use
to describe the probability of each action taken in the current state.
Therefore, the Finite Markov Decision Process is the process,in which the agent know the current state ,actions that is about to choose, even the probability of and
for each action (
) and obtain the expected return in different policy.
Formula( Bellman equation )
Mathematically, we could calculate the value function .
Consideration
For every scenario, we know the dynamics of the environment , the state set
and coresponding action set
. For evey policy we set, we know
. So we could obtain N equations for
.
Limitation
- Many times, we could not know the dynamics of the environment.
- Many times, such as gammon, there are too many states. So we have no capacity to compute this probelm in this way ( solve equations )
- problems have Markov property, which means
and
only depend on r and a. In other words, r and a could get all possible
and
.
Optimal policies
Definition
For policy and policy
, if for each state s, the ineuqation can be fulfilled:
then we can say is better than
.
Therefore, there must be more than one policy, that is the optimal policy .
At the meantime, every state in policy also will meet Bellman equation.
Optimal Bellman equation
for in the total policy set:
For a specific case, the environment's dynamic is constant. We can only change the apportionment of .
For a maximum v(s), we should apportion 1 to the max q(s,a).
Therefore, the optimal policy is actually a greedy policy without any exploration.