0% found this document useful (0 votes)
56 views9 pages

Understanding Markov Decision Processes

Uploaded by

khanakgupta321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views9 pages

Understanding Markov Decision Processes

Uploaded by

khanakgupta321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

 Living near an airport for a year and getting used to the sound of airplanes passing

overhead ─ Habituation
 Hearing loud thunder when at home alone at night and then becoming easily startled
by bright flashes of light ─ Sensitization

5.6.3 Markov-Decision Process

Reinforcement Learning is a type of Machine Learning. It allows machines and software


agents to automatically determine the ideal behavior within a specific context, in order to
maximize its performance. Simple reward feedback is required for the agent to learn its
behavior; this is known as the reinforcement signal.

There are many different algorithms that tackle this issue. As a matter of fact,
Reinforcement Learning is defined by a specific type of problem, and all its solutions are
classed as Reinforcement Learning algorithms. In the problem, an agent is supposed to
decide the best action to select based on his current state. When this step is repeated, the
problem is known as a Markov Decision Process.
A Markov Decision Process (MDP) model contains:

A set of possible world states S.


 A set of Models.
 A set of possible actions A.
 A real-valued reward function R(s,a).
 A policy the solution of Markov Decision Process.

IT DEPT-R20-MACHINE LEARNING Page 124


State:

A State is a set of tokens that represent every state that the agent can be in.

Model:
A Model (sometimes called Transition Model) gives an action’s effect in a state. In
particular, T(S, a, S’) defines a transition T where being in state S and taking an action
‘a’ takes us to state S’ (S and S’ may be the same). For stochastic actions (noisy, non-
deterministic) we also define a probability P(S’|S,a) which represents the probability of
reaching a state S’ if action ‘a’ is taken in state S. Note Markov property states that the
effects of an action taken in a state depend only on that state and not on the prior
history.

Actions
An Action A is a set of all possible actions. A(s) defines the set of actions that can be taken
being in state S.

Reward
A Reward is a real-valued reward function. R(s) indicates the reward for simply being in
the state S. R(S,a) indicates the reward for being in a state S and taking an action ‘a’.
R(S,a,S’) indicates the reward for being in a state S, taking an action ‘a’ and ending up in a
state S’.
Policy
A Policy is a solution to the Markov Decision Process. A policy is a mapping from S to a.
It indicates the action ‘a’ to be taken while in state S.

Let us take the example of a grid world:

IT DEPT-R20-MACHINE LEARNING Page 125


An agent lives in the grid. The above example is a 3*4 grid. The grid has a START
state(grid no 1,1). The purpose of the agent is to wander around the grid to finally reach the
Blue Diamond (grid no 4,3). Under all circumstances, the agent should avoid the Fire grid
(orange color, grid no 4,2). Also the grid no 2,2 is a blocked grid, it acts as a wall hence the
agent cannot enter it.

The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT
Walls block the agent path, i.e., if there is a wall in the direction the agent would have
taken, the agent stays in the same place. So for example, if the agent says LEFT in the
START grid he would stay put in the START grid.

First Aim: To find the shortest sequence getting from START to the Diamond. Two such
sequences can be found:
 RIGHT RIGHT UP UPRIGHT
 UP UP RIGHT RIGHT RIGHT
Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent discussion.
The move is now noisy. 80% of the time the intended action works correctly. 20% of the
time the action agent takes causes it to move at right angles. For example, if the agent says
UP the probability of going UP is 0.8 whereas the probability of going LEFT is 0.1, and the
probability of going RIGHT is 0.1 (since LEFT and RIGHT are right angles to UP).

The agent receives rewards each time step:-

 Small reward each step (can be negative when can also be term as punishment, in
the above example entering the Fire can have a reward of -1).
 Big rewards come at the end (good or bad).
 The goal is to Maximize the sum of rewards.

IT DEPT-R20-MACHINE LEARNING Page 126


5.6.4 Q-learning
Q-learning is a model-free, value-based, off-policy algorithm that will find the best series of
actions based on the agent's current state. The “Q” stands for quality. Quality represents how
valuable the action is in maximizing future rewards.

The model-based algorithms use transition and reward functions to estimate the optimal
policy and create the model. In contrast, model-free algorithms learn the consequences of
their actions through the experience without transition and reward function.

The value-based method trains the value function to learn which state is more valuable and
take action. On the other hand, policy-based methods train the policy directly to learn which
action to take in a given state.

In the off-policy, the algorithm evaluates and updates a policy that differs from the policy
used to take an action. Conversely, the on-policy algorithm evaluates and improves the same
policy used to take an action

Before we jump into how Q-learning works, we need to learn a few useful terminologies to
understand Q-learning's fundamentals.

 States(s): the current position of the agent in the environment.

 Action(a): a step taken by the agent in a particular state.

 Rewards: for every action, the agent receives a reward and penalty.

 Episodes: the end of the stage, where agents can’t take new action. It happens when
the agent has achieved the goal or failed.

 Q(St+1, a): expected optimal Q-value of doing the action in a particular state.

 Q(St, At): it is the current estimation of Q(St+1, a).

 Q-Table: the agent maintains the Q-table of sets of states and actions.

 Temporal Differences(TD): used to estimate the expected value of Q(St+1, a) by using


the current state and action and previous state and action.

IT DEPT-R20-MACHINE LEARNING Page 127


We will learn in detail how Q-learning works by using the example of a frozen lake. In this
environment, the agent must cross the frozen lake from the start to the goal, without falling
into the holes. The best strategy is to reach goals by taking the shortest path

Q-Table

The agent will use a Q-table to take the best possible action based on the expected reward for
each state in the environment. In simple words, a Q-table is a data structure of sets of actions
and states, and we use the Q-learning algorithm to update the values in the table.

Q-Function

The Q-function uses the Bellman equation and takes state(s) and action(a) as input. The
equation simplifies the state values and state-action value calculation.

IT DEPT-R20-MACHINE LEARNING Page 128


Q-learning algorithm

Initialize Q-Table

We will first initialize the Q-table. We will build the table with columns based on the number
of actions and rows based on the number of states.

In our example, the character can move up, down, left, and right. We have four possible
actions and four states(start, Idle, wrong path, and end). You can also consider the wrong
path for falling into the hole. We will initialize the Q-Table with values at 0.

IT DEPT-R20-MACHINE LEARNING Page 129


Choose an Action

The second step is quite simple. At the start, the agent will choose to take the random
action(down or right), and on the second run, it will use an updated Q-Table to select the
action.

Perform an Action

Choosing an action and performing the action will repeat multiple times until the training
loop stops. The first action and state are selected using the Q-Table. In our case, all values of
the Q-Table are zero.

Then, the agent will move down and update the Q-Table using the Bellman equation. With
every move, we will be updating values in the Q-Table and also using it for determining the
best course of action.

Initially, the agent is in exploration mode and chooses a random action to explore the
environment. The Epsilon Greedy Strategy is a simple method to balance exploration and
exploitation. The epsilon stands for the probability of choosing to explore and exploits when
there are smaller chances of exploring.

At the start, the epsilon rate is higher, meaning the agent is in exploration mode. While
exploring the environment, the epsilon decreases, and agents start to exploit the environment.
During exploration, with every iteration, the agent becomes more confident in estimating Q-
values

IT DEPT-R20-MACHINE LEARNING Page 130


In the frozen lake example, the agent is unaware of the environment, so it takes random
action (move down) to start. As we can see in the above image, the Q-Table is updated using
the Bellman equation.

Measuring the Rewards

After taking the action, we will measure the outcome and the reward.

 The reward for reaching the goal is +1

 The reward for taking the wrong path (falling into the hole) is 0

 The reward for Idle or moving on the frozen lake is also 0.

Update Q-Table

We will update the function Q(St, At) using the equation. It uses the previous episode’s
estimated Q-values, learning rate, and Temporal Differences error. Temporal Differences
error is calculated using Immediate reward, the discounted maximum expected future reward,
and the former estimation Q-value.

The process is repeated multiple times until the Q-Table is updated and the Q-value function
is maximized.

IT DEPT-R20-MACHINE LEARNING Page 131


At the start, the agent is exploring the environment to update the Q-table. And when the Q-
Table is ready, the agent will start exploiting and start taking better decisions.

In the case of a frozen lake, the agent will learn to take the shortest path to reach the goal and
avoid jumping into the holes.

IT DEPT-R20-MACHINE LEARNING Page 132

You might also like