Introduction and Overview
• Procedures and Process.
• Markov Decision Processes.
• In reinforcement learning , the interactions between the agent and
the environment are often described by an infinitehorizon ,
discounted Markov Decision Process ( MDP) M = ( S , A , P, r , γ, µ),
specified by:.
• A state space S , which may be finite or infinite. For mathematical
convenience , we will assume that S is finite or countably infinite.
📊 Convergence
📊 Mathematical
analysis showing
functiondifferent
visualization
learning rates
Procedures and Process
• Procedures and Process.
• Markov Decision Processes.
📊 MDP
📊 Policy
diagram
visualization
showingshowing
states, actions,
action probabilities
and transitions
Procedures and Process
• Procedures and Process.
• In reinforcement learning , the interactions between the agent and
the environment are often described by an infinitehorizon ,
discounted Markov Decision Process ( MDP) M = ( S , A , P, r , γ, µ),
specified by:.
• A state space S , which may be finite or infinite. For mathematical
convenience , we will assume that S is finite or countably infinite.
• An action space A , which also may be discrete or infinite. For
mathematical convenience , we will assume that A is finite.
• A transition function P : 𝒮 × 𝒜 → ∆( S), where ∆( S) is the space of
probability distributions over S( i.e., the probability simplex ). P( s ′ |
s , a ) is the probability of transitioning into state s ′ upon taking
action a in state s . We use P_{s , a} to denote the vector P(· ∣ ∣ s , a ).
• A reward function r: 𝒮 × 𝒜 → [0, 1] . r ( s , a ) is the immediate
reward associated with taking action a in state s . More generally ,
the r ( s , a ) could be a random variable ( where the distribution
depends on s , a ). While we largely focus on the case where r ( s , a ) 📊 Convergence
📊 Mathematical
analysis showing
functiondifferent
visualization
learning rates
is deterministic , the extension to methods with stochastic rewards
are often straightforward.
• A discount factor γ ∈ [0 , 1), which defines a horizon for the
problem.
Important Information
• Important Information.
• In a given MDP M = ( S , A , P, r , γ, µ), the agent interacts with the
environment according to the following protocol: the agent starts a ₜ
some state s₀ ∼ µ ; aₜ each time step t = 0 , 1 , 2 ,.
• , the agent takes an action a ₜ ∈ 𝒜 , obtains the immediate reward r ₜ
= r ( sₜ, aₜ), and observes the next state s ₜ₊₁ sampled according to
sₜ₊₁ ∼ P(·| sₜ , aₜ).
• The interaction record aₜ time t ,.
• is called a trajectory , which includes the observed state aₜ time t .
• In the most general setting , a policy specifies a decision-making
strategy in which the agent chooses actions adaptively based on the
history of observations ; precisely , a policy is a ( possibly
randomized ) mapping from a trajectory to an action , i.e.
• π : H → ∆( A) where H is the set of all possible trajectories ( of all
lengths ) and ∆( A) is the space of probability distributions over A.
• A stationary policy π : S → ∆( A) specifies a decision-making 📊 Value
📊 Mathematical
function heatmap
function
showing
visualization
state values
strategy in which the agent chooses actions based only on the
current state , i.e.
• A deterministic , stationary policy is of the form π : S → A .
Summary and Key Takeaways
• Markov Decision Processes.
• 1. Discounted ( Infinite-Horizon ) Markov Decision Processes
• In reinforcement learning , the interactions between the agent and
the environment are often described by an infinitehorizon ,
discounted Markov Decision Process ( MDP) M = ( S , A , P, r , γ, µ),
specified by:.
• **- A state space S , which may be finite or infinite.
• 1. .1 The objective , policies , and values
📊 Convergence
📊 Mathematical
analysis showing
functiondifferent
visualization
learning rates