0% found this document useful (0 votes)
5 views5 pages

Generated Presentation 20250910 233933

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views5 pages

Generated Presentation 20250910 233933

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Introduction and Overview

• Procedures and Process.


• Markov Decision Processes.
• In reinforcement learning , the interactions between the agent and
the environment are often described by an infinitehorizon ,
discounted Markov Decision Process ( MDP) M = ( S , A , P, r , γ, µ),
specified by:.
• A state space S , which may be finite or infinite. For mathematical
convenience , we will assume that S is finite or countably infinite.

📊 Convergence
📊 Mathematical
analysis showing
functiondifferent
visualization
learning rates
Procedures and Process
• Procedures and Process.
• Markov Decision Processes.

📊 MDP
📊 Policy
diagram
visualization
showingshowing
states, actions,
action probabilities
and transitions
Procedures and Process
• Procedures and Process.
• In reinforcement learning , the interactions between the agent and
the environment are often described by an infinitehorizon ,
discounted Markov Decision Process ( MDP) M = ( S , A , P, r , γ, µ),
specified by:.
• A state space S , which may be finite or infinite. For mathematical
convenience , we will assume that S is finite or countably infinite.
• An action space A , which also may be discrete or infinite. For
mathematical convenience , we will assume that A is finite.
• A transition function P : 𝒮 × 𝒜 → ∆( S), where ∆( S) is the space of
probability distributions over S( i.e., the probability simplex ). P( s ′ |
s , a ) is the probability of transitioning into state s ′ upon taking
action a in state s . We use P_{s , a} to denote the vector P(· ∣ ∣ s , a ).
• A reward function r: 𝒮 × 𝒜 → [0, 1] . r ( s , a ) is the immediate
reward associated with taking action a in state s . More generally ,
the r ( s , a ) could be a random variable ( where the distribution
depends on s , a ). While we largely focus on the case where r ( s , a ) 📊 Convergence
📊 Mathematical
analysis showing
functiondifferent
visualization
learning rates
is deterministic , the extension to methods with stochastic rewards
are often straightforward.
• A discount factor γ ∈ [0 , 1), which defines a horizon for the
problem.
Important Information
• Important Information.
• In a given MDP M = ( S , A , P, r , γ, µ), the agent interacts with the
environment according to the following protocol: the agent starts a ₜ
some state s₀ ∼ µ ; aₜ each time step t = 0 , 1 , 2 ,.
• , the agent takes an action a ₜ ∈ 𝒜 , obtains the immediate reward r ₜ
= r ( sₜ, aₜ), and observes the next state s ₜ₊₁ sampled according to
sₜ₊₁ ∼ P(·| sₜ , aₜ).
• The interaction record aₜ time t ,.
• is called a trajectory , which includes the observed state aₜ time t .
• In the most general setting , a policy specifies a decision-making
strategy in which the agent chooses actions adaptively based on the
history of observations ; precisely , a policy is a ( possibly
randomized ) mapping from a trajectory to an action , i.e.
• π : H → ∆( A) where H is the set of all possible trajectories ( of all
lengths ) and ∆( A) is the space of probability distributions over A.
• A stationary policy π : S → ∆( A) specifies a decision-making 📊 Value
📊 Mathematical
function heatmap
function
showing
visualization
state values
strategy in which the agent chooses actions based only on the
current state , i.e.
• A deterministic , stationary policy is of the form π : S → A .
Summary and Key Takeaways
• Markov Decision Processes.
• 1. Discounted ( Infinite-Horizon ) Markov Decision Processes
• In reinforcement learning , the interactions between the agent and
the environment are often described by an infinitehorizon ,
discounted Markov Decision Process ( MDP) M = ( S , A , P, r , γ, µ),
specified by:.
• **- A state space S , which may be finite or infinite.

• 1. .1 The objective , policies , and values

📊 Convergence
📊 Mathematical
analysis showing
functiondifferent
visualization
learning rates

You might also like