0% found this document useful (0 votes)

13 views

Assignment 2 - Policy Gradients

Uploaded by

7b8b5x8wgq

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Assignment 2 - Policy Gradients

Uploaded by

7b8b5x8wgq

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Berkeley CS 285 Deep Reinforcement Learning, Decision Making, and Control Fall 2020

Assignment 2: Policy Gradients

Due September 28, 11:59 pm

1 Introduction
The goal of this assignment is to experiment with policy gradient and its variants, including variance reduction
tricks such as implementing reward-to-go and neural network baselines. The startercode can be found at
https://github.com/berkeleydeeprlcourse/homework_fall2020/tree/master/hw2

2 Review
2.1 Policy gradient
Recall that the reinforcement learning objective is to learn a θ∗ that maximizes the objective function:

J(θ) = Eτ ∼πθ (τ ) [r(τ )] (1)

where each rollout τ is of length T , as follows:

−1
TY
πθ (τ ) = p(s0 , a0 , ..., sT −1 , aT −1 ) = p(s0 )πθ (a0 |s0 ) p(st |st−1 , at−1 )πθ (at |st )
t=1

and
T
X −1
r(τ ) = r(s0 , a0 , ..., sT −1 , aT −1 ) = r(st , at ).
t=0

The policy gradient approach is to directly take the gradient of this objective:
Z
∇θ J(θ) = ∇θ πθ (τ )r(τ )dτ (2)
Z
= πθ (τ )∇θ log πθ (τ )r(τ )dτ. (3)

= Eτ ∼πθ (τ ) [∇θ log πθ (τ )r(τ )] (4)

(5)

In practice, the expectation over trajectories τ can be approximated from a batch of N sampled trajectories:
N
1 X
∇θ J(θ) ≈ ∇θ log πθ (τi )r(τi ) (6)
N i=1
−1 −1
N T
! T
!
1 X X X
= ∇θ log πθ (ait |sit ) r(sit , ait ) . (7)
N i=1 t=0 t=0

Here we see that the policy πθ is a probability distribution over the action space, conditioned on the state. In
the agent-environment loop, the agent samples an action at from πθ (·|st ) and the environment responds with
a reward r(st , at ).

2.2 Variance Reduction

2.2.1 Reward-to-go
One way to reduce the variance of the policy gradient is to exploit causality: the notion that the policy cannot
affect rewards in the past. This yields the following modified objective, where the sum of rewards here does
Berkeley CS 285 Deep Reinforcement Learning, Decision Making, and Control Fall 2020

not include the rewards achieved prior to the time step at which the policy is being queried. This sum of
rewards is a sample estimate of the Q function, and is referred to as the “reward-to-go.”
N T −1 −1
T
!
1 XX X
∇θ J(θ) ≈ ∇θ log πθ (ait |sit ) r(sit0 , ait0 ) . (8)
N i=1 t=0 0 t =t

2.2.2 Discounting
Multiplying a discount factor γ to the rewards can be interpreted as encouraging the agent to focus more on
the rewards that are closer in time, and less on the rewards that are further in the future. This can also be
thought of as a means for reducing variance (because there is more variance possible when considering futures
that are further into the future). We saw in lecture that the discount factor can be incorporated in two ways,
as shown below.
The first way applies the discount on the rewards from full trajectory:
T −1
N
! T −1 !
1 X X X 0
t −1
∇θ J(θ) ≈ ∇θ log πθ (ait |sit ) γ r(sit0 , ait0 ) (9)
N i=1 t=0 0 t =0

and the second way applies the discount on the “reward-to-go:”

N T −1 −1
T
!
1 XX X
t0 −t
∇θ J(θ) ≈ ∇θ log πθ (ait |sit ) γ r(sit0 , ait0 ) . (10)
N i=1 t=0
t0 =t
.

2.2.3 Baseline
Another variance reduction method is to subtract a baseline (that is a constant with respect to τ ) from the
sum of rewards:

∇θ J(θ) = ∇θ Eτ ∼πθ (τ ) [r(τ ) − b] . (11)

This leaves the policy gradient unbiased because

∇θ Eτ ∼πθ (τ ) [b] = Eτ ∼πθ (τ ) [∇θ log πθ (τ ) · b] = 0.

In this assignment, we will implement a value function Vφπ which acts as a state-dependent baseline. This
value function will be trained to approximate the sum of future rewards starting from a particular state:
T
X −1
Vφπ (st ) ≈ Eπθ [r(st0 , at0 )|st ] , (12)
t0 =t

so the approximate policy gradient now looks like this:

N T −1 −1
T
! !
1 XX X
t0 −t
∇θ J(θ) ≈ ∇θ log πθ (ait |sit ) γ r(sit0 , ait0 ) − Vφπ (sit ) . (13)
N i=1 t=0
t0 =t
Berkeley CS 285 Deep Reinforcement Learning, Decision Making, and Control Fall 2020

3 Overview of Implementation
3.1 Files
To implement policy gradients, we will be building up the code that we started in homework 1. All files needed
to run your code are in the hw2 folder, but there will be some blanks you will fill with your solutions from
homework 1. These locations are marked with # TODO: get this from hw1 and are found in the following
files:
• infrastructure/rl trainer.py
• infrastructure/utils.py
• policies/MLP policy.py
After bringing in the required components from the previous homework, you can begin work on the new policy
gradient code. These placeholders are marked with TODO, located in the following files:
• agents/pg agent.py
• policies/MLP policy.py
The script to run the experiments is found in scripts/run hw2.py (for the local option) or scripts/run hw2.ipynb
(for the Colab option).

3.2 Overview
As in the previous homework, the main training loop is implemented in
infrastructure/rl trainer.py.
The policy gradient algorithm uses the following 3 steps:
1. Sample trajectories by generating rollouts under your current policy.
2. Estimate returns and compute advantages. This is executed in the train function of pg agent.py
3. Train/Update parameters. The computational graph for the policy and the baseline, as well as the
update functions, are implemented in policies/MLP policy.py.

4 Implementing Policy Gradients

You will be implementing two different return estimators within pg agent.py. The first (“Case 1” within
calculate q vals) uses the discounted cumulative return of the full trajectory and corresponds to the
“vanilla” form of the policy gradient (Equation 9):
T −1
X 0
r(τi ) = γ t r(sit0 , ait0 ). (14)
t0 =0

The second (“Case 2”) uses the “reward-to-go” formulation from Equation 10:
T −1
X 0
r(τi ) = γ t −t r(sit0 , ait0 ). (15)
t0 =t

Note that these differ only by the starting point of the summation.
Implement these return estimators as well as the remaining sections marked TODO in the code. For the small-
scale experiments, you may skip those sections that are run only if nn baseline is True; we will return to
baselines in Section 6. (These sections are in MLPPolicyPG:update and PGAgent:estimate advantage.)
Berkeley CS 285 Deep Reinforcement Learning, Decision Making, and Control Fall 2020

5 Small-Scale Experiments
After you have implemented all non-baseline code from Section 4, you will run two small-scale experiments to
get a feel for how different settings impact the performance of policy gradient methods.

Experiment 1 (CartPole). Run multiple experiments with the PG algorithm on the discrete CartPole-v0
environment, using the following commands:
python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 1000 \
-dsa --exp_name q1_sb_no_rtg_dsa

python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 1000 \

-rtg -dsa --exp_name q1_sb_rtg_dsa

python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 1000 \

-rtg --exp_name q1_sb_rtg_na

python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 5000 \

-dsa --exp_name q1_lb_no_rtg_dsa

python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 5000 \

-rtg -dsa --exp_name q1_lb_rtg_dsa

python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 5000 \

-rtg --exp_name q1_lb_rtg_na

What’s happening here:

• -n : Number of iterations.
• -b : Batch size (number of state-action pairs sampled while acting according to the current policy at
each iteration).
• -dsa : Flag: if present, sets standardize_advantages to False. Otherwise, by default, standardizes
advantages to have a mean of zero and standard deviation of one.
• -rtg : Flag: if present, sets reward_to_go=True. Otherwise, reward_to_go=False by default.
• --exp_name : Name for experiment, which goes into the name for the data logging directory.
Various other command line arguments will allow you to set batch size, learning rate, network architecture,
and more. You can change these as well, but keep them fixed between the 6 experiments mentioned above.

Deliverables for report:

• Create two graphs:
– In the first graph, compare the learning curves (average return at each iteration) for the experiments
prefixed with q1_sb_. (The small batch experiments.)
– In the second graph, compare the learning curves for the experiments prefixed with q1_lb_. (The
large batch experiments.)
• Answer the following questions briefly:
– Which value estimator has better performance without advantage-standardization: the trajectory-
centric one, or the one using reward-to-go?
– Did advantage standardization help?
– Did the batch size make an impact?
Berkeley CS 285 Deep Reinforcement Learning, Decision Making, and Control Fall 2020

• Provide the exact command line configurations (or #@params settings in Colab) you used to run your
experiments, including any parameters changed from their defaults.
What to Expect:
• The best configuration of CartPole in both the large and small batch cases should converge to a maximum
score of 200.

Experiment 2 (InvertedPendulum). Run experiments on the InvertedPendulum-v2 continuous control

environment as follows:
python cs285/scripts/run_hw2.py --env_name InvertedPendulum-v2 \
--ep_len 1000 --discount 0.9 -n 100 -l 2 -s 64 -b <b*> -lr <r*> -rtg \
--exp_name q2_b<b*>_r<r*>

where your task is to find the smallest batch size b* and largest learning rate r* that gets to optimum
(maximum score of 1000) in less than 100 iterations. The policy performance may fluctuate around 1000; this
is fine. The precision of b* and r* need only be one significant digit.
Deliverables:
• Given the b* and r* you found, provide a learning curve where the policy gets to optimum (maximum
score of 1000) in less than 100 iterations. (This may be for a single random seed, or averaged over
multiple.)
• Provide the exact command line configurations you used to run your experiments.

6 Implementing Neural Network Baselines

You will now implement a value function as a state-dependent neural network baseline. This will require filling
in the remaining TODO sections skipped in Section 4. In particular:
• This neural network will be trained in the update method of MLPPolicyPG along with the policy gradient
update.
• In pg agent.py:estimate advantage, the predictions of this network P will 0 be subtracted from the
T −1 t −t
reward-to-go to yield an estimate of the advantage. This implements t0 =t γ r(sit0 , ait0 ) −Vφπ (sit ).

7 More Complex Experiments

Note: The following tasks take quite a bit of time to train. Please start early! For all remaining experiments,
use the reward-to-go estimator.

Experiment 3 (LunarLander). You will now use your policy gradient implementation to learn a controller
for LunarLanderContinuous-v2. The purpose of this problem is to test and help you debug your baseline
implementation from Section 6.
Run the following command:
python cs285/scripts/run_hw2.py \
--env_name LunarLanderContinuous-v2 --ep_len 1000
--discount 0.99 -n 100 -l 2 -s 64 -b 40000 -lr 0.005 \
--reward_to_go --nn_baseline --exp_name q3_b40000_r0.005

Deliverables:
• Plot a learning curve for the above command. You should expect to achieve an average return of around
180 by the end of training.
Berkeley CS 285 Deep Reinforcement Learning, Decision Making, and Control Fall 2020

Experiment 4 (HalfCheetah). You will be using your policy gradient implementation to learn a controller
for the HalfCheetah-v2 benchmark environment with an episode length of 150. This is shorter than the default
episode length (1000), which speeds up training significantly. Search over batch sizes b ∈ [10000, 30000, 50000]
and learning rates r ∈ [0.005, 0.01, 0.02] to replace <b> and <r> below.
python cs285/scripts/run_hw2.py --env_name HalfCheetah-v2 --ep_len 150 \
--discount 0.95 -n 100 -l 2 -s 32 -b <b> -lr <r> -rtg --nn_baseline \
--exp_name q4_search_b<b>_lr<r>_rtg_nnbaseline

Deliverables:
• Provide a single plot with the learning curves for the HalfCheetah experiments that you tried. Describe
in words how the batch size and learning rate affected task performance.
Once you’ve found optimal values b* and r*, use them to run the following commands (replace the terms in
angle brackets):
python cs285/scripts/run_hw2.py --env_name HalfCheetah-v2 --ep_len 150 \
--discount 0.95 -n 100 -l 2 -s 32 -b <b*> -lr <r*> \
--exp_name q4_b<b*>_r<r*>

python cs285/scripts/run_hw2.py --env_name HalfCheetah-v2 --ep_len 150 \

--discount 0.95 -n 100 -l 2 -s 32 -b <b*> -lr <r*> -rtg \
--exp_name q4_b<b*>_r<r*>_rtg

python cs285/scripts/run_hw2.py --env_name HalfCheetah-v2 --ep_len 150 \

--discount 0.95 -n 100 -l 2 -s 32 -b <b*> -lr <r*> --nn_baseline \
--exp_name q4_b<b*>_r<r*>_nnbaseline

python cs285/scripts/run_hw2.py --env_name HalfCheetah-v2 --ep_len 150 \

--discount 0.95 -n 100 -l 2 -s 32 -b <b*> -lr <r*> -rtg --nn_baseline \
--exp_name q4_b<b*>_r<r*>_rtg_nnbaseline

Deliverables: Provide a single plot with the learning curves for these four runs. The run with both reward-
to-go and the baseline should achieve an average score close to 200.

8 Bonus!
Choose any (or all) of the following:
• A serious bottleneck in the learning, for more complex environments, is the sample collection time. In
infrastructure/rl trainer.py, we only collect trajectories in a single thread, but this process can be
fully parallelized across threads to get a useful speedup. Implement the parallelization and report on
the difference in training time.
• Implement GAE-λ for advantage estimation.1 Run experiments in a MuJoCo gym environment to
explore whether this speeds up training. (Walker2d-v2 may be good for this.)
• In PG, we collect a batch of data, estimate a single gradient, and then discard the data and move on.
Can we potentially accelerate PG by taking multiple gradient descent steps with the same batch of data?
Explore this option and report on your results. Set up a fair comparison between single-step PG and
multi-step PG on at least one MuJoCo gym environment.

9 Submission
1 https://arxiv.org/abs/1506.02438
Berkeley CS 285 Deep Reinforcement Learning, Decision Making, and Control Fall 2020

9.1 Submitting the PDF

Your report should be a document containing
(a) All graphs and answers to short explanation questions requested for Experiments 1-4.
(b) All command-line expressions you used to run your experiments.
(c) (Optionally) Your bonus results (command-line expressions, graphs, and a few sentences that comment
on your findings).

9.2 Submitting the code and experiment runs

In order to turn in your code and experiment logs, create a folder that contains the following:
• A folder named run logs with all the experiment runs from this assignment. These folders can be
copied directly from the cs285/data folder. Do not change the names originally assigned to the
folders, as specified by exp name in the instructions. Video logging is disabled by default in the
code, but if you turned it on for debugging, you need to run those again with --video log freq -1, or
else the file size will be too large for submission.
• The cs285 folder with all the .py files, with the same names and directory structure as the original
homework repository (excluding the cs285/data folder). Also include any special instructions we need
to run in order to produce each of your figures or tables in the form of a README file.
As an example, the unzipped version of your submission should result in the following file structure. Make
sure that the submit.zip file is below 15MB.
submit.zip
run logs
q1 lb rtg na CartPole-v0 12-09-2019 17-53-4
events.out.tfevents.1567529456.e3a096ac8ff4
q3 b40000 r0.005 LunarLanderContinuous-v2 12-09-2019 00-17-58
events.out.tfevents.1567529456.e3a096ac8ff4
...
cs285
agents
bc agent.py
...
policies
...
...
README.md
...

9.3 Turning it in
Turn in your assignment by the deadline on Gradescope. Uploade the zip file with your code to HW2 Code,
and upload the PDF of your report to HW2.

Hemeon s Plant Process Ventilation Third Edition Burton 2024 Scribd Download
100% (3)
Hemeon s Plant Process Ventilation Third Edition Burton 2024 Scribd Download
52 pages
RL Intro-2
No ratings yet
RL Intro-2
24 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
cs224r_L04_Actor_Critic
No ratings yet
cs224r_L04_Actor_Critic
89 pages
rl5
No ratings yet
rl5
26 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
Part 3 - Intro To Policy Optimization - Spinning Up Documentation PDF
No ratings yet
Part 3 - Intro To Policy Optimization - Spinning Up Documentation PDF
10 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
100% (5)
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
62 pages
rl-3
No ratings yet
rl-3
31 pages
2023_week5_policy
No ratings yet
2023_week5_policy
62 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
28 pages
DRL
No ratings yet
DRL
9 pages
Instant ebooks textbook (Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python by Laura Graesser; Wah Loon Keng ISBN 9780135172490, 0135172497 download all chapters
100% (9)
Instant ebooks textbook (Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python by Laura Graesser; Wah Loon Keng ISBN 9780135172490, 0135172497 download all chapters
65 pages
Lec 5 Policy Gradients
No ratings yet
Lec 5 Policy Gradients
40 pages
9 Sqoop Notes
No ratings yet
9 Sqoop Notes
35 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
Download Full Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF All Chapters
100% (4)
Download Full Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF All Chapters
62 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Deep Reinforcement Learning Handout v2.0.docx (1)
0% (1)
Deep Reinforcement Learning Handout v2.0.docx (1)
6 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
cs224r_L03_MDP_PG
No ratings yet
cs224r_L03_MDP_PG
30 pages
13_RL_3
No ratings yet
13_RL_3
48 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
20AI903_RL_UNIT 4
No ratings yet
20AI903_RL_UNIT 4
49 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
lecture-06
No ratings yet
lecture-06
98 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
13 ML Reinforcement Learning - Policy Search
No ratings yet
13 ML Reinforcement Learning - Policy Search
10 pages
Abdolmaleki et al. - 2018 - Maximum a Posteriori Policy Optimisation
No ratings yet
Abdolmaleki et al. - 2018 - Maximum a Posteriori Policy Optimisation
23 pages
Policy-Based Reinforcement Learning: Shusen Wang
No ratings yet
Policy-Based Reinforcement Learning: Shusen Wang
46 pages
Lecture 10 - Overview of RL With A VIP Perspective
No ratings yet
Lecture 10 - Overview of RL With A VIP Perspective
35 pages
2023 Week2 More Examples
No ratings yet
2023 Week2 More Examples
17 pages
Midterm_Report_Example3
No ratings yet
Midterm_Report_Example3
4 pages
Reinforcement Learning As Classification: Leveraging Modern Classifiers
No ratings yet
Reinforcement Learning As Classification: Leveraging Modern Classifiers
8 pages
RL_Basics_1737166593
No ratings yet
RL_Basics_1737166593
30 pages
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
No ratings yet
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
14 pages
paper RL
No ratings yet
paper RL
61 pages
Lecture_2_Summary
No ratings yet
Lecture_2_Summary
1 page
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Policy Gradient Methods For Reinforcement Learning PDF
No ratings yet
Policy Gradient Methods For Reinforcement Learning PDF
5 pages
Conservativeddpg
No ratings yet
Conservativeddpg
13 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
HW 1
No ratings yet
HW 1
4 pages
REINFORCE_Algorithm
No ratings yet
REINFORCE_Algorithm
15 pages
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments 1810.09026
No ratings yet
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments 1810.09026
28 pages
Reinforcement Learning - Basics
No ratings yet
Reinforcement Learning - Basics
7 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
Lecture10
No ratings yet
Lecture10
25 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
3X RRZZHHTTS4 BR24 - Finale
No ratings yet
3X RRZZHHTTS4 BR24 - Finale
6 pages
09-PhysicalSecurity
No ratings yet
09-PhysicalSecurity
79 pages
MLLG-LED-HT80 High Temperature Product Family
No ratings yet
MLLG-LED-HT80 High Temperature Product Family
8 pages
3.KT 102H Unit 3 To Ss AUG 2024
No ratings yet
3.KT 102H Unit 3 To Ss AUG 2024
41 pages
Solving Examples of Linear Programming Models
No ratings yet
Solving Examples of Linear Programming Models
52 pages
MAFI Catalog 2018
No ratings yet
MAFI Catalog 2018
128 pages
Apacer UV110 UFD7 BiCS5 AN2 112XXU XXX22 Spec v1 1-3106689
No ratings yet
Apacer UV110 UFD7 BiCS5 AN2 112XXU XXX22 Spec v1 1-3106689
16 pages
Andrew John Stevenson - Curriculum Vitae: Residence Date of Birth Education
No ratings yet
Andrew John Stevenson - Curriculum Vitae: Residence Date of Birth Education
6 pages
Curriculum Vitae English Netherlands
100% (2)
Curriculum Vitae English Netherlands
7 pages
Ebooks File FS Fundamentals of Surveying Reference Handbook 2.0 National Council of Examiners For Engineering and Surveying (Ncees) All Chapters
100% (7)
Ebooks File FS Fundamentals of Surveying Reference Handbook 2.0 National Council of Examiners For Engineering and Surveying (Ncees) All Chapters
62 pages
Ketosis DM
100% (1)
Ketosis DM
33 pages
Pacheco v. Reyes
No ratings yet
Pacheco v. Reyes
9 pages
Aneeket Arya Tic-Tac-Toe AI First Draft
No ratings yet
Aneeket Arya Tic-Tac-Toe AI First Draft
12 pages
MIL-HDBK-1461A Ammo Manufacturer and Symbol
No ratings yet
MIL-HDBK-1461A Ammo Manufacturer and Symbol
463 pages
Annual Function 2024-25
No ratings yet
Annual Function 2024-25
73 pages
October 09-13, 2023 (G11)
No ratings yet
October 09-13, 2023 (G11)
3 pages
013449251X 9780134492513 Solution Manual For Principles of Marketing 17th Edition Kotler Armstrong
100% (53)
013449251X 9780134492513 Solution Manual For Principles of Marketing 17th Edition Kotler Armstrong
36 pages
Regulation of Pay On Imposition of Penalty Under Ccs Cca Rules
No ratings yet
Regulation of Pay On Imposition of Penalty Under Ccs Cca Rules
11 pages
LAST NAME - FIRST NAME INITIAL - MIDDLE INITIAL-YearBSMA-2sem, SY20-21
No ratings yet
LAST NAME - FIRST NAME INITIAL - MIDDLE INITIAL-YearBSMA-2sem, SY20-21
12 pages
Fitness Tracker-1
No ratings yet
Fitness Tracker-1
3 pages
Casting Simulation Software - Applications and Benefits
No ratings yet
Casting Simulation Software - Applications and Benefits
4 pages
Chapter 9 Divisional Transfer Pricing
No ratings yet
Chapter 9 Divisional Transfer Pricing
36 pages
MSA Worksheet
No ratings yet
MSA Worksheet
29 pages
Cis Guide v8
No ratings yet
Cis Guide v8
51 pages
Gratitude Word Search
No ratings yet
Gratitude Word Search
3 pages
Creating Custom CDS Views in SAP
No ratings yet
Creating Custom CDS Views in SAP
11 pages
SEO Exercise 4
No ratings yet
SEO Exercise 4
3 pages
Buchholz Export
No ratings yet
Buchholz Export
21 pages
6M140A-5 Series
100% (1)
6M140A-5 Series
2 pages

Assignment 2 - Policy Gradients

Uploaded by

Assignment 2 - Policy Gradients

Uploaded by

Berkeley CS 285 Deep Reinforcement Learning, Decision Making, and Control Fall 2020

Assignment 2: Policy Gradients

J(θ) = Eτ ∼πθ (τ ) [r(τ )] (1)

where each rollout τ is of length T , as follows:

= Eτ ∼πθ (τ ) [∇θ log πθ (τ )r(τ )] (4)

2.2 Variance Reduction

and the second way applies the discount on the “reward-to-go:”

∇θ J(θ) = ∇θ Eτ ∼πθ (τ ) [r(τ ) − b] . (11)

This leaves the policy gradient unbiased because

∇θ Eτ ∼πθ (τ ) [b] = Eτ ∼πθ (τ ) [∇θ log πθ (τ ) · b] = 0.

so the approximate policy gradient now looks like this:

4 Implementing Policy Gradients

python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 1000 \

python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 1000 \

python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 5000 \

python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 5000 \

python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 5000 \

What’s happening here:

Deliverables for report:

Experiment 2 (InvertedPendulum). Run experiments on the InvertedPendulum-v2 continuous control

6 Implementing Neural Network Baselines

7 More Complex Experiments

python cs285/scripts/run_hw2.py --env_name HalfCheetah-v2 --ep_len 150 \

python cs285/scripts/run_hw2.py --env_name HalfCheetah-v2 --ep_len 150 \

python cs285/scripts/run_hw2.py --env_name HalfCheetah-v2 --ep_len 150 \

9.1 Submitting the PDF

9.2 Submitting the code and experiment runs

You might also like