Assignment 2 - Policy Gradients
Assignment 2 - Policy Gradients
1 Introduction
The goal of this assignment is to experiment with policy gradient and its variants, including variance reduction
tricks such as implementing reward-to-go and neural network baselines. The startercode can be found at
https://github.com/berkeleydeeprlcourse/homework_fall2020/tree/master/hw2
2 Review
2.1 Policy gradient
Recall that the reinforcement learning objective is to learn a θ∗ that maximizes the objective function:
and
T
X −1
r(τ ) = r(s0 , a0 , ..., sT −1 , aT −1 ) = r(st , at ).
t=0
The policy gradient approach is to directly take the gradient of this objective:
Z
∇θ J(θ) = ∇θ πθ (τ )r(τ )dτ (2)
Z
= πθ (τ )∇θ log πθ (τ )r(τ )dτ. (3)
In practice, the expectation over trajectories τ can be approximated from a batch of N sampled trajectories:
N
1 X
∇θ J(θ) ≈ ∇θ log πθ (τi )r(τi ) (6)
N i=1
−1 −1
N T
! T
!
1 X X X
= ∇θ log πθ (ait |sit ) r(sit , ait ) . (7)
N i=1 t=0 t=0
Here we see that the policy πθ is a probability distribution over the action space, conditioned on the state. In
the agent-environment loop, the agent samples an action at from πθ (·|st ) and the environment responds with
a reward r(st , at ).
not include the rewards achieved prior to the time step at which the policy is being queried. This sum of
rewards is a sample estimate of the Q function, and is referred to as the “reward-to-go.”
N T −1 −1
T
!
1 XX X
∇θ J(θ) ≈ ∇θ log πθ (ait |sit ) r(sit0 , ait0 ) . (8)
N i=1 t=0 0 t =t
2.2.2 Discounting
Multiplying a discount factor γ to the rewards can be interpreted as encouraging the agent to focus more on
the rewards that are closer in time, and less on the rewards that are further in the future. This can also be
thought of as a means for reducing variance (because there is more variance possible when considering futures
that are further into the future). We saw in lecture that the discount factor can be incorporated in two ways,
as shown below.
The first way applies the discount on the rewards from full trajectory:
T −1
N
! T −1 !
1 X X X 0
t −1
∇θ J(θ) ≈ ∇θ log πθ (ait |sit ) γ r(sit0 , ait0 ) (9)
N i=1 t=0 0 t =0
2.2.3 Baseline
Another variance reduction method is to subtract a baseline (that is a constant with respect to τ ) from the
sum of rewards:
In this assignment, we will implement a value function Vφπ which acts as a state-dependent baseline. This
value function will be trained to approximate the sum of future rewards starting from a particular state:
T
X −1
Vφπ (st ) ≈ Eπθ [r(st0 , at0 )|st ] , (12)
t0 =t
3 Overview of Implementation
3.1 Files
To implement policy gradients, we will be building up the code that we started in homework 1. All files needed
to run your code are in the hw2 folder, but there will be some blanks you will fill with your solutions from
homework 1. These locations are marked with # TODO: get this from hw1 and are found in the following
files:
• infrastructure/rl trainer.py
• infrastructure/utils.py
• policies/MLP policy.py
After bringing in the required components from the previous homework, you can begin work on the new policy
gradient code. These placeholders are marked with TODO, located in the following files:
• agents/pg agent.py
• policies/MLP policy.py
The script to run the experiments is found in scripts/run hw2.py (for the local option) or scripts/run hw2.ipynb
(for the Colab option).
3.2 Overview
As in the previous homework, the main training loop is implemented in
infrastructure/rl trainer.py.
The policy gradient algorithm uses the following 3 steps:
1. Sample trajectories by generating rollouts under your current policy.
2. Estimate returns and compute advantages. This is executed in the train function of pg agent.py
3. Train/Update parameters. The computational graph for the policy and the baseline, as well as the
update functions, are implemented in policies/MLP policy.py.
The second (“Case 2”) uses the “reward-to-go” formulation from Equation 10:
T −1
X 0
r(τi ) = γ t −t r(sit0 , ait0 ). (15)
t0 =t
Note that these differ only by the starting point of the summation.
Implement these return estimators as well as the remaining sections marked TODO in the code. For the small-
scale experiments, you may skip those sections that are run only if nn baseline is True; we will return to
baselines in Section 6. (These sections are in MLPPolicyPG:update and PGAgent:estimate advantage.)
Berkeley CS 285 Deep Reinforcement Learning, Decision Making, and Control Fall 2020
5 Small-Scale Experiments
After you have implemented all non-baseline code from Section 4, you will run two small-scale experiments to
get a feel for how different settings impact the performance of policy gradient methods.
Experiment 1 (CartPole). Run multiple experiments with the PG algorithm on the discrete CartPole-v0
environment, using the following commands:
python cs285/scripts/run_hw2.py --env_name CartPole-v0 -n 100 -b 1000 \
-dsa --exp_name q1_sb_no_rtg_dsa
• Provide the exact command line configurations (or #@params settings in Colab) you used to run your
experiments, including any parameters changed from their defaults.
What to Expect:
• The best configuration of CartPole in both the large and small batch cases should converge to a maximum
score of 200.
where your task is to find the smallest batch size b* and largest learning rate r* that gets to optimum
(maximum score of 1000) in less than 100 iterations. The policy performance may fluctuate around 1000; this
is fine. The precision of b* and r* need only be one significant digit.
Deliverables:
• Given the b* and r* you found, provide a learning curve where the policy gets to optimum (maximum
score of 1000) in less than 100 iterations. (This may be for a single random seed, or averaged over
multiple.)
• Provide the exact command line configurations you used to run your experiments.
Experiment 3 (LunarLander). You will now use your policy gradient implementation to learn a controller
for LunarLanderContinuous-v2. The purpose of this problem is to test and help you debug your baseline
implementation from Section 6.
Run the following command:
python cs285/scripts/run_hw2.py \
--env_name LunarLanderContinuous-v2 --ep_len 1000
--discount 0.99 -n 100 -l 2 -s 64 -b 40000 -lr 0.005 \
--reward_to_go --nn_baseline --exp_name q3_b40000_r0.005
Deliverables:
• Plot a learning curve for the above command. You should expect to achieve an average return of around
180 by the end of training.
Berkeley CS 285 Deep Reinforcement Learning, Decision Making, and Control Fall 2020
Experiment 4 (HalfCheetah). You will be using your policy gradient implementation to learn a controller
for the HalfCheetah-v2 benchmark environment with an episode length of 150. This is shorter than the default
episode length (1000), which speeds up training significantly. Search over batch sizes b ∈ [10000, 30000, 50000]
and learning rates r ∈ [0.005, 0.01, 0.02] to replace <b> and <r> below.
python cs285/scripts/run_hw2.py --env_name HalfCheetah-v2 --ep_len 150 \
--discount 0.95 -n 100 -l 2 -s 32 -b <b> -lr <r> -rtg --nn_baseline \
--exp_name q4_search_b<b>_lr<r>_rtg_nnbaseline
Deliverables:
• Provide a single plot with the learning curves for the HalfCheetah experiments that you tried. Describe
in words how the batch size and learning rate affected task performance.
Once you’ve found optimal values b* and r*, use them to run the following commands (replace the terms in
angle brackets):
python cs285/scripts/run_hw2.py --env_name HalfCheetah-v2 --ep_len 150 \
--discount 0.95 -n 100 -l 2 -s 32 -b <b*> -lr <r*> \
--exp_name q4_b<b*>_r<r*>
Deliverables: Provide a single plot with the learning curves for these four runs. The run with both reward-
to-go and the baseline should achieve an average score close to 200.
8 Bonus!
Choose any (or all) of the following:
• A serious bottleneck in the learning, for more complex environments, is the sample collection time. In
infrastructure/rl trainer.py, we only collect trajectories in a single thread, but this process can be
fully parallelized across threads to get a useful speedup. Implement the parallelization and report on
the difference in training time.
• Implement GAE-λ for advantage estimation.1 Run experiments in a MuJoCo gym environment to
explore whether this speeds up training. (Walker2d-v2 may be good for this.)
• In PG, we collect a batch of data, estimate a single gradient, and then discard the data and move on.
Can we potentially accelerate PG by taking multiple gradient descent steps with the same batch of data?
Explore this option and report on your results. Set up a fair comparison between single-step PG and
multi-step PG on at least one MuJoCo gym environment.
9 Submission
1 https://arxiv.org/abs/1506.02438
Berkeley CS 285 Deep Reinforcement Learning, Decision Making, and Control Fall 2020
9.3 Turning it in
Turn in your assignment by the deadline on Gradescope. Uploade the zip file with your code to HW2 Code,
and upload the PDF of your report to HW2.