CS229 Linear Algebra Review
CS229 Linear Algebra Review
Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at https://piazza.com/stanford/fall2018/cs229. (3) If
you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please
read the policy on Handout #1 (available from the course website) before starting work. (4)
This specific homework is not graded, but we encourage you to solve each of the problems to
brush up on your linear algebra. Some of them may even be useful for subsequent problem sets.
It also serves as your introduction to using Gradescope for submissions.
(a) Suppose that the matrix A ∈ Rn×n is diagonalizable, that is, A = T ΛT −1 for an invertible
matrix T ∈ Rn×n , where Λ = diag(λ1 , . . . , λn ) is diagonal. Use the notation t(i) for the
columns of T , so that T = [t(1) · · · t(n) ], where t(i) ∈ Rn . Show that At(i) = λi t(i) , so
that the eigenvalues/eigenvector pairs of A are (t(i) , λi ).
A matrix U ∈ Rn×n is orthogonal if U T U = I. The spectral theorem, perhaps one of the most
important theorems in linear algebra, states that if A ∈ Rn×n is symetric, that is, A = AT ,
then A is diagonalizable by a real orthogonal matrix. That is, there are a diagonal matrix
Λ ∈ Rn×n and orthogonal matrix U ∈ Rn×n such that U T AU = Λ, or, equivalently,
A = U ΛU T .
(b) Let A be symmetric. Show that if U = [u(1) · · · u(n) ] is orthogonal, where u(i) ∈
Rn and A = U ΛU T , then u(i) is an eigenvector of A and Au(i) = λi u(i) , where Λ =
diag(λ1 , . . . , λn ).
(c) Show that if A is PSD, then λi (A) ≥ 0 for each i.
CS229 Problem Set #1 1
Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at http://piazza.com/stanford/fall2018/cs229. (3) If
you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please
read the policy on Handout #1 (available from the course website) before starting work. (4)
For the coding problems, you may not use any libraries except those defined in the provided
environment.yml file. In particular, ML-specific libraries such as scikit-learn are not permitted.
(5) To account for late days, the due date listed on Gradescope is Oct 20 at 11:59 pm. If you
submit after Oct 17, you will begin consuming your late days. If you wish to submit on time,
submit before Oct 17 at 11:59 pm.
All students must submit an electronic PDF version of the written questions. We highly recom-
mend typesetting your solutions via LATEX. If you are scanning your document by cell phone,
please check the Piazza forum for recommended scanning apps and best practices. All students
must also submit a zip file of their source code to Gradescope, which should be created using
the make zip.py script. In order to pass the auto-grader tests, you should make sure to (1)
restrict yourself to only using libraries included in the environment.yml file, and (2) make sure
your code runs without errors using the run.py script. Your submission will be evaluated by
the auto-grader using a private test set.
Honor code: We strongly encourage students to form study groups. Students may discuss and
work on homework problems in groups. However, each student must write down the solutions
independently, and without referring to written notes from the joint session. In other words,
each student must understand the solution well enough in order to reconstruct it by him/herself.
In addition, each student should write on the problem set the set of people with whom s/he
collaborated. Further, because we occasionally reuse problem set questions from previous years,
we expect students not to copy, refer to, or look at the solutions in preparing their answers. It
is an honor code violation to intentionally refer to a previous year’s solutions.
CS229 Problem Set #1 2
(a) [10 points] In lecture we saw the average empirical loss for logistic regression:
m
1 X (i)
J(θ) = − y log(hθ (x(i) )) + (1 − y (i) ) log(1 − hθ (x(i) )),
m i=1
where y (i) ∈ {0, 1}, hθ (x) = g(θT x) and g(z) = 1/(1 + e−z ).
Find the Hessian H of this function, and show that for any vector z, it holds true that
z T Hz ≥ 0.
Hint: You may want to start by showing that i j zi xi xj zj = (xT z)2 ≥ 0. Recall also
P P
that g 0 (z) = g(z)(1 − g(z)).
Remark: This is one of the standard ways of showing that the matrix H is positive semi-
definite, written “H 0.” This implies that J is convex, and has no local minima other
than the global one. If you have some other way of showing H 0, you’re also welcome to
use your method instead of the one above.
(b) [5 points] Coding problem. Follow the instructions in src/p01b logreg.py to train a
logistic regression classifier using Newton’s Method. Starting with θ = ~0, run Newton’s
Method until the updates to θ are small: Specifically, train until the first iteration k such
that kθk − θk−1 k1 < , where = 1 × 10−5 . Make sure to write your model’s predictions to
the file specified in the code.
(c) [5 points] Recall that in GDA we model the joint distribution of (x, y) by the following
equations:
(
φ if y = 1
p(y) =
1 − φ if y = 0
1 1 T −1
p(x|y = 0) = exp − (x − µ0 ) Σ (x − µ0 )
(2π)n/2 |Σ|1/2 2
1 1 T −1
p(x|y = 1) = exp − (x − µ1 ) Σ (x − µ1 ) ,
(2π)n/2 |Σ|1/2 2
CS229 Problem Set #1 3
By maximizing ` with respect to the four parameters, prove that the maximum likelihood
estimates of φ, µ0 , µ1 , and Σ are indeed as given in the formulas above. (You may assume
that there is at least one positive and one negative example, so that the denominators in
the definitions of µ0 and µ1 above are non-zero.)
(e) [3 points] Coding problem. In src/p01e gda.py, fill in the code to calculate φ, µ0 ,
µ1 , and Σ, use these parameters to derive θ, and use the resulting GDA model to make
predictions on the validation set.
(f) [5 points] For Dataset 1, create a plot of the training data with x1 on the horizontal axis, and
x2 on the vertical axis. To visualize the two classes, use a different symbol for examples x(i)
with y (i) = 0 than for those with y (i) = 1. On the same figure, plot the decision boundary
found by logistic regression in part (b). Make an identical plot with the decision boundary
found by GDA in part (e).
(g) [5 points] Repeat the steps in part (f) for Dataset 2. On which dataset does GDA seem to
perform worse than logistic regression? Why might this be the case?
(h) [3 extra credit points] For the dataset where GDA performed worse in parts (f) and (g),
can you find a transformation of the x(i) ’s such that GDA performs significantly better?
What is this transformation?
CS229 Problem Set #1 4
All labeled examples are positive, which is to say p(t(i) = 1 | y (i) = 1) = 1, but unlabeled
examples may be positive or negative. Our goal in the problem is to construct a binary classifier
h of the true label t, with only access to the partial labels y. In other words, we want to construct
h such that h(x(i) ) ≈ p(t(i) = 1 | x(i) ) as closely as possible, using only x and y.
Real world example: Suppose we maintain a database of proteins which are involved in transmit-
ting signals across membranes. Every example added to the database is involved in a signaling
process, but there are many proteins involved in cross-membrane signaling which are missing
from the database. It would be useful to train a classifier to identify proteins that should be
added to the database. In our notation, each example x(i) corresponds to a protein, y (i) = 1
if the protein is in the database and 0 otherwise, and t(i) = 1 if the protein is involved in a
cross-membrane signaling process and thus should be added to the database, and 0 otherwise.
(a) [5 points] Suppose that each y (i) and x(i) are conditionally independent given t(i) :
Note this is equivalent to saying that labeled examples were selected uniformly at random
from the set of positive examples. Prove that the probability of an example being labeled
differs by a constant factor from the probability of an example being positive. That is,
show that p(t(i) = 1 | x(i) ) = p(y (i) = 1 | x(i) )/α for some α ∈ R.
(b) [5 points] Suppose we want to estimate α using a trained classifier h and a held-out validation
set V . Let V+ be the set of labeled (and hence positive) examples in V , given by V+ =
{x(i) ∈ V | y (i) = 1}. Assuming that h(x(i) ) ≈ p(y (i) = 1 | x(i) ) for all examples x(i) , show
that
h(x(i) ) ≈ α for all x(i) ∈ V+ .
You may assume that p(t(i) = 1 | x(i) ) ≈ 1 when x(i) ∈ V+ .
(c) [5 points] Coding problem. The following three problems will deal with a dataset which
we have provided in the following files:
data/ds3_{train,valid,test}.csv
Each file contains the following columns: x1 , x2 , y, and t. As in Problem 1, there is one
example per row.
First we will consider the ideal case, where we have access to the true t-labels for training.
In src/p02cde posonly, write a logistic regression classifier that uses x1 and x2 as input
features, and train it using the t-labels (you can ignore the y-labels for this part). Output
the trained model’s predictions on the test set to the file specified in the code.
CS229 Problem Set #1 5
(d) [5 points] Coding problem. We now consider the case where the t-labels are unavail-
able, so you only have access to the y-labels at training time. Add to your code in
p02cde posonly.py to re-train the classifier (still using x1 and x2 as input features), but
using the y-labels only.
(e) [10 points] Coding problem. Using the validation set, estimate the constant α by aver-
aging your classifier’s predictions over all labeled examples in the validation set:
1 X
α≈ h(x(i) ).
|V+ |
x(i) ∈V+
Add code in src/p02cde posonly.py to rescale your classifier’s predictions from part (d)
using the estimated value for α.
Finally, using a threshold of p(t(i) = 1 | x(i) ) = 0.5, make three separate plots with the
decision boundaries from parts (c) - (e) plotted on top of the test set. Plot x1 on the
horizontal axis and x2 on the vertical axis, and use two different symbols for the positive
(t(i) = 1) and negative (t(i) = 0) examples. In each plot, indicate the separating hyperplane
with a red line.
Remark: We saw that the true probability p(t | x) was only a constant factor away from
p(y | x). This means, if our task is to only rank examples (i.e. sort them) in a particular order
(e.g, sort the proteins in order of being most likely to be involved in transmitting signals across
membranes), then in fact we do not even need to estimate α. The rank based on p(y | x) will
agree with the rank based on p(t | x).
CS229 Problem Set #1 6
e−λ λy
p(y; λ) = .
y!
Show that the Poisson distribution is in the exponential family, and clearly state the values
for b(y), η, T (y), and a(η).
(b) [3 points] Consider performing regression using a GLM model with a Poisson response
variable. What is the canonical response function for the family? (You may use the fact
that a Poisson random variable with parameter λ has mean λ.)
(c) [7 points] For a training set {(x(i) , y (i) ); i = 1, . . . , m}, let the log-likelihood of an example
be log p(y (i) |x(i) ; θ). By taking the derivative of the log-likelihood with respect to θj , derive
the stochastic gradient ascent update rule for learning using a GLM model with Poisson
responses y and the canonical response function.
(d) [7 points] Coding problem. Consider a website that wants to predict its daily traffic.
The website owners have collected a dataset of past traffic to their website, along with
some features which they think are useful in predicting the number of visitors per day. The
dataset is split into train/valid/test sets and follows the same format as Datasets 1-3:
data/ds4_{train,valid}.csv
We will apply Poisson regression to model the number of visitors per day. Note that ap-
plying Poisson regression in particular assumes that the data follows a Poisson distribution
whose natural parameter is a linear combination of the input features (i.e., η = θT x).
In src/p03d poisson.py, implement Poisson regression for this dataset and use gradient
ascent to maximize the log-likelihood of θ.
CS229 Problem Set #1 7
where η is the natural parameter of the distribution. Moreover, in a Generalized Linear Model, η
is modeled as θT x, where x ∈ Rn are the input features of the example, and θ ∈ Rn are learnable
parameters. In order to show that the NLL loss is convex for GLMs, we break down the process
into sub-parts, and approach them one at a time. Our approach is to show that the second
derivative (i.e., Hessian) of the loss w.r.t the model parameters is Positive Semi-Definite (PSD)
at all values of the model parameters. We will also show some nice properties of Exponential
Family distributions as intermediate steps.
For the sake of convenience we restrict ourselves to the case where η is a scalar. Assume
p(Y |X; θ) ∼ ExponentialFamily(η), where η ∈ R is a scalar, and T (y) = y. This makes the
exponential family representation take the form
(a) [5 points] Derive an expression for the mean of the distribution. Show that E[Y | X; θ] can
be represented as the gradient of the log-partition function a with respect to the natural
parameter η.
∂
R R ∂
Hint: Start with observing that ∂η p(y; η)dy = ∂η p(y; η)dy.
(b) [5 points] Next, derive an expression for the variance of the distribution. In particular,
show that Var(Y | X; θ) can be expressed as the derivative of the mean w.r.t η (i.e., the
second derivative of the log-partition function a(η) w.r.t the natural parameter η.)
(c) [5 points] Finally, write out the loss function `(θ), the NLL of the distribution, as a function
of θ. Then, calculate the Hessian of the loss w.r.t θ, and show that it is always PSD. This
concludes the proof that NLL loss of GLM is convex.
Hint: Use the chain rule of calculus along with the results of the previous parts to simplify
your derivations.
Remark: The main takeaways from this problem are:
• Any GLM model is convex in its model parameters.
• The exponential family of probability distributions are mathematically nice. Whereas cal-
culating mean and variance of distributions in general involves integrals (hard), surprisingly
we can calculate them using derivatives (easy) for exponential family.
CS229 Problem Set #1 8
(a) [10 points] Consider a linear regression problem in which we want to “weight” different
training examples differently. Specifically, suppose we want to minimize
m 2
1 X (i) T (i)
J(θ) = w θ x − y (i) .
2 i=1
In class, we worked out what happens for the case where all the weights (the w(i) ’s) are the
same. In this problem, we will generalize some of those ideas to the weighted setting.
i. [2 points] Show that J(θ) can also be written
for an appropriate matrix W , and where X and y are as defined in class. Clearly specify
the value of each element of the matrix W .
ii. [4 points] If all the w(i) ’s equal 1, then we saw in class that the normal equation is
X T Xθ = X T y,
and that the value of θ that minimizes J(θ) is given by (X T X)−1 X T y. By finding
the derivative ∇θ J(θ) and setting that to zero, generalize the normal equation to this
weighted setting, and give the new value of θ that minimizes J(θ) in closed form as a
function of X, W and y.
iii. [4 points] Suppose we have a dataset {(x(i) , y (i) ); i = 1 . . . , m} of m independent ex-
amples, but we model the y (i) ’s as drawn from conditional distributions with different
levels of variance (σ (i) )2 . Specifically, assume the model
(y (i) − θT x(i) )2
1
p(y (i) |x(i) ; θ) = √ exp −
2πσ (i) 2(σ (i) )2
That is, each y (i) is drawn from a Gaussian distribution with mean θT x(i) and variance
(σ (i) )2 (where the σ (i) ’s are fixed, known, constants). Show that finding the maximum
likelihood estimate of θ reduces to solving a weighted linear regression problem. State
clearly what the w(i) ’s are in terms of the σ (i) ’s.
(b) [10 points] Coding problem. We will now consider the following dataset (the formatting
matches that of Datasets 1-4, except x(i) is 1-dimensional):
data/ds5_{train,valid,test}.csv
In src/p05b lwr.py, implement locally weighted linear regression using the normal equa-
tions you derived in Part (a) and using
kx(i) − xk22
(i)
w = exp − .
2τ 2
Train your model on the train split using τ = 0.5, then run your model on the valid split
and report the mean squared error (MSE). Finally plot your model’s predictions on the
validation set (plot the training set with blue ‘x’ markers and the validation set with a red
‘o’ markers). Does the model seem to be under- or overfitting?
CS229 Problem Set #1 9
(c) [5 points] Coding problem. We will now tune the hyperparameter τ . In src/p05c tau.py,
find the MSE value of your model on the validation set for each of the values of τ specified
in the code. For each τ , plot your model’s predictions on the validation set in the format
described in part (b). Report the value of τ which achieves the lowest MSE on the valid
split, and finally report the MSE on the test split using this τ -value.
CS229 Problem Set #2 1
Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at http://piazza.com/stanford/fall2018/cs229. (3) If
you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please
read the policy on Handout #1 (available from the course website) before starting work. (4)
For the coding problems, you may not use any libraries except those defined in the provided
environment.yml file. In particular, ML-specific libraries such as scikit-learn are not permitted.
(5) To account for late days, the due date listed on Gradescope is Nov 03 at 11:59 pm. If you
submit after Oct 31, you will begin consuming your late days. If you wish to submit on time,
submit before Oct 31 at 11:59 pm.
All students must submit an electronic PDF version of the written questions. We highly recom-
mend typesetting your solutions via LATEX. If you are scanning your document by cell phone,
please check the Piazza forum for recommended scanning apps and best practices. All students
must also submit a zip file of their source code to Gradescope, which should be created using the
make zip.py script. In order to pass the auto-grader tests, you should make sure to (1) restrict
yourself to only using libraries included in the environment.yml file, and (2) make sure your
code runs without errors when running p05 percept.py and p06 spam.py. Your submission will
be evaluated by the auto-grader using a private test set.
CS229 Problem Set #2 2
(a) [2 points] What is the most notable difference in training the logistic regression model on
datasets A and B?
(b) [5 points] Investigate why the training procedure behaves unexpectedly on dataset B, but
not on A. Provide hard evidence (in the form of math, code, plots, etc.) to corroborate
your hypothesis for the misbehavior. Remember, you should address why your explanation
does not apply to A.
Hint: The issue is not a numerical rounding or over/underflow error.
(c) [5 points] For each of these possible modifications, state whether or not it would lead to
the provided training algorithm converging on datasets such as B. Justify your answers.
i. Using a different constant learning rate.
ii. Decreasing the learning rate over time (e.g. scaling the initial learning rate by 1/t2 ,
where t is the number of gradient descent iterations thus far).
iii. Linear scaling of the input features.
iv. Adding a regularization term kθk22 to the loss function.
v. Adding zero-mean Gaussian noise to the training data or labels.
(d) [3 points] Are support vector machines, which use the hinge loss, vulnerable to datasets like
B? Why or why not? Give an informal justification.
Hint: Recall the distinction between functional margin and geometric margin.
CS229 Problem Set #2 3
where P (y = 1|x; θ) = hθ (x) = 1/(1 + exp(−θ> x)), Ia,b = {i|i ∈ {1, ..., m}, hθ (x(i) ) ∈ (a, b)} is
an index set of all training examples x(i) where hθ (x(i) ) ∈ (a, b), and |S| denotes the size of the
set S.
(a) [5 points] Show that the above property holds true for the described logistic regression
model over the range (a, b) = (0, 1).
Hint: Use the fact that we include a bias term.
(b) [3 points] If we have a binary classification model that is perfectly calibrated—that is, the
property we just proved holds for any (a, b) ⊂ [0, 1]—does this necessarily imply that the
model achieves perfect accuracy? Is the converse necessarily true? Justify your answers.
(c) [2 points] Discuss what effect including L2 regularization in the logistic regression objective
has on model calibration.
Remark: We considered the range (a, b) = (0, 1). This is the only range for which logistic
regression is guaranteed to be calibrated on the training set. When the GLM modeling assump-
tions hold, all ranges (a, b) ⊂ [0, 1] are well calibrated. In addition, when the training and test set
are from the same distribution and when the model has not overfit or underfit, logistic regression
tends to be well-calibrated on unseen test data as well. This makes logistic regression a very
popular model in practice, especially when we are interested in the level of uncertainty in the
model output.
CS229 Problem Set #2 4
Compare this to the maximum likelihood estimation (MLE) we have seen previously:
In this problem, we explore the connection between MAP estimation, and common regularization
techniques that are applied with MLE estimation. In particular, you will show how the choice
of prior distribution over θ (e.g Gaussian, or Laplace prior) is equivalent to different kinds of
regularization (e.g L2 , or L1 regularization). To show this, we shall proceed step by step, showing
intermediate steps.
(a) [3 points] Show that θMAP = argmaxθ p(y|x, θ)p(θ) if we assume that p(θ) = p(θ|x). The
assumption that p(θ) = p(θ|x) will be valid for models such as linear regression where the
input x are not explicitly modeled by θ. (Note that this means x and θ are marginally
independent, but not conditionally independent when y is given.)
(b) [5 points] Recall that L2 regularization penalizes the L2 norm of the parameters while
minimizing the loss (i.e., negative log likelihood in case of probabilistic models). Now we
will show that MAP estimation with a zero-mean Gaussian prior over θ, specifically θ ∼
N (0, η 2 I), is equivalent to applying L2 regularization with MLE estimation. Specifically,
show that
θMAP = arg min − log p(y|x, θ) + λ||θ||22 .
θ
where each row vector is one example input, and ~y be the column vector of all the example
outputs.
Come up with a closed form expression for θMAP .
(d) [5 points] Next, consider the Laplace distribution, whose density is given by
1 |z − µ|
fL (z|µ, b) = exp − .
2b b
Remark: Linear regression with L2 regularization is also commonly called Ridge regression, and
when L1 regularization is employed, is commonly called Lasso regression. These regularizations
can be applied to any Generalized Linear models just as above (by replacing log p(y|x, θ) with
the appropriate family likelihood). Regularization techniques of the above type are also called
weight decay, and shrinkage. The Gaussian and Laplace priors encourage the parameter values
to be closer to their mean (i.e., zero), which results in the shrinkage effect.
Remark: Lasso regression (i.e L1 regularization) is known to result in sparse parameters, where
most of the parameter values are zero, with only some of them non-zero.
CS229 Problem Set #2 6
[Hint: For part (e), the answer is that K is indeed a kernel. You still have to prove it, though.
(This one may be harder than the rest.) This result may also be useful for another part of the
problem.]
CS229 Problem Set #2 7
where θ(i) is the value of the parameters after the algorithm has seen the first i training examples.
Prior to seeing any training examples, θ(0) is initialized to ~0.
(a) [9 points] Let K be a Mercer kernel corresponding to some very high-dimensional feature
mapping φ. Suppose φ is so high-dimensional (say, ∞-dimensional) that it’s infeasible to
ever represent φ(x) explicitly. Describe how you would apply the “kernel trick” to the
perceptron to make it work in the high-dimensional feature space φ, but without ever
explicitly computing φ(x).
[Note: You don’t have to worry about the intercept term. If you like, think of φ as having
the property that φ0 (x) = 1 so that this is taken care of.] Your description should specify:
i. [3 points] How you will (implicitly) represent the high-dimensional parameter vector
θ(i) , including how the initial value θ(0) = 0 is represented (note that θ(i) is now a
vector whose dimension is the same as the feature vectors φ(x));
ii. [3 points] How you will efficiently make a prediction on a new input x(i+1) . I.e., how
T
you will compute hθ(i) (x(i+1) ) = g(θ(i) φ(x(i+1) )), using your representation of θ(i) ;
and
iii. [3 points] How you will modify the update rule given above to perform an update to θ
on a new training example (x(i+1) , y (i+1) ); i.e., using the update rule corresponding to
the feature mapping φ:
(b) [5 points] Implement your approach by completing the initial state, predict, and
update state methods of src/p05 percept.py.
(c) [2 points] Run src/p05 percept.py to train kernelized perceptrons on data/ds5 train.csv.
The code will then test the perceptron on data/ds5 test.csv and save the resulting pre-
dictions in the src/output folder. Plots will also be saved in src/output. We provide two
kernels, a dot product kernel and an radial basis function (rbf) kernel. One of the provided
kernels performs extremely poorly in classifying the points. Which kernel performs badly
and why does it fail?
CS229 Problem Set #2 8
(a) [5 points] Implement code for processing the the spam messages into numpy arrays that can
be fed into machine learning models. Do this by completing the get words, create dictionary,
and transform text functions within our provided src/p06 spam.py. Do note the corre-
sponding comments for each function for instructions on what specific processing is required.
The provided code will then run your functions and save the resulting dictionary into
output/p06 dictionary and a sample of the resulting training matrix into output/p06 sample train matrix.
(b) [10 points] In this question you are going to implement a naive Bayes classifier for spam
classification with multinomial event model and Laplace smoothing (refer to class notes on
Naive Bayes for details on Laplace smoothing).
Write your implementation by completing the fit naive bayes model and
predict from naive bayes model functions in src/p06 spam.py.
src/p06 spam.py should then be able to train a Naive Bayes model, compute your predic-
tion accuracy and then save your resulting predictions to output/p06 naive bayes predictions.
Remark. If you implement
Q naive Bayes the straightforward way, you’ll find that the
computed p(x|y) = i p(xi |y) often equals zero. This is because p(x|y), which is the
product of many numbers less than one, is a very small number. The standard computer
representation of real numbers cannot handle numbers that are too small, and instead
rounds them off to zero. (This is called “underflow.”) You’ll have to find a way to compute
Naive Bayes’ predicted class labels without explicitly representing very small numbers such
as p(x|y). [Hint: Think about using logarithms.]
(c) [5 points] Intuitively, some tokens may be particularly indicative of an SMS being in a
particular class. We can try to get an informal sense of how indicative token i is for the
SPAM class by looking at:
p(xj = i|y = 1) P (token i|email is SPAM)
log = log .
p(xj = i|y = 0) P (token i|email is NOTSPAM)
Complete the get top five naive bayes words function within the provided code using
the above formula in order to obtain the 5 most indicative tokens.
The provided code will print out the resulting indicative tokens and then save them to
output/p06 top indicative words.
1 Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New
Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG’11),
Mountain View, CA, USA, 2011.
CS229 Problem Set #2 9
(d) [2 points] Support vector machines (SVMs) are an alternative machine learning model that
we discussed in class. We have provided you an SVM implementation (using a radial basis
function (RBF) kernel) within src/svm.py (You should not need to modify that code).
One important part of training an SVM parameterized by an RBF kernel is choosing an
appropriate kernel radius.
Complete the compute best svm radius by writing code to compute the best SVM radius
which maximizes accuracy on the validation dataset.
The provided code will use your compute best svm radius to compute and then write the
best radius into output/p06 optimal radius.
CS229 Problem Set #3 1
Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at http://piazza.com/stanford/fall2018/cs229. (3) If
you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please
read the policy on Handout #1 (available from the course website) before starting work. (4)
For the coding problems, you may not use any libraries except those defined in the provided
environment.yml file. In particular, ML-specific libraries such as scikit-learn are not permitted.
(5) To account for late days, the due date listed on Gradescope is Nov 17 at 11:59 pm. If you
submit after Nov 14, you will begin consuming your late days. If you wish to submit on time,
submit before Nov 14 at 11:59 pm.
All students must submit an electronic PDF version of the written questions. We highly recom-
mend typesetting your solutions via LATEX. If you are scanning your document by cell phone,
please check the Piazza forum for recommended scanning apps and best practices. All students
must also submit a zip file of their source code to Gradescope, which should be created using the
make zip.py script. In order to pass the auto-grader tests, you should make sure to (1) restrict
yourself to only using libraries included in the environment.yml file, and (2) make sure your
code runs without errors when running p04 gmm.py and p05 kmeans.py. Your submission will
be evaluated by the auto-grader using a private test set.
CS229 Problem Set #3 2
4.0
3.5
3.0
2.5
2.0
x2
1.5
1.0
0.5
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
x1
The examples in class 1 are marked as “×” and examples in class 0 are marked as “◦”. We want
to perform binary classification using a simple neural network with the architecture shown in
Figure 2:
Output
Denote the two features x1 and x2 , the three neurons in the hidden layer h1 , h2 , and h3 , and
[1]
the output neuron as o. Let the weight from xi to hj be wi,j for i ∈ {1, 2}, j ∈ {1, 2, 3}, and the
[2] [1]
weight from hj to o be wj . Finally, denote the intercept weight for hj as w0,j , and the intercept
[2]
weight for o as w0 . For the loss function, we’ll use average squared loss instead of the usual
negative log-likelihood:
m
1 X (i)
l= (o − y (i) )2 ,
m i=1
(a) [5 points] Suppose we use the sigmoid function as the activation function for h1 , h2 , h3 and
[1]
o. What is the gradient descent update to w1,2 , assuming we use a learning rate of α? Your
answer should be written in terms of x(i) , o(i) , y (i) , and the weights.
(b) [10 points] Now, suppose instead of using the sigmoid function for the activation function
for h1 , h2 , h3 and o, we instead used the step function f (x), defined as
(
1, x ≥ 0
f (x) =
0, x < 0
Is it possible to have a set of weights that allow the neural network to classify this dataset
with 100% accuracy?
If it is possible, please provide a set of weights that enable 100% accuracy by completing
optimal step weights within src/p01 nn.py and explain your reasoning for those weights
in your PDF.
If it is not possible, please explain your reasoning in your PDF. (There is no need to modify
optimal step weights if it is not possible.)
Hint: There are three sides to a triangle, and there are three neurons in the hidden layer.
(c) [10 points] Let the activation functions for h1 , h2 , h3 be the linear function f (x) = x and
the activation function for o be the same step function as before.
Is it possible to have a set of weights that allow the neural network to classify this dataset
with 100% accuracy?
If it is possible, please provide a set of weights that enable 100% accuracy by complet-
ing optimal linear weights within src/p01 nn.py and explain your reasoning for those
weights in your PDF.
If it is not possible, please explain your reasoning in your PDF. (There is no need to modify
optimal linear weights if it is not possible.)
CS229 Problem Set #3 4
X P (x)
DKL (P kQ) = P (x) log
Q(x)
x∈X
For notational convenience, we assume P (x) > 0, ∀x. (One other standard thing to do is to
adopt the convention that “0 log 0 = 0.”) Sometimes, we also write the KL divergence more
explicitly as DKL (P ||Q) = DKL (P (X)||Q(X)).
Background on Information Theory
Before we dive deeper, we give a brief (optional) Information Theoretic background on KL
divergence. While this introduction is not necessary to answer the assignment question, it may
help you better understand and appreciate why we study KL divergence, and how Information
Theory can be relevant to Machine Learning.
We start with the entropy H(P ) of a probability distribution P (X), which is defined as
X
H(P ) = − P (x) log P (x).
x∈X
Intuitively, entropy measures how dispersed a probability distribution is. For example, a uni-
form distribution is considered to have very high entropy (i.e. a lot of uncertainty), whereas a
distribution that assigns all its mass on a single point is considered to have zero entropy (i.e.
no uncertainty). Notably, it can be shown that among continuous distributions over R, the
Gaussian distribution N (µ, σ 2 ) has the highest entropy (highest uncertainty) among all possible
distributions that have the given mean µ and variance σ 2 .
To further solidify our intuition, we present motivation from communication theory. Suppose we
want to communicate from a source to a destination, and our messages are always (a sequence
of) discrete symbols over space X (for example, X could be letters {a, b, . . . , z}). We want to
construct an encoding scheme for our symbols in the form of sequences of binary bits that are
transmitted over the channel. Further, suppose that in the long run the frequency of occurrence
of symbols follow a probability distribution P (X). This means, in the long run, the fraction of
times the symbol x gets transmitted is P (x).
A common desire is to construct an encoding scheme such that the average number of bits per
symbol transmitted remains as small as possible. Intuitively, this means we want very frequent
symbols to be assigned to a bit pattern having a small number of bits. Likewise, because we are
1 If P and Q are densities for continuous-valued random variables, then the sum is replaced by an integral,
and everything stated in this problem works fine as well. But for the sake of simplicity, in this problem we’ll just
work with this form of KL divergence for probability mass functions/discrete-valued distributions.
CS229 Problem Set #3 5
interested in reducing the average number of bits per symbol in the long term, it is tolerable for
infrequent words to be assigned to bit patterns having a large number of bits, since their low
frequency has little effect on the long term average. The encoding scheme can be as complex as
we desire, for example, a single bit could possibly represent a long sequence of multiple symbols
(if that specific pattern of symbols is very common). The entropy of a probability distribution
P (X) is its optimal bit rate, i.e., the lowest average bits per message that can possibly be
achieved if the symbols x ∈ X occur according to P (X). It does not specifically tell us how to
construct that optimal encoding scheme. It only tells us that no encoding can possibly give us
a lower long term bits per message than H(P ).
To see a concrete example, suppose our messages have a vocabulary of K = 32 symbols, and
each symbol has an equal probability of transmission in the long term (i.e, uniform probability
distribution). An encoding scheme that would work well for this scenario would be to have
log2 K bits per symbol, and assign each symbol some unique combination of the log2 K bits. In
fact, it turns out that this is the most efficient encoding one can come up with for the uniform
distribution scenario.
It may have occurred to you by now that the long term average number of bits per message
depends only on the frequency of occurrence of symbols. The encoding scheme of scenario A can
in theory be reused in scenario B with a different set of symbols (assume equal vocabulary size
for simplicity), with the same long term efficiency, as long as the symbols of scenario B follow
the same probability distribution as the symbols of scenario A. It might also have occured to
you, that reusing the encoding scheme designed to be optimal for scenario A, for messages in
scenario B having a different probability of symbols, will always be suboptimal for scenario B.
To be clear, we do not need know what the specific optimal schemes are in either scenarios. As
long as we know the distributions of their symbols, we can say that the optimal scheme designed
for scenario A will be suboptimal for scenario B if the distributions are different.
Concretely, if we reuse the optimal scheme designed for a scenario having symbol distribution
Q(X), into a scenario that has symbol distribution P (X), the long term average number of bits
per symbol achieved is called the cross entropy, denoted by H(P, Q):
X
H(P, Q) = − P (x) log Q(x).
x∈X
To recap, the entropy H(P ) is the best possible long term average bits per message (optimal)
that can be achived under a symbol distribution P (X) by using an encoding scheme (possibly
unknown) specifically designed for P (X). The cross entropy H(P, Q) is the long term average bits
per message (suboptimal) that results under a symbol distribution P (X), by reusing an encoding
scheme (possibly unknown) designed to be optimal for a scenario with symbol distribution Q(X).
Now, KL divergence is the penalty we pay, as measured in average number of bits, for using the
optimal scheme for Q(X), under the scenario where symbols are actually distributed as P (X).
It is straightforward to see this
X P (x)
DKL (P, Q) = P (x) log
Q(x)
x∈X
X X
= P (x) log P (x) − P (x) log Q(x)
x∈X x∈X
= H(P, Q) − H(P ). (difference in average number of bits.)
CS229 Problem Set #3 6
If the cross entropy between P and Q is zero (and hence DKL (P ||Q) = 0) then it necessarily
means P = Q. In Machine Learning, it is a common task to find a distribution Q that is “close”
to another distribution P . To achieve this, we use DKL (Q||P ) to be the loss function to be
optimized. As we will see in this question below, Maximum Likelihood Estimation, which is
a commonly used optimization objective, turns out to be equivalent minimizing KL divergence
between the training data (i.e. the empirical distribution over the data) and the model.
Now, we get back to showing some simple properties of KL divergence.
and
This can be thought of as the expected KL divergence between the corresponding conditional
distributions on x (that is, between P (X|Y = y) and Q(X|Y = y)), where the expectation
is taken over the random y.
Prove the following chain rule for KL divergence:
(c) [5 points] KL and maximum likelihood. Consider a density estimation problem, and
suppose wePare given a training set {x(i) ; i = 1, . . . , m}. Let the empirical distribution be
1 m (i)
P̂ (x) = m i=1 1{x = x}. (P̂ is just the uniform distribution over the training set; i.e.,
sampling from the empirical distribution is the same as picking a random example from the
training set.)
Suppose we have some family of distributions Pθ parameterized by θ. (If you like, think of
Pθ (x) as an alternative notation for P (x; θ).) Prove that finding the maximum likelihood
estimate for the parameter θ is equivalent to finding Pθ with minimal KL divergence from
P̂ . I.e. prove:
Xm
arg min DKL (P̂ kPθ ) = arg max log Pθ (x(i) )
θ θ
i=1
Remark. Consider the relationship between parts (b-c) and multi-variate Bernoulli Naive
Bayes parameter estimation. In the Naive Bayes model we assumed Pθ is of the following
CS229 Problem Set #3 7
Qn
form: Pθ (x, y) = p(y) i=1 p(xi |y). By the chain rule for KL divergence, we therefore have:
n
X
DKL (P̂ kPθ ) = DKL (P̂ (y)kp(y)) + DKL (P̂ (xi |y)kp(xi |y)).
i=1
This shows that finding the maximum likelihood/minimum KL-divergence estimate of the
parameters decomposes into 2n + 1 independent optimization problems: One for the class
priors p(y), and one for each of the conditional distributions p(xi |y) for each feature xi
given each of the two possible labels for y. Specifically, finding the maximum likelihood
estimates for each of these problems individually results in also maximizing the likelihood
of the joint distribution. (If you know what Bayesian networks are, a similar remark applies
to parameter estimation for them.)
CS229 Problem Set #3 8
Intuitively, the Fisher information represents the amount of information that a random
variable Y carries about a parameter θ of interest. When the parameter of interest is a
vector (as in our case, since θ ∈ Rn ), this information becomes a matrix. Show that the
Fisher information can equivalently be given by
T
I(θ) = Ey∼p(y;θ) [∇θ0 log p(y; θ0 )∇θ0 log p(y; θ0 ) |θ0 =θ ]
Note that the Fisher Information is a function of the parameter. The parameter of the
Fisher information is both a) the parameter value at which the score function is evaluated,
and b) the parameter of the distribution with respect to which the expectation and variance
is calculated.
(c) [5 points] Fisher Information (alternate form)
It turns out that the Fisher Information can not only be defined as the covariance of the
score function, but in most situations it can also be represented as the expected negative
Hessian of the log-likelihood.
Show that Ey∼p(y;θ) [−∇2θ0 log p(y; θ0 )|θ0 =θ ] = I(θ).
Remark. The Hessian represents the curvature of a function at a point. This shows that
the expected curvature of the log-likelihood function is also equal to the Fisher information
matrix. If the curvature of the log-likelihood at a parameter is very steep (i.e, Fisher
Information is very high), this generally means you need fewer number of data samples to a
estimate that parameter well (assuming data was generated from the distribution with those
parameters), and vice versa. The Fisher information matrix associated with a statistical
model parameterized by θ is extremely important in determining how a model behaves as
a function of the number of training set examples.
(d) [5 points] Approximating DKL with Fisher Information
As we explained at the start of this problem, we are interested in the set of all distributions
that are at a small fixed DKL distance away from the current distribution. In order to
calculate DKL between p(y; θ) and p(y; θ + d), where d ∈ Rn is a small magnitude “delta”
vector, we approximate it using the Fisher Information at θ. Eventually d will be the
natural gradient update we will add to θ. To approximate the KL-divergence with Fisher
CS229 Problem Set #3 10
Infomration, we will start with the Taylor Series expansion of DKL and see that the Fisher
Information pops up in the expansion.
1
Show that DKL (pθ ||pθ+d ) ≈ dT I(θ)d.
2
Hint: Start with the Taylor Series expansion of DKL (pθ ||pθ̃ ) where θ is a constant and θ̃
is a variable. Later set θ̃ = θ + d. Recall that the Taylor Series allows us to approximate a
scalar function f (θ̃) near θ by:
1
f (θ̃) ≈ f (θ) + (θ̃ − θ)T ∇θ0 f (θ0 )|θ0 =θ + (θ̃ − θ)T ∇2θ0 f (θ0 )|θ0 =θ (θ̃ − θ)
2
First we note that we can use Taylor approximation on `(θ + d) ≈ `(θ) + dT ∇θ0 `(θ0 )|θ0 =θ .
Also note that we calculated the Taylor approximation DKL (pθ ||pθ+d ) in the previous sub-
problem. We shall substitute both these approximations into the above constrainted opti-
mization problem.
In order to solve this constrained optimization problem, we employ the method of Lagrange
multipliers. If you are familiar with Lagrange multipliers, you can proceed directly to solve
for d∗ . If you are not familiar with Lagrange multipliers, here is a simplified introduction.
(You may also refer to a slightly more comprehensive introduction in the Convex Opti-
mization section notes, but for the purposes of this problem, the simplified introduction
provided here should suffice).
Consider the following constrained optimization problem
The function f is the objective function and g is the constraint. We instead optimize the
Lagrangian L(d, λ), which is defined as
∇d L(d, λ) = 0, (a)
∇λ L(d, λ) = 0. (b)
CS229 Problem Set #3 11
So we have two equations (a and b above) with two unknowns (d and λ), which can be
sometimes be solved analytically (in our case, we can).
The following steps guide you through solving the constrained optimization problem:
• Construct the Lagrangian for the constrained optimization problem (1) with the Taylor
approximations substituted in for both the objective and the constraint.
• Then construct the system of linear equations (like (a) and (b)) from the Lagrangian
you obtained.
• From (a), come up with an expression for d that involves λ.
At this stage we have already found the “direction” of the natural gradient d, since
λ is only a positive scaling constant. For most practical purposes, the solution we
obtain here is sufficient. This is because we almost always include a learning rate
hyperparameter in our optimization algorithms, or perform some kind of a line search
for algorithmic stability. This can make the exact calculation of λ less critical. Let’s
call this expression d˜ (involving λ) as the unscaled natural gradient. Clearly state what
is d˜ as a function of λ.
The remaining steps are to figure out the value of the scaling constant λ along the
direction of d, for completeness.
• Plugin that expression for d into (b). Now we have an equation that has λ but not d.
Come up with an expression for λ that does not include d.
• Plug that expression for λ (without d) back into (a). Now we have an equation that
has d but not λ. Come up with an expression for d that does not include λ.
The expression of d obtained this way will be the desired natural gradient update d∗ . Clearly
state and highlight your final expression for d∗ . This expression cannot include λ.
(f) [2 points] Relation to Newton’s Method
After going through all these steps to calculate the natural gradient, you might wonder if
this is something used in practice. We will now see that the familiar Newton’s method that
we studied earlier, when applied to Generalized Linear Models, is equivalent to natural
gradient on Generalized Linear Models. While the two methods (Netwon’s and natural
gradient) agree on GLMs, in general they need not be equivalent.
Show that the direction of update of Newton’s method, and the direction of natural gradient,
are exactly the same for Generalized Linear Models. You may want to recall and cite the
results you derived in problem set 1 question 4 (Convexity of GLMs). For the natural
˜ the unscaled natural gradient.
gradient, it is sufficient to use d,
CS229 Problem Set #3 12
m
X
`unsup (θ) = log p(x(i) ; θ)
i=1
m
X X
= log p(x(i) , z (i) ; θ)
i=1 z (i)
E-step (semi-supervised)
M-step (semi-supervised)
"m ! m̃
!#
X X (t) p(x(i) , z (i) ; θ) X
θ(t+1) := arg max Qi (z (i) ) log (t)
+α log p(x̃(i) , z̃ (i) ; θ)
θ
i=1 z (i)
Qi (z (i) ) i=1
(a) [5 points] Convergence. First we will show that this algorithm eventually converges. In
order to prove this, it is sufficient to show that our semi-supervised objective `semi-sup (θ)
monotonically increases with each iteration of E and M step. Specifically, let θ(t) be the
parameters obtained at the end of t EM-steps. Show that `semi-sup (θ(t+1) ) ≥ `semi-sup (θ(t) ).
CS229 Problem Set #3 13
Semi-supervised GMM
Now we will revisit the Gaussian Mixture Model (GMM), to apply our semi-supervised EM al-
gorithm. Let us consider a scenario where data is generated from k ∈ N Gaussian distributions,
with unknown means µj ∈ Rd and covariances Σj ∈ Sd+ where j ∈ {1, . . . , k}. We have m data
points x(i) ∈ Rd , i ∈ {1, . . . , m}, and each data point has a corresponding latent (hidden/un-
known) variable z (i) ∈ {1, . . . , k} indicating which distribution x(i) belongs to. Specifically,
Pk
z (i) ∼ Multinomial(φ), such that j=1 φj = 1 and φj ≥ 0 for all j, and x(i) |z (i) ∼ N (µz(i) , Σz(i) )
i.i.d. So, µ, Σ, and φ are the model parameters.
We also have an additional m̃ data points x̃(i) ∈ Rd , i ∈ {1, . . . , m̃}, and an associated observed
variable z̃ ∈ {1, . . . , k} indicating the distribution x̃(i) belongs to. Note that z̃ (i) are known
constants (in contrast to z (i) which are unknown random variables). As before, we assume
x̃(i) |z̃ (i) ∼ N (µz̃(i) , Σz̃(i) ) i.i.d.
In summary we have m + m̃ examples, of which m are unlabelled data points x’s with unobserved
z’s, and m̃ are labelled data points x̃(i) with corresponding observed labels z̃ (i) . The traditional
EM algorithm is designed to take only the m unlabelled examples as input, and learn the model
parameters µ, Σ, and φ.
Our task now will be to apply the semi-supervised EM algorithm to GMMs in order to leverage
the additional m̃ labelled examples, and come up with semi-supervised E-step and M-step update
rules specific to GMMs. Whenever required, you can cite the lecture notes for derivations and
steps.
(b) [5 points] Semi-supervised E-Step. Clearly state which are all the latent variables that
need to be re-estimated in the E-step. Derive the E-step to re-estimate all the stated
latent variables. Your final E-step expression must only involve x, z, µ, Σ, φ and universal
constants.
(c) [5 points] Semi-supervised M-Step. Clearly state which are all the parameters that
need to be re-estimated in the M-step. Derive the M-step to re-estimate all the stated
parameters. Specifically, derive closed form expressions for the parameter update rules for
µ(t+1) , Σ(t+1) and φ(t+1) based on the semi-supervised objective.
(d) [5 points] [Coding Problem] Classical (Unsupervised) EM Implementation. For
this sub-question, we are only going to consider the m unlabelled examples. Follow the
instructions in src/p04 gmm.py to implement the traditional EM algorithm, and run it on
the unlabelled data-set until convergence.
Run three trials and use the provided plotting function to construct a scatter plot of the
resulting assignments to clusters (one plot for each trial). Your plot should indicate clus-
ter assignments with colors they got assigned to (i.e., the cluster which had the highest
probability in the final E-step).
Note: You only need to submit the three plots in your write-up. Your code will not be
autograded.
(e) [7 points] [Coding Problem] Semi-supervised EM Implementation. Now we will
consider both the labelled and unlabelled examples (a total of m + m̃), with 5 labelled
examples per cluster. We have provided starter code for splitting the dataset into a ma-
trices x of labelled examples and x tilde of unlabelled examples. Add to your code in
src/p04 gmm.py to implement the modified EM algorithm, and run it on the dataset until
convergence.
Create a plot for each trial, as done in the previous sub-question.
CS229 Problem Set #3 14
Note: You only need to submit the three plots in your write-up. Your code will not be
autograded.
(f) [3 points] Comparison of Unsupervised and Semi-supervised EM. Briefly describe
the differences you saw in unsupervised vs. semi-supervised EM for each of the following:
i. Number of iterations taken to converge.
ii. Stability (i.e., how much did assignments change with different random initializations?)
iii. Overall quality of assignments.
Note: The dataset was sampled from a mixture of three low-variance Gaussian distribu-
tions, and a fourth, high-variance Gaussian distribution. This should be useful in deter-
mining the overall quality of the assignments that were found by the two algorithms.
CS229 Problem Set #3 15
(a) [15 points] [Coding Problem] K-Means Compression Implementation. From the
data directory, open an interactive Python prompt, and type
from matplotlib.image import imread; import matplotlib.pyplot as plt;
and run A = imread(’peppers-large.tiff’). Now, A is a “three dimensional matrix,”
and A[:,:,0], A[:,:,1] and A[:,:,2] are 512x512 arrays that respectively contain the
red, green, and blue values for each pixel. Enter plt.imshow(A); plt.show() to display
the image.
Since the large image has 262144 pixels and would take a while to cluster, we will instead run
vector quantization on a smaller image. Repeat (a) with peppers-small.tiff. Treating
each pixel’s (r, g, b) values as an element of R3 , run K-means2 with 16 clusters on the pixel
data from this smaller image, iterating (preferably) to convergence, but in no case for less
than 30 iterations. For initialization, set each cluster centroid to the (r, g, b)-values of a
randomly chosen pixel in the image.
Take the matrix A from peppers-large.tiff, and replace each pixel’s (r, g, b) values with
the value of the closest cluster centroid. Display the new image, and compare it visually
to the original image. Include in your write-up all your code and a copy of your
compressed image.
(b) [5 points] Compression Factor. If we represent the image with these reduced (16) colors,
by (approximately) what factor have we compressed the image?
Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at http://piazza.com/stanford/fall2018/cs229. (3) If
you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please
read the policy on Handout #1 (available from the course website) before starting work. (4)
For the coding problems, you may not use any libraries except those defined in the provided
environment.yml file. In particular, ML-specific libraries such as scikit-learn are not permitted.
(5) To account for late days, the due date listed on Gradescope is Dec 08 at 11:59 pm. If you
submit after Dec 05, you will begin consuming your late days. If you wish to submit on time,
submit before Dec 05 at 11:59 pm.
All students must submit an electronic PDF version of the written questions. We highly recom-
mend typesetting your solutions via LATEX. If you are scanning your document by cell phone,
please check the Piazza forum for recommended scanning apps and best practices. All students
must also submit a zip file of their source code to Gradescope, which should be created using the
make zip.py script. In order to pass the auto-grader tests, you should make sure to (1) restrict
yourself to only using libraries included in the environment.yml file, and (2) make sure your
code runs without errors when running p01 nn.py, p04 ica.py and p06 cartpole.py. Your
submission will be evaluated by the auto-grader using a private test set.
CS229 Problem Set #4 2
The data for this problem can be found in the data folder as images train.csv, images test.csv,
labels train.csv and labels test.csv.
The code for this assignment can be found within p01 nn.py within the src folder.
The starter code splits the set of 60,000 training images and labels into a sets of 59,600 examples
as the training set and 400 examples for dev set.
To start, you will implement a simple convolutional neural network and cross entropy loss, and
train it with the provided data set.
The architecture is as follows:
(a) The first layer is a convolutional layer with 2 output channels with a convolution size of 4
by 4.
(b) The second layer is a max pooling layer of stride and width 5 by 5.
(c) The third layer is a ReLU activation layer.
(d) After the four layer, the data is flattened into a single dimension.
(e) The fith layer is a single linear layer with output size 10 (the number of classes).
(f) The sixth layer is a softmax layer that computes the probabilities for each class.
(g) Finally, we use a cross entropy loss as our loss function.
We have provided all of the forward functions for these different layers so there is an unambigious
definition of them in the code. Your job in this assignment will be to implement functions that
1 https://commons.wikimedia.org/wiki/File:MnistExamples.png
CS229 Problem Set #4 3
compute the gradients for these layers. However, here is some additional text that might be
helpful in understanding the forward functions.
We have discussed convolutional layers on the exam, but as a review, the following equation
defines what we mean by a 2d convolution:
output[out channel, x, y] = max input[in channel, x ∗ pool width + di, y ∗ pool height + dy]
di,dj
The ReLU (rectified linear unit) is our activation function. The ReLU is simply max(0, x) where
x is the input.
We use cross entropy loss as our loss function. Recall that for a single example (x, y), the cross
entropy loss is:
K
X
CE(y, ŷ) = − yk log yˆk ,
k=1
where ŷ ∈ RK is the vector of softmax outputs from the model for the training example x, and
y ∈ RK is the ground-truth vector for the training example x such that y = [0, ..., 0, 1, 0, ..., 0]>
contains a single 1 at the position of the correct class (also called a “one-hot” representation).
We are also doing mini-batch gradient descent with a batch size of 16. Normally we would
iterater over the data multiple times with multiple epochs, but for this assignment we only do
400 batches to save time.
(a) [20 points]
Implement the following functions within p01 nn.py. We recommend that you start at the
top of the list and work your way down:
i. backward softmax
ii. backward relu
iii. backward cross entropy loss
iv. backward linear
v. backward convolution
vi. backward max pool
(b) [10 points] Now implement a function that computes the full backward pass.
i. backward prop
CS229 Problem Set #4 4
E s∼p(s) R(s, a)
a∼π1 (s,a)
X
E s∼p(s) R(s, a) = R(s, a)p(s, a)
a∼π1 (s,a) (s,a)
X
= R(s, a)p(s)p(a|s)
(s,a)
X
= R(s, a)p(s)π1 (s, a)
(s,a)
Unfortunately, we cannot estimate this directly as we only have samples created under policy π0
and not π1 . For this problem, we will be looking at formulas that approximate this value using
expectations under π0 that we can actually estimate.
We will make one additional assumption that each action has a non-zero probability in the
observed policy π0 (s, a). In other words, for all actions a and states s, π0 (s, a) > 0.
Regression: The simplest possible estimator is to directly use our learned MDP parameters to
estimate our goal. This is usually called the regression estimator. While training our MDP, we
learn an estimator R̂(s, a) that estimates R(s, a). We can now directly estimate
E s∼p(s) R(s, a)
a∼π1 (s,a)
with
E s∼p(s) R̂(s, a)
a∼π1 (s,a)
(a) [2 points] Importance Sampling: One commonly used estimator is known as the impor-
tance sampling estimator. Let π̂0 be an estimate of the true π0 . The importance sampling
estimator uses that π̂0 and has the form:
π1 (s, a)
E s∼p(s) R(s, a)
a∼π0 (s,a) π̂0 (s, a)
Please show that if π̂0 = π0 , then the importance sampling estimator is equal to:
E s∼p(s) R(s, a)
a∼π1 (s,a)
Note that this estimator only requires us to model π0 as we have the R(s, a) values for the
items in the observational data.
(b) [2 points] Weighted Importance Sampling: One variant of the importance sampling es-
timator is known as the weighted importance sampling estimator. The weighted importance
sampling estimator has the form:
π1 (s,a)
E s∼p(s) π̂0 (s,a) R(s, a)
a∼π0 (s,a)
Please show that if π̂0 = π0 , then the weighted importance sampling estimator is equal to:
E s∼p(s) R(s, a)
a∼π1 (s,a)
(c) [2 points] One issue with the weighted importance sampling estimator is that it can be
biased in many finite sample situations. In finite samples, we replace the expected value
with a sum over the seen values in our observational dataset. Please show that the weighted
importance sampling estimator is biased in these situations.
Hint: Consider the case where there is only a single data element in your observational
dataset.
(d) [7 points] Doubly Robust: One final commonly used estimator is the doubly robust
estimator. The doubly robust estimator has the form:
π1 (s, a)
E s∼p(s) ((Ea∼π1 (s,a) R̂(s, a)) + (R(s, a) − R̂(s, a)))
a∼π0 (s,a) π̂0 (s, a)
One advantage of the doubly robust estimator is that it works if either π̂0 = π0 or R̂(s, a) =
R(s, a)
i. [4 points] Please show that the doubly robust estimator is equal to E s∼p(s) R(s, a)
a∼π1 (s,a)
when π̂0 = π0
ii. [3 points] Please show that the doubly robust estimator is equal to E s∼p(s) R(s, a)
a∼π1 (s,a)
when R̂(s, a) = R(s, a)
(e) [2 points] We will now consider several situations where you might have a choice between
the importance sampling estimator and the regression estimator. Please state whether the
importance sampling estimator or the regression estimator would probably work best in
each situation and explain why it would work better. In all of these situations, your states
s consist of patients, your actions a represent the drugs to give to certain patients and your
R(s, a) is the lifespan of the patient after receiving the drug.
i. [1 points] Drugs are randomly assigned to patients, but the interaction between the
drug, patient and lifespan is very complicated.
ii. [1 points] Drugs are assigned to patients in a very complicated manner, but the inter-
action between the drug, patient and lifespan is very simple.
CS229 Problem Set #4 7
Show that the unit-length vector u that minimizes the mean squared error between projected
points and original points corresponds to the first principal component for the data. I.e., show
that
Xm
arg min kx(i) − fu (x(i) )k22 .
u:uT u=1
i=1
where g is the cumulative distribution function, and g 0 is the probability density function of
the source distribution (in this sub-question it is a standard normal distribution). Whereas
in the notes we derive an update rule to train W iteratively, for the cause of Gaussian
distributed sources, we can analytically reason about the resulting W .
Try to derive a closed form expression for W in terms of X when g is the standard normal
CDF. Deduce the relation between W and X in the simplest terms, and highlight the
ambiguity (in terms of rotational invariance) in computing W .
(b) [10 points] Laplace source.
For this sub-question, we assume sources are distributed according to a standard Laplace
distribution, i.e si ∼ L(0, 1). The Laplace distribution L(0, 1) has PDF fL (s) = 12 exp (−|s|).
With this assumption, derive the update rule for a single example in the form
W := W + α (. . .) .
(c) [5 points] Cocktail Party Problem
For this question you will implement the Bell and Sejnowski ICA algorithm, but assuming
a Laplace source (as derived in part-b), instead of the Logistic distribution covered in class.
The file mix.dat contains the input data which consists of a matrix with 5 columns, with
each column corresponding to one of the mixed signals xi . The code for this question can
be found in p04 ica.py.
Implement the update W and unmix functions in p04 ica.py.
CS229 Problem Set #4 9
You can then run p04 ica.py in order to split the mixed audio into its components. The
mixed audio tracks are written to midex i.wav in the output folder. The split audio tracks
are written to split i.wav in the output folder.
To make sure your code is correct, you should listen to the resulting unmixed sources.
(Some overlap or noise in the sources may be present, but the different sources should be
pretty clearly separated.)
If you implemention is correct, your output split 0.wav should sound similar to the file
correct split 0.wav included with the source code.
Note: In our implementation, we anneal the learning rate α (slowly decreased it over
time) to speed up learning. In addition to using the variable learning rate to speed up
convergence, one thing that we also do is choose a random permutation of the training
data, and running stochastic gradient ascent visiting the training data in that order (each
of the specified learning rates was then used for one full pass through the data).
CS229 Problem Set #4 10
(a) [10 points] Prove that, for any two finite-valued vectors V1 , V2 , it holds true that
where
||V ||∞ = max |V (s)|.
s∈S
(This shows that the Bellman update operator is a “γ-contraction in the max-norm.”)
(b) [5 points] We say that V is a fixed point of B if B(V ) = V . Using the fact that the
Bellman update operator is a γ-contraction in the max-norm, prove that B has at most
one fixed point—i.e., that there is at most one solution to the Bellman equations. You may
assume that B has at least one fixed point.
Remark: The result you proved in part(a) implies that value iteration converges geometrically
to the optimal value function V ∗ . That is, after k iterations, the distance between V and V ∗ is
at most γ k .
CS229 Problem Set #4 11
3.5
2.5
1.5
0.5
−0.5
−3 −2 −1 0 1 2 3
We have written a simple simulator for this problem. The simulation proceeds in discrete time
cycles (steps). The state of the cart and pole at any time is completely characterized by 4
parameters: the cart position x, the cart velocity ẋ, the angle of the pole θ measured as its
deviation from the vertical position, and the angular velocity of the pole θ̇. Since it would
be simpler to consider reinforcement learning in a discrete state space, we have approximated
the state space by a discretization that maps a state vector (x, ẋ, θ, θ̇) into a number from 0 to
NUM STATES-1. Your learning algorithm will need to deal only with this discretized representation
of the states.
At every time step, the controller must choose one of two actions - push (accelerate) the cart
right, or push the cart left. (To keep the problem simple, there is no do-nothing action.) These
are represented as actions 0 and 1 respectively in the code. When the action choice is made, the
simulator updates the state parameters according to the underlying dynamics, and provides a
new discretized state.
We will assume that the reward R(s) is a function of the current state only. When the pole
angle goes beyond a certain limit or when the cart goes too far out, a negative reward is given,
and the system is reinitialized randomly. At all other times, the reward is zero. Your program
must learn to balance the pole using only the state transitions and rewards observed.
The files for this problem are in src directory. Most of the the code has already been written
for you, and you need to make changes only to p06 cartpole.py in the places specified. This
file can be run to show a display and to plot a learning curve at the end. Read the comments at
the top of the file for more details on the working of the simulation.
2 The dynamics are adapted from http://www-anw.cs.umass.edu/rlr/domains.html
CS229 Problem Set #4 12
To solve the inverted pendulum problem, you will estimate a model (i.e., transition probabilities
and rewards) for the underlying MDP, solve Bellman’s equations for this estimated MDP to
obtain a value function, and act greedily with respect to this value function.
Briefly, you will maintain a current model of the MDP and a current estimate of the value func-
tion. Initially, each state has estimated reward zero, and the estimated transition probabilities
are uniform (equally likely to end up in any other state).
During the simulation, you must choose actions at each time step according to some current
policy. As the program goes along taking actions, it will gather observations on transitions and
rewards, which it can use to get a better estimate of the MDP model. Since it is inefficient to
update the whole estimated MDP after every observation, we will store the state transitions and
reward observations each time, and update the model and value function/policy only periodically.
Thus, you must maintain counts of the total number of times the transition from state si to state
sj using action a has been observed (similarly for the rewards). Note that the rewards at any
state are deterministic, but the state transitions are not because of the discretization of the state
space (several different but close configurations may map onto the same discretized state).
Each time a failure occurs (such as if the pole falls over), you should re-estimate the transition
probabilities and rewards as the average of the observed values (if any). Your program must then
use value iteration to solve Bellman’s equations on the estimated MDP, to get the value function
and new optimal policy for the new model. For value iteration, use a convergence criterion
that checks if the maximum absolute change in the value function on an iteration exceeds some
specified tolerance.
Finally, assume that the whole learning procedure has converged once several consecutive at-
tempts (defined by the parameter NO LEARNING THRESHOLD) to solve Bellman’s equation all con-
verge in the first iteration. Intuitively, this indicates that the estimated model has stopped
changing significantly.
The code outline for this problem is already in p06 cartpole.py, and you need to write code
fragments only at the places specified in the file. There are several details (convergence criteria
etc.) that are also explained inside the code. Use a discount factor of γ = 0.995.
Implement the reinforcement learning algorithm as specified, and run it.
• How many trials (how many times did the pole fall over or the cart fall off) did it take before
the algorithm converged? Hint: if your solution is correct, on the plot the red line indicating
smoothed log num steps to failure should start to flatten out at about 60 iterations.
• Plot a learning curve showing the number of time-steps for which the pole was balanced
on each trial. Python starter code already includes the code to plot. Include it in your
submission.
• Find the line of code that says np.random.seed, and rerun the code with the seed set to 1,
2, and 3. What do you observe? What does this imply about the algorithm?