0% found this document useful (0 votes)
255 views47 pages

CS229 Linear Algebra Review

The document is a problem set for a machine learning course that covers two topics: 1) Logistic regression, a discriminative linear classifier that models the posterior probability of class membership as a function of linear predictor using a logistic sigmoid function. 2) Gaussian discriminant analysis (GDA), a generative linear classifier that assumes each class generates observations from a Gaussian distribution and assigns observations to the class with highest posterior probability based on Gaussian densities. The problem set asks students to explain the assumptions and formulations of logistic regression and GDA, derive the maximum likelihood estimates for their parameters, and discuss their differences and relationships. Students are instructed to submit written answers and code implementing the classifiers within the constraints of an provided environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
255 views47 pages

CS229 Linear Algebra Review

The document is a problem set for a machine learning course that covers two topics: 1) Logistic regression, a discriminative linear classifier that models the posterior probability of class membership as a function of linear predictor using a logistic sigmoid function. 2) Gaussian discriminant analysis (GDA), a generative linear classifier that assumes each class generates observations from a Gaussian distribution and assigns observations to the class with highest posterior probability based on Gaussian densities. The problem set asks students to explain the assumptions and formulations of logistic regression and GDA, derive the maximum likelihood estimates for their parameters, and discuss their differences and relationships. Students are instructed to submit written answers and code implementing the classifiers within the constraints of an provided environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

CS229 Problem Set #0 1

CS 229, Fall 2018


Problem Set #0: Linear Algebra and Multivariable
Calculus

Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at https://piazza.com/stanford/fall2018/cs229. (3) If
you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please
read the policy on Handout #1 (available from the course website) before starting work. (4)
This specific homework is not graded, but we encourage you to solve each of the problems to
brush up on your linear algebra. Some of them may even be useful for subsequent problem sets.
It also serves as your introduction to using Gradescope for submissions.

1. [0 points] Gradients and Hessians


Recall that a matrix A ∈ Rn×n is symmetric if AT = A, that is, Aij = Aji for all i, j. Also
recall the gradient ∇f (x) of a function f : Rn → R, which is the n-vector of partial derivatives
 ∂   
∂x1 f (x) x1
.
.. . 
∇f (x) =   where x =  ..  .
  

∂xn f (x)
xn

The hessian ∇2 f (x) of a function f : Rn → R is the n × n symmetric matrix of twice partial


derivatives,
∂2 ∂2 2
 
∂x21
f (x) ∂x1 ∂x2 f (x) · · · ∂x1∂∂xn f (x)
 ∂2 ∂2 ∂2


∂x ∂x f (x) ∂x 2 f (x) · · · ∂x ∂x f (x) 
∇2 f (x) = 
2 1 2 2 n
.
 
.. .. .. ..

 . . . . 

2 2 2
∂ ∂ ∂
∂xn ∂x1 f (x) ∂xn ∂x2 f (x) · · · ∂x 2 f (x)
n

(a) Let f (x) = 21 xT Ax + bT x, where A is a symmetric matrix and b ∈ Rn is a vector. What


is ∇f (x)?
(b) Let f (x) = g(h(x)), where g : R → R is differentiable and h : Rn → R is differentiable.
What is ∇f (x)?
(c) Let f (x) = 12 xT Ax + bT x, where A is symmetric and b ∈ Rn is a vector. What is ∇2 f (x)?
(d) Let f (x) = g(aT x), where g : R → R is continuously differentiable and a ∈ Rn is a vector.
What are ∇f (x) and ∇2 f (x)? (Hint: your expression for ∇2 f (x) may have as few as 11
symbols, including 0 and parentheses.)

2. [0 points] Positive definite matrices


A matrix A ∈ Rn×n is positive semi-definite (PSD), denoted A  0, if A = AT and xT Ax ≥ 0
for all x ∈ Rn . A matrix A is positive definite, denoted A  0, if A = AT and xT Ax > 0 for
all x 6= 0, that is, all non-zero vectors x. The simplest example of a positive definite matrix is
the identity I (the diagonal matrix with 1s on the diagonal and 0s elsewhere), which satisfies
2 Pn
xT Ix = kxk2 = i=1 x2i .
CS229 Problem Set #0 2

(a) Let z ∈ Rn be an n-vector. Show that A = zz T is positive semidefinite.


(b) Let z ∈ Rn be a non-zero n-vector. Let A = zz T . What is the null-space of A? What is
the rank of A?
(c) Let A ∈ Rn×n be positive semidefinite and B ∈ Rm×n be arbitrary, where m, n ∈ N. Is
BAB T PSD? If so, prove it. If not, give a counterexample with explicit A, B.

3. [0 points] Eigenvectors, eigenvalues, and the spectral theorem


The eigenvalues of an n × n matrix A ∈ Rn×n are the roots of the characteristic polynomial
pA (λ) = det(λI − A), which may (in general) be complex. They are also defined as the the
values λ ∈ C for which there exists a vector x ∈ Cn such that Ax = λx. We call such a pair
(x, λ) an eigenvector, eigenvalue pair. In this question, we use the notation diag(λ1 , . . . , λn )
to denote the diagonal matrix with diagonal entries λ1 , . . . , λn , that is,
 
λ1 0 0 ··· 0
 0 λ2 0 · · · 0 
 
diag(λ1 , . . . , λn ) =  0
 0 λ3 · · · 0  .
 .. .. .. .. .. 
. . . . . 
0 0 0 · · · λn

(a) Suppose that the matrix A ∈ Rn×n is diagonalizable, that is, A = T ΛT −1 for an invertible
matrix T ∈ Rn×n , where Λ = diag(λ1 , . . . , λn ) is diagonal. Use the notation t(i) for the
columns of T , so that T = [t(1) · · · t(n) ], where t(i) ∈ Rn . Show that At(i) = λi t(i) , so
that the eigenvalues/eigenvector pairs of A are (t(i) , λi ).

A matrix U ∈ Rn×n is orthogonal if U T U = I. The spectral theorem, perhaps one of the most
important theorems in linear algebra, states that if A ∈ Rn×n is symetric, that is, A = AT ,
then A is diagonalizable by a real orthogonal matrix. That is, there are a diagonal matrix
Λ ∈ Rn×n and orthogonal matrix U ∈ Rn×n such that U T AU = Λ, or, equivalently,

A = U ΛU T .

Let λi = λi (A) denote the ith eigenvalue of A.

(b) Let A be symmetric. Show that if U = [u(1) · · · u(n) ] is orthogonal, where u(i) ∈
Rn and A = U ΛU T , then u(i) is an eigenvector of A and Au(i) = λi u(i) , where Λ =
diag(λ1 , . . . , λn ).
(c) Show that if A is PSD, then λi (A) ≥ 0 for each i.
CS229 Problem Set #1 1

CS 229, Fall 2018


Problem Set #1: Supervised Learning

Due Wednesday, Oct 17 at 11:59 pm on Gradescope.

Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at http://piazza.com/stanford/fall2018/cs229. (3) If
you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please
read the policy on Handout #1 (available from the course website) before starting work. (4)
For the coding problems, you may not use any libraries except those defined in the provided
environment.yml file. In particular, ML-specific libraries such as scikit-learn are not permitted.
(5) To account for late days, the due date listed on Gradescope is Oct 20 at 11:59 pm. If you
submit after Oct 17, you will begin consuming your late days. If you wish to submit on time,
submit before Oct 17 at 11:59 pm.
All students must submit an electronic PDF version of the written questions. We highly recom-
mend typesetting your solutions via LATEX. If you are scanning your document by cell phone,
please check the Piazza forum for recommended scanning apps and best practices. All students
must also submit a zip file of their source code to Gradescope, which should be created using
the make zip.py script. In order to pass the auto-grader tests, you should make sure to (1)
restrict yourself to only using libraries included in the environment.yml file, and (2) make sure
your code runs without errors using the run.py script. Your submission will be evaluated by
the auto-grader using a private test set.
Honor code: We strongly encourage students to form study groups. Students may discuss and
work on homework problems in groups. However, each student must write down the solutions
independently, and without referring to written notes from the joint session. In other words,
each student must understand the solution well enough in order to reconstruct it by him/herself.
In addition, each student should write on the problem set the set of people with whom s/he
collaborated. Further, because we occasionally reuse problem set questions from previous years,
we expect students not to copy, refer to, or look at the solutions in preparing their answers. It
is an honor code violation to intentionally refer to a previous year’s solutions.
CS229 Problem Set #1 2

1. [40 points] Linear Classifiers (logistic regression and GDA)


In this problem, we cover two probabilistic linear classifiers we have covered in class so far.
First, a discriminative linear classifier: logistic regression. Second, a generative linear classifier:
Gaussian discriminant analysis (GDA). Both the algorithms find a linear decision boundary that
separates the data into two classes, but make different assumptions. Our goal in this problem is
to get a deeper understanding of the similarities and differences (and, strengths and weaknesses)
of these two algorithms.
For this problem, we will consider two datasets, provided in the following files:
i. data/ds1_{train,valid}.csv
ii. data/ds2_{train,valid}.csv
Each file contains m examples, one example (x(i) , y (i) ) per row. In particular, the i-th row
(i) (i)
contains columns x0 ∈ R, x1 ∈ R, and y (i) ∈ {0, 1}. In the subproblems that follow, we
will investigate using logistic regression and Gaussian discriminant analysis (GDA) to perform
binary classification on these two datasets.

(a) [10 points] In lecture we saw the average empirical loss for logistic regression:
m
1 X (i)
J(θ) = − y log(hθ (x(i) )) + (1 − y (i) ) log(1 − hθ (x(i) )),
m i=1

where y (i) ∈ {0, 1}, hθ (x) = g(θT x) and g(z) = 1/(1 + e−z ).
Find the Hessian H of this function, and show that for any vector z, it holds true that

z T Hz ≥ 0.

Hint: You may want to start by showing that i j zi xi xj zj = (xT z)2 ≥ 0. Recall also
P P
that g 0 (z) = g(z)(1 − g(z)).
Remark: This is one of the standard ways of showing that the matrix H is positive semi-
definite, written “H  0.” This implies that J is convex, and has no local minima other
than the global one. If you have some other way of showing H  0, you’re also welcome to
use your method instead of the one above.
(b) [5 points] Coding problem. Follow the instructions in src/p01b logreg.py to train a
logistic regression classifier using Newton’s Method. Starting with θ = ~0, run Newton’s
Method until the updates to θ are small: Specifically, train until the first iteration k such
that kθk − θk−1 k1 < , where  = 1 × 10−5 . Make sure to write your model’s predictions to
the file specified in the code.
(c) [5 points] Recall that in GDA we model the joint distribution of (x, y) by the following
equations:
(
φ if y = 1
p(y) =
1 − φ if y = 0
 
1 1 T −1
p(x|y = 0) = exp − (x − µ0 ) Σ (x − µ0 )
(2π)n/2 |Σ|1/2 2
 
1 1 T −1
p(x|y = 1) = exp − (x − µ1 ) Σ (x − µ1 ) ,
(2π)n/2 |Σ|1/2 2
CS229 Problem Set #1 3

where φ, µ0 , µ1 , and Σ are the parameters of our model.


Suppose we have already fit φ, µ0 , µ1 , and Σ, and now want to predict y given a new point
x. To show that GDA results in a classifier that has a linear decision boundary, show the
posterior distribution can be written as
1
p(y = 1 | x; φ, µ0 , µ1 , Σ) = ,
1 + exp(−(θT x + θ0 ))
where θ ∈ Rn and θ0 ∈ R are appropriate functions of φ, Σ, µ0 , and µ1 .
(d) [7 points] For this part of the problem only, you may assume n (the dimension of x) is 1, so
that Σ = [σ 2 ] is just a real number, and likewise the determinant of Σ is given by |Σ| = σ 2 .
Given the dataset, we claim that the maximum likelihood estimates of the parameters are
given by
m
1 X
φ = 1{y (i) = 1}
m i=1
Pm (i)
i=1 1{y = 0}x(i)
µ0 = P m (i)
i=1 1{y = 0}
Pm (i)
i=1 1{y = 1}x(i)
µ1 = P m (i) = 1}
i=1 1{y
m
1 X (i)
Σ = (x − µy(i) )(x(i) − µy(i) )T
m i=1

The log-likelihood of the data is


m
Y
`(φ, µ0 , µ1 , Σ) = log p(x(i) , y (i) ; φ, µ0 , µ1 , Σ)
i=1
Ym
= log p(x(i) |y (i) ; µ0 , µ1 , Σ)p(y (i) ; φ).
i=1

By maximizing ` with respect to the four parameters, prove that the maximum likelihood
estimates of φ, µ0 , µ1 , and Σ are indeed as given in the formulas above. (You may assume
that there is at least one positive and one negative example, so that the denominators in
the definitions of µ0 and µ1 above are non-zero.)
(e) [3 points] Coding problem. In src/p01e gda.py, fill in the code to calculate φ, µ0 ,
µ1 , and Σ, use these parameters to derive θ, and use the resulting GDA model to make
predictions on the validation set.
(f) [5 points] For Dataset 1, create a plot of the training data with x1 on the horizontal axis, and
x2 on the vertical axis. To visualize the two classes, use a different symbol for examples x(i)
with y (i) = 0 than for those with y (i) = 1. On the same figure, plot the decision boundary
found by logistic regression in part (b). Make an identical plot with the decision boundary
found by GDA in part (e).
(g) [5 points] Repeat the steps in part (f) for Dataset 2. On which dataset does GDA seem to
perform worse than logistic regression? Why might this be the case?
(h) [3 extra credit points] For the dataset where GDA performed worse in parts (f) and (g),
can you find a transformation of the x(i) ’s such that GDA performs significantly better?
What is this transformation?
CS229 Problem Set #1 4

2. [30 points] Incomplete, Positive-Only Labels


In this problem we will consider training binary classifiers in situations where we do not have
full access to the labels. In particular, we consider a scenario, which is not too infrequent in real
life, where we have labels only for a subset of the positive examples. All the negative examples
and the rest of the positive examples are unlabelled.
That is, we assume a dataset {(x(i) , t(i) , y (i) )}m
i=1 , where t
(i)
∈ {0, 1} is the “true” label, and
where (
1 x(i) is labeled
y (i) =
0 otherwise.

All labeled examples are positive, which is to say p(t(i) = 1 | y (i) = 1) = 1, but unlabeled
examples may be positive or negative. Our goal in the problem is to construct a binary classifier
h of the true label t, with only access to the partial labels y. In other words, we want to construct
h such that h(x(i) ) ≈ p(t(i) = 1 | x(i) ) as closely as possible, using only x and y.
Real world example: Suppose we maintain a database of proteins which are involved in transmit-
ting signals across membranes. Every example added to the database is involved in a signaling
process, but there are many proteins involved in cross-membrane signaling which are missing
from the database. It would be useful to train a classifier to identify proteins that should be
added to the database. In our notation, each example x(i) corresponds to a protein, y (i) = 1
if the protein is in the database and 0 otherwise, and t(i) = 1 if the protein is involved in a
cross-membrane signaling process and thus should be added to the database, and 0 otherwise.

(a) [5 points] Suppose that each y (i) and x(i) are conditionally independent given t(i) :

p(y (i) = 1 | t(i) = 1, x(i) ) = p(y (i) = 1 | t(i) = 1).

Note this is equivalent to saying that labeled examples were selected uniformly at random
from the set of positive examples. Prove that the probability of an example being labeled
differs by a constant factor from the probability of an example being positive. That is,
show that p(t(i) = 1 | x(i) ) = p(y (i) = 1 | x(i) )/α for some α ∈ R.
(b) [5 points] Suppose we want to estimate α using a trained classifier h and a held-out validation
set V . Let V+ be the set of labeled (and hence positive) examples in V , given by V+ =
{x(i) ∈ V | y (i) = 1}. Assuming that h(x(i) ) ≈ p(y (i) = 1 | x(i) ) for all examples x(i) , show
that
h(x(i) ) ≈ α for all x(i) ∈ V+ .
You may assume that p(t(i) = 1 | x(i) ) ≈ 1 when x(i) ∈ V+ .
(c) [5 points] Coding problem. The following three problems will deal with a dataset which
we have provided in the following files:
data/ds3_{train,valid,test}.csv
Each file contains the following columns: x1 , x2 , y, and t. As in Problem 1, there is one
example per row.
First we will consider the ideal case, where we have access to the true t-labels for training.
In src/p02cde posonly, write a logistic regression classifier that uses x1 and x2 as input
features, and train it using the t-labels (you can ignore the y-labels for this part). Output
the trained model’s predictions on the test set to the file specified in the code.
CS229 Problem Set #1 5

(d) [5 points] Coding problem. We now consider the case where the t-labels are unavail-
able, so you only have access to the y-labels at training time. Add to your code in
p02cde posonly.py to re-train the classifier (still using x1 and x2 as input features), but
using the y-labels only.
(e) [10 points] Coding problem. Using the validation set, estimate the constant α by aver-
aging your classifier’s predictions over all labeled examples in the validation set:
1 X
α≈ h(x(i) ).
|V+ |
x(i) ∈V+

Add code in src/p02cde posonly.py to rescale your classifier’s predictions from part (d)
using the estimated value for α.
Finally, using a threshold of p(t(i) = 1 | x(i) ) = 0.5, make three separate plots with the
decision boundaries from parts (c) - (e) plotted on top of the test set. Plot x1 on the
horizontal axis and x2 on the vertical axis, and use two different symbols for the positive
(t(i) = 1) and negative (t(i) = 0) examples. In each plot, indicate the separating hyperplane
with a red line.

Remark: We saw that the true probability p(t | x) was only a constant factor away from
p(y | x). This means, if our task is to only rank examples (i.e. sort them) in a particular order
(e.g, sort the proteins in order of being most likely to be involved in transmitting signals across
membranes), then in fact we do not even need to estimate α. The rank based on p(y | x) will
agree with the rank based on p(t | x).
CS229 Problem Set #1 6

3. [25 points] Poisson Regression

(a) [5 points] Consider the Poisson distribution parameterized by λ:

e−λ λy
p(y; λ) = .
y!
Show that the Poisson distribution is in the exponential family, and clearly state the values
for b(y), η, T (y), and a(η).
(b) [3 points] Consider performing regression using a GLM model with a Poisson response
variable. What is the canonical response function for the family? (You may use the fact
that a Poisson random variable with parameter λ has mean λ.)
(c) [7 points] For a training set {(x(i) , y (i) ); i = 1, . . . , m}, let the log-likelihood of an example
be log p(y (i) |x(i) ; θ). By taking the derivative of the log-likelihood with respect to θj , derive
the stochastic gradient ascent update rule for learning using a GLM model with Poisson
responses y and the canonical response function.
(d) [7 points] Coding problem. Consider a website that wants to predict its daily traffic.
The website owners have collected a dataset of past traffic to their website, along with
some features which they think are useful in predicting the number of visitors per day. The
dataset is split into train/valid/test sets and follows the same format as Datasets 1-3:
data/ds4_{train,valid}.csv
We will apply Poisson regression to model the number of visitors per day. Note that ap-
plying Poisson regression in particular assumes that the data follows a Poisson distribution
whose natural parameter is a linear combination of the input features (i.e., η = θT x).
In src/p03d poisson.py, implement Poisson regression for this dataset and use gradient
ascent to maximize the log-likelihood of θ.
CS229 Problem Set #1 7

4. [15 points] Convexity of Generalized Linear Models


In this question we will explore and show some nice properties of Generalized Linear Models,
specifically those related to its use of Exponential Family distributions to model the output.
Most commonly, GLMs are trained by using the negative log-likelihood (NLL) as the loss func-
tion. This is mathematically equivalent to Maximum Likelihood Estimation (i.e., maximizing
the log-likelihood is equivalent to minimizing the negative log-likelihood). In this problem, our
goal is to show that the NLL loss of a GLM is a convex function w.r.t the model parameters.
As a reminder, this is convenient because a convex function is one for which any local minimum
is also a global minimum.
To recap, an exponential family distribution is one whose probability density can be represented

p(y; η) = b(y) exp(η T T (y) − a(η)),

where η is the natural parameter of the distribution. Moreover, in a Generalized Linear Model, η
is modeled as θT x, where x ∈ Rn are the input features of the example, and θ ∈ Rn are learnable
parameters. In order to show that the NLL loss is convex for GLMs, we break down the process
into sub-parts, and approach them one at a time. Our approach is to show that the second
derivative (i.e., Hessian) of the loss w.r.t the model parameters is Positive Semi-Definite (PSD)
at all values of the model parameters. We will also show some nice properties of Exponential
Family distributions as intermediate steps.
For the sake of convenience we restrict ourselves to the case where η is a scalar. Assume
p(Y |X; θ) ∼ ExponentialFamily(η), where η ∈ R is a scalar, and T (y) = y. This makes the
exponential family representation take the form

p(y; η) = b(y) exp(ηy − a(η)).

(a) [5 points] Derive an expression for the mean of the distribution. Show that E[Y | X; θ] can
be represented as the gradient of the log-partition function a with respect to the natural
parameter η.

R R ∂
Hint: Start with observing that ∂η p(y; η)dy = ∂η p(y; η)dy.
(b) [5 points] Next, derive an expression for the variance of the distribution. In particular,
show that Var(Y | X; θ) can be expressed as the derivative of the mean w.r.t η (i.e., the
second derivative of the log-partition function a(η) w.r.t the natural parameter η.)
(c) [5 points] Finally, write out the loss function `(θ), the NLL of the distribution, as a function
of θ. Then, calculate the Hessian of the loss w.r.t θ, and show that it is always PSD. This
concludes the proof that NLL loss of GLM is convex.
Hint: Use the chain rule of calculus along with the results of the previous parts to simplify
your derivations.
Remark: The main takeaways from this problem are:
• Any GLM model is convex in its model parameters.
• The exponential family of probability distributions are mathematically nice. Whereas cal-
culating mean and variance of distributions in general involves integrals (hard), surprisingly
we can calculate them using derivatives (easy) for exponential family.
CS229 Problem Set #1 8

5. [25 points] Locally weighted linear regression

(a) [10 points] Consider a linear regression problem in which we want to “weight” different
training examples differently. Specifically, suppose we want to minimize
m 2
1 X (i)  T (i)
J(θ) = w θ x − y (i) .
2 i=1

In class, we worked out what happens for the case where all the weights (the w(i) ’s) are the
same. In this problem, we will generalize some of those ideas to the weighted setting.
i. [2 points] Show that J(θ) can also be written

J(θ) = (Xθ − y)T W (Xθ − y)

for an appropriate matrix W , and where X and y are as defined in class. Clearly specify
the value of each element of the matrix W .
ii. [4 points] If all the w(i) ’s equal 1, then we saw in class that the normal equation is

X T Xθ = X T y,

and that the value of θ that minimizes J(θ) is given by (X T X)−1 X T y. By finding
the derivative ∇θ J(θ) and setting that to zero, generalize the normal equation to this
weighted setting, and give the new value of θ that minimizes J(θ) in closed form as a
function of X, W and y.
iii. [4 points] Suppose we have a dataset {(x(i) , y (i) ); i = 1 . . . , m} of m independent ex-
amples, but we model the y (i) ’s as drawn from conditional distributions with different
levels of variance (σ (i) )2 . Specifically, assume the model

(y (i) − θT x(i) )2
 
1
p(y (i) |x(i) ; θ) = √ exp −
2πσ (i) 2(σ (i) )2

That is, each y (i) is drawn from a Gaussian distribution with mean θT x(i) and variance
(σ (i) )2 (where the σ (i) ’s are fixed, known, constants). Show that finding the maximum
likelihood estimate of θ reduces to solving a weighted linear regression problem. State
clearly what the w(i) ’s are in terms of the σ (i) ’s.
(b) [10 points] Coding problem. We will now consider the following dataset (the formatting
matches that of Datasets 1-4, except x(i) is 1-dimensional):
data/ds5_{train,valid,test}.csv
In src/p05b lwr.py, implement locally weighted linear regression using the normal equa-
tions you derived in Part (a) and using

kx(i) − xk22
 
(i)
w = exp − .
2τ 2

Train your model on the train split using τ = 0.5, then run your model on the valid split
and report the mean squared error (MSE). Finally plot your model’s predictions on the
validation set (plot the training set with blue ‘x’ markers and the validation set with a red
‘o’ markers). Does the model seem to be under- or overfitting?
CS229 Problem Set #1 9

(c) [5 points] Coding problem. We will now tune the hyperparameter τ . In src/p05c tau.py,
find the MSE value of your model on the validation set for each of the values of τ specified
in the code. For each τ , plot your model’s predictions on the validation set in the format
described in part (b). Report the value of τ which achieves the lowest MSE on the valid
split, and finally report the MSE on the test split using this τ -value.
CS229 Problem Set #2 1

CS 229, Fall 2018


Problem Set #2: Supervised Learning II

Due Wednesday, Oct 31 at 11:59 pm on Gradescope.

Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at http://piazza.com/stanford/fall2018/cs229. (3) If
you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please
read the policy on Handout #1 (available from the course website) before starting work. (4)
For the coding problems, you may not use any libraries except those defined in the provided
environment.yml file. In particular, ML-specific libraries such as scikit-learn are not permitted.
(5) To account for late days, the due date listed on Gradescope is Nov 03 at 11:59 pm. If you
submit after Oct 31, you will begin consuming your late days. If you wish to submit on time,
submit before Oct 31 at 11:59 pm.
All students must submit an electronic PDF version of the written questions. We highly recom-
mend typesetting your solutions via LATEX. If you are scanning your document by cell phone,
please check the Piazza forum for recommended scanning apps and best practices. All students
must also submit a zip file of their source code to Gradescope, which should be created using the
make zip.py script. In order to pass the auto-grader tests, you should make sure to (1) restrict
yourself to only using libraries included in the environment.yml file, and (2) make sure your
code runs without errors when running p05 percept.py and p06 spam.py. Your submission will
be evaluated by the auto-grader using a private test set.
CS229 Problem Set #2 2

1. [15 points] Logistic Regression: Training stability


In this problem, we will be delving deeper into the workings of logistic regression. The goal of
this problem is to help you develop your skills debugging machine learning algorithms (which
can be very different from debugging software in general).
We have provided a implementation of logistic regression in src/p01 lr.py, and two labeled
datasets A and B in data/ds1 a.txt and data/ds1 b.txt
Please do not modify the code for the logistic regression training algorithm for this problem.
First, run the given logistic regression code to train two different models on A and B. You can
run the code by simply executing python p01 lr.py in the src directory.

(a) [2 points] What is the most notable difference in training the logistic regression model on
datasets A and B?
(b) [5 points] Investigate why the training procedure behaves unexpectedly on dataset B, but
not on A. Provide hard evidence (in the form of math, code, plots, etc.) to corroborate
your hypothesis for the misbehavior. Remember, you should address why your explanation
does not apply to A.
Hint: The issue is not a numerical rounding or over/underflow error.
(c) [5 points] For each of these possible modifications, state whether or not it would lead to
the provided training algorithm converging on datasets such as B. Justify your answers.
i. Using a different constant learning rate.
ii. Decreasing the learning rate over time (e.g. scaling the initial learning rate by 1/t2 ,
where t is the number of gradient descent iterations thus far).
iii. Linear scaling of the input features.
iv. Adding a regularization term kθk22 to the loss function.
v. Adding zero-mean Gaussian noise to the training data or labels.
(d) [3 points] Are support vector machines, which use the hinge loss, vulnerable to datasets like
B? Why or why not? Give an informal justification.

Hint: Recall the distinction between functional margin and geometric margin.
CS229 Problem Set #2 3

2. [10 points] Model Calibration


In this question we will try to understand the output hθ (x) of the hypothesis function of a logistic
regression model, in particular why we might treat the output as a probability (besides the fact
that the sigmoid function ensures hθ (x) always lies in the interval (0, 1)).
When the probabilities outputted by a model match empirical observation, the model is said
to be well-calibrated (or reliable). For example, if we consider a set of examples x(i) for which
hθ (x(i) ) ≈ 0.7, around 70% of those examples should have positive labels. In a well-calibrated
model, this property will hold true at every probability value.
Logistic regression tends to output well-calibrated probabilities (this is often not true with other
classifiers such as Naive Bayes, or SVMs). We will dig a little deeper in order to understand why
this is the case, and find that the structure of the loss function explains this property.
Suppose we have a training set {x(i) , y (i) }m i=1 with x
(i)
∈ Rn+1 and y (i) ∈ {0, 1}. Assume we
(i)
have an intercept term x0 = 1 for all i. Let θ ∈ Rn+1 be the maximum likelihood parameters
learned after training a logistic regression model. In order for the model to be considered well-
calibrated, given any range of probabilities (a, b) such that 0 ≤ a < b ≤ 1, and training examples
x(i) where the model outputs hθ (x(i) ) fall in the range (a, b), the fraction of positives in that set
of examples should be equal to the average of the model outputs for those examples. That is,
the following property must hold:
(i)

= 1|x(i) ; θ (i)
P P
i∈Ia,b P y i∈Ia,b I{y = 1}
= ,
|{i ∈ Ia,b }| |{i ∈ Ia,b }|

where P (y = 1|x; θ) = hθ (x) = 1/(1 + exp(−θ> x)), Ia,b = {i|i ∈ {1, ..., m}, hθ (x(i) ) ∈ (a, b)} is
an index set of all training examples x(i) where hθ (x(i) ) ∈ (a, b), and |S| denotes the size of the
set S.

(a) [5 points] Show that the above property holds true for the described logistic regression
model over the range (a, b) = (0, 1).
Hint: Use the fact that we include a bias term.
(b) [3 points] If we have a binary classification model that is perfectly calibrated—that is, the
property we just proved holds for any (a, b) ⊂ [0, 1]—does this necessarily imply that the
model achieves perfect accuracy? Is the converse necessarily true? Justify your answers.
(c) [2 points] Discuss what effect including L2 regularization in the logistic regression objective
has on model calibration.

Remark: We considered the range (a, b) = (0, 1). This is the only range for which logistic
regression is guaranteed to be calibrated on the training set. When the GLM modeling assump-
tions hold, all ranges (a, b) ⊂ [0, 1] are well calibrated. In addition, when the training and test set
are from the same distribution and when the model has not overfit or underfit, logistic regression
tends to be well-calibrated on unseen test data as well. This makes logistic regression a very
popular model in practice, especially when we are interested in the level of uncertainty in the
model output.
CS229 Problem Set #2 4

3. [20 points] Bayesian Interpretation of Regularization


Background: In Bayesian statistics, almost every quantity is a random variable, which can
either be observed, or unobserved. For instance, parameters θ are generally unobserved random
variables, and data x and y are observed random variables. The joint distribution of all the ran-
dom variables is also called the model (e.g p(x, y, θ)). Every unknown quantity can be estimated
by conditioning the model on all the observed quantities. Such a conditional distribution over
the unobserved random variables, conditioned on the observed random variables, is called the
posterior distribution. For instance p(θ|x, y) is the posterior distribution in the machine learning
context. A consequence of this approach is that, we are required to endow our model parameters,
i.e. p(θ), with a prior distribution. The prior probabilities are to be assigned before we see the
data – they need to capture our prior beliefs of what the model parameters might be before
observing any evidence, and must be a subjective opinion by the person building the model.
In the purest Bayesian interpretation, we are required to keep the entire posterior distribu-
tion over the parameters all the way until prediction, to come up with the posterior predictive
distribution, and the final prediction will be the expected value of the posterior predictive dis-
tribution. However in most situations, this is computationally very expensive, and we settle for
a compromise that is less pure (in the Bayesian sense).
The compromise is to estimate a point value of the parameters (instead of the full distribution)
which is the mode of the posterior distribution. Estimating the mode of the posterior distribution
is also called maximum a posteriori estimation (MAP). That is,

θMAP = arg max p(θ|x, y).


θ

Compare this to the maximum likelihood estimation (MLE) we have seen previously:

θMLE = arg max p(y|x, θ).


θ

In this problem, we explore the connection between MAP estimation, and common regularization
techniques that are applied with MLE estimation. In particular, you will show how the choice
of prior distribution over θ (e.g Gaussian, or Laplace prior) is equivalent to different kinds of
regularization (e.g L2 , or L1 regularization). To show this, we shall proceed step by step, showing
intermediate steps.

(a) [3 points] Show that θMAP = argmaxθ p(y|x, θ)p(θ) if we assume that p(θ) = p(θ|x). The
assumption that p(θ) = p(θ|x) will be valid for models such as linear regression where the
input x are not explicitly modeled by θ. (Note that this means x and θ are marginally
independent, but not conditionally independent when y is given.)
(b) [5 points] Recall that L2 regularization penalizes the L2 norm of the parameters while
minimizing the loss (i.e., negative log likelihood in case of probabilistic models). Now we
will show that MAP estimation with a zero-mean Gaussian prior over θ, specifically θ ∼
N (0, η 2 I), is equivalent to applying L2 regularization with MLE estimation. Specifically,
show that
θMAP = arg min − log p(y|x, θ) + λ||θ||22 .
θ

Also, what is the value of λ?


(c) [7 points] Now consider a specific instance, a linear regression model given by y = θT x + 
where  ∼ N (0, σ 2 ). Like before, assume a Gaussian prior on this model such that θ ∼
N (0, η 2 I). For notation, let X be the design matrix of all the training example inputs
CS229 Problem Set #2 5

where each row vector is one example input, and ~y be the column vector of all the example
outputs.
Come up with a closed form expression for θMAP .
(d) [5 points] Next, consider the Laplace distribution, whose density is given by

 
1 |z − µ|
fL (z|µ, b) = exp − .
2b b

As before, consider a linear regression model given by y = xT θ +  where  ∼ N (0, σ 2 ).


Assume a Laplace prior on this model where θ ∼ L(0, bI).
Show that θMAP in this case is equivalent to the solution of linear regression with L1
regularization, whose loss is specified as

J(θ) = ||Xθ − ~y ||22 + γ||θ||1


Also, what is the value of γ?
Note: A closed form solution for linear regression problem with L1 regularization does not
exist. To optimize this, we use gradient descent with a random initialization and solve it
numerically.

Remark: Linear regression with L2 regularization is also commonly called Ridge regression, and
when L1 regularization is employed, is commonly called Lasso regression. These regularizations
can be applied to any Generalized Linear models just as above (by replacing log p(y|x, θ) with
the appropriate family likelihood). Regularization techniques of the above type are also called
weight decay, and shrinkage. The Gaussian and Laplace priors encourage the parameter values
to be closer to their mean (i.e., zero), which results in the shrinkage effect.
Remark: Lasso regression (i.e L1 regularization) is known to result in sparse parameters, where
most of the parameter values are zero, with only some of them non-zero.
CS229 Problem Set #2 6

4. [18 points] Constructing kernels


In class, we saw that by choosing a kernel K(x, z) = φ(x)T φ(z), we can implicitly map data to a
high dimensional space, and have the SVM algorithm work in that space. One way to generate
kernels is to explicitly define the mapping φ to a higher dimensional space, and then work out
the corresponding K.
However in this question we are interested in direct construction of kernels. I.e., suppose we have
a function K(x, z) that we think gives an appropriate similarity measure for our learning problem,
and we are considering plugging K into the SVM as the kernel function. However for K(x, z)
to be a valid kernel, it must correspond to an inner product in some higher dimensional space
resulting from some feature mapping φ. Mercer’s theorem tells us that K(x, z) is a (Mercer)
kernel if and only if for any finite set {x(1) , . . . , x(m) }, the square matrix K ∈ Rm×m whose
entries are given by Kij = K(x(i) , x(j) ) is symmetric and positive semidefinite. You can find
more details about Mercer’s theorem in the notes, though the description above is sufficient for
this problem.
Now here comes the question: Let K1 , K2 be kernels over Rn × Rn , let a ∈ R+ be a positive
real number, let f : Rn 7→ R be a real-valued function, let φ : Rn → Rd be a function mapping
from Rn to Rd , let K3 be a kernel over Rd × Rd , and let p(x) a polynomial over x with positive
coefficients.
For each of the functions K below, state whether it is necessarily a kernel. If you think it is,
prove it; if you think it isn’t, give a counter-example.

(a) [1 points] K(x, z) = K1 (x, z) + K2 (x, z)


(b) [1 points] K(x, z) = K1 (x, z) − K2 (x, z)
(c) [1 points] K(x, z) = aK1 (x, z)
(d) [1 points] K(x, z) = −aK1 (x, z)
(e) [5 points] K(x, z) = K1 (x, z)K2 (x, z)
(f) [3 points] K(x, z) = f (x)f (z)
(g) [3 points] K(x, z) = K3 (φ(x), φ(z))
(h) [3 points] K(x, z) = p(K1 (x, z))

[Hint: For part (e), the answer is that K is indeed a kernel. You still have to prove it, though.
(This one may be harder than the rest.) This result may also be useful for another part of the
problem.]
CS229 Problem Set #2 7

5. [16 points] Kernelizing the Perceptron


Let there be a binary classification problem with y ∈ {0, 1}. The perceptron uses hypotheses
of the form hθ (x) = g(θT x), where g(z) = sign(z) = 1 if z ≥ 0, 0 otherwise. In this problem
we will consider a stochastic gradient descent-like implementation of the perceptron algorithm
where each update to the parameters θ is made using only one training example. However, unlike
stochastic gradient descent, the perceptron algorithm will only make one pass through the entire
training set. The update rule for this version of the perceptron algorithm is given by

θ(i+1) := θ(i) + α(y (i+1) − hθ(i) (x(i+1) ))x(i+1)

where θ(i) is the value of the parameters after the algorithm has seen the first i training examples.
Prior to seeing any training examples, θ(0) is initialized to ~0.

(a) [9 points] Let K be a Mercer kernel corresponding to some very high-dimensional feature
mapping φ. Suppose φ is so high-dimensional (say, ∞-dimensional) that it’s infeasible to
ever represent φ(x) explicitly. Describe how you would apply the “kernel trick” to the
perceptron to make it work in the high-dimensional feature space φ, but without ever
explicitly computing φ(x).
[Note: You don’t have to worry about the intercept term. If you like, think of φ as having
the property that φ0 (x) = 1 so that this is taken care of.] Your description should specify:
i. [3 points] How you will (implicitly) represent the high-dimensional parameter vector
θ(i) , including how the initial value θ(0) = 0 is represented (note that θ(i) is now a
vector whose dimension is the same as the feature vectors φ(x));
ii. [3 points] How you will efficiently make a prediction on a new input x(i+1) . I.e., how
T
you will compute hθ(i) (x(i+1) ) = g(θ(i) φ(x(i+1) )), using your representation of θ(i) ;
and
iii. [3 points] How you will modify the update rule given above to perform an update to θ
on a new training example (x(i+1) , y (i+1) ); i.e., using the update rule corresponding to
the feature mapping φ:

θ(i+1) := θ(i) + α(y (i+1) − hθ(i) (x(i+1) ))x(i+1)

(b) [5 points] Implement your approach by completing the initial state, predict, and
update state methods of src/p05 percept.py.
(c) [2 points] Run src/p05 percept.py to train kernelized perceptrons on data/ds5 train.csv.
The code will then test the perceptron on data/ds5 test.csv and save the resulting pre-
dictions in the src/output folder. Plots will also be saved in src/output. We provide two
kernels, a dot product kernel and an radial basis function (rbf) kernel. One of the provided
kernels performs extremely poorly in classifying the points. Which kernel performs badly
and why does it fail?
CS229 Problem Set #2 8

6. [22 points] Spam classification


In this problem, we will use the naive Bayes algorithm and an SVM to build a spam classifier.
In recent years, spam on electronic media has been a growing concern. Here, we’ll build a
classifier to distinguish between real messages, and spam messages. For this class, we will be
building a classifier to detect SMS spam messages. We will be using an SMS spam dataset
developed by Tiago A. Almedia and José Marı́a Gómez Hidalgo which is publicly available on
http://www.dt.fee.unicamp.br/~tiago/smsspamcollection 1
We have split this dataset into training and testing sets and have included them in this assignment
as data/ds6 spam train.tsv and data/ds6 spam test.tsv. See data/ds6 readme.txt for
more details about this dataset. Please refrain from redistributing these dataset files. The goal
of this assignment is to build a classifier from scratch that can tell the difference the spam and
non-spam messages using the text of the SMS message.

(a) [5 points] Implement code for processing the the spam messages into numpy arrays that can
be fed into machine learning models. Do this by completing the get words, create dictionary,
and transform text functions within our provided src/p06 spam.py. Do note the corre-
sponding comments for each function for instructions on what specific processing is required.
The provided code will then run your functions and save the resulting dictionary into
output/p06 dictionary and a sample of the resulting training matrix into output/p06 sample train matrix.
(b) [10 points] In this question you are going to implement a naive Bayes classifier for spam
classification with multinomial event model and Laplace smoothing (refer to class notes on
Naive Bayes for details on Laplace smoothing).
Write your implementation by completing the fit naive bayes model and
predict from naive bayes model functions in src/p06 spam.py.
src/p06 spam.py should then be able to train a Naive Bayes model, compute your predic-
tion accuracy and then save your resulting predictions to output/p06 naive bayes predictions.
Remark. If you implement
Q naive Bayes the straightforward way, you’ll find that the
computed p(x|y) = i p(xi |y) often equals zero. This is because p(x|y), which is the
product of many numbers less than one, is a very small number. The standard computer
representation of real numbers cannot handle numbers that are too small, and instead
rounds them off to zero. (This is called “underflow.”) You’ll have to find a way to compute
Naive Bayes’ predicted class labels without explicitly representing very small numbers such
as p(x|y). [Hint: Think about using logarithms.]
(c) [5 points] Intuitively, some tokens may be particularly indicative of an SMS being in a
particular class. We can try to get an informal sense of how indicative token i is for the
SPAM class by looking at:
 
p(xj = i|y = 1) P (token i|email is SPAM)
log = log .
p(xj = i|y = 0) P (token i|email is NOTSPAM)

Complete the get top five naive bayes words function within the provided code using
the above formula in order to obtain the 5 most indicative tokens.
The provided code will print out the resulting indicative tokens and then save them to
output/p06 top indicative words.
1 Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New

Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG’11),
Mountain View, CA, USA, 2011.
CS229 Problem Set #2 9

(d) [2 points] Support vector machines (SVMs) are an alternative machine learning model that
we discussed in class. We have provided you an SVM implementation (using a radial basis
function (RBF) kernel) within src/svm.py (You should not need to modify that code).
One important part of training an SVM parameterized by an RBF kernel is choosing an
appropriate kernel radius.
Complete the compute best svm radius by writing code to compute the best SVM radius
which maximizes accuracy on the validation dataset.
The provided code will use your compute best svm radius to compute and then write the
best radius into output/p06 optimal radius.
CS229 Problem Set #3 1

CS 229, Fall 2018


Problem Set #3: Deep Learning & Unsupervised
learning

Due Wednesday, Nov 14 at 11:59 pm on Gradescope.

Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at http://piazza.com/stanford/fall2018/cs229. (3) If
you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please
read the policy on Handout #1 (available from the course website) before starting work. (4)
For the coding problems, you may not use any libraries except those defined in the provided
environment.yml file. In particular, ML-specific libraries such as scikit-learn are not permitted.
(5) To account for late days, the due date listed on Gradescope is Nov 17 at 11:59 pm. If you
submit after Nov 14, you will begin consuming your late days. If you wish to submit on time,
submit before Nov 14 at 11:59 pm.
All students must submit an electronic PDF version of the written questions. We highly recom-
mend typesetting your solutions via LATEX. If you are scanning your document by cell phone,
please check the Piazza forum for recommended scanning apps and best practices. All students
must also submit a zip file of their source code to Gradescope, which should be created using the
make zip.py script. In order to pass the auto-grader tests, you should make sure to (1) restrict
yourself to only using libraries included in the environment.yml file, and (2) make sure your
code runs without errors when running p04 gmm.py and p05 kmeans.py. Your submission will
be evaluated by the auto-grader using a private test set.
CS229 Problem Set #3 2

1. [20 points] A Simple Neural Network


Let X = {x(1) , · · · , x(m) } be a dataset of m samples with 2 features, i.e x(i) ∈ R2 . The samples
are classified into 2 categories with labels y (i) ∈ {0, 1}. A scatter plot of the dataset is shown in
Figure 1:

4.0
3.5
3.0
2.5
2.0
x2

1.5
1.0
0.5
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
x1

Figure 1: Plot of dataset X.

The examples in class 1 are marked as “×” and examples in class 0 are marked as “◦”. We want
to perform binary classification using a simple neural network with the architecture shown in
Figure 2:

Inputs Hidden layer

Output

Figure 2: Architecture for our simple neural network.

Denote the two features x1 and x2 , the three neurons in the hidden layer h1 , h2 , and h3 , and
[1]
the output neuron as o. Let the weight from xi to hj be wi,j for i ∈ {1, 2}, j ∈ {1, 2, 3}, and the
[2] [1]
weight from hj to o be wj . Finally, denote the intercept weight for hj as w0,j , and the intercept
[2]
weight for o as w0 . For the loss function, we’ll use average squared loss instead of the usual
negative log-likelihood:
m
1 X (i)
l= (o − y (i) )2 ,
m i=1

where o(i) is the result of the output neuron for example i.


CS229 Problem Set #3 3

(a) [5 points] Suppose we use the sigmoid function as the activation function for h1 , h2 , h3 and
[1]
o. What is the gradient descent update to w1,2 , assuming we use a learning rate of α? Your
answer should be written in terms of x(i) , o(i) , y (i) , and the weights.
(b) [10 points] Now, suppose instead of using the sigmoid function for the activation function
for h1 , h2 , h3 and o, we instead used the step function f (x), defined as
(
1, x ≥ 0
f (x) =
0, x < 0

Is it possible to have a set of weights that allow the neural network to classify this dataset
with 100% accuracy?
If it is possible, please provide a set of weights that enable 100% accuracy by completing
optimal step weights within src/p01 nn.py and explain your reasoning for those weights
in your PDF.
If it is not possible, please explain your reasoning in your PDF. (There is no need to modify
optimal step weights if it is not possible.)
Hint: There are three sides to a triangle, and there are three neurons in the hidden layer.
(c) [10 points] Let the activation functions for h1 , h2 , h3 be the linear function f (x) = x and
the activation function for o be the same step function as before.
Is it possible to have a set of weights that allow the neural network to classify this dataset
with 100% accuracy?
If it is possible, please provide a set of weights that enable 100% accuracy by complet-
ing optimal linear weights within src/p01 nn.py and explain your reasoning for those
weights in your PDF.
If it is not possible, please explain your reasoning in your PDF. (There is no need to modify
optimal linear weights if it is not possible.)
CS229 Problem Set #3 4

2. [15 points] KL divergence and Maximum Likelihood


The Kullback-Leibler (KL) divergence is a measure of how much one probability distribution is
different from a second one. It is a concept that originated in Information Theory, but has made
its way into several other fields, including Statistics, Machine Learning, Information Geometry,
and many more. In Machine Learning, the KL divergence plays a crucial role, connecting various
concepts that might otherwise seem unrelated.
In this problem, we will introduce KL divergence over discrete distributions, practice some simple
manipulations, and see its connection to Maximum Likelihood Estimation.
The KL divergence between two discrete-valued distributions P (X), Q(X) over the outcome
space X is defined as follows1 :

X P (x)
DKL (P kQ) = P (x) log
Q(x)
x∈X

For notational convenience, we assume P (x) > 0, ∀x. (One other standard thing to do is to
adopt the convention that “0 log 0 = 0.”) Sometimes, we also write the KL divergence more
explicitly as DKL (P ||Q) = DKL (P (X)||Q(X)).
Background on Information Theory
Before we dive deeper, we give a brief (optional) Information Theoretic background on KL
divergence. While this introduction is not necessary to answer the assignment question, it may
help you better understand and appreciate why we study KL divergence, and how Information
Theory can be relevant to Machine Learning.
We start with the entropy H(P ) of a probability distribution P (X), which is defined as
X
H(P ) = − P (x) log P (x).
x∈X

Intuitively, entropy measures how dispersed a probability distribution is. For example, a uni-
form distribution is considered to have very high entropy (i.e. a lot of uncertainty), whereas a
distribution that assigns all its mass on a single point is considered to have zero entropy (i.e.
no uncertainty). Notably, it can be shown that among continuous distributions over R, the
Gaussian distribution N (µ, σ 2 ) has the highest entropy (highest uncertainty) among all possible
distributions that have the given mean µ and variance σ 2 .
To further solidify our intuition, we present motivation from communication theory. Suppose we
want to communicate from a source to a destination, and our messages are always (a sequence
of) discrete symbols over space X (for example, X could be letters {a, b, . . . , z}). We want to
construct an encoding scheme for our symbols in the form of sequences of binary bits that are
transmitted over the channel. Further, suppose that in the long run the frequency of occurrence
of symbols follow a probability distribution P (X). This means, in the long run, the fraction of
times the symbol x gets transmitted is P (x).
A common desire is to construct an encoding scheme such that the average number of bits per
symbol transmitted remains as small as possible. Intuitively, this means we want very frequent
symbols to be assigned to a bit pattern having a small number of bits. Likewise, because we are
1 If P and Q are densities for continuous-valued random variables, then the sum is replaced by an integral,

and everything stated in this problem works fine as well. But for the sake of simplicity, in this problem we’ll just
work with this form of KL divergence for probability mass functions/discrete-valued distributions.
CS229 Problem Set #3 5

interested in reducing the average number of bits per symbol in the long term, it is tolerable for
infrequent words to be assigned to bit patterns having a large number of bits, since their low
frequency has little effect on the long term average. The encoding scheme can be as complex as
we desire, for example, a single bit could possibly represent a long sequence of multiple symbols
(if that specific pattern of symbols is very common). The entropy of a probability distribution
P (X) is its optimal bit rate, i.e., the lowest average bits per message that can possibly be
achieved if the symbols x ∈ X occur according to P (X). It does not specifically tell us how to
construct that optimal encoding scheme. It only tells us that no encoding can possibly give us
a lower long term bits per message than H(P ).
To see a concrete example, suppose our messages have a vocabulary of K = 32 symbols, and
each symbol has an equal probability of transmission in the long term (i.e, uniform probability
distribution). An encoding scheme that would work well for this scenario would be to have
log2 K bits per symbol, and assign each symbol some unique combination of the log2 K bits. In
fact, it turns out that this is the most efficient encoding one can come up with for the uniform
distribution scenario.
It may have occurred to you by now that the long term average number of bits per message
depends only on the frequency of occurrence of symbols. The encoding scheme of scenario A can
in theory be reused in scenario B with a different set of symbols (assume equal vocabulary size
for simplicity), with the same long term efficiency, as long as the symbols of scenario B follow
the same probability distribution as the symbols of scenario A. It might also have occured to
you, that reusing the encoding scheme designed to be optimal for scenario A, for messages in
scenario B having a different probability of symbols, will always be suboptimal for scenario B.
To be clear, we do not need know what the specific optimal schemes are in either scenarios. As
long as we know the distributions of their symbols, we can say that the optimal scheme designed
for scenario A will be suboptimal for scenario B if the distributions are different.
Concretely, if we reuse the optimal scheme designed for a scenario having symbol distribution
Q(X), into a scenario that has symbol distribution P (X), the long term average number of bits
per symbol achieved is called the cross entropy, denoted by H(P, Q):
X
H(P, Q) = − P (x) log Q(x).
x∈X

To recap, the entropy H(P ) is the best possible long term average bits per message (optimal)
that can be achived under a symbol distribution P (X) by using an encoding scheme (possibly
unknown) specifically designed for P (X). The cross entropy H(P, Q) is the long term average bits
per message (suboptimal) that results under a symbol distribution P (X), by reusing an encoding
scheme (possibly unknown) designed to be optimal for a scenario with symbol distribution Q(X).
Now, KL divergence is the penalty we pay, as measured in average number of bits, for using the
optimal scheme for Q(X), under the scenario where symbols are actually distributed as P (X).
It is straightforward to see this

X P (x)
DKL (P, Q) = P (x) log
Q(x)
x∈X
X X
= P (x) log P (x) − P (x) log Q(x)
x∈X x∈X
= H(P, Q) − H(P ). (difference in average number of bits.)
CS229 Problem Set #3 6

If the cross entropy between P and Q is zero (and hence DKL (P ||Q) = 0) then it necessarily
means P = Q. In Machine Learning, it is a common task to find a distribution Q that is “close”
to another distribution P . To achieve this, we use DKL (Q||P ) to be the loss function to be
optimized. As we will see in this question below, Maximum Likelihood Estimation, which is
a commonly used optimization objective, turns out to be equivalent minimizing KL divergence
between the training data (i.e. the empirical distribution over the data) and the model.
Now, we get back to showing some simple properties of KL divergence.

(a) [5 points] Nonnegativity. Prove the following:

∀P, Q DKL (P kQ) ≥ 0

and

DKL (P kQ) = 0 if and only if P = Q.


Hint: You may use the following result, called Jensen’s inequality. If f is a convex
function, and X is a random variable, then E[f (X)] ≥ f (E[X]). Moreover, if f is strictly
convex (f is convex if its Hessian satisfies H ≥ 0; it is strictly convex if H > 0; for instance
f (x) = − log x is strictly convex), then E[f (X)] = f (E[X]) implies that X = E[X] with
probability 1; i.e., X is actually a constant.
(b) [5 points] Chain rule for KL divergence. The KL divergence between 2 conditional
distributions P (X|Y ), Q(X|Y ) is defined as follows:
!
X X P (x|y)
DKL (P (X|Y )kQ(X|Y )) = P (y) P (x|y) log
y x
Q(x|y)

This can be thought of as the expected KL divergence between the corresponding conditional
distributions on x (that is, between P (X|Y = y) and Q(X|Y = y)), where the expectation
is taken over the random y.
Prove the following chain rule for KL divergence:

DKL (P (X, Y )kQ(X, Y )) = DKL (P (X)kQ(X)) + DKL (P (Y |X)kQ(Y |X)).

(c) [5 points] KL and maximum likelihood. Consider a density estimation problem, and
suppose wePare given a training set {x(i) ; i = 1, . . . , m}. Let the empirical distribution be
1 m (i)
P̂ (x) = m i=1 1{x = x}. (P̂ is just the uniform distribution over the training set; i.e.,
sampling from the empirical distribution is the same as picking a random example from the
training set.)
Suppose we have some family of distributions Pθ parameterized by θ. (If you like, think of
Pθ (x) as an alternative notation for P (x; θ).) Prove that finding the maximum likelihood
estimate for the parameter θ is equivalent to finding Pθ with minimal KL divergence from
P̂ . I.e. prove:
Xm
arg min DKL (P̂ kPθ ) = arg max log Pθ (x(i) )
θ θ
i=1

Remark. Consider the relationship between parts (b-c) and multi-variate Bernoulli Naive
Bayes parameter estimation. In the Naive Bayes model we assumed Pθ is of the following
CS229 Problem Set #3 7

Qn
form: Pθ (x, y) = p(y) i=1 p(xi |y). By the chain rule for KL divergence, we therefore have:
n
X
DKL (P̂ kPθ ) = DKL (P̂ (y)kp(y)) + DKL (P̂ (xi |y)kp(xi |y)).
i=1

This shows that finding the maximum likelihood/minimum KL-divergence estimate of the
parameters decomposes into 2n + 1 independent optimization problems: One for the class
priors p(y), and one for each of the conditional distributions p(xi |y) for each feature xi
given each of the two possible labels for y. Specifically, finding the maximum likelihood
estimates for each of these problems individually results in also maximizing the likelihood
of the joint distribution. (If you know what Bayesian networks are, a similar remark applies
to parameter estimation for them.)
CS229 Problem Set #3 8

3. [25 points] KL Divergence, Fisher Information, and the Natural Gradient


As seen before, the Kullback-Leibler divergence between two distributions is an asymmetric
measure of how different two distributions are. Consider two distributions over the same space
given by densities p(x) and q(x). The KL divergence between two continuous distributions, q
and p is defined as,
Z ∞
p(x)
DKL (p||q) = p(x) log dx
−∞ q(x)
Z ∞ Z ∞
= p(x) log p(x)dx − p(x) log q(x)dx
−∞ −∞
= Ex∼p(x) [log p(x)] − Ex∼p(x) [log q(x)].

A nice property of KL divergence is that it invariant to parametrization. This means, KL


divergence evaluates to the same value no matter how we parametrize the distributions P and
Q. For e.g, if P and Q are in the exponential family, the KL divergence between them is
the same whether we are using natural parameters, or canonical parameters, or any arbitrary
reparametrization.
Now we consider the problem of fitting model parameters using gradient descent (or stochastic
gradient descent). As seen previously, fitting model parameters using Maximum Likelihood
is equivalent to minimizing the KL divergence between the data and the model. While KL
divergence is invariant to parametrization, the gradient w.r.t the model parameters (i.e, direction
of steepest descent) is not invariant to parametrization. To see its implication, suppose we
are at a particular value of parameters (either randomly initialized, or mid-way through the
optimization process). The value of the parameters correspond to some probability distribution
(and in case of regression, a conditional probability distribution). If we follow the direction of
steepest descent from the current parameter, take a small step along that direction to a new
parameter, we end up with a new distribution corresponding to the new parameters. The non-
invariance to reparametrization means, a step of fixed size in the parameter space could end up
in a distribution that could either be extremely far away in DKL from the previous distribution,
or on the other hand not move very much at all w.r.t DKL from the previous distributions.
This is where the natural gradient comes into picture. It is best introduced in contrast with the
usual gradient descent. In the usual gradient descent, we first choose the direction by calculating
the gradient of the MLE objective w.r.t the parameters, and then move a magnitude of step size
(where size is measured in the parameter space) along that direction. Whereas in natural gradi-
ent, we first choose a divergence amount by which we would like to move, in the DKL sense. This
effectively gives us a perimeter around the current parameters (of some arbitrary shape), such
that points along this perimeter correspond to distributions which are at an equal DKL -distance
away from the current parameter. Among the set of all distributions along this perimeter, we
move to the distribution that maximizes the objective (i.e minimize DKL between data and
itself) the most. This approach makes the optimization process invariant to parametrization.
That means, even if we chose a new arbitrary reparametrization, by starting from a particular
distribution, we always descend down the same sequence of distributions towards the optimum.
In the rest of this problem, we will construct and derive the natural gradient update rule. For
that, we will break down the process into smaller sub-problems, and give you hints to answer
them. Along the way, we will encounter important statistical concepts such as the score function
and Fisher Information (which play a prominant role in Statistical Learning Theory as well).
Finally, we will see how this new natural gradient based optimization is actually equivalent to
Newton’s method for Generalized Linear Models.
CS229 Problem Set #3 9

Let the distribution of a random variable Y parameterized by θ ∈ Rn be p(y; θ).

(a) [3 points] Score function


The score function associated with p(y; θ) is defined as ∇θ log p(y; θ), which signifies the
sensitivity of the likelihood function with respect to the parameters. Note that the score
function is actually a vector since it’s the gradient of a scalar quantity with respect to the
vector θ.
R∞
Recall that Ey∼p(y) [g(y)] = −∞ p(y)g(y)dy. Using this fact, show that the expected value
of the score is 0, i.e.

Ey∼p(y;θ) [∇θ0 log p(y; θ0 )|θ0 =θ ] = 0


(b) [2 points] Fisher Information
Let us now introduce a quantity known as the Fisher information. It is defined as the
covariance matrix of the score function,

I(θ) = Covy∼p(y;θ) [∇θ0 log p(y; θ0 )|θ0 =θ ]

Intuitively, the Fisher information represents the amount of information that a random
variable Y carries about a parameter θ of interest. When the parameter of interest is a
vector (as in our case, since θ ∈ Rn ), this information becomes a matrix. Show that the
Fisher information can equivalently be given by

T
I(θ) = Ey∼p(y;θ) [∇θ0 log p(y; θ0 )∇θ0 log p(y; θ0 ) |θ0 =θ ]
Note that the Fisher Information is a function of the parameter. The parameter of the
Fisher information is both a) the parameter value at which the score function is evaluated,
and b) the parameter of the distribution with respect to which the expectation and variance
is calculated.
(c) [5 points] Fisher Information (alternate form)
It turns out that the Fisher Information can not only be defined as the covariance of the
score function, but in most situations it can also be represented as the expected negative
Hessian of the log-likelihood.
Show that Ey∼p(y;θ) [−∇2θ0 log p(y; θ0 )|θ0 =θ ] = I(θ).
Remark. The Hessian represents the curvature of a function at a point. This shows that
the expected curvature of the log-likelihood function is also equal to the Fisher information
matrix. If the curvature of the log-likelihood at a parameter is very steep (i.e, Fisher
Information is very high), this generally means you need fewer number of data samples to a
estimate that parameter well (assuming data was generated from the distribution with those
parameters), and vice versa. The Fisher information matrix associated with a statistical
model parameterized by θ is extremely important in determining how a model behaves as
a function of the number of training set examples.
(d) [5 points] Approximating DKL with Fisher Information
As we explained at the start of this problem, we are interested in the set of all distributions
that are at a small fixed DKL distance away from the current distribution. In order to
calculate DKL between p(y; θ) and p(y; θ + d), where d ∈ Rn is a small magnitude “delta”
vector, we approximate it using the Fisher Information at θ. Eventually d will be the
natural gradient update we will add to θ. To approximate the KL-divergence with Fisher
CS229 Problem Set #3 10

Infomration, we will start with the Taylor Series expansion of DKL and see that the Fisher
Information pops up in the expansion.
1
Show that DKL (pθ ||pθ+d ) ≈ dT I(θ)d.
2
Hint: Start with the Taylor Series expansion of DKL (pθ ||pθ̃ ) where θ is a constant and θ̃
is a variable. Later set θ̃ = θ + d. Recall that the Taylor Series allows us to approximate a
scalar function f (θ̃) near θ by:
1
f (θ̃) ≈ f (θ) + (θ̃ − θ)T ∇θ0 f (θ0 )|θ0 =θ + (θ̃ − θ)T ∇2θ0 f (θ0 )|θ0 =θ (θ̃ − θ)

2

(e) [8 points] Natural Gradient


Now we move on to calculating the natural gradient. Recall that we want to maximize the
log-likelihood by moving only by a fixed DKL distance from the current position. In the
previous sub-question we came up with a way to approximate DKL distance with Fisher
Information. Now we will set up the constrained optimization problem that will yield the
natural gradient update d. Let the log-likelihood objective be `(θ) = log p(y; θ). Let the
DKL distance we want to move by, be some small positive constant c. The natural gradient
update d∗ is

d∗ = arg max `(θ + d) subject to DKL (pθ ||pθ+d ) = c (1)


d

First we note that we can use Taylor approximation on `(θ + d) ≈ `(θ) + dT ∇θ0 `(θ0 )|θ0 =θ .
Also note that we calculated the Taylor approximation DKL (pθ ||pθ+d ) in the previous sub-
problem. We shall substitute both these approximations into the above constrainted opti-
mization problem.
In order to solve this constrained optimization problem, we employ the method of Lagrange
multipliers. If you are familiar with Lagrange multipliers, you can proceed directly to solve
for d∗ . If you are not familiar with Lagrange multipliers, here is a simplified introduction.
(You may also refer to a slightly more comprehensive introduction in the Convex Opti-
mization section notes, but for the purposes of this problem, the simplified introduction
provided here should suffice).
Consider the following constrained optimization problem

d∗ = arg max f (d) subject to g(d) = c


d

The function f is the objective function and g is the constraint. We instead optimize the
Lagrangian L(d, λ), which is defined as

L(d, λ) = f (d) − λ[g(d) − c]


with respect to both d and λ. Here λ ∈ R+ is called the Lagrange multiplier. In order to
optimize the above, we construct the following system of equations:

∇d L(d, λ) = 0, (a)
∇λ L(d, λ) = 0. (b)
CS229 Problem Set #3 11

So we have two equations (a and b above) with two unknowns (d and λ), which can be
sometimes be solved analytically (in our case, we can).
The following steps guide you through solving the constrained optimization problem:
• Construct the Lagrangian for the constrained optimization problem (1) with the Taylor
approximations substituted in for both the objective and the constraint.
• Then construct the system of linear equations (like (a) and (b)) from the Lagrangian
you obtained.
• From (a), come up with an expression for d that involves λ.
At this stage we have already found the “direction” of the natural gradient d, since
λ is only a positive scaling constant. For most practical purposes, the solution we
obtain here is sufficient. This is because we almost always include a learning rate
hyperparameter in our optimization algorithms, or perform some kind of a line search
for algorithmic stability. This can make the exact calculation of λ less critical. Let’s
call this expression d˜ (involving λ) as the unscaled natural gradient. Clearly state what
is d˜ as a function of λ.
The remaining steps are to figure out the value of the scaling constant λ along the
direction of d, for completeness.
• Plugin that expression for d into (b). Now we have an equation that has λ but not d.
Come up with an expression for λ that does not include d.
• Plug that expression for λ (without d) back into (a). Now we have an equation that
has d but not λ. Come up with an expression for d that does not include λ.
The expression of d obtained this way will be the desired natural gradient update d∗ . Clearly
state and highlight your final expression for d∗ . This expression cannot include λ.
(f) [2 points] Relation to Newton’s Method
After going through all these steps to calculate the natural gradient, you might wonder if
this is something used in practice. We will now see that the familiar Newton’s method that
we studied earlier, when applied to Generalized Linear Models, is equivalent to natural
gradient on Generalized Linear Models. While the two methods (Netwon’s and natural
gradient) agree on GLMs, in general they need not be equivalent.
Show that the direction of update of Newton’s method, and the direction of natural gradient,
are exactly the same for Generalized Linear Models. You may want to recall and cite the
results you derived in problem set 1 question 4 (Convexity of GLMs). For the natural
˜ the unscaled natural gradient.
gradient, it is sufficient to use d,
CS229 Problem Set #3 12

4. [30 points] Semi-supervised EM


Expectation Maximization (EM) is a classical algorithm for unsupervised learning (i.e., learning
with hidden or latent variables). In this problem we will explore one of the ways in which EM
algorithm can be adapted to the semi-supervised setting, where we have some labelled examples
along with unlabelled examples.
In the standard unsupervised setting, we have m ∈ N unlabelled examples {x(1) , . . . , x(m) }.
We wish to learn the parameters of p(x, z; θ) from the data, but z (i) ’s are not observed. The
classical EM algorithm is designed for this very purpose, where we maximize the intractable
p(x; θ) indirectly by iteratively performing the E-step and M-step, each time maximizing a
tractable lower bound of p(x; θ). Our objective can be concretely written as:

m
X
`unsup (θ) = log p(x(i) ; θ)
i=1
m
X X
= log p(x(i) , z (i) ; θ)
i=1 z (i)

Now, we will attempt to construct an extension of EM to the semi-supervised setting. Let us


suppose we have an additional m̃ ∈ N labelled examples {(x(1) , z (1) ), . . . , (x(m̃) , z (m̃) )} where
both x and z are observed. We want to simultaneously maximize the marginal likelihood of the
parameters using the unlabelled examples, and full likelihood of the parameters using the labelled
examples, by optimizing their weighted sum (with some hyperparameter α). More concretely,
our semi-supervised objective `semi-sup (θ) can be written as:

X
`sup (θ) = log p(x̃(i) , z̃ (i) ; θ)
i=1
`semi-sup (θ) = `unsup (θ) + α`sup (θ)
We can derive the EM steps for the semi-supervised setting using the same approach and steps
as before. You are strongly encouraged to show to yourself (no need to include in the write-up)
that we end up with:

E-step (semi-supervised)

For each i ∈ {1, . . . , m}, set


(t)
Qi (z (i) ) := p(z (i) |x(i) ; θ(t) )

M-step (semi-supervised)

"m ! m̃
!#
X X (t) p(x(i) , z (i) ; θ) X
θ(t+1) := arg max Qi (z (i) ) log (t)
+α log p(x̃(i) , z̃ (i) ; θ)
θ
i=1 z (i)
Qi (z (i) ) i=1

(a) [5 points] Convergence. First we will show that this algorithm eventually converges. In
order to prove this, it is sufficient to show that our semi-supervised objective `semi-sup (θ)
monotonically increases with each iteration of E and M step. Specifically, let θ(t) be the
parameters obtained at the end of t EM-steps. Show that `semi-sup (θ(t+1) ) ≥ `semi-sup (θ(t) ).
CS229 Problem Set #3 13

Semi-supervised GMM

Now we will revisit the Gaussian Mixture Model (GMM), to apply our semi-supervised EM al-
gorithm. Let us consider a scenario where data is generated from k ∈ N Gaussian distributions,
with unknown means µj ∈ Rd and covariances Σj ∈ Sd+ where j ∈ {1, . . . , k}. We have m data
points x(i) ∈ Rd , i ∈ {1, . . . , m}, and each data point has a corresponding latent (hidden/un-
known) variable z (i) ∈ {1, . . . , k} indicating which distribution x(i) belongs to. Specifically,
Pk
z (i) ∼ Multinomial(φ), such that j=1 φj = 1 and φj ≥ 0 for all j, and x(i) |z (i) ∼ N (µz(i) , Σz(i) )
i.i.d. So, µ, Σ, and φ are the model parameters.
We also have an additional m̃ data points x̃(i) ∈ Rd , i ∈ {1, . . . , m̃}, and an associated observed
variable z̃ ∈ {1, . . . , k} indicating the distribution x̃(i) belongs to. Note that z̃ (i) are known
constants (in contrast to z (i) which are unknown random variables). As before, we assume
x̃(i) |z̃ (i) ∼ N (µz̃(i) , Σz̃(i) ) i.i.d.
In summary we have m + m̃ examples, of which m are unlabelled data points x’s with unobserved
z’s, and m̃ are labelled data points x̃(i) with corresponding observed labels z̃ (i) . The traditional
EM algorithm is designed to take only the m unlabelled examples as input, and learn the model
parameters µ, Σ, and φ.
Our task now will be to apply the semi-supervised EM algorithm to GMMs in order to leverage
the additional m̃ labelled examples, and come up with semi-supervised E-step and M-step update
rules specific to GMMs. Whenever required, you can cite the lecture notes for derivations and
steps.

(b) [5 points] Semi-supervised E-Step. Clearly state which are all the latent variables that
need to be re-estimated in the E-step. Derive the E-step to re-estimate all the stated
latent variables. Your final E-step expression must only involve x, z, µ, Σ, φ and universal
constants.
(c) [5 points] Semi-supervised M-Step. Clearly state which are all the parameters that
need to be re-estimated in the M-step. Derive the M-step to re-estimate all the stated
parameters. Specifically, derive closed form expressions for the parameter update rules for
µ(t+1) , Σ(t+1) and φ(t+1) based on the semi-supervised objective.
(d) [5 points] [Coding Problem] Classical (Unsupervised) EM Implementation. For
this sub-question, we are only going to consider the m unlabelled examples. Follow the
instructions in src/p04 gmm.py to implement the traditional EM algorithm, and run it on
the unlabelled data-set until convergence.
Run three trials and use the provided plotting function to construct a scatter plot of the
resulting assignments to clusters (one plot for each trial). Your plot should indicate clus-
ter assignments with colors they got assigned to (i.e., the cluster which had the highest
probability in the final E-step).
Note: You only need to submit the three plots in your write-up. Your code will not be
autograded.
(e) [7 points] [Coding Problem] Semi-supervised EM Implementation. Now we will
consider both the labelled and unlabelled examples (a total of m + m̃), with 5 labelled
examples per cluster. We have provided starter code for splitting the dataset into a ma-
trices x of labelled examples and x tilde of unlabelled examples. Add to your code in
src/p04 gmm.py to implement the modified EM algorithm, and run it on the dataset until
convergence.
Create a plot for each trial, as done in the previous sub-question.
CS229 Problem Set #3 14

Note: You only need to submit the three plots in your write-up. Your code will not be
autograded.
(f) [3 points] Comparison of Unsupervised and Semi-supervised EM. Briefly describe
the differences you saw in unsupervised vs. semi-supervised EM for each of the following:
i. Number of iterations taken to converge.
ii. Stability (i.e., how much did assignments change with different random initializations?)
iii. Overall quality of assignments.
Note: The dataset was sampled from a mixture of three low-variance Gaussian distribu-
tions, and a fourth, high-variance Gaussian distribution. This should be useful in deter-
mining the overall quality of the assignments that were found by the two algorithms.
CS229 Problem Set #3 15

5. [20 points] K-means for compression


In this problem, we will apply the K-means algorithm to lossy image compression, by reducing
the number of colors used in an image.
We will be using the files data/peppers-small.tiff and data/peppers-large.tiff.
The peppers-large.tiff file contains a 512x512 image of peppers represented in 24-bit color.
This means that, for each of the 262144 pixels in the image, there are three 8-bit numbers (each
ranging from 0 to 255) that represent the red, green, and blue intensity values for that pixel. The
straightforward representation of this image therefore takes about 262144 × 3 = 786432 bytes (a
byte being 8 bits). To compress the image, we will use K-means to reduce the image to k = 16
colors. More specifically, each pixel in the image is considered a point in the three-dimensional
(r, g, b)-space. To compress the image, we will cluster these points in color-space into 16 clusters,
and replace each pixel with the closest cluster centroid.
Follow the instructions below. Be warned that some of these operations can take a while (several
minutes even on a fast computer)!

(a) [15 points] [Coding Problem] K-Means Compression Implementation. From the
data directory, open an interactive Python prompt, and type
from matplotlib.image import imread; import matplotlib.pyplot as plt;
and run A = imread(’peppers-large.tiff’). Now, A is a “three dimensional matrix,”
and A[:,:,0], A[:,:,1] and A[:,:,2] are 512x512 arrays that respectively contain the
red, green, and blue values for each pixel. Enter plt.imshow(A); plt.show() to display
the image.
Since the large image has 262144 pixels and would take a while to cluster, we will instead run
vector quantization on a smaller image. Repeat (a) with peppers-small.tiff. Treating
each pixel’s (r, g, b) values as an element of R3 , run K-means2 with 16 clusters on the pixel
data from this smaller image, iterating (preferably) to convergence, but in no case for less
than 30 iterations. For initialization, set each cluster centroid to the (r, g, b)-values of a
randomly chosen pixel in the image.
Take the matrix A from peppers-large.tiff, and replace each pixel’s (r, g, b) values with
the value of the closest cluster centroid. Display the new image, and compare it visually
to the original image. Include in your write-up all your code and a copy of your
compressed image.
(b) [5 points] Compression Factor. If we represent the image with these reduced (16) colors,
by (approximately) what factor have we compressed the image?

2 Please implement K-means yourself, rather than using built-in functions.


CS229 Problem Set #4 1

CS 229, Fall 2018


Problem Set #4: EM, DL, & RL

Due Wednesday, Dec 05 at 11:59 pm on Gradescope.

Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at http://piazza.com/stanford/fall2018/cs229. (3) If
you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please
read the policy on Handout #1 (available from the course website) before starting work. (4)
For the coding problems, you may not use any libraries except those defined in the provided
environment.yml file. In particular, ML-specific libraries such as scikit-learn are not permitted.
(5) To account for late days, the due date listed on Gradescope is Dec 08 at 11:59 pm. If you
submit after Dec 05, you will begin consuming your late days. If you wish to submit on time,
submit before Dec 05 at 11:59 pm.
All students must submit an electronic PDF version of the written questions. We highly recom-
mend typesetting your solutions via LATEX. If you are scanning your document by cell phone,
please check the Piazza forum for recommended scanning apps and best practices. All students
must also submit a zip file of their source code to Gradescope, which should be created using the
make zip.py script. In order to pass the auto-grader tests, you should make sure to (1) restrict
yourself to only using libraries included in the environment.yml file, and (2) make sure your
code runs without errors when running p01 nn.py, p04 ica.py and p06 cartpole.py. Your
submission will be evaluated by the auto-grader using a private test set.
CS229 Problem Set #4 2

1. [30 points] Neural Networks: MNIST image classification


In this problem, you will implement a simple convolutional neural network to classify grayscale
images of handwritten digits (0 - 9) from the MNIST dataset. The dataset contains 60,000
training images and 10,000 testing images of handwritten digits, 0 - 9. Each image is 28×28
pixels in size with only a single channel. It also includes labels for each example, a number
indicating the actual digit (0 - 9) handwritten in that image.
1
The following shows some example images from the MNIST dataset:

The data for this problem can be found in the data folder as images train.csv, images test.csv,
labels train.csv and labels test.csv.
The code for this assignment can be found within p01 nn.py within the src folder.
The starter code splits the set of 60,000 training images and labels into a sets of 59,600 examples
as the training set and 400 examples for dev set.
To start, you will implement a simple convolutional neural network and cross entropy loss, and
train it with the provided data set.
The architecture is as follows:

(a) The first layer is a convolutional layer with 2 output channels with a convolution size of 4
by 4.
(b) The second layer is a max pooling layer of stride and width 5 by 5.
(c) The third layer is a ReLU activation layer.
(d) After the four layer, the data is flattened into a single dimension.
(e) The fith layer is a single linear layer with output size 10 (the number of classes).
(f) The sixth layer is a softmax layer that computes the probabilities for each class.
(g) Finally, we use a cross entropy loss as our loss function.
We have provided all of the forward functions for these different layers so there is an unambigious
definition of them in the code. Your job in this assignment will be to implement functions that
1 https://commons.wikimedia.org/wiki/File:MnistExamples.png
CS229 Problem Set #4 3

compute the gradients for these layers. However, here is some additional text that might be
helpful in understanding the forward functions.
We have discussed convolutional layers on the exam, but as a review, the following equation
defines what we mean by a 2d convolution:

output[out channel, x, y] = convolution bias[out channel]+


X
input[in channel, x + di, y + dy]∗
di,dj,in channel

convolution weights[out channel, in channel, di, dj]

di and dj iterate through the convolution width and height respectively.


The output of a convolution is of size (# output channels, input width - convolution width + 1,
output height - convolution height + 1). Note that the dimension of the output is smaller due
to padding issues.
Max pooling layers simply take the maximum element over a grid.
It’s defined by the following function

output[out channel, x, y] = max input[in channel, x ∗ pool width + di, y ∗ pool height + dy]
di,dj

The ReLU (rectified linear unit) is our activation function. The ReLU is simply max(0, x) where
x is the input.
We use cross entropy loss as our loss function. Recall that for a single example (x, y), the cross
entropy loss is:
K
X
CE(y, ŷ) = − yk log yˆk ,
k=1
where ŷ ∈ RK is the vector of softmax outputs from the model for the training example x, and
y ∈ RK is the ground-truth vector for the training example x such that y = [0, ..., 0, 1, 0, ..., 0]>
contains a single 1 at the position of the correct class (also called a “one-hot” representation).
We are also doing mini-batch gradient descent with a batch size of 16. Normally we would
iterater over the data multiple times with multiple epochs, but for this assignment we only do
400 batches to save time.
(a) [20 points]
Implement the following functions within p01 nn.py. We recommend that you start at the
top of the list and work your way down:
i. backward softmax
ii. backward relu
iii. backward cross entropy loss
iv. backward linear
v. backward convolution
vi. backward max pool
(b) [10 points] Now implement a function that computes the full backward pass.
i. backward prop
CS229 Problem Set #4 4

2. [15 points] Off Policy Evaluation And Causal Inference


In class we have discussed Markov decision processes (MDPs), methods for learning MDPs
from data, and ways to compute optimal policies from that MDP. However, before we use that
policy, we often want to get an estimate of the its performance. In some settings such as
games or simulations, you are able to directly implement that policy and directly measure the
performance, but in many situations such as health care implementing and evaluating a policy
is very expensive and time consuming.
Thus we need methods for evaluating policies without actually implementing them. This task is
usually referred to as off-policy evaluation or causal inference. In this problem we will explore
different ways of estimating off policy performance and prove some of the properties of those
estimators.
Most of the methods we discuss apply to general MDPs, but for the sake of this problem, we will
consider MDPs with a single timestep. We consider a universe consisting of states S, actions A,
a reward function R(s, a) where s is a state and a is an action. One important factor is that
we often only have a subset of a in our dataset. For example, each state s could represent a
patient, each action a could represent which drug we prescribe to that patient and R(s, a) be
their lifespan after prescribing that drug.
A policy is defined by a function πi (s, a) = p(a|s, πi ). In other words, πi (s, a) is the conditional
probability of an action given a certain state and a policy.
We are given an observational dataset consisting of (s, a, R(s, a)) tuples.
Let p(s) denote the probability density function for the distribution of state s values within that
dataset. Let π0 (s, a) = p(a|s) within our observational data. π0 corresponds to the baseline
policy present in our observational data. Going back to the patient example, p(s) would be the
probability of seeing a particular patient s and π0 (s, a) would be the probability of a patient
receiving a drug in the observational data.
We are also given a target policy π1 (s, a) which gives the conditional probability p(a|s) in our
optimal policy that we are hoping to evaluate. One particular note is that even though this is a
distribution, many of the policies that we hope to evaluate are deterministic such that given a
particular state si , p(a|si ) = 1 for a single action and p(a|s) = i for the other actions.
Our goal is to compute the expected value of R(s, a) in the same population as our observational
data, but with a policy of π1 instead of π0 . In other words, we are trying to compute:

E s∼p(s) R(s, a)
a∼π1 (s,a)

Important Note About Notation And Simplifying Assumptions:


We haven’t really covered expected values over multiple variables such as E s∼p(s) R(s, a) in
a∼π1 (s,a)
class yet. For the purpose of this question, you may make the simplifying assumption that our
states and actions are discrete distributions. This expected value over multiple variables simply
indicates that we are taking the expected value over the joint pair (s, a) where s comes from p(s)
and a comes from π1 (s, a). In other words, you have a p(s, a) term which is the probabilities of
observing that pair and we can factorize that probability to p(s)p(a|s) = p(s)π1 (s, a). In math
notation, this can be written as:
CS229 Problem Set #4 5

X
E s∼p(s) R(s, a) = R(s, a)p(s, a)
a∼π1 (s,a) (s,a)
X
= R(s, a)p(s)p(a|s)
(s,a)
X
= R(s, a)p(s)π1 (s, a)
(s,a)

Unfortunately, we cannot estimate this directly as we only have samples created under policy π0
and not π1 . For this problem, we will be looking at formulas that approximate this value using
expectations under π0 that we can actually estimate.
We will make one additional assumption that each action has a non-zero probability in the
observed policy π0 (s, a). In other words, for all actions a and states s, π0 (s, a) > 0.
Regression: The simplest possible estimator is to directly use our learned MDP parameters to
estimate our goal. This is usually called the regression estimator. While training our MDP, we
learn an estimator R̂(s, a) that estimates R(s, a). We can now directly estimate
E s∼p(s) R(s, a)
a∼π1 (s,a)

with
E s∼p(s) R̂(s, a)
a∼π1 (s,a)

If R̂(s, a) = R(s, a), then this estimator is trivially correct.


We will now consider alternative approaches and explore why you might use one estimator over
another.

(a) [2 points] Importance Sampling: One commonly used estimator is known as the impor-
tance sampling estimator. Let π̂0 be an estimate of the true π0 . The importance sampling
estimator uses that π̂0 and has the form:
π1 (s, a)
E s∼p(s) R(s, a)
a∼π0 (s,a) π̂0 (s, a)

Please show that if π̂0 = π0 , then the importance sampling estimator is equal to:
E s∼p(s) R(s, a)
a∼π1 (s,a)

Note that this estimator only requires us to model π0 as we have the R(s, a) values for the
items in the observational data.
(b) [2 points] Weighted Importance Sampling: One variant of the importance sampling es-
timator is known as the weighted importance sampling estimator. The weighted importance
sampling estimator has the form:
π1 (s,a)
E s∼p(s) π̂0 (s,a) R(s, a)
a∼π0 (s,a)

E s∼p(s) ππ̂10 (s,a)


(s,a)
a∼π0 (s,a)
CS229 Problem Set #4 6

Please show that if π̂0 = π0 , then the weighted importance sampling estimator is equal to:

E s∼p(s) R(s, a)
a∼π1 (s,a)

(c) [2 points] One issue with the weighted importance sampling estimator is that it can be
biased in many finite sample situations. In finite samples, we replace the expected value
with a sum over the seen values in our observational dataset. Please show that the weighted
importance sampling estimator is biased in these situations.
Hint: Consider the case where there is only a single data element in your observational
dataset.
(d) [7 points] Doubly Robust: One final commonly used estimator is the doubly robust
estimator. The doubly robust estimator has the form:

π1 (s, a)
E s∼p(s) ((Ea∼π1 (s,a) R̂(s, a)) + (R(s, a) − R̂(s, a)))
a∼π0 (s,a) π̂0 (s, a)

One advantage of the doubly robust estimator is that it works if either π̂0 = π0 or R̂(s, a) =
R(s, a)
i. [4 points] Please show that the doubly robust estimator is equal to E s∼p(s) R(s, a)
a∼π1 (s,a)
when π̂0 = π0
ii. [3 points] Please show that the doubly robust estimator is equal to E s∼p(s) R(s, a)
a∼π1 (s,a)
when R̂(s, a) = R(s, a)
(e) [2 points] We will now consider several situations where you might have a choice between
the importance sampling estimator and the regression estimator. Please state whether the
importance sampling estimator or the regression estimator would probably work best in
each situation and explain why it would work better. In all of these situations, your states
s consist of patients, your actions a represent the drugs to give to certain patients and your
R(s, a) is the lifespan of the patient after receiving the drug.
i. [1 points] Drugs are randomly assigned to patients, but the interaction between the
drug, patient and lifespan is very complicated.
ii. [1 points] Drugs are assigned to patients in a very complicated manner, but the inter-
action between the drug, patient and lifespan is very simple.
CS229 Problem Set #4 7

3. [10 points] PCA


In class, we showed that PCA finds the “variance maximizing” directions onto which to project
the data. In this problem, we find another interpretation of PCA.
Suppose we are given a set of points {x(1) , . . . , x(m) }. Let us assume that we have as usual
preprocessed the data to have zero-mean and unit variance in each coordinate. For a given
unit-length vector u, let fu (x) be the projection of point x onto the direction given by u. I.e., if
V = {αu : α ∈ R}, then
fu (x) = arg min ||x − v||2 .
v∈V

Show that the unit-length vector u that minimizes the mean squared error between projected
points and original points corresponds to the first principal component for the data. I.e., show
that
Xm
arg min kx(i) − fu (x(i) )k22 .
u:uT u=1
i=1

gives the first principal component.


Remark. If we are asked to find a k-dimensional subspace onto which to project the data so as
to minimize the sum of squares distance between the original data and their projections, then
we should choose the k-dimensional subspace spanned by the first k principal components of the
data. This problem shows that this result holds for the case of k = 1.
CS229 Problem Set #4 8

4. [20 points] Independent components analysis


While studying Independent Component Analysis (ICA) in class, we made an informal argu-
ment about why Gaussian distributed sources will not work. We also mentioned that any other
distribution (except Gaussian) for the sources will work for ICA, and hence used the logistic
distribution instead. In this problem, we will go deeper into understanding why Gaussian dis-
tributed sources are a problem. We will also derive ICA with the Laplace distribution, and apply
it to the cocktail party problem.
Reintroducing notation, let s ∈ Rd be source data that is generated from d independent sources.
Let x ∈ Rd be observed data such that x = As, where A ∈ Rd×d is called the mixing matrix.
We assume A is invertible, and W = A−1 is called the unmixing matrix. So, s = W x. The goal
of ICA is to estimate W . Similar to the notes, we denote wjT to be the j th row of W . Note
that this implies that the j th source can be reconstructed with wj and x, since sj = wjT x. We
are given a training set {x(1) , . . . , x(n) } for the following sub-questions. Let us denote the entire
training set by the design matrix X ∈ Rn×d where each example corresponds to a row in the
matrix.

(a) [5 points] Gaussian source


For this sub-question, we assume sources are distributed according to a standard normal
distribution, i.e sj ∼ N (0, 1), j = {1, . . . , d}. The likelihood of our unmixing matrix, as
described in the notes, is
 
n
X d
X
`(W ) = log |W | + log g 0 (wjT x(i) ) ,
i=1 j=1

where g is the cumulative distribution function, and g 0 is the probability density function of
the source distribution (in this sub-question it is a standard normal distribution). Whereas
in the notes we derive an update rule to train W iteratively, for the cause of Gaussian
distributed sources, we can analytically reason about the resulting W .
Try to derive a closed form expression for W in terms of X when g is the standard normal
CDF. Deduce the relation between W and X in the simplest terms, and highlight the
ambiguity (in terms of rotational invariance) in computing W .
(b) [10 points] Laplace source.
For this sub-question, we assume sources are distributed according to a standard Laplace
distribution, i.e si ∼ L(0, 1). The Laplace distribution L(0, 1) has PDF fL (s) = 12 exp (−|s|).
With this assumption, derive the update rule for a single example in the form

W := W + α (. . .) .
(c) [5 points] Cocktail Party Problem
For this question you will implement the Bell and Sejnowski ICA algorithm, but assuming
a Laplace source (as derived in part-b), instead of the Logistic distribution covered in class.
The file mix.dat contains the input data which consists of a matrix with 5 columns, with
each column corresponding to one of the mixed signals xi . The code for this question can
be found in p04 ica.py.
Implement the update W and unmix functions in p04 ica.py.
CS229 Problem Set #4 9

You can then run p04 ica.py in order to split the mixed audio into its components. The
mixed audio tracks are written to midex i.wav in the output folder. The split audio tracks
are written to split i.wav in the output folder.
To make sure your code is correct, you should listen to the resulting unmixed sources.
(Some overlap or noise in the sources may be present, but the different sources should be
pretty clearly separated.)
If you implemention is correct, your output split 0.wav should sound similar to the file
correct split 0.wav included with the source code.
Note: In our implementation, we anneal the learning rate α (slowly decreased it over
time) to speed up learning. In addition to using the variable learning rate to speed up
convergence, one thing that we also do is choose a random permutation of the training
data, and running stochastic gradient ascent visiting the training data in that order (each
of the specified learning rates was then used for one full pass through the data).
CS229 Problem Set #4 10

5. [15 points] Markov decision processes


Consider an MDP with finite state and action spaces, and discount factor γ < 1. Let B be the
Bellman update operator with V a vector of values for each state. I.e., if V 0 = B(V ), then
X
V 0 (s) = R(s) + γ max Psa (s0 )V (s0 ).
a∈A
s0 ∈S

(a) [10 points] Prove that, for any two finite-valued vectors V1 , V2 , it holds true that

||B(V1 ) − B(V2 )||∞ ≤ γ||V1 − V2 ||∞ .

where
||V ||∞ = max |V (s)|.
s∈S

(This shows that the Bellman update operator is a “γ-contraction in the max-norm.”)
(b) [5 points] We say that V is a fixed point of B if B(V ) = V . Using the fact that the
Bellman update operator is a γ-contraction in the max-norm, prove that B has at most
one fixed point—i.e., that there is at most one solution to the Bellman equations. You may
assume that B has at least one fixed point.

Remark: The result you proved in part(a) implies that value iteration converges geometrically
to the optimal value function V ∗ . That is, after k iterations, the distance between V and V ∗ is
at most γ k .
CS229 Problem Set #4 11

6. [25 points] Reinforcement Learning: The inverted pendulum


In this problem, you will apply reinforcement learning to automatically design a policy for a
difficult control task, without ever using any explicit knowledge of the dynamics of the underlying
system.
The problem we will consider is the inverted pendulum or the pole-balancing problem.2
Consider the figure shown. A thin pole is connected via a free hinge to a cart, which can move
laterally on a smooth table surface. The controller is said to have failed if either the angle of
the pole deviates by more than a certain amount from the vertical position (i.e., if the pole falls
over), or if the cart’s position goes out of bounds (i.e., if it falls off the end of the table). Our
objective is to develop a controller to balance the pole with these constraints, by appropriately
having the cart accelerate left and right.

3.5

2.5

1.5

0.5

−0.5
−3 −2 −1 0 1 2 3

We have written a simple simulator for this problem. The simulation proceeds in discrete time
cycles (steps). The state of the cart and pole at any time is completely characterized by 4
parameters: the cart position x, the cart velocity ẋ, the angle of the pole θ measured as its
deviation from the vertical position, and the angular velocity of the pole θ̇. Since it would
be simpler to consider reinforcement learning in a discrete state space, we have approximated
the state space by a discretization that maps a state vector (x, ẋ, θ, θ̇) into a number from 0 to
NUM STATES-1. Your learning algorithm will need to deal only with this discretized representation
of the states.
At every time step, the controller must choose one of two actions - push (accelerate) the cart
right, or push the cart left. (To keep the problem simple, there is no do-nothing action.) These
are represented as actions 0 and 1 respectively in the code. When the action choice is made, the
simulator updates the state parameters according to the underlying dynamics, and provides a
new discretized state.
We will assume that the reward R(s) is a function of the current state only. When the pole
angle goes beyond a certain limit or when the cart goes too far out, a negative reward is given,
and the system is reinitialized randomly. At all other times, the reward is zero. Your program
must learn to balance the pole using only the state transitions and rewards observed.
The files for this problem are in src directory. Most of the the code has already been written
for you, and you need to make changes only to p06 cartpole.py in the places specified. This
file can be run to show a display and to plot a learning curve at the end. Read the comments at
the top of the file for more details on the working of the simulation.
2 The dynamics are adapted from http://www-anw.cs.umass.edu/rlr/domains.html
CS229 Problem Set #4 12

To solve the inverted pendulum problem, you will estimate a model (i.e., transition probabilities
and rewards) for the underlying MDP, solve Bellman’s equations for this estimated MDP to
obtain a value function, and act greedily with respect to this value function.
Briefly, you will maintain a current model of the MDP and a current estimate of the value func-
tion. Initially, each state has estimated reward zero, and the estimated transition probabilities
are uniform (equally likely to end up in any other state).
During the simulation, you must choose actions at each time step according to some current
policy. As the program goes along taking actions, it will gather observations on transitions and
rewards, which it can use to get a better estimate of the MDP model. Since it is inefficient to
update the whole estimated MDP after every observation, we will store the state transitions and
reward observations each time, and update the model and value function/policy only periodically.
Thus, you must maintain counts of the total number of times the transition from state si to state
sj using action a has been observed (similarly for the rewards). Note that the rewards at any
state are deterministic, but the state transitions are not because of the discretization of the state
space (several different but close configurations may map onto the same discretized state).
Each time a failure occurs (such as if the pole falls over), you should re-estimate the transition
probabilities and rewards as the average of the observed values (if any). Your program must then
use value iteration to solve Bellman’s equations on the estimated MDP, to get the value function
and new optimal policy for the new model. For value iteration, use a convergence criterion
that checks if the maximum absolute change in the value function on an iteration exceeds some
specified tolerance.
Finally, assume that the whole learning procedure has converged once several consecutive at-
tempts (defined by the parameter NO LEARNING THRESHOLD) to solve Bellman’s equation all con-
verge in the first iteration. Intuitively, this indicates that the estimated model has stopped
changing significantly.
The code outline for this problem is already in p06 cartpole.py, and you need to write code
fragments only at the places specified in the file. There are several details (convergence criteria
etc.) that are also explained inside the code. Use a discount factor of γ = 0.995.
Implement the reinforcement learning algorithm as specified, and run it.

• How many trials (how many times did the pole fall over or the cart fall off) did it take before
the algorithm converged? Hint: if your solution is correct, on the plot the red line indicating
smoothed log num steps to failure should start to flatten out at about 60 iterations.
• Plot a learning curve showing the number of time-steps for which the pole was balanced
on each trial. Python starter code already includes the code to plot. Include it in your
submission.
• Find the line of code that says np.random.seed, and rerun the code with the seed set to 1,
2, and 3. What do you observe? What does this imply about the algorithm?

You might also like