Giant Pile ML Problems
Giant Pile ML Problems
1
regression model or classification model.
(b) I need to go to the bank to deposit a check, but my schedule is tight and I want to go to the ATM that I know
has the shortest line. In my town there are 3 ATMs to choose from, and I have all the security footages from
each one. So I form a model that predicts that, given the day of the week, the time of day, the rotation of
the planets, which ATM will have the shortest line. I use this to predict which ATM I should visit on a warm
Tuesday afternoon when Jupiter is in retrograde.
This is an example of a
regression model, or classification model.
(c) A
discriminative model or generative model
forms a probability distribution based on training data and uses it to predict future samples using maximum
likelihood estimation.
(d) A
discriminative model, generative model
produces discriminative lines that well-separates different classes, but does not really focus on the probability
distribution away from the class separation points.
(e) I want to know if an image is a dog or a cat. So I go online, download all the image search results for cat, and
for dog, and train a model over my downloaded images to correctly predict cat or dog.
This is an example of
supervised learning, or unsupervised learning.
(f) I want to know if an image is a dog or a cat. So I adopt 100,000 animals that are all either cats or dogs, and
then put them in separate rooms based on their physical similarities, and attribute self-made labels to each
room.
This is an example of
supervised learning, or unsupervised learning.
(a) Linear regression can be used to fit nonlinear curves to data, as long as they can be written as linear combi-
nations of nonlinear functions.
(b) Bagging is inherently a sequential algorithm (e.g. not easily parallelizable), while boosting can be parallelized
in implementation.
(c) Decision trees are a special type of model that cannot overfit data.
(d) KNN is a type of model in which training time is very long but inference time is usually very short.
(e) KNN is a supervised method and Kmeans is an unsupervised method.
(f) KNN involves a pesky sequence of alternating minimization of what is inherently a nonconvex objective loss,
while Kmeans involves minimizing a simple convex objective and always converges to a true solution.
(g) Naive Bayes is, generally, more sample-efficient (e.g. requires fewer training samples for good performance)
than full Bayes models.
(h) Ensembling is when you take into account the predictions of more than one model, and can produce a stronger
model by merging their predictions wisely.
(i) Unlike Bayes solutions, minimax solutions are by far more useful in low-risk applications, when there is profit
in any correct answer and basically no penalty in every wrong answer.
(j) The main goal of having a Bayesian model is to account for uncertainty in the parameter space.
2
(k) The main discriminating factor between supervised and unsupervised learning is that in supervised learning,
all the labels are wrong and in unsupervised learning, all the labels are 100% accurate.
(l) If we can show that a model has the same performance over its training sample, as it does over new unseen
samples drawn from the same distribution, then we do not need to worry as the model has generalized perfectly,
and should work well in the wild.
(m) Next-word-prediction, inpainting, and letter unscrambling are examples of seemingly useless tasks, but that
are often used to pretrain large models.
(n) All good machine learning models are trained by minimizing a convex, smooth function.
(o) The primary benefit of decision tree models is that they never overfit data.
(p) Variance and bias describe the behavior of a model over random selections of limited training data. Without
increasing training size, often variance can only be reduced at the expense of increasing bias, and vice versa.
(q) I want to produce a machine learning model that learns to identify good code from bad code. To train this
model, I dig up past homework assignments from two students, one who I deem a “good coder” and one who
I deem a “bad coder”. The model achieves perfect prediction accuracy, and thus it is likely that this model
will predict well on any incoming Stony Brook student.
(r) Minimizing a convex function is usually pretty easy, but maximizing a concave function is usually pretty hard.
(s) I am a King of a great nation, and I have to decide whether cut taxes on my people or not. I turn to my
100 advisers, each of which, on their own, make the right decision at least 60% of the time. As long as the
advisers behave independently, the majority vote is very likely to be the right answer, with probability higher
than 60%.
3
1.4. Vocabulary. Match the term to the definition.
regression (a) describes whether a function is “bowl-shaped” or not
Hessian (e) the problem that arises when your model is so complex
that you fit every training sample perfectly, but the model
does not generalize well to unseen data
overfitting (i) the problem that arises when your model is too simple
to fit complex patterns in the dataset
underfitting (j) the matrix collecting all the second partial derivatives
of a function with multiple inputs and one output
variance (l) the task of ensuring that a model that performs well on
a training set, performs equally well on unseen data
1.5. Computation and memory. Rank the following multiclass classification methods in terms of memory complexity:
one vs one, one vs all, multiclass logistic regression. (If memory complexity is the same, write them in the same
column.)
1.6. Linear classification. Select all of the following possible discriminators that can result from solving a logistic
regression problem over θ ∈ R2 . (Do not consider linear combinations of nonlinear bases functions.)
4
1.7. (Generalized) linear regression. Study the datasets below. In each case, select one of the following:
(i) The function can be fit with a simple linear model (y = mx + b, where we learn m and b).
(ii) The function can be fit with a generalized linear model, as a weighted sum of bases functions (sines/cosines,
polynomials, spike trains, combination, etc.)
(iii) The function can only be fit with a nonlinear model.
A B C D E
25 1 800 6 8
0.5 6
20 600 4
0
4
15 -0.5 400 2
2
y
y
10 -1 200 0
0
-1.5
5 0 -2
-2 -2
0 -2.5 -200 -4 -4
0 50 100 0 5 10 15 0 20 40 60 80 0 50 100 0 50 100
x x x x x
Write your answers below. You do not need to provide a justification, but we will read it for partial credit.
2.1. Decide if the following ways of thinking fall under the umbrella of supervised, semi-supervised, or unsupervised
learning. Justify your answer.
There is only and exactly one correct answer (no “both” or “all 3”, no “none”).
(a) I have just arrived to Middle Earth, and I am shocked by all the
different creatures I see. Some are tall, some are short. Some have
thorny heads, others have hair, or are bald. I decide to group every-
thing I see into categories, and give them funny names (entpeople,
olephants, etc.) These names are entirely my design, and capture
feature similarities between clusters of species.
(b) I have now met an “entperson”, who tells me that there are specific
anatomical features that are unique to entmen and entwomen. For
example, entmen tend to have really thorny heads, while entwomen
have bushier feet. These are not hard and fast rules, in that there
are entmen with bushy feat and entwomen with thorny heads, but
they are rare. I therefore devise a classification scheme that decides
the gender of entpeople based on these features. My entfriend then
goes through a list of all the people in our neighborhood and helps
me label them as men or women.
(c) I have now lived in Middle Earth for 150 years, so I have a pretty
good grasp of the “main” species, such as dwarfs, or giants, or
elves. But every once in a while I see something completely new,
and am not sure exactly which of these categories it belongs. So, I (credit: xkcd.com)
use a mixture of the clustering strategy I used when I first arrived
to kind of lump this new species in some category, and then use
the labels I acquired by talking to a lot of different species over the
past 150 years, and guess an appropriate species name for this new
entity.
5
2.2. In the following scenarios, decide if the i.i.d. (independent, identically distributed) assumption is reasonable. Justify
your answer.
(a) I am trying to decide if a child has COVID or not. There are several features
I am measuring: fever presence, runny nose, sluggish behavior, complaints
of headache. Are these features independent of each other? identically
distributed across the children (conditioned on presence of COVID and no
other information)?
(b) Assume that poor Johnny has chronic asthma, and little Debbie has just
recovered from bronchitis. The rest of the children are healthy. I measure
the lung capacity of each child. Is this measurement identically distributed
across all the children? Is it independent, assuming that these conditions
are all noncontagious?
(c) I now look at three adults, Amber, Brenda, and Chloe. Amber and Brenda
both have children in the same kindergarten class. Chloe also has a child, (source: xkcd)
but hers is in a different class. Are their chances of getting COVID inde-
pendent? Identically distributed?
2.3. In the following scenarios, discuss whether or not you believe that I did a good job in model generalization. Justify
your answer.
(a) I work in real estate. I have observed, over 10 years of experience, that large houses on Long Island tend to
sell at around $500,000 or more, while small houses sell at around $400,000 or fewer. I move to the Texas
Panhandle, and find a house that is very large, and conclude that it should be worth around $500,000 or more.
(b) I am a young upshot professional and I would love to live in New York City. I talk to my friend, who lived in
NYC in the late 2000’s, and she tells me how factors affect rent prices (quality of apt, number of bedrooms,
neighborhoods, allow pets, etc.) The year is now 2020. Statistically speaking, all the listed features have
stayed about constant in NYC, but inflation has hit the East Coast rather hard. Does her advice generalize
to my current situation?
(c) I am a computer science expert who has spent the past 5 years classifying images of dogs and cats. I’m pretty
good at this now, and my 1 million layer neural network achieves 100% accuracy (in testing and training) on
this task. Now, the government wants me to apply my neural network to remote sensing satellite images, to
detect underground marijuana farms. I use the same neural network, with no tweaking, on this new task, and
expect the same level of accuracy.
(credit: xkcd)
6
(a) I have spent the past decade of my life collecting data for a specific task,
and I have finally accrued 100 data points. I want to use all this data to
both design and train my model, thereby giving the best model. My best
friend thinks it’s a bad idea. Why is that?
(b) “Fine!” I say to my best friend. I will instead divide my dataset into two
sets: a train and test set. I will train my model over the train set and
only report the score over my test set. One caveat, this model is a little
sensitive to how many iterations I run gradient descent during training: if I
run it for too long or too short, the performance will degrade. So, I use my
test set to determine the best number of iterations to run. My best friend
scowls at me and scolds me again. Why is that? What is the appropriate
way to partition my data, to give the most generalizable evaluation?
(c) Today, many machine learning subfields use standardized datasets. These
(source: xkcd)
databases are divided into three sets: training, validation, and testing. We
can assume that everyone is honest, and uses each set appropriately. Still,
people are saying that there is a danger of overfitting, simply by using the
same test base so many times. Discuss this: do you agree or disagree with
this fear? Can you give an example of when this type of methodology can
lead to bad real world effects? Be specific.
2.5. There is a growing need for machine models to have interpretability; that is, when it tells you the label of your new
data sample, you want to also understand why it told you this.
One of the most common answers to the “why” is to find which features (or
which combination of features) best contributed to the answer given, or which
samples best contributed to the model. In each case, describe how you could
use parts of the learned model to return not only a new data label, but which
features (or samples) were most important. Be specific and detailed; if I can’t
implement your suggestion based on what you wrote, you won’t get full credit.
3 Basic probability
3.1. Silicon Valley and socks. I have 4 children, Alexa, Siri, Googs, and Zuckie. Every morning I tell them to put on
their socks.
• Alexa only listens to me on Mondays and Thursdays and puts on her socks. The rest of the days, she puts on
her socks only half of the time. She either puts on both her socks or none of her socks.
• Siri always runs and gets her socks, but half the time she only puts on 1 sock, and the other half the time she
just throws them in the garbage disposal.
• Googs tells me all this random trivia about socks, but never puts on his socks.
• Zuckie wears both his socks 4/7 of the time and sells the rest of them to CambridgeAnalytica.
Assume the children all act independently, and the probability of each day, if not specified, is 1/7 (uniform). Round
all answers to at least 3 significant digits.
(a) What are the chances that either Alexa or Zuckie is wearing a sock?
7
(b) On a random day, I see both girls, and notice that at least one girl is wearing a sock. What are the chances
that it’s Alexa?
(c) What is the expected number of socks being worn by each child?
(d) What is the variance in the number of socks being work by each child?
(e) Elon, the kid living next door, has a huge crush on Siri, so whenever Siri doesn’t want to wear any socks, Elon
will just buy them from her and put them on himself. If Elon cannot get a sock from Siri, he will take a sock
from his mom exactly 1/3 of the time. However, like Siri, he never wears both socks; only one or none at a
time.
S
Using the random vector X = , write down the covariance matrix capturing the interactions between
E
S =# socks on Siri’s feet and E = # socks on Elon’s feet.
eua-authorized-serology-test-performance
8
Here, we interpret the features of alien type and nose length as the model data or parameters, and the actual
presence of Sbovid as the model outcome or evidence.
A new alien, Esmeralda, walks into town, a Scrabble with a long nose.
Our goal is to compute the posterior that Esmeralda has Sbovid, e.g.
(a) Write down the prior probability of any alien contracting Sbovid. Of any alien not contracting Sbovid?
(b) Write down the likelihood of an alien with Sbovid being a Scrabble? An alien who is immune being a Scrabble?
(c) Compute the posterior probability that any Scrabble contracts Sbovid.
(d) Compute the posterior that Esmeralda that this alien has Sbovid, using the Naive Bayes assumption.
(e) Now use the table to compute a full-Bayes likelihood that Esmeralda has Sbovid.
(a) I pick up a baby animal at random. What is the probability that ... (fill in the table)
fur \ tail furry rope-like
blue
gray
brown
blue
gray
brown
(d) Are the features “fur color” and “tail texture” correlated, now that I know the animal is Tom’s cherished baby
daughter? (Show mathematically.)
3.5. Consider the 8 scenarios before. The position of each data point is the feature vector x, and the color/marker is
the label y.
9
A B C D
5
5
2 5
0 0 0 0
-2
-5 -5
-4 -5
-2 0 2 4 -5 0 5 -5 0 5 -5 0 5
E F G H
1
0.5 2
1
0.5 1
0.5 0
0
0 0 -0.5
0.2 0.4 0.6 0.8 -0.5 0 0.5 -0.5 0 0.5 -1 0 1 2
The top row are generated via Gaussians, and the bottom row are generated via uniform distributions.
(a) List all the instances in which the features (regardless of the labels) are independent.
(b) List all the instances in which the features, given the labels, are independent.
3.6. Independent or not independent. Two random variables A and B are independent if, given distributions pA and
pB , we have
pA,B (a, b) = pA (a)pB (b)
where pA,B is their joint distribution.
In the following scenarios, decide if A and B are independent. Justify your answer.
(a) A and B are discrete random variables and have the following p.m.f.s
0.3, b = hat
0.25, a = red
0.3, b = T-shirt
pA (a) = 0.25, a = blue , pB (b) =
0.2, b = skirt
0.5, a = green
0.2, b = shoes
10
and B = A · C where (
0.9, c=1
pC (c) =
0.1, c = −1
E[(A − E[A])2 ]
E[A] E[(A − E[A])(B − E[B])]
µ= , Σ=
E[B] E[(A − E[A])(B − E[B])] E[(B − E[B])2 ]
3.7. The importance of i.i.d. When we talk about probability distributions and sampling via pdfs, there are a lot of
tools available for us to do so. In this problem, we will explore the importance of using independent samples, e.g.
samples drawn entirely independently. This will greatly affect the speed at which distributions are learned.
• Using Python, draw N points randomly from a Gaussian distribution with mean 0 and variance 1 (use
np.random.randn()), for N = 250, 1000, 2000, 4000. Plot the histogram (normalized) in each case, and plot
over it the known pdf of the Gaussian distribution
1 x2
pX (x) = √ exp(− ).
2π 2
Comment on how N affects how closely your histogram fits the pdf curve. If you are in fact using the simplest
implementation tools, you should be able to ensure that all these points are drawn independently.
• Now, code up an imaginary coin, which flips heads or tails (or 0 or 1). Then, repeat the experiment. However,
half the time, draw a new random element, and the other half of the time, uniformly sample from the points
you’ve already drawn. Again, plot the histogram against the pdfs for N = 250, 1000, 2000, 4000.
• Finally, code up an imaginary dice, which randomly picks a number 1 through 6. Again, repeat the experiment.
If the number is 1, draw a new element randomly. If not, sample uniformly from the old points. Again, plot
the histogram against the pdfs for N = 250, 1000, 2000, 4000.
4 Advanced probability
4.1. Jensen’s inequality. Jensen’s inequality is an application of convexity to expectations. Recall that the expectation
of a discrete random variable f (X) with pmf pX (x) is written as
X
E[f (X)] = f (x)pX (x).
x∈X
Assume that |X | is finite; that is, there are only a finite number of points where X = x with nonzero probability.
Show that if f is a strictly convex function; that is, if
f (θx + (1 − θ)y) < θf (x) + (1 − θ)f (y), for all x ̸= y, 0 < θ < 1
then
E[f (X)] > f (E[X]).
4.2. Exponential distribution. Wait time is often modeled as an exponential distribution, e.g.
(
1 − e−λx , x > 0
Pr(I wait < x hours at the DMV) =
0, x ≤ 0,
and this cumulative density function is parametrized by some constant λ > 0. A random variable X distributed
according to this cdf is denoted as X ∼ exp[λ].
11
(a) In terms of λ, give the probability distribution function for the exponential distribution.
(b) Show that if X ∼ exp(λ), then the mean of X is 1/λ and the variance is 1/λ2 .
(You may use a symbolic integration tool such as Wolfram Alpha. If you do wish to do the integral by hand,
my hint is to review integration by parts.)
(c) Now suppose I run a huge server farm, and I am monitoring the server’s ability to respond to web requests.
I have m observations of delay times, x1 , ..., xm , which I assume are i.i.d., distributed according to exp[λ] for
some λ. Given these m observations, what is the maximum likelihood estimate λ̂ of λ?
(d) Given the estimate of λ̂ in your previous question, is 1/λ̂ an unbiased estimate of the mean wait time? Is 1/λ̂2
an unbiased estimate of the variance in wait time?
4.3. Probably approximately correct bounds. Recall Hoefdding’s inequality for a sum Sm = X1 + · · · + Xm where each
Xi are i.i.d. with range ai ≤ Xi ≤ bi almost surely. Then, for all c > 0,
2c2
Pr(Sm − E[Sm ] ≥ c) ≤ exp − Pm 2
i=1 (bi − ai )
Suppose I am running for president and I want to see the chances that I will win. I go to a whole bunch of shopping
malls, assuming the customers inside are a good i.i.d. sample of the voting base, and do a bunch of polling. We
model the problem as follows: if person i says they’ll vote for me, we predict Xi = 1 and otherwise, Xi = 0. If at
the end of the day, Sm ≥ m2 , then I win the election; else, I will lose.
I do my polling and find that 60% of the people I poll say they will vote for me. How many people do I have to
poll to be 90% sure that I will win the election?
4.4. Entropy, conditional entropy, mutual information, information gain. I have a very messy sock drawer, containing
10 red socks, 5 blue socks, 4 yellow socks, and 1 black sock.
(a) Recall the formula for entropy:
X
H(X) = Pr(X = x) log2 (Pr(X = x)).
X=x
Define X a random variable which represents the color of a sock, randomly picked. What is the entropy of
this sock?
(b) My mom comes and tells me I must organize my socks better. So, I put all my red socks in the top drawer
and the rest in my bottom drawer. Recall the formula for conditional entropy:
X
H(X|Y ) = Pr(X = x, Y = y) log2 (Pr(X = x|Y = y)).
X=x,Y =y
What is the conditional entropy, where X is the color of a sock randomly picked, and Y is the drawer of which
I pick it from? Assume that I pick the top drawer with twice the probability as picking the bottom drawer
(c) The information gain (also called mutual information can be defined in terms of the entropy and conditional
entropy
I(X; Y ) = H(X) − H(X|Y ).
Give the mutual information between X the color of the sock and Y the drawer which it comes from.
The worst distribution. Let’s revisit the messy sock problem. We are using this problem to arrive at a well-known
result in information theory. Therefore, although you are free to google around, submit here full justifications for
each answer for full credit. (You cannot use the result to justify the result.)
4.5. (a) Suppose that my mom gives me some money to buy 10 socks, and I decide only to buy red and black socks.
(Stonybrook colors!) Define the random variable X as the color of the sock I randomly pull out of my shopping
bag. How many socks of each color should I buy to maximize the entropy of X?
As a reminder, the entropy of a random variable which can take 2 values is
H(X) = −Pr(X = red) log2 (Pr(X = red)) − Pr(X = black) log2 Pr(X = black).
12
(b) Show more generally that, for any 0 ≤ c ≤ 1, the constrained optimization problem
4.6. Gaining information at Hogwarts. It is a new year at Hogwarts, but unfortunately this year, the Sorting Hat went
on vacation to the Bahamas, so you have been assigned the unofficial sorter of the incoming students. To help
maintain consistency, you are given a year of past students, their key attributes, and what houses they were sorted
in:
*Some Harry Potter facts are altered to make computations less painful
Round all answers to the nearest 0.001. If a probability P = 0, use the convention P log(P ) = 0.
(a) Assuming uniform distribution across each “training sample” (e.g. past student), what is the entropy of labels
(house)?
(b) What is the conditional entropy of labels given hair color?
(c) If one of the three features is revealed to you (hair color, is a Weasley, pets) which one provides the largest
information gain? Report this information gain.
(d) Pick the next node with the lowest purity and split again, using the largest information gain to pick the
next feature. In each case, use n-ary splits (e.g. the number of branches is the number of individual feature
possibilities at that node.)
(e) Finish constructing the decision tree, until there does not exist a split that lowers training misclassification
rate anymore. Report this training misclassification rate.
(f) Now, classify some new samples
Name hair color Fearlessness pets house
Albus Dumbledore blonde high toad Gryffindor
Tom Riddle brown low toad Slytherin
Severus Snape brown high rat Slytherin
Minerva McGonagall red high rat Gryffindor
Sirius Black brown high rat Ravenclaw
Report the test error over this new set, using the decision tree you’d constructed in the previous parts.
13
4.7. Correlated mixture of sequential experts. I want to buy a yacht, but I’m not sure if it’s a good idea given the
economy. So, I decide to question m consultants. Each consultant has more-or-less the same qualifications, and
they come in one at a time.
The first consultant comes in my office. I ask, “Should I buy a yacht?” She says yes with probability p.
On her way out the building, she meets the second consultant. They chat briefly, and she leaves, he comes, and
the process repeats. Each time, I ask the consultant if I should buy a yacht, and receive “yes” with probability
p; each time, the consultant chats briefly with the next consultant. However, the answers now are not i.i.d., but
rather each expert’s answer is correlated with the answer of experts he/she chatted with in the lobby.
At the end of the day, I have met with m consultants. I will make a decision whether to buy a yacht based on
majority rule. We will now calculate the probability that I will buy a yacht.
(a) We model the answer of each consultant as Yi = 1 if the ith consultant recommended “yes”, and Yi = −1
otherwise. Show that if distribution
5 Graphical models
5.1. Covid at kindergarten. Alice, Bob, Carlos, Dima, and Estella are all elementary school kids, and the sit in the
classroom in a circle. Alice is closest to the teacher, Mrs. Novid, and next to her is Bob, then Carlos, then Dima,
then Estella.
The students are somewhat well-spaced apart, but, you know kids. They like to cough on each other, touch each
other’s stuff, there’s no privacy or social distancing. So disease transmission is indeed an issue.
Transmission probabilities
• If Mrs. Novid has Covid, then Alice has a 50% chance of getting it from Mrs. Novid.
• If Alice has Covid, then Bob has 50% chance of getting Covid from Alice.
• If Bob has Covid, then Carlos has 50% chance of getting Covid from Bob.
• If Carlos has Covid, then Dima has 50% chance of getting Covid from Carlos.
• If Dima has Covid, then Estella has 50% chance of getting Covid from Dima.
However, even if no one else in the class has Covid, each child still has a 25% chance of getting Covid, from an
outside source.
14
(a) Draw a graphical model in which each node stands for a random variable, which represents the probability
that a person has Covid. (e.g. node A may represent the event that Alice has Covid, node B that Bob has
covid, etc...)
(b) We find out one day that Mrs. Novid is home sick, and has had Covid for the past few days. What is the
probability that Estella has Covid? Give your answer to the nearest 0.001
Hint: The question is a little easier if you first try to find the probability that each person doesn’t have Covid.
5.2. Fashion choice. I have 4 tops: a red sweater, a blue T-shirt, a green hoodie, and a white tank top. I need your
help to decide what to wear.
5.3. NNBA brackets. It’s the Neural Network Tournament of 2022! Four neural networks will face off in a battle-to-
the-death against each other, and the winner will win free french fries at Arby’s!
Below is the tournament bracket schedule. To read this, first, GPT and Resnet will face off, and concurrently,
LSTM and Alexnet will face off, GAN and BERT will face off, and Sigmoid and Transformer will face off. The
winner of each quarterfinal match will then advance to the semifinal matches (red), and the winner of the two
semifinal matches will face each other in the final match, leading to a single winner.
15
loser \ winner GPT ResNet LSTM AlexNet GAN BERT Sigmoid Transformer
GPT - 45% 100% 10% 85% 15% 0% 14%
ResNet - 15% 0% 75% 10% 0% 75%
LSTM - 100% 50% 100% 10% 100%
Alexnet - 85% 100% 10% 100%
GAN - 5% 0% 5%
BERT - 50% 50%
Sigmoid - 100%
Transformer -
(a) Draw a graphical model where each node represents the state of the winner at each stage. Label each node
precisely with what you want it to represent. No need to label the edges.
Hint: There should be as many nodes in your plot as there are circles in the above diagram.
(b) In the next few questions, your goal is to predict the probability that ResNet will win. Round all answers to
the nearest 0.01 of a percent.
First, infer the probability that ResNet will make it to the semifinals.
(c) Next, infer the probability that ResNet will make it to the finals.
(d) Infer the probability that ResNet will win the final.
5.4. Trouble (again) in Silicon Valley. Alexa, Googs, and Zuckie are three mischievous children living in California.
This year, these three children are all in summer camp together, and one day, the prized camp flag is missing.
The flag was first discovered missing on Monday, and it is now Wednesday.
The three children are very conniving, and they know only to exchange the flag at night. There is only enough time
for one exchange to happen, if at all. Assume Markov property; the children’s behavior does not change based on
past behavior.
• On Monday, the flag was discovered missing, and all three kids had equal access to it.
• The remaining probabilities describe what could happen each night.
• Alexa is a rather greedy child, and if she gets the flag, she will just hoard it and never share with anyone else.
• Googs believes in freedom of information, so if Googs gets the flag, he will keep it with probability 1/2, and
give it to each of his other social media friends (Alexa and Zuckie) each also with probability 1/4.
• Zuckie is somewhat selfish and would keep the flag with probability 80%. However, he also has a huge crush
on Alexa, so with 20% chance he would give it to her.
(b) Report all the transition probabilities by filling out the table below (row gives to column).
...Alexa ... Googs ... Zuckie
16
(c) What is the probability that Zuckie has the flag on Wednesday?
The brown fox jumped over the brown fence over the baby fox.
features the brown fox jumped over the brown fence over the baby
label brown fox jumped over the brown fence over the baby fox.
features the, brown brown,fox fox,jumped jumped,over over,the the, brown brown,fence fence, over ···
label fox jumped over the brown fence over the ···
and so forth.
That is, if we think of each task as taking a list of features and spitting out a label, then this becomes your usual
classification task.
(a) Toy corpus. Let us now use the following tokenization of the sample corpus
the brown fox jumped over the brown fence over the baby fox
0 1 2 3 4 5 6
the brown fox jumped over fence baby
i. Write down the counts vector for this problem, e.g/ q ∈ R7 where qi is the number of occurances of the
ith word.
ii. The prior probability, e.g.
prior = Pr(word = x).
is computed as
qi
prior(i) = P .
j qj
Write down the prior vector for this problem.
iii. Write down the cooccurance matrix for this problem, e.g. C ∈ R7×7 where C[i, j] = # times that word j
follows word i.
iv. Bigram classifier. Let us now construct the peices of a next-word classifier based on one previous word.
We will define likelihood matrix for this problem as
Compute the likelihood matrix for this problem. Note that if a probability cannot be computed because
there isn’t any observation for this (i, j) pair, we set likelihood[i, j] = 0.
v. Now, let us compute the posterior matrix for this problem, defined as
Write down the formula for computing the posterior, given the likelihood and the prior probability com-
putations. Hint: review Bayes’ rule.
vi. Compute the posterior matrix, using the formula you wrote above.
17
vii. Trigram classifier. Now, repeat the process for a trigram classifier. First, write down the likelihood
In particular, we will use the Naive Bayes assumption, which allows us to decompose the likelihood
likelihood(i, j, k) = Pr(previous word = i|next word = k)Pr(previous second word = j|next word = k)
Compute a matrix that represents Pr(previous second word = j|next word = k).
viii. Next, use these computations to compute the posterior
via Bayes’ rule. While you may use the Naive Bayes assumption to assume features are independent
conditioned on label, do not assume that features are independent when unconditioned. Write down the
formula for computing the posterior, using vectors and matrices previously computed.
ix. Write down the posterior probabilities for all words y given the following choices for x1 , x2 :
A. the brown
B. over the
C. brown baby
(b) Naive Bayes and Alice in Wonderland. We now are going to apply all these princples to a much larger corpus,
in coding. I would encourage that to complete this exercise, you first do all these steps using the super short
corpus (in part (a)), and check to make sure the answers are consistent with what you would expect. Then,
you should be able to just plug and play the new corpuse to get the result you want.
• Open the python notebook alice naivebayes release.ipynb . After running the first couple boxes, you
should have loaded the entire text of “Alice in Wonderland” by Lewis Caroll, as an ordered list of words.
Our task today will be to do word prediction based on this corpus. Throughout this exercise, this corpus
will serve as both our training and testing data.
• Tokenize While the exact word means a lot to us, for a (primitive) computer, a word is just some object;
in particular, we represent each unique word as a unique number. This is the word’s token. Run the 3rd
block to tokenize the data, and understand what it is doing.
• Counting of past words. In the next box, I give you code that generates a lookup table, which counts
how many times a word precedes a previous word. This matrix is then V × V where V = 2637 is the
number of distinct words. Each row of that matrix is indexed by y the label, and each column by x the
feature.
i. Bigram classifier. Use the table given to create a bigram classifier, e.g. predicting the next word using
only the previous word (n = 1). To check your work, the following words should be the highest likely
bigrams.
• word: alice. Most frequent next word: was, with probability 0.00071.
• word: the. Most frequent next word: queen, with probability 0.00276.
• word: cheshire. Most frequent next word: cat, with probability 0.000197.
• word: mock. Most frequent next word: turtle, with probability 0.00225.
Report the classification accuracy over the entire corpus of this classifier.
ii. Well, that was pretty terrible. Let’s try and incorporate not just the past word, but the past k words,
where k can be up to 30. (You can use a tensor and a loop to do this.) Use the same strategy as for the
k = 3 case in part (a) to do this.
Here are some sanity checks, for k = 2:
• seed: ’before’, ’she’, ’found’, ’herself’, ’falling’, ’down’, ’a’, ’very’, ’deep’. Next
most likely word: well
• seed: ’what’, ’an’, ’ignorant’, ’little’. Next most likely word: girl,
• seed: ’four’,’thousand’. Next most likely word: miles
18
All, after normalization, with probabilities very close to 1.
This is the n = 2 case.
iii. Report the classification accuracy over the entire corpus of this classifier, for n = 3, 5, 10. (Note that
this is not an n-gram classifier, which would be the not-Naive-Bayes version of what we are doing here.)
iv. Text generation Using the likelihoods computed from the classifier using n = 3 (3 past words), and
starting with a seed phrase “the mad hatter”, generate the next 25 words by always picking the most
likely next word.
v. Text generation Using the likelihoods computed from the n = 3 classifier, and starting with a seed phrase
“the mad hatter”, generate the next 25 words by always picking the next word by sampling according to
that probability. (Hint: use random.choices().)
5.6. Hidden Markov Model spellchecker. In this exercise we will make a spell-checker using a HMM. To do this, download
alice hmm release.ipynb and follow the instructions.
• Read through the first two blocks to get an idea of what the task is. The idea is to go through the corrupted
corpus, identify words which have probably been corrupted, and correct them probabilistically.
• In the 4th box, fill in the functions to construct the word probabilities (weighted frequencies in uncorrupted
corpus) and transition matrix (which gives Pr(word | prev word)). If done correctly, the lines printed out
should read
[’alice’, ’abide’, ’voice’, ’above’, ’alive’, ’twice’, ’dunce’, ’prize’, ’smile’, ’since’]
• Construct and run your Hidden Markov Model spell checker using the functions computed for the prior
probabilities, emission probabilities, and transition probabilities. List some words whose spelling was corrected
correctly, and some examples where the spell-correcter did not work as expected. Report the recovery rate of
the “fixed” corpus.
Xθ = y
19
6.2. Consider the following
1 2 −5
10
A = 3 4 , x= , b = −6
20
5 6 −7
Compute the following, or explain why it’s not possible to compute it. When it is possible, specify the dimension
of the output.
(a) Ax
(b) Ab
(c) AT b
(d) xT A
(e) xT AT
(f) ∥x∥2
(g) ∥b∥32
(h) ∥Ax − b∥22
6.3. For general A ∈ Rm×n , b ∈ Rm , and x ∈ Rn , give the symbolic representation of each gradient. Clearly give the
dimensions at each step, and simplify as much as possible.
6.4. Write out the specific gradients given the following matrix and vector assignments. (Hint: do previous problem
first!)
1 2 −5
10 1
A= 3 4 ,
x= , b = −6 ,
c=
20 −1
5 6 −7
Clearly give the dimensions at each step.
20
6.5. Consider the following linear systems. Decide if the solution is unique or not, and if it exists or not.
(Note that if a solution does not exist, you should not mark that it is unique.)
(a)
1 1 x1 5
=
1 1 x2 5
(b)
1 1 x1 5
=
1 1 x2 −5
(c)
0 0 x1 5
=
0 1 x2 5
(d)
x1
1 −1 = 10
x2
(e)
1 0 x1 1
=
0 4 x2 −1
6.6. Linear regression. Consider the following loss functions. In each case,
i. Write down the linear system ∇f (θ) = 0 as two equations over two variables (θ1 and θ2 ).
ii. Solve for θ1 and θ2 (or find one solution, if not unique.)
iii. Is this answer unique?
(a)
1 1 −1 θ1 −3 2
f (θ) = · −
2 −1 1 θ2 6 2
(b)
1 1 10 2
1 θ
f (θ) = · 1 1 1 − 20
2 θ2 2
1 1 30
(c)
1 2 −3 2
1 θ
f (θ) = · −2 1 1 − 1
2 θ2 2
−1 2 0
(a) Are the following functions linear? Justify your answer with either a proof if true, or counterexample if false.
i. f (x) = ∥x∥22
ii. f (x) = ∥x∥1
iii. f (x) = 12 xT Qx + pT x + r
iv. f (x) = cT x + bT Ax
21
(b) Show that the gradient operator is a linear operator.
7.3. Norms. Using the properties of norms, verify that the following are norms, or prove that they are not norms
7.4. Gradients. Compute the gradients of the following functions. Give the exact dimension of the output.
1
(a) Linear regression. f (x) = 40 ∥Ax − b∥22 , A ∈ R20×10
(b) Sigmoid. f (x) = σ(cT x), c ∈ R , σ(s) =
5 1
1+exp(−x) . Hint: Start by showing that σ ′ (s) = σ(s)(1 − σ(s)).
(c) Quadratic function. f (x) = 21 xT Qx + pT x + r, Q ∈ R12×12 and Q is symmetric (Qi,j = Qj,i ).
P8
(d) Softmax function. f (x) = µ1 log( i=1 exp(µxi )), x ∈ R8 , µ is a positive scalar. 4
7.5. Cauchy Schwartz, Holder, and beyond. A p-norm for vectors x ∈ Rn is defined as
(a) The Cauchy Schwartz inequality says that for any two vectors x ∈ Rn , y ∈ Rn ,
xT y ≤ ∥x∥p ∥y∥q .
When p = 1, the corresponding q is q = +∞, and ∥x∥∞ is the max-norm, e.g. ∥x∥∞ = maxi |xi |. Prove
Holder’s inequality for this choice of p and q; that is, prove that for any x and any y,
xT y ≤ ∥x∥1 ∥y∥∞ .
Prove (∗∗) and list the set of y such that, given x, (∗∗) is true with equality.
4 For those of you who train neural networks for multiclass classification, you probably recognize this as the softmax layer. Well, this is
where the name “softmax” comes from! (The function f (x) gives a “soft approximation” of the maximum element of xi .)
22
(d) The singular value decomposition of a matrix X ∈ Rm×n decomposes X to its singular values and vectors. It
is usually written as
r
X
X= si ui viT
i=1
where ui ∈ Rm are the left singular values, vi ∈ Rn are the right singular values, and si are positive scalars.
Here, r is the rank of X. Each singular vector are orthonormal, e.g.
( (
T 1 if i = j T 1 if i = j
ui uj = vi vj =
0 else, 0 else.
The trace norm of X is the sum of its singular values (denoted ∥X∥∗ ). The spectral norm of X is its largest
singular value (denoted ∥X∥2 ).
Prove the following generalization of the Cauchy Schwartz inequality for matrices X ∈ Rm×n and Y ∈ Rm×n
tr(X T Y ) ≤ ∥X∥∗ ∥Y ∥2 .
7.6. Am I positive semidefinite? A symmetric matrix X is positive semidefinite if for all u, uT Xu ≥ 0. For each of the
following, either prove that the matrix is positive semidefinite, or find a counterexample.
1 1 1
(a) X = 1 1 1
1 1 1
−1 −1 −1
(b) X = −1 1 −1
−1 −1 1
1 2 3
(c) X = 2 1 4
3 4 1
1 0 0
(d) X = 0 2 0
0 0 3
8 ML models
8.1. K-nearest neighbors classification.
(a) KNN for MNIST. We will now try to use the KNN classifier to classify MNIST digits.
• Open mnist release.ipynb. Load the necessary packages and the data, and take a look at how the data
is formatted and structured. I have done all the “data cleaning” needed for this assignment (which is
very minimal for this exercise). I have also included a function get small dataset which will return a
subset of the training data (60000 samples!) so that we can reasonably train some things on even the
worst laptops.
• Distance function. The first step in establishing a KNN classifier is deciding what is going to be
your metric for “distance”, and writing a function that, given the training data Xtrain and query point
zquery, can as efficiently as possible return a vector of distances between zquery and all of the datapoints
in Xtrain.
There are many ways to do this, some faster than others. In general, if your implementation involves a
for loop, you may be in for a lot of waiting and some very warm laptops. One implementation that
avoids for loops is to really try to use the optimized numerical linear algebra functions of the numpy
library as much as possible, e.g. using functions like np.dot, np.sum, etc.
23
The distance function we will use is the 2 norm squared; specifically:
n
X
d(xi , z) = (xi [j] − z[j])2 .
j=1
There are several different ways of implementing this in your code, some more efficient than others. (Hint:
think about what you can preprocess.)
As a sanity check, the following box
print(get_dist(Xtrain,Xtrain[0,:])[0])
print(get_dist(Xtrain,Xtest[0,:])[10])
print(get_dist(Xtrain,Xtest[10,:])[50])
should produce
0.0
6069462.0
5661744.0
• Prediction. Implement a K-nearest-neighbor classification predictor, which takes a test data point, finds
the closest point (in terms of Euclidean distance) in the train data set, and returns the KNN prediction.
Use a majority vote scheme to decide which label to return; use whatever scheme you wish to break ties.
Hint: take a look at scipy.stats.mode()
As a sanity check, for K = 3 and m = 100, the output for the lines
print(ytest_pred[:20])
print(ytest[:20])
should be
[7 2 1 0 4 1 4 4 6 9 0 0 9 0 1 9 7 7 3 4]
[7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5 9 7 3 4]
• Evaluate based on classification accuracy. Now write a function that returns the classification
accuracy given a list of true labels (ytrue) and predicted labels (ypred).
In your writeup, print classification accuracy of the test set.
• Hyperparameter tuning. I have included in the next box an experiment in which your KNN predictor
is tested for a training dataset with 100, 1000, and 2500 data samples. In each case, the code will run your
predictor and return three numbers: m, the prediction accuracy, and the prediction runtime. Run this
box and return the classification accuracy and runtimes for m = 100, 1000, 2500 and K = 1, 3, 5.
m K accuracy runtime (sec)
100 1
100 3
100 5
1000 1
1000 3
1000 5
2500 1
2500 3
2500 5
Comment a bit on the performance of the model for these different hyperparameter choices. In particular:
– Is it feasible to run this model for the full m = 60000 training dataset in runtime? Is it advisable?
– How does the accuracy depend on K for different values of m?
(b) Linearly hashed KNN. If your computer is anything like my dinky laptop, that KNN exercise was a
pain to run. As we discussed in class, in general, KNN is tough on runtime memory and computation,
because in general we don’t like doing things like training during prediction. One way to get around this is
dimensionality reduction. While there are many clever ways of implementing such a thing, for this exercise
we will just try something very simple, yet sometimes shockingly effective: linear hashing.
Linear hashing basically follows the principle (sometimes formally stated as the Johnson-Lindenstrauss lemma):
dist(x, y) ≈ dist(Ax, Ay)
24
where A ∈ Rn×d is some random matrix with d ≪ n. That is to say, if I take some points in space, and hit
them with a random projection to a lower dimensional space, their configuration should be roughly preserved.
We won’t really go into this until later lectures, but for now we can evaluate this principle empirically, by
creating linearly hashed KNN.
Add a new box in your KNN ipython notebook. Code in two hashing functions:
• Random subsampling Pick a random subset S ⊂ {1, ..., n}, |S| = d and reduce each feature vector so
that only the features in S remain.
• Random matrix multiply Generate a random matrix A ∈ Rd×n where each Aij is i.i.d. generated
according to N (0, 1). Then the new feature is z = Ax.
Code up KNN over the hashed vectors. Make sure you also hash vectors before prediction. Return the
classification accuracy and runtimes for m = 100, 1000, 10000, with lower dimension d = 50, and number of
neighbors = 1,3,5.
What are your thoughts on using hashing for KNN? In particular:
• Does the reduction in runtime justify the accuracy hit? In what scenarios would we prefer to use hashing
vs not hashing?
• Comment on the performance difference between hashing via subsampling and hashing via random ma-
trices.
8.2. Coding up a decision trees. In real life, you would use one of many highly optimized packages to program decision
trees, but for the sake of understanding, here we will build a tiny decision tree on a simplified version of a multiclass
classification problem, using our greedy method.
• Download covtype decisiontree release.zip, which is a remote sensing classification problem (more details
here https://archive.ics.uci.edu/ml/datasets/covertype). Inside there are two raw files: covtype.info
and covtype.data. Skim covtype.info to understand the basic task.
• The file covtype reduced.mat has all the data already loaded and separated into a train and test set, and is
subsampled severely to reduce computational complexity. (Like I said, this is a simple implementation. Go
ahead and use scikit-learn for a “full” version.)
• The task is now to construct a decision tree classifier, that uses information gain to pick the best next split.
Open the iPython notebook covtype.ipynb, and follow the instructions there.
• Fill in the box to return functions that computes entropy and conditional entropy for a sequence of labels.
Remember to include special cases for when one is required to take the log of 0. (We assume that 0 log2 (0) = 0.)
As a sanity check, you should get the following values as a result of the test problem provided in the notebook.
entropy = 3.314182323161083, conditional entropy = 3.3029598816135173
• Fill in the box to return a function that, at each step, given X, y, and a set of data sample indices to consider,
returns the best feature to split and the best split value, as well as the indices split into two sets. (Follow
the template in the iPython notebook). Again, return the solution to the test problem given. Don’t forget to
handle the special case if the split results in an empty set–something special should happen, so the algorithm
knows to reject this split.
The rest of the implementation is up to you. I have given you some guidance in breaking down the task into 3 tree
versions.
(a) A toy tree, which splits the leaf node with the median nodeid, and then traverses to the leaf by always following
the child with the larger id.
(b) A decision tree that splits the node with the lowest purity and gives the best information gain split value and
split feature at each split.
(c) A decision tree that does all of the previous, plus proper inference.
If you follow all these parts step by step, you should be able to code up your own decision tree!
• Draw the outputted tree after 10 iterations, labeling the purity and number of samples in each leaf node.
25
• Report your train and test misclassification rate for 100 steps of training. Describe your resulting tree. Report
how many nodes there are in your resulting tree, how many are leaves. List, for the leaves, how many samples
ended up on each leaf, and report also their “purity”, e.g. # values most frequent / total size of set.
• Did your tree overfit? What are some hints that this may have happened?
(a) Download weatherDewTmp.mat. Plot the data (plot(weeks,dew)). It should look like the following
10
0
Dew point temp (C)
−5
−10
−15
−20
−25
0 1 2 3 4
Weeks after first reading
(b) We want to form a polynomial regression of this data. That is, given w = weeks and d = dew readings, we
want to find θ1 , ..., θp as the solution to
m
X
minimize
p
1
2 (θ1 + θ2 wi + θ3 wi2 + · · · + θp wip−1 − di )2 . (1)
θ∈R
i=1
Find X and y such that (1) is equivalent to the least squares problem
minimize
p
1
2 ∥Xθ − y∥22 . (2)
θ∈R
(c) What are the normal equations for problem (2)? In particular, if θ∗ is the minimizer of (2), then θ∗ solves a
linear system Aθ∗ = b. What are A and b?
(d) Ridge regression Oftentimes, it is helpful to add a regularization term to (2), to improve stability. This also
has an interpretation as Bayesian linear regression with a Gaussian 0-mean prior. In other words, we solve
minimize
p
1
2 ∥Xθ − y∥22 + ρ2 ∥θ∥22 . (3)
θ∈R
for some ρ > 0. The θ∗ that minimizes (3) is the solution to a different linear system Areg θ∗ = breg . What are
Areg and breg ?
(e) If A is a positive semidefinite matrix with condition number 5 and largest eigenvalue 1, what is the condition
number of A + ρI for some ρ > 0?
(f) In Python, write a function that takes as argument p and returns X and y so that (1) is equivalent to (2).
After solving this linear system, you should get a checknumber of 1.759.
Report the condition numbers for A and Areg by filling out this table:
p A (ρ = 0) Areg , ρ = 0.1 · m Areg , ρ = m Areg , ρ = 10 · m Areg , ρ = 100 · m
1
2
5
10
(g) Compute a polynomial fit by solving (2) for polynomials of order 1, 2, 3, 10, 100. Plot all the fits on separate
plots (use subplot). Comment on your observations.
26
(h) Now compute a regularized polynomial fit by solving (3) for polynomials of order 1, 2, 10, 100, 150, and 200,
for ρ = 0.0001. Plot all the fits on separate plots (use subplot). Comment on your observations. How does
this compare to the unregularized polynomial fit?
(i) Picking your favorite set of hyperparameters (p, ρ), forecast the next week’s dew point temperature. Plot the
forecasted data over the current observations. Do you believe your forecast? Why?
8.4. Binary classification and logistic regression. Consider the following dataset.
T
i xi yi dist to θA dist to θB dist to θC
1 1, 1 1
2 −2, 0 1
3 −2, 1 1
4 1, −1 -1
5 −2, −2 -1
6 3, −2 -1
(a) Draw the points xi on an axis, in solid dots when yi = 1 and circles when yi = −1.
(b) Now consider the following two choices of θ
−1 −6 −20
θA = , θB = , θC =
3 3 25
Draw the three discriminators on your previous plot, with DA solid, DB dashed, and DC dotted. Clearly label
them as well.
DA = {z : z T θA = 0}, DB = {z : z T θB = 0}, DC = {z : z T θC = 0}
(c) Report the misclassification error over this training set if we were to use each choice of θ to generate a
classification rule
ypred = sign(xT θ)
(d) Which discriminator(s) have the best classification accuracy over the provided training set?
(e) Compute the shortest distance from xi to DA , DB , and DC for all i (fill in the table). Remember that the
formula for the line between a point x and a discriminator formed by θ is
|xT θ|
distance to margin =
∥θ∥2
Round your answers to the nearest 0.1. (Round 0.05 up to 0.1.) Include your work below.
(f) Which correctly classified point(s) have the minimum margin distance to discriminator A?
(g) Which correctly classified point(s) have the minimum margin distance to discriminator B?
(h) Which correctly classified point(s) have the minimum margin distance to discriminator C?
(i) Now assume that each of the training set points xi are jostled by a bit of noise, e.g. instead, we have x̂i = xi +zi
for a very small magnitude random vector zi . (∥zi ∥2 ≤ 0.1.) Which discriminator is most robust against this
kind of noise?
(j) Logistic regression. Recall that logistic regression minimizes the following objective function
m
X 1
f (θ) = − log(σ(yi xTi θ)), σ(s) = .
i=1
1 + exp(−s)
27
T
i xi yi yi xTi yi xTi θA yi xTi θB yi xTi θC σ(yi xTi θA ) σ(yi xTi θB ) σ(yi xTi θC )
1 1, 1 1
2 −2, 0 1
3 −2, 1 1
4 1, −1 -1
5 −2, −2 -1
6 3, −2 -1
(l) Compute the numeric values of the objective values f (θA ), f (θB ), f (θC ).
(m) Compute the numeric values of the gradient norms ∥∇f (θA )∥2 , ∥∇f (θB )∥2 , ∥∇f (θC )∥2 .
(n) Which classifier(s) has the smaller loss?
(o) Which classifier(s) has the smaller gradient norm?
(p) What is the significance of a classifier having a smaller gradient norm? Explain.
(q) Is it possible that a classifier might have a smaller loss but not actually do a better job in classifying (e.g., not
more robust against noise)? Explain.
For example, for the function f (θ) = θ12 + 2θ1 θ2 + θ33 , the gradient and Hessian are
2θ1 + 2θ2 2 2 0
∇f (θ) = 2θ1 , ∇2 f (θ) = 2 0 0 .
2
3θ3 0 0 6θ3
28
• Write a script that returns the classification accuracy given θ.
• Use gradient descent to minimize the logistic loss for this classification problem. Use a step size of 10−6 .
• Run for 1500 iterations. Report the final train and test accuracy values.
(c) Coding stochastic gradient descent. Do not use scikit-learn or other built in tools for this exercise.
Please only use the packages that are already imported in the notebook. Now, fill in the next box a function
that takes in θ and a minibatch B as either a list of numbers or as an np.array, and returns the minibatch
gradient
1 X
∇B f (θ) = ∇fi (θ)
|B|
i∈B
Run the script. If done correctly, you should see the number 5803.5 printed out.
(d) Write a script to run stochastic gradient descent over logistic regression. When coding up the minibatching,
make sure you cycle through an entire training set once before moving on to the next epoch. Additionally,
use time() to record the runtime, and submit a plot which compares the performance of gradient descent
and stochastic gradient descent, using a minibatch size of 50 data samples and running for 50000 iterations.
Return a plot that compares the objective loss, train accuracy, and test accuracy between the two optimization
methods, as a function of runtime. Comment on the pros and cons of the two methods.
Important Remember that calculating the loss function and train/test accuracy requires making full passes
through the data. If you do this at each iteration, you will not see any runtime benefit between stochastic
gradient descent and gradient descent. Therefore I recommend you log these values every 10 iterations for
gradient descent, and every 100 iterations for stochastic gradient descent.
8.6. Overfitting SVMs. Before the world of deep neural nets, the Arnold Schwarzenegger of the machine learning world
was actually kernel SVMs. These things are big, bulky, and will fit anything you want–for a price. In this problem
you will play around with your own encoded kernel SVM, around a seemingly easy dataset, and see what troubles
may arise.
(a) Recall that the soft primal kernel SVM problem proposes to solve
m
1 2
X
minimize ∥θ∥2 + ρ si
θ∈Rn ,s∈Rm 2 i=1 . (p-KSVM)
subject to yi ϕ(xi )T θ ≥ 1 − si , i = 1, ..., m
si ≥ 0, i = 1, ..., m
29
Pick an appropriate step size α and number of iterations through experimentation. Give the resulting train
and test errors for the following choices of ρ: ρ = 0.01, 0.1, 1. Also give the train/test error when setting ρ to
np.inf, simulating the hard margin scenario.
(c) Next, code up an RBF Kernel, e.g. a function
∥x1 − x2 ∥22
K(x1 , x2 ) = exp
2σ 2
Sweep σ across values σ = 1, 0.1, 0.001, 0.0001, and for each value, pick the best hyperparameters (α, ρ, and
maximum iterations), and return the train and test errors of the learned RBF model for each value of σ.
(d) Plot the contour plots of the best-learned linear and RBF model under the choices of σ: σ = 1, 0.1, 0.001, 0.0001.
(e) In your own words, describe what is happening to the model as σ → 0.
8.7. Multiclass classification. Consider the multiclass logistic regression optimization problem
m K K
!
1 X X T
X
T
maximize f (Θ) = yik xi θk − log exp(xi θk ) .
Θ∈Rn×K m i=1
k=1 k=1
where yik = 1 if data sample i is in class k, and 0 otherwise. As usual, xi ∈ Rn is the ith data feature. Here, we
write the entire matrix variable as
Θ = θ1 θ2 · · · θK .
Show that both terms v T diag(d)v and v T ddT v must be in the range [0, 1]
• Then, for a sample point x = xi , show that the maximum eigenvalue of ∇2Θ fˆ(Θ) ≤ ∥x∥22 where
K
!
X
ˆ
f (Θ) = log exp(x θl ) = g(ΘT x).
T
l=1
is sometimes called the log-sum-exp function. As we saw in lecture, it has the nice property of acting like a
soft-max function, by “pulling” away the largest values of θi , to somewhat exaggerate their “lead”.
30
A downside of using the log-sum-exp function is that it can have numerical issues. If θi is somewhat big, then
exp(θi ) becomes very big, and can cause overflow. Conversely if θi is very negative, then all the values may
be too close to 0 and cause underflow.
The “log-sum-exp-trick” is a numerical trick which deals with this issue, by adding and subtracting a constant
whenever necessary. In effect, we simply do
m
!
X
f (θ) = log exp(θi − D) +D.
i=1
| {z }
f1 (θ)
Then, for the right choice of D, we can prevent overflow and underflow.
Propose a value of D such that f1 (θ) ≤ c (preventing overflow), and another value such that f1 (θ) ≥ c
(preventing underflow), for some reasonably sized constant c.
(d) Coding. Run multiclass logistic regression on MNIST dataset, against each of the 10 classes.
While usually we pick a stepsize of 2/L, I have tried this and found a larger stepsize of 10−5 will work well.
Use this stepsize and run for 500 iterations, or however many you need to see reasonable “working” behavior.
Show the train/test loss plot and the train/test classification plot.
8.8. Kmeans and beyond. In this problem, you will code up the Kmeans method and experiment with the effect of
different initializations on the final outcome.
• Download kmeans release.ipynb. I have generated a multiclass classification problem for you in 2-D, which
you can easily visualize. Load that data.
• First, let’s fill in our helper functions. Code up a function that takes in two matrices, X ∈ Rm,n and z ∈ RK,n
and returns a matrix D where
Di,j = (Xi,: − zj,: )2 .
Technically this is the distance squared, but it will work for our purposes.
• Next, code up a function that identifies the closest center point. Given X ∈ Rm,n and z ∈ RK,n , it returns a
vector ŷ ∈ Rm where ŷi = the index of the row in z that is closest to Xi,: .
• Finally, code up a function that recomputes cluster centers, given label guesses. That is, it takes X and ŷ and
returns z where zk,: is the average of all the rows Xi,: of X where ŷi = k.
• To evaluate our success, we will compute cluster purities
K
# items in set of label k 1 X
purity(set) = max , Avg. purity = (purity(cluster k))
k size of set k K
k=1
Armed with these helper functions, we may now begin some experiments.
(a) Code up a generic Kmeans method that recomputes cluster ids and centers for T iterations, where we set
T = 100. Run this for any initialization of z, and plot the outcome. Turn in at least one plot of a successful
run.
(b) Run Kmeans for T = 25 iterations, for P = 100 random trials, each where z is initialized as
z = np.random.randn(K,2)*10.
that is, a random Gaussian distribution in 2-D, of standard deviation 10. Give a plot showing the trajectories
of the purities across iterations, across trials, and comment on the “success ratio” of such a run.
(c) Run Kmeans for the same choices of T and P , where the initialization are randomly selected points from the
dataset, e.g.
idx = np.random.choice(m,K); z = X[idx,:]
Again, give a plot showing the trajectories of the purities across iterations, across trials, and comment on the
“success ratio” of such a run.
31
(d) Finally, we compare these “simple initialization schemes” with the (still pretty simple, but slightly more
thought-out) Kmeans++ initialization scheme. Concretely, the method works as follows 6
• Choose one center uniformly at random among the data points. Call it z1 .
• Compute the distance between every point xi and this first center z1 , as di = D(xi , z1 ).
• Now, using a probability mass function pi = d2i / j d2j , pick another point to be the next center z2 .
P
• Next, compute di for i = 1, ..., m asPthe minimum distance between two points and any previously chosen
center zi . Again forming pi = d2i / j d2j , pick another point to be the next center z3 .
• Repeat the previous step until all initial centers z1 , ..., zK are chosen.
Again, give a plot showing the trajectories of the purities across iterations, across trials, and comment on the
“success ratio” of such a run.
8.9. Gaussian mixture models of socks. I have three kids, Alexa, Elon, and Zuckie. It’s Christmastime, and all of my
friends bought socks for the childrens’ presents.
Recall that the usual p.d.f. for a 1-D probability is
2 !
1 1 x−µ
pG (x; µ, σ) = √ exp −
σ 2π 2 σ
(a) Alexa is an infant, and her feet are only 3 inches long. Elon is 5 years old, with feet about 6 inches long.
Zuckie is a full-blown teenager, with feet 10 inches long.
Now, Alexa received 2x as many socks as Elon, and Zuckie received 1/3 as many socks as Elon.
Assume that all the socks, for each child, have sizes that can be fit with a Gaussian model, with their exact
foot size as the mean, and 1 inch2 as the variance.
i. Propose a Gaussian mixture model probability density function for this scenario. You may express your
answer in terms of pG , to save space.
ii. I reach into the pile and pick a sock at random. What’s the probability that the sock is for a foot between
6 and 8 inches long? Round your answer to the nearest 0.001.
(b) Now I visit my twin sister’s home. I forgot how many children she has, or how old they are, so I decide to
guess based on the sizes of socks under her Christmas tree.
There are 8 socks, of sizes (in inches): 8, 2, 7, 3, 2, 4, 3,6.
i. Starting with a guess of two children, with foot sizes µ1 = 2 inches and µ2 = 8 inches, use one iteration
of k-means to cluster the socks into two piles. Write down the sizes of each cluster below.
Hint: Write down all your intermediate calculations, as they may help you with later calculations.
Cluster 1:
Cluster 2:
ii. Recompute the new means µ1 and µ2 . Compute the variances σ12 and σ22 of each cluster, and weighting
α1 and α2 of each cluster, such that these parameters:µ1 , µ2 , σ12 , and σ22 , α1 , α2 form the best fitting
Gaussian mixture model, given cluster assignments. Round all answers to the nearest 0.001.
iii. In order to make it up to my sister by constantly forgetting her kids’ birthdays, I buy a whole bunch of
mittens, each matching the socks. So, you can think of each present as paired, e.g.
socks (length in inches) 8 2 7 3 2 4 3 6
mittens (weight in ounces) 12 4 16 7 4 8 6 14
Using the same cluster assignment as above, compute the mean vectors µ1 ∈ R2 , µ2 ∈ R2 , covariance
matrices Σ1 ∈ R2×2 , Σ2 ∈ R2×2 , and weightings α1 and α2 for these presents, which are now feature
vectors of length 2.
8.10. Gaussian mixture model. Given x1 , ..., xm ∈ Rn , the Gaussian mixture model fits the probabilistic model of
xi ∼ N (µk , Σk )
whenever sample point xi belongs to cluster k.
6 Reference: Wikipedia https://en.wikipedia.org/wiki/K-means
32
(a) Maximization. Define (
1 if sample i in cluster k
zi,k =
0 else.
We write the likelihood of xi and zi,k as
Pr({xi zi,k }i,k |αk , µk , Σk ) = Pr({xi }i,k |αk , µk , Σk , zi,k ) Pr({zi,k }i,k |αk , µk , Σk )
| {z }| {z }
(A) (B)
Here, µk and Σk are the mean and covariances of the kth cluster, and αk is the prior probability that any
point belongs to cluster k.
Show that term (A) is maximized when
T
P P
zi,k xi i zi,k xi xi
µk = Pi , Σk = P .
j zj,k j zj,k
Hint: you may use the shorthand for the Gaussian multivariate pdf:
1
exp −(x − µ)T Σ−1 (x − µ) .
fN (x; µ, Σ) =
(2π)m/2 |Σ|1/2
(d) Finally, we will implement the procedure on some problems. First, we use a curated 2-D problem, which will
help us work out bugs, and visualize our result, before figuring out how to extend this to a higher dimensional
problem.
• Open mnist GMM.ipynb. We will load the MNIST dataset, and build a Gaussian mixture model over a
2-D PCA representation of it. The reduced dimensionality will help us with some numerical stability, and
can make the landscape easier to visualize.
• Before doing anything, fill in the gaussfn(x,mu,C) function with the formula for the pdf of a multivariate
Gaussian with mean vector µ and covariance matrix C. The input x should be of dimension 2 × m, where
m may be the number of training samples in 2 dimensions, and the output should be a vector number of
length m, representing the PDF at the m points.
• I have provided for you two functions: plot label(y), which plots the datapoints with different colors
for each label, and plot gauss(alpha,mu,Clist) which plots contours of Gaussians, weighted by α, with
centers µ and covariances in the list Clist. Note that this function depends on gaussfn being coded
correctly!
• Fill in the four functions get_pi, get_alpha, get_mu, get_Clist.
• If you’ve done the previous part correctly, you should just be able to run the last two boxes and get the
GMM!
• Turn in two plots, one with the Gaussian contour plot overlaying the points colored by their true labels,
and one overlaying the points colored by their learned labels. Comment on any discrepancies.
(e) Now, we will extend this to the full mnist dataset, without any curation. This problem is harder because, in
its full 784 dimensional form, we get a lot of singular matrices.
• The issue that arises in the higher dimensional regime is that the covariance matrix C is often so close
to singular, especially in the unimportant dimensions, that inverting it is a huge pain. So, we will use a
“lazy linear algebra trick”, which is to add a tiny diagonal, +I, to the covariance matrices C each time
we compute them. This will get your code to converge much better.
33
• Another issue that often presents problems in probabilistic modeling is that the actual data samples are
very sparse in the metric space. One “trick” that helps smooth out the computations is to add a tiny
bit of noise to the samples; this avoids, basically, dividing by 0. I have already done this for you in the
released code.
• Now, for 5 different initialization, run the GMM method on the n = 196 dataset for 100 trials. Plot the
average purity, e.g.
# of most frequent element in S
purity(S) =
size of S
and the average purity is the purity over sets
Sk = {y[i] : ŷ[i] = k}
8.11. Clustering
• Open the python notebook mnist dimred clustering.ipynb, and load the MNIST data by running the first
cell. Don’t change the way I’ve formatted it, or the checksums won’t work.
• Fill in the function for determining the Euclidean distance between any sample point and the entire training
data. IF done successfully, your checksum should read 160239119987.1912.
• Using this function, code up a K-Means method, using an initialization of
mu = X[:K,:]
where K = 10 is the number of clusters. After convergence (10 iterations should be sufficient), plot the centers
and the 10 closest points to each center.
• Define the purity of a set S as the following fraction:
maxi |{yj = i, j ∈ S}|
class purity(S) =
|S|
that is, it is the size of the largest same-label subset of S divided by the total size of S.
The entire clustering purity will be defined as the average class purity amongst the K clustered sets.
Computer the purity of the Kmeans algorithm on the vanilla dataset. After 10 iterations, it should be 0.62.
• Now, we will repeat this previous step, but also experiment with a variety of dimensionality reduction methods.
First, de-mean the data:
X = X - np.outer(np.ones(X.shape[0]),np.mean(X,axis=0)).
– Using np.linalg.svd, implement PCA, and reduce the data dimension to d = 10, 100, and 500. In each
case, re-run K-Means to get new class memberships. Report all the purity measures. (You may want to
enact the option full matrices = False when computing the SVD.)
– Use random hashing (as promoted by the JL lemma) and reduce the feature dimension to 10,100,500
dimensions, and cluster with MNIST.
– Use spectral embeddings to reduce dimension to 10,100,500, and cluster with MNIST.
– Use t-SNE to reduce dimensions to 1,2,3, and cluster with K-Means. You may use sklearn here.
Report all the average purities by filling out the table below:
10 (1) 100 (2) 500 (3) full
No dim reduction x x x
PCA x
JL Hashing x
Spectral embedding x
t-SNE (dim in parens) x
34
• Comment on the effect of the different dimensionality reduction schemes in the MNIST clustering task. What
are the tradeoffs, in terms of performance and computational complexity?
• Plot the t-SNE visualizations for the three dimensions, colored by their original class label and comment on
whether you think t-SNE respected the natural cluster structure.
9 Ensemble methods
9.1. Bootstrap Aggregation and Random Forest.
(a) Using the training data below, construct a decision tree by following the steps below.
animal label location color outer
Hello cat house pink fur
Garfield cat house orange fur
Tom cat farm blue fur
Butch cat farm black fur
Babe pig farm pink skin
Wilbur pig jungle pink skin
Nemo fish ocean orange scales
Ursula fish ocean black skin
Marlin fish ocean orange scales
Mighty armadillo jungle pink shell
i. Give the entropy across labels, rounding to 3 significant digits. That is, assuming that Pr(cat) = 4/10,
Pr(pig) = 2/10, Pr(fish) = 3/10, and Pr(armadillo) = 1/10, what is the entropy across the whole
label space?
ii. For the first split, pick the feature that, when split, maximizes information gain. Split across all the
feature values, creating a 2-level n-ary tree. You will have to try all three splits, so report the information
gain in each case, and identify clearly which feature had the highest information gain when split.
iii. Now, you should have a tree with 1 root and k leaves. Across each leaf, report the node purity. Which
node do we split next?
(b) Using this 2-layer decision tree, infer the labels of the animals in the following test set and report the test
error.
animal true label location color outer
Pumbaa pig jungle pink skin
Cheshire cat jungle pink fur
Salem cat house black fur
(c) I close my eyes and pick three animals from my train set at random (a minibatch), and I do this 3 times. The
names that come up are
a. {Hello, Wilbur, Ursula}
b. {Butch, Babe, Mighty}
c. {Nemo, Garfield, Ursula}
Now, we construct a random boosted tree as follows:
i. For each minibatch, create a binary decision stump, e.g. find the best discriminating feature and create
a tree with 1 root and k leaves, fully describing one feature. Pick the best feature, with the strongest
discriminating power. (There is a unique right answer.)
ii. Now propose a classification scheme that aggregates the decision stumps, using a majority rule aggregation
scheme. For ties, flip a coin randomly. Use this to give a labeling on the entire train and test set, below.
35
animal true label tree 1 label tree 2 label tree 3 label majority vote label
Hello cat
Garfield cat
Tom cat
Butch cat
Babe pig
Wilbur pig
Nemo fish
Ursula fish
Marlin fish
Mighty armadillo
Pumbaa pig
Cheshire cat
Salem cat
9.2. Adaboost. A popular and computationally cheap boosting method is adaboost, described in Algorithm 1. In
particular, it is a greedy coordinate-wise method that minimizes the empirical exponential loss, e.g. given a
predictor h(x) = y, we find h which minimizes
m
1 X
f (h) = exp(−yi h(xi )).
m i=1
In this problem, we will implement Adaboost and analyze its greedy structure.
Update
1 − ϵ(t)
1 X (t−1)
α(t) = log , ϵ(t) := wi
2 ϵ(t)
h(t) (xi )̸=yi
Update weights
(t+1)
(t) (t−1) ŵ
ŵi = wi exp(−yi α(t) h(t) (xi )), wi,t+1 = P i (t+1)
j ŵj
end
(a) Greedy behavior. Show that the update for α indeed minimizes the empirical risk over the already-trained
classifiers, using the exponential loss function. That is, at some time t, show that
m
X
α(t) = argmin L H (t) (xi ); yi , L(y; ŷ) = exp(−y ŷ)
α(t) i=1
Pt ′ ′
where H (t) (x) = t′ =1 α(t ) h(t ) (x) the current aggregated predictor. Note that y, ŷ, and h(t) (x) all take
binary values in {−1, 1}.
Hint. You do not need to follow this way, but here are some hints to get you started.
36
• Start by writing the loss function as a function of α = α(t) .
• Show that this function is convex in α by taking its second derivative, and arguing that it is nonnegative
everywhere.
• When setting the gradient to 0, it will be frustrating because it will look something like
m
X
f ′ (α) = ci eαzi .
i=1
Note, however, that in that scenario, zi ∈ {−1, 1}, and you can separate the sum to deal with both
scenarios separately. That is, X X
f ′ (α) = ci eα + ci e−α = 0.
zi >0 zi <0
So, our goal is to show that the adaboost algorithm update for α(t) is indeed the α that minimizes f (α).
Before going further, we first do some simplifications. We notice that
t−1
!!
(t−1)
X ′ ′
wi = exp −yi α(t ) h(t ) (xi )
t′ =1
so we can simplify f as
m
(t−1)
X
f (α) = wi exp(−αzi ), zi = yi h(t) (xi ) ∈ {−1, 1}.
i=1
In particular, in the f ′′ (α) form, each element in the sum is nonnegative, so f ′′ (α) ≥ 0 for all α. So, since α
is unconstrained, we know we are just looking for α where f ′ (α) = 0. We use the last hint to break apart the
sum: X
(t−1) −α
X (t−1)
f ′ (α) = −wi e + wi eα = 0
i:zi =1 zi =−1
α 2α
which, scaling both terms by e and solving for e yields
P (t−1)
2α i:zi =1 wi 1 − ϵ(t)
e =P (t−1)
= ,
wi ϵ(t)
zi =−1
P (t−1)
since i:zi =−11 wi is just the weighted misclassification rate. Solving for α now yields the claimed result.
(b) Coding. Open the mnist adaboost release directory and in the iPython notebook, download the data. We
are again going to do 4/9 handwriting disambiguation. Only minimal preprocessing was used in this dataset,
since for decision trees, normalization is less of an issue.
• Borrowing code from the previous exercise, implement a decision stump (tree with depth = 1). Initialize
weights as wi = 1/m for all i = 1, ..., m, and fit the decision tree over the weighted misclassification error,
using the code snippet
37
clf = clf.fit(Xtrain, ytrain, sample weight = w)
Report the train and test misclassification rates using just this decision stump. Report also the train
exponential loss value.
• Now implement the Adaboost method, as shown in algorithm 1. Plot the training exponential loss, and
train and test misclassification rate. How would you say the performance compares to previous versions
of this task (e.g. using logistic regression, 1-NN, etc)?
Plot also ϵ(t) and α(t) as a function of t. For what values of ϵ(t) is α(t) really large and positive? really
large and negative? close to 0? Interpret this mechanism; what is it saying about how boosting uses
classifiers, in terms of their weighted performance?
10 Generalization
10.1. Take out the trash. My mom bought me a telescope for my birthday, so now I use this telescope to spy on my
neighbors. In particular, I’m trying to figure out when is the best time for neighbors to put out their trash.
I have five neighbors, and every day, I record the time they take out their trash. I do this for 7 days. I count their
time based on how many minutes past 6:00pm.
Round your answers to the nearest 0.01.
(a) On Monday,
• Alice took out her trash at 7:00 on the dot (60 minutes past 6pm).
• Bob took it out at 6:55 pm (55 minutes past)
• Chuck took it out at 7:03 pm (63 minutes past)
• Danesh took it out at 7:39 (99 minutes past)
• Esmeralda took it out at 8pm (120 minutes past). (She’s a night owl.)
Let’s define the random variable X = time that trash is taken out. What is the sample mean of X, based on
Monday’s data? What is the sample variance, in minutes squared?
Hint: It might be easier to think of all these quantities as “minutes after 6:00pm” and then transform back
afterwards.
Recall: Sample mean is the average value. Sample variance is the average variance, where variance of one
point is
var(xi ) = (xi − E[X])2 .
(b) Recall that an estimator Y is a biased estimator of X if
E[Y ] ̸= E[X]
3 1
Z= · Y + · 60.
4 4
Compute the bias and variance of Z, in terms of the expectation and variance of X.
(d) Is this second estimator biased? Why or why not?
(e) How does the variance of this second estimator Z compare to the variance of Y ?
38
10.2. Wish upon a star. Suppose I want to identify the location of a star, which lives at coordinates defined by x ∈ R.
Every day I go to the telescope, and I receive a new measurement yi = x + zi , where zi ∼ N (0, 1) is the noise in
each measurement.
After m days, I receive m measurements, y1 , ..., ym ∈ R.
(a) First, assume that without observations, we have no idea what x is, and can only impose a uniform prior over
some reasonably large support. 7 Denote xMLE as the maximum likelihood estimate of x given yi and compute
xMLE in terms of yi and m.
(b) Compute the bias and variance of xMLE . Describe the behavior of the bias and variance as m → +∞.
(c) A colleague walks in the room and scoffs at my experiment. “I already know where this star is!” the colleague
exclaims, and gives me a new set of measurements x̄ ∈ Rn . “You can just cancel your experiment now!”
Trouble is, I know the colleague is full of hot air, so while this is valuable information, I’m not willing to take
it without any verification. Instead, I estimate x by solving a linear regression problem
m
1 X ρ
minimize (yi − x)2 + (x − x̄)2
x 2m i=1 2
for some ρ > 0. Denote xMAP as the solution to this linear regression problem. Write out xMAP in terms of
yi , x̄, ρ, and m.
(d) Compute also the bias and variance of xMAP , its behavior as m → +∞, and how it compares to the bias and
variance of the MLE estimator.
(e) Show that for any estimator x̂, the expected squared error E[(x − x̂)2 ] = B 2 + V where B = x − E[x̂] the
estimator bias and V = E[(E[x̂] − x̂)2 ] the estimator variance.
(f) In terms of constants ∆ = x − x̄ and m, what value of ρ minimizes the squared error of the estimator xMAP ?
Hint: if a function is not convex in α, it may be convex in β where β = f (α) is a 1-1 function of α.
(g) The xMAP is the maximum a posteriori solution, brought about by incorporating the obnoxious colleague’s
observations as the mean of a Gaussian prior on x. Suppose we also model the posterior of each observation
as a Gaussian, e.g.
yi |x ∼ N (x; 1).
Compute the variance of the prior distribution assumed by this xMAP calculation.
39
(b) Suppose we have external reasons why we would like to have small, short trees, so we add a tree depth
penalty, in the form of
1
mixed penalty = test misclassificatio rate + · log(tree depth)
10
Note that this is the fitted tree depth, not the max depth set by you (which is just an upper limit). Plot
the mixed penalty as a function of this mixed penalty.
(c) Finally, after performing a proper K-fold cross validation, using this mixed penalty, plot the K validation
losses and their average, as well as the seperated test penalty, and include it in your writeup.
In each case, describe the plots briefly in your writeup.
10.4. Generalization error of linear regression. Consider a model of data/label pairs as
where θ0 is some vector that is constant, but unknown, and ∥θ0 ∥2 = 1. Suppose I draw m data/label pairs, and
pack them into the following matrix and vector
T
x1 y1
.. ..
X = . , y = . .
xTm ym
I now consider a linear regression model, which attempts to minimize the loss
Suppose now that I have a θ̂ which is the unique solution to the problem
(a) Suppose (LinReg) is feasible. Show that the linear regression models formed by using θ̂ for any v̂ all have the
same training loss.
(b) Show that the solution to (LinReg) can be written as
θ̂ = θ1 + θ2
where θ1 is the projection of v on the nullspace of XT X, and θ2 is the min-norm solution to (LinReg); e.g. θ2
solves (LinReg) for v̂ = 0. Hint: Use Lagrange multipliers and KKT conditions.
(c) We now define the generalization loss of a model defined by θ as
Derive a simplified formula for L(θ) in terms of the model constants Σx , σy , θ0 and variable θ. Show that this
quantity is lower bounded by σy2 .
(d) Show that this lower bound is achieved by any solution θ̂ where XT Xθ̂ = XT y when the nullspace of the
matrix XT X is identical to the nullspace of Σx , provided that XT Xθ̂0 = XT y.
(e) Now suppose that the nullspace of XT X contains subspaces that are in the range of Σx . We will now show
that this generalization error quantity can be quite large.
(f) Now suppose that XT Xθ̃ = XT y, and θ̃ is trained by minimizing (LinReg) for some v where Xv̂ = 0, ∥θ̃∥2 = 1,
but ∥Σx ∥2 = 1, ∥v∥2 = c, and v̂ T Σx v̂ = c2 ρ. (Here, 0 < ρ ≤ 1 gives the weight of the component of v̂ in the
range of the first eigenvector of Σx .) Show that the generalization error is now lower bounded by c2 ρ − 4c + σy2 .
Remark on what this bound implies if c is very large and ρ is close to 1.
40
11 Optimization analysis
11.1. Convex or not convex. Are the following sets convex or not convex? Justify your answer.
(a) Norm balls. Consider ∥x∥ any norm. Show that the norm properties dictate that the norm balls, defined as
Br = {x : ∥x∥ ≤ r}
are always convex sets. Show that for any r > 0, the complement set
is nonconvex.
(b) Level sets. For a function f : Rd → R, we define the level sets as
Sr = {x : f (x) ≤ r}.
Show that if f is convex, then all of its level sets Sr are convex. Is the opposite true? (If a function has only
convex level sets, is it necessarily convex?)
(a) f (x) = cT x
(b) f (x) = ∥Ax − b∥22
(c) f (x) = exp(x)
(d) f (x) = log(x), for x ≥ 1.
41
11.6. Medians. We will show that the solution to the scalar optimization problem
m
X
minimize |zi − x|
x
i=1
is in fact achieved when x∗ is the median of zi . We can do this using subdifferentials, which are a generalization
of gradients. In particular, for a function that is differentiable almost everywhere, you can find the (Clarke)
subdifferential 8 by taking the convex hull of all the derivatives approaching that point; e.g.
n o
∂f (x) = conv lim ∇f (x + ϵz) : for all z .
ϵ→0
For reference, the convex hull of S is defined as the set that contains all convex combinations of elements in S:
(a) Show that the subdifferential of the absolute function f (x) = |x| is
{1},
if x > 0
∂f (x) = {−1}, if x < 0
[−1, 1], if x = 0.
Hint:
The eigenvalue
decomposition of a symmetric matrix can be written as C = V ΛV T where V =
v1 v2 · · · vn contain the eigenvalues vi of C, and Λ = diag(λ1 , λ2 , · · · , λn ) contain the correspond-
ing eigenvalues λi . (That is, Cvi = λi vi .) Under certain conditions which we will just assume 9 then V is an
orthonormal matrix, e.g. V T V = V V T = I. Then, we can use this to form projections, e.g. pick any vector
u. Then V V T u = V T V u = u.
(b) Show that since n > m, then if C = AT A then λmin (C) = 0. Hint: The nonzero eigenvalues of AAT and of
AT A are the same.
8 A convex subdifferential (set of all supporting hyperplanes) is a specific case of this that applies here, but for this problem I actually find
42
(c) We say a function f : Rn → R is L-smooth if its gradient is L-Lipschitz, e.g. for some L > 0,
∥x(t+1) − x∗ ∥2 ≤ c∥x(t) − x∗ ∥2
11.9.
11.8.
Linear regression
without strong convexity still gets linear convergence. Now consider linear regression with A =
1 1 1
,b= , and ρ = 0 (no ridge). That is, we consider only
1 1 −1
1
f (x) = ∥Ax − b∥22 .
2
(a) Linear algebra. For a matrix A, the nullspace of A is the set
null(A) = {x : Ax = 0},
43
0
(d) Linear regression doesn’t pick up nullspace components. Suppose x(0) = . Now we run the gradient descent
0
method
x(t+1) = x(t) − α∇f (x(t) ) (5)
to get x(1) , x(2) , ...
Show that using this initial point, P x(t) = 0 for all t, where P is the matrix computed in the previous question.
(0) 1.23
(e) Linear regression doesn’t pick up nullspace components, another example. Now suppose that x = .
1.23
Again, we run the gradient descent method. Show that
2 Trajectories
1
x[2]
−1
Range(A) Null(A)
−2
−2 −1 0 1 2
x[1]
(f) Consider a reduced gradient descent problem, where we minimize over a scalar variable v
1 1 1
minimize g(v) = ∥ASv − P b∥22 , S = √ .
v∈R 2 2 1
Argue that, using gradient descent
v (t+1) = v (t) − α∇g(v (t) ),
the iterates x(t) = Sv (t) are exactly those outputted by (5) where x(0) = Sv (0) .
Hint: Start by showing that SS T ∇f (x) = ∇f (x).
(g) Finally, show that g(v) is strongly convex in u.
(h) Show that for this specific choice of A, gradient descent converges linearly, e.g. f (x(t) ) − f (x∗ ) = O(ct ), and
carefully state what c is.
(i) Now consider any matrix A. Argue that this entire problem basically shows that gradient descent, minimizing
linear regression, will always converge at a linear rate, and describe what c is.
44
12 Optimization methods
12.1. Sparse regularizer. In this problem we will investigate a sparse regularizer, in which we replace the 2-norm regu-
larizer with a 1-norm regularizer. In other words, given A ∈ Rm×n , b ∈ Rm , and λ ∈ R, we will solve
minimize
n
1
2m ∥Ax − b∥22 + λ∥x∥1 (6)
x∈R
(a) This objective function is composed of a smooth (everywhere differentiable) and nonsmooth (not everywhere
differentiable) term. Show that ∥x∥1 is nonsmooth by describing all the points x where g(x) = ∥x∥1 is not
differentiable.
(b) Because the objective has a nonsmooth point, gradient descent will not converge to the global minimum. To
see that this is true, consider the case of m = n = 1, with A = b = 1, λ = 2. In other words, we consider
1
minimize (x − 1)2 + 2|x|. (7)
x 2
Start with x(0) = 2, and with a step size t = 1, write out the iterates x(k) for k = 1, 2, 3, 4. Is there a limit
point limk→+∞ x(k) ? If so, is this limit point the problem’s global minima?
(c) We therefore will introduce a new method called the proximal gradient descent method. This method is similar
to gradient descent, except we break away the nonsmooth term and deal with it separately. Explicitly, for
solving
minimize f (x) + g(x)
x
where f (x) is smooth and g(x) is nonsmooth, the proximal gradient descent method picks a random point
x(0) and iterates
x(k+1) = proxtg (x(k) − t∇f (x(t) ))
where the mapping proxtg is the proximal operator
1
proxtg (z) = argmin g(x) + ∥x − z∥22 .
x 2t
We can interpret this as finding the variable x that trades off minimizing the nonsmooth term g(x) and a
proximity term (e.g. doesn’t want to deviate too far from z).
Show that the proximal operator of the 1-norm can be computed in closed form, as
x1 (
.. (|zk | − t)sign(zk ) if |zk | > t
proxtg (z) = . , xk =
0 else.
xn
Comparison with 2-norm regularization. We can also consider a 2-norm regularized version as well,
where we solve
1
minimize ∥Ax − b∥22 + λ∥x∥2 (8)
x 2m
(f) Show that this regularization term ∥x∥2 is also nonsmooth.
(g) Derive the proximal operator proxtg for g(x) = ∥x∥2 .
45
(h) Use proximal gradient descent to solve (8), using the same choices of A and b as in the previous section.
Histogram the final solutions x∗ for λ = 0, 1, 2, 5, 10. Comment on the sparsifying properties of the 2-norm vs
the 1-norm.
12.2. How to deal with multiple advisers. I have m advisers, and I don’t know which one of them to trust. Every day
(t) (t)
t = 1, ..., T , I ask all m advisers a prediction question, which they answer yes yi = 1 or no yi = −1 (i = 1, ..., m).
(t)
The true answer on each day is denoted by y⋆ ∈ {−1, 1}.
(t)
Each day, I have to make a guess as to what y⋆ has to be, which I do using the linear predictor
m
!
(t)
X
(t)
ŷpred (w, y ) = sign wi yi .
i=1
| {z }
=:f (t) (w)
(9)
subject to 0≤w≤1
m
X
wi = 1
i=1
(a) Online learning method. To learn the correct weighting, I run the following method. I start by setting
(0) (t) (t−1) µ
wi = 1/m for all i. Then, every day, if an adviser is correct, I update that adviser’s weight as ŵi = wi e
(t) (t−1)
for some µ ≥ 0. If the adviser is not correct, I leave the weight alone ŵi = wi . Then I reweight, e.g.
(t)
(t) ŵ
wi = Pm i (t)
.
j=1 ŵj
Hint: Note that the method doesn’t really change if you reweight at each iteration, vs reweighting at the very
end.
(b) Projected incremental gradient descent. We can also think of a gradient descent method to solve (9),
where at each iteration, we take a gradient step
w(t+1) = proj∆m−1 w(t) − η∇f (t) (w(t) )
46
Pm
(c) Mirror descent. The set ∆m−1 = {w : 0 ≤ w ≤ 1, i=1 wi = 1} is called the probability simplex.
Projecting on this set is actually not very trivial, and involves sorting all the elements (which, if we remember
from algorithms, is at least O(m log m). So, while we could run a projected gradient descent method to solve
(9), we will try a different trick instead.
We do something called the mirror descent method. Basically, this method is a projected gradient method,
but with a transformed variable. In particular, our mirror map is going to be the gradient of the strictly
convex function
Xm
g(u) = ui log(ui ) (negative entropy function).
i=1
Derive the mirror map, which is ∇g(u), and the inverse mirror map. That is, find Φ and Φ−1 where
ŵ = Φ−1 Φ(w(t) ) − ∇f (w(t) )
w(t+1) = Φ−1 Φ(proj∆m−1 (ŵ)) .
Show that this method is equivalent to the exponential weighted algorithm in part (a).
47
4
3
A E
2
1
D
0
C F
-1
B
-2
-3
-4
-4 -3 -2 -1 0 1 2 3 4
The labels correspond to blue points (-1) and red points (+1), e.g.
yA = yB = yC = −1, yD = yE = yF = 1.
(a) Draw some decision boundaries. On the plot above, draw a line (dashed) corresponding to the set
S1 = {x : xT θ1 = 0}
2
where θ1 = . Also, draw a line (solid) corresponding to the set
1
S2 = {x : xT θ2 = 0}
1
where θ2 = .
1
(b) For each point (A,B,C,D,E,F), give the distance from each point to the set S1 , and to set S2 .
(c) I use the usual linear predictor to deal with new points: y = sign(θT x).
i. If I pick θ = θ1 , which points (A,B,C,D,E,F) are my support vectors?
ii. What is this predictor’s (θ = θ1 ) minimum margin?
iii. If I pick θ = θ2 , which points (A,B,C,D,E,F) are my support vectors?
iv. What is this predictor’s (θ = θ2 ) minimum margin?
48
v. Which choice of θ maximizes the minimum margin?
(d) Consider all the following candidates for θi , which forms a margin Si = {x : xT θi = 0}. Circle all the choices
of θi which lead to the maximum minimum margin solution. (Hint: read as “maximum (minimum margin)
solution”.)
2 1 −1 2 −4
θ1 = , θ2 = , θ3 = , θ4 = , θ5 =
1 1 −1 2 −2
13.2. Logistic regression and the Fisher Information Matrix. Recall that the training of logistic regression models is
equivalent to minimizing the following convex optimization problem
m
1 X
minimize f (θ) := log(1 + exp(−yi xTi θ)) (LogReg)
n
θ∈R m i=1 | {z }
fi (θ)
for data features x1 , ..., xm ∈ Rn and binary labels y1 , ..., ym ∈ {−1, 1}.
(a) Derive the stochastic gradient of f with respect to θ. (That is, give the gradient of fi (θ).)
Hint: It may help to use the sigmoid function σ(s) = (1 + exp(−s))−1 .
(b) Derive the stochastic Hessian of f w.r.t. θ. (That is, give the Hessian if fi (θ).)
Hint: Don’t attempt this without first solving part (a) successfully.
(c) The Fisher information matrix is often used to describe the amount of information about a paramether θ that
is carried by observables (xi , yi ). Formally, it is written as
h i
T
I(θ) = Ey|x (∇θ log(pθ (x, y))) (∇θ log(pθ (x, y))) |θ
where pθ (x, y) is the PDF distribution of x, y under parameter choice θ. Now consider the logistic model,
where
pθ (y|x) = σ(yxT θ).
Show that the Fisher information matrix for yi |xi is equivalent to the Hessian of fi in (LogReg).
Note that this implies that, for logistic models, we can identify the Hessian using only gradient information,
which may be a significant computational benefit.
Hint: It helps to recall that 1 − σ(s) = σ(−s).
Hint 2: Recall the exact formula for an expectation of a continuous distribution. Don’t try to guess this part.
13.3. Decision theory. I run a factory that makes widgets and gadgets. Despite best efforts, manufacturing defects can
always occur. I would like to inspect each of these items individually, but the cost of inspection is pretty high, so
I cannot inspect each individual widget and gadget.
The widgets and gadgets are printed on disks. A disk has a 10% chance of being warped. There are two printing
presses, a blue one and a red one. The table below gives the possibility that, given a disk of a particular state
printed by a particular press, a widget or gadget printed on that disk is defective.
(To interpret the table, the probability that a gadget is defective if it were on a disk that is not warped, and printed
by a red press, is 5%.)
(a) First, we consider only the loss of quality in a product. That is, if we ship a widget or gadget is defective, we
incur a loss of +1. Otherwise, we incur no losses. Assuming I do not do any inspection and just sell all the
widgets and gadgets, compute the Bayes’ risk if we use the red disk, as well as the Bayes’ risk of using the
blue disk.
49
i. Without inspecting anything (that is, we ship out everything we make), what is the Bayes risk of using a
red press? a blue press? Which machine would I use to minimize the Bayes’ risk?
ii. Without inspecting anything, what is the Minimax risk of using a red press? a blue press?
iii. Suppose I invest the effort into inspecting disks, and remove all warped disks. For now, suppose that
there is no cost in removing disks. (We are currently not considering losses in profit.) What is the Bayes
risk of using a red press? a blue press?
iv. After removing all warped disks, what is the Minimax risk of using a red press? a blue press?
(b) Widgets are primarily used in online advertising. If they are defective, they will end up sending an ad that
is undesirable. However, if they are removed, then no ad is sent out. Therefore, the revenue gained from a
widget is estimated at
$1
if widget is sold and is not defective
revenue per widget = −$1 if widget is sold and is defective
0 if the widget is not sold.
c
There is no cost to rejecting a disk, but the cost of inspection is $ 100 per percent of disks inspected. (In other
words, if I inspect all the disks, that cost me $c per widget/gadget.) Every widget that is not on a disk that
was found to be warped is sold.
i. Compute the Bayes reward (e.g. the expected profit per widget) as a function of x = Pr(inspection) for
widgets, when using the blue press. Compute the same for the red press.
ii. If you were a consultant for my factory, at what value of c would you recommend that I inspect some of
the widgets? How much inspection? Does your answer change depending on whether we are using the
red or blue press?
(c) Gadgets are primarily used in medical care. If they are defective, someone will die. However, if they are
removed, then someone waits a day longer to get a much-needed test. While we can never assign monetary
value to a human life, in terms of insurance costs experts have estimated the following value:
$500
if gadget is sold and is not defective
revenue per gadget = −$10, 000 if gadget is sold and is defective
$0 if the gadget is not sold.
Again, there is no cost to rejecting a disk, and again, the cost of inspection is $1 per percent of disks inspected.
Every gadget that is not on a disk that was found to be warped is sold.
i. Compute the Bayes reward (e.g. the expected profit per day) as a function of x = Pr(inspection) for
gadgets, when using the blue press. Compute the same for the red press.
ii. If you were a consultant for my factory, at what value of c would you recommend that I inspect some of
the gadgets? How much inspection? Does your answer change depending on whether we are using the
red or blue press?
13.4. Bayes Risk and 1-NN bounds. In lecture we saw that if RN N is the Bayes risk of a 1-NN classifier and R∗
the Bayes risk of a Bayes classifier, then the Bayes risk can be used to bound the 1-NN classifier in that R∗ ≤
RN N ≤ 2R∗ (1 − R∗ ). Here we will investigate this bound more carefully, to make sure we really understand all the
components of the proof.
We will analyze this bound in terms of a 2-cluster model, defined as follows:
• Y is a random variable taking values in {+1, −1}, and Pr(Y = 1) = p.
• X ∈ R is a random variable taking any value in R, defined by a scalar Gaussian distribution X ∼ N (Y · µ, σ).
That is, if Y = 1 then X has mean µ and variance σ 2 ; if Y = −1, then X has mean −µ and variance σ 2 .
The goal will be to perform binary classification on X, and analyze how the performance of 1-NN and Bayes
classifier works as we increase / decrease µ and σ.
50
(a) Exploring the model. Plot some histograms of X, by drawing m datapoints and labels x1 , ..., xm and
y1 , ..., ym according to our model. Use a “large enough” value of m so that the histogram well represents the
model at the limit m → +∞.
i. Sweep label balance. Do this for several choices of µ = 2, σ = 1 and p = 0.1, 0.25, 0.5
ii. Sweep separation width. Repeat for σ = 1, p = 0.5 and µ = 0.1, 2.0, 10.0.
iii. Sweep cluster variance. Repeat for µ = 2, p = 0.5 and σ = 0.1, 1.0, 3.0.
(b) Bayes classifier. Recall that we denote η(x) = Pr(Y = 1|x).
i. Write out this probability (that is, find an expression for η) in terms of µ, σ, p, and x.
Hint: Use Bayes’ rule. Additionally, you can use the property that, for two different PDFs pg(X) and
ph(X) , that
Pr(g(X) = g(x)) pg(x)
= .
Pr(h(X) = h(x)) ph(x)
Note that in general, Pr(g(X) = g(x)) ̸= pg(x) when X is a continuous random variable! 11
ii. For µ = 1, σ = 1, p = 0.25, fill out to 2 significant digits the first three columns in this table
Pr(y = 1|x) Pr(y = −1|x) Bayes risk 1-NN risk
x (η(x)) (1 − η(x)) (β(x))
-2
-1
-0.5
0.5
iii. Given a new vector x drawn from this distribution, describe the action of the Bayes classifier. What is
the rule in choosing a label ŷ = 1 or ŷ = −1?
iv. Write out an expression for the Bayes risk (R∗ ) , in terms of µ, σ, and p. Your answer may involve an
integral that you do not need to evaluate.
v. Numerically estimate the Bayes risk (R∗ ) for p = 0.25, σ = 1, µ = 1. You can do this by generating 1000
points according to this distribution, calculating the Bayes risk for each point, and reporting the average.
(c) 1-NN classifier. Now we consider the limiting case of a 1-NN classifier; that is, we consider a scenario where,
for any test data point x, there exists a labeled training point zx that is arbitrarily close to x. We assume
that the training and test data are drawn i.i.d., so, conditioned on their respective labels, x and zx are not
correlated.
11 While in general the PDF does not tell us about the probability of a continuous variable taking a specific value, we can arrive at this
equivalence of their ratios through the use of Radon-Nikodym derivatives. Think of it as something similar to chain rule: for two functions F
∂F (A)
and G applied over a measurable set A, assuming F and G are absolutely continuous, then F (A) = G(A) · ∂G(A) . Anyway, we do not need
to go into measure theory in this class, just rest assured that this operation in the hint is allowed!
51
i. In this regime, the probability of error should somehow be high in regions where the label could be 1 or
-1 with equal probability, but pretty low when the label is more likely 1 or -1. Write an expression, in
terms of p, µ, σ, and x, of the “limiting error”, e.g. the error of 1-NN if there always exists a labeled
point arbitrarily close to x.
ii. Fill out the last column in the table above, again for µ = 1, σ = 1, p = 0.25.
iii. Write out an expression for the 1-NN risk (RN N ) , in terms of µ, σ, and p. Your answer may involve an
integral that you do not need to evaluate.
iv. Numerically estimate the 1-NN risk (RN N ) for p = 0.25, σ = 1, µ = 1. You can do this by generating
1000 points according to this distribution, calculating the Bayes risk for each point, and reporting the
average.
(d) Ok, now we have all the pieces of the puzzle, and all the code snippets needed to do a more involved analysis!
In particular, we want to see under what regimes we would expect the 1-NN risk to approach its lower bound
(Bayes risk) or upper bound (2R∗ (1 − R∗ )).
• Write a function that takes a value σ, p, and µ, returns a numerical estimate of the Bayes risk R∗ and
1-NN risk RN N . Pick m “large enough” (I find m = 1000 works fine, but for certain regimes even fewer
points is sufficient.)
• Write a function that takes a value σ, p, and µ, generates x1 , ..., xm and y1 , ..., ym according to this 2-
cluster model, and using the first half of the data as a train set and the second half as a test set, returns
the test misclassification error for a 1-NN classifier.
• For σ = 1., p = 0.25, sweep µ as logspace(-2,2,10) * sigma. Plot as a function of µ the quantities
RN N , R∗ , the upper bound 2R∗ (1−R∗ ), and the misclassification rate of the implemented 1-NN. Comment
on what you see.
• Pick either 2 other values of σ or 2 other values of p and repeat this experiment. How do these other
parameters affect the bound and tightness? Can you venture a guess as to what kind of scenarios would
hit the upper bound, and what would hit the lower bound?
14 Deep learning
14.1. Movie recommendations. We are going to try to replicate the famous Netflix prize recommendation system,
but using an incredibly baby version of the model. (You don’t have two years to complete this assignment, after
all.) Open movie recommendations.ipynb. I have downloaded data from the MovieLens databank 12 partitioned
into 33% train, 33% validation, and 34% test.
We will build a movie recommendation system using solely a matrix factorization technique, but with as much
respect to memory compression as possible. See also this benchmark 13 which gives an idea of what kind of values
of RMSE we are looking for.
52
where average user is a vector computed on the training data and gives each user’s average rating, after the
global average is subtracted away.
The second predictor should do the same thing, but average over movies. Report the RMSE of the train,
validation, and test set here.
Note that some users or movies from the train set may not appear in the validation or test set. In those cases,
you are stuck, and can only compute the global average score.
• Finally, build a simple predictor that uses the global average, user average, and movie average vectors. Do
this via the following
1 1
ratings(u,m) = global average + 2 average user(u)+ 2 average movie(m).
Report the RMSE of the train, validation, and test set here.
• Now we run the matrix completion method, by minimizing
1 X T m̄j w̄i 2
minimize f (U, V ) := ui vj + c + + − Ri,j
U ∈R# users,r ,V ∈R# movies,r 2 2 2
i,j∈Ω
Here, c is the global average, m̄ is the average movie rating after global average is subtracted away, and w̄ is
the average user rating after global average is subtracted away. We represent ui as the ith row of U and vj as
the jth row of V .
The matrix R ∈ R# users, # movies is an extremely sparse rating matrix, that contains the observed ratings.
We now do this in several steps. First, write down the gradient of this objective function w.r.t. ui and vj .
• Now code up the gradients, by filling in the two helper functions
def get_grad_ui(U,V,i):
return np.zeros(r) # replace me
def get_grad_vj(U,V,j):
return np.zeros(r) # replace me
Note that given a row i or j, the gradient ∇Ui should only involve the rows of Vj where Rij is nonzero.
Carefully read through the scipy.sparse.coo matrix documentation to see the correct way of pulling out those
indices.
Hint: maybe something like...
j_nonzero = R.col[R.row==i]
Note also that you should never create a dense matrix R, and just work with its sparse form.
• Finally, code up the full gradient descent method. Note that you cannot use 0 initializations here for U and
V , which would render all gradients 0 (super suboptimal saddle point).
Pick a value of r = 5, step size 0.05, initialization of Ui,j ∼ N (0, 1/5) and Vi,j ∼ N (0, 1/5), i.i.d. Report the
RMSE of the train, validation, and test set after 20 iterations.
• Now, use the validation set to pick the best value of r and maxiter. Train the model under these hyperpa-
rameters. Plot the train, validation, and test RMSE for each iteration, for each value of r. Comment on this
trend.
14.2. Fully connected neural network. In this exercise, you will implement backpropagation on a fully connected neural
network, using ReLU nonlinearities. This is kind of the “textbook neural network” (also called a multilayer percep-
tron, or MLP) that is not used super often in practice, but can be very instructive for learning backpropagation.
More complicated neural network training is heavily based off of this core process.
The neural network we will consider will be constructed for regression, using the following loss function:
(B(x, θ) − y)2
L(θ; x, y) = , x ∈ R, y ∈ R.
2
where B(x, θ) forms the representation of x:
T T T
B(x, θ) = θL r(θL−1 r(θL−2 · · · r(θ2T r(θ1T x + b1 ) + b2 ) · · · + bL−2 ) + bL−1 ) + bL
53
(Here, you should think of θ = (θ1 , θ2 , ..., θL , b1 , ..., bL ) as collecting all the parameters across weights. The terms
b1 , ..., bL ∈ R are the bias weights.) L is the number of layers (aka depth-1) of the neural network, and we will just
consider r to be some differentiable function. (Usually, r(s) = max{s, 0} element-wise, e.g. the ReLU activation
function, which isn’t actually differentiable everywhere, but it’s close enough.)
Then, in training, we minimize the empirical loss function
m
1 X
minimize L(θ; xi , yi ).
θ m i=1
This case reduces to our usual linear regression. Note that the two parameters, θ1 , and b1 , are both scalars.
Write down the gradient of L with respect to θ1 and b1 in this case.
∇θ1 L = ?, ∇b1 L = ?
(b) Now, extend this to the case of L = 2; e.g., B(x, θ) = θ2T r(θ1 x + b1 ) + b2 . You may write r′ (u) = v to represent
∂r(u )
the element-wise derivative of r, e.g. v [i] = ∂u[i][i] . We will consider some width w ≥ 1, and θ1 ∈ Rw , θ2 ∈ Rw .
Now, if we assign s1 = xθ1 + b1 1, write out
Your answers can include the term s1 and other defined dummy variables..
(Hint: In all cases, write out the dimensions of everything. If you want to first try this on scratch paper, you
can practice with, say, w = 3. If you want to submit your whole answer with w = 3 to keep things cleaner,
that is acceptable – just know that in your code, you need to be able to adjust w.)
(Notation: In this question, you will be faced with a new (well, not that new) character called the “Jacobian”,
which maps derivatives from vectors to vectors. Specifically, for two vectors u ∈ Rn and v ∈ Rm , we may
write
∂vi
Ju (v) ∈ Rm×n , (Ju (v))[i,j] = .
∂uj
This will be useful in keeping with the matrix/vector spirit, which when coding up reduces for-loops and keeps
your code clean and efficient!)
(Notation: you can use the symbol ⊙ to indicate element-by-element multiplication. So, A ⊙ B = C means
Cij = Aij Bij .)
(c) Finally, extend this to a case of arbitrarily large L and w, where si = θiT r(si−1 ) + bi 1 for i = 2, ..., L − 1 and
s1 = θ1T x + b1 1. Note the dimensions of the parameters:
b1 ∈ R, b2 , ..., bL−1 ∈ Rw , bL ∈ R
| {z }
vectors
(Hint: in this part, it is especially important to keep track of all the dimensions, and know exactly when to
transpose θi s. I am not being nitpicky; if you do this part wrong, your code will not work.)
(Notational hint: At some point, you will also meet a tensor, e.g. an element that is in Rw×w×w . This is
super annoying to work with, so I recommend computing ∇θi si[j] ∈ Rw (e.g. the gradient one element at a
time w.r.t. the output) rather than take on ∇θi si ∈ Rw×w×w .)
54
(Notation: you may also use the notation ej ∈ Rw to mean a 1-hot vector of length w; that is, ej [i] = 0 for all
i ̸= j, and ej [j] = 1.)
(Final hint: Remember that when taking gradients, things get transposed. So:
∇x cT x = ∇x xT c = c, JA (Ax) = AT , JB (B T x) = B
and so forth.)
(d) Download fitsinewave release.ipynb and complete the exercise of fitting a sine wave using a 1-D deep
neural network, using r(s) = max{0, s}. Include a figure of the fit for L = 1, 2, 3, 4, 5 layers. How many layers
does it take to get a good fit?
(Hint: do not try to do this without first carefully doing the first part of this problem. I have been coding
up this thing for 8 years now and I have never done it successfully without writing it all out first on scratch
paper, dimensions and all.)
(Hint: Remember that when you are coding this up for a lot of training data, then every gradient you computer
has to be averaged over all the training data. Parts (a)-(d) are about how to compute ∇θ L(θ; x, y), but you
actually need to back-propagate via
m
α X
θnext = θ − ∇θ L(θ; xi , yi ).
m i=1
You will have to try several values of α > 0 to get the best choice.)
Submit a figure showing the sine wave, and your neural network’s fit. If you cannot get a perfect fit, then
submit 10 fits from different random initializations.
14 url: https://www.gutenberg.org/ebooks/28885
15 url: https://archive.ics.uci.edu/ml/datasets/covertype
55
References
[1] xkcd comics. xkcd.com.
[2] Movielens. https://grouplens.org/datasets/movielens/, Dec 2021.
[3] Jock A Blackard and Denis J Dean. Comparative accuracies of artificial neural networks and discriminant analysis
in predicting forest cover types from cartographic variables. Computers and electronics in agriculture, 24(3):131–151,
1999.
56