0% found this document useful (0 votes)
4 views

homework4_v1.0

Homework 4 for the course 10-701 Introduction to Machine Learning includes tasks on VC dimension, AdaBoost, Gaussian Mixture Models, and K-means clustering. Students are required to submit their work via the CMU Autolab system, with guidelines on collaboration and submission format. The homework consists of theoretical proofs, algorithm implementations, and practical applications using provided datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

homework4_v1.0

Homework 4 for the course 10-701 Introduction to Machine Learning includes tasks on VC dimension, AdaBoost, Gaussian Mixture Models, and K-means clustering. Students are required to submit their work via the CMU Autolab system, with guidelines on collaboration and submission format. The homework consists of theoretical proofs, algorithm implementations, and practical applications using provided datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

10-701 Introduction to Machine Learning

Homework 4, version 1.0 Due Nov 13, 11:59 am

Rules:

1. Homework submission is done via CMU Autolab system. Please package your writeup and code into
a zip or tar file, e.g., let submit.zip contain writeup.pdf and the code. Submit the package to
https://autolab.cs.cmu.edu/courses/10701-f15.
2. Like conference websites, repeated submission is allowed. So please feel free to refine your answers.
We will only grade the latest version.
3. Autolab may allow submission after the deadline, note however it is because of the late day policy.
Please see course website for policy on late submission.
4. We recommend that you typeset your homework using appropriate software such as LATEX. If you are
writing please make sure your homework is cleanly written up and legible. The TAs will not invest
undue effort to decrypt bad handwriting.
5. You are allowed to collaborate on the homework, but you should write up your own solution and code.
Please indicate your collaborators in your submission.

1
1 VC dimension (20 Points) (Xun)
To show a concept class H has VC dimension d, we need to prove both the lower bound VCdim(H) ≥ d and
the upper bound VCdim(H) ≤ d.

1. Show that linear classifiers h(x) = 1{a> x+b≥0} in Rn has VC dimension n + 1.


Hint: the following theorem might be useful in proving the upper bound. A set of n + 2 points in
Rn can be partitioned into two disjoint subsets S1 and S2 such that their convex hulls intersect. The
convex hull conv(C) of a set C is defined as the set of all convex combinations of points in C:
( k k
)
X X
conv(C) = αi xi : xi ∈ C, αi ≥ 0, αi = 1 . (1)
i=1 i=1

You do not need to know anything about convexity beyond this hint to solve this problem.
2. Show that axis-aligned boxes h(x) = 1{ai ≤xi ≤bi ,∀i} in Rn has VC dimension 2n.

2 AdaBoost (30 Points) (Xun)


Consider m training examples S = {(x1 , y1 ), . . . , (xm , ym )}, where x ∈ X and y ∈ {−1, 1}. Suppose we
have a weak learning algorithm A which produces a hypothesis h : X → {−1, 1} given any distribution D of
examples. AdaBoost works as follows (slightly different from the lecture slides, but they are equivalent):
1
• Begin with a uniform distribution D1 (i) = m, i = 1, . . . , m.
• At each round t = 1, . . . , T ,

– Run A on Dt and get ht .


Dt −αt yi ht (xi )
– Update Dt+1 (i) = Zt e , where Zt is the normalizer and i = 1, . . . , m.
Note that since A is a weak learning algorithm, the produced ht at round t is only slightly better than
random guessing, say, by a margin γt :
1
t = errDt (ht ) = Pr x∼Dt [y 6= ht (x)] = 2 − γt . (2)
P 
T
In the end, AdaBoost outputs H = sign t=1 αt ht as the learned hypothesis. We will now prove that
the training error errS (H) of AdaBoost decreases to zero at a very fast rate. In the answer, please state
clearly why the derivation makes sense, for instance “by Cauchy-Schwarz, ...”.

1. Let’s first justify the update rule. Imagine there is an adversarial who wants to fool ht in the next
round by adjusting the distribution. More formally, given ht , the adversarial wants to set Dt+1 such
that errDt+1 (ht ) = 12 . Show that the particular choice of αt = 12 log 1−
t achieves this goal.
t

Note: why do we want such an adversarial setting? Because otherwise A might as well return ht or
−ht again in round t + 1 and still be slightly better than random guessing, which means it essentially
learns nothing.
 QT −1 PT
2. Show that DT +1 (i) = m · t=1 Zt e−yi f (xi ) , where f (x) = t=1 αt ht (x).
QT
3. Show that errS (H) ≤ t=1 Zt .
QT PT 2
4. Show that t=1 Zt ≤ e−2 t=1 γt .

2
5

4 x x x
3 6 7

3 x
2

2 x1 x4 x8

1 x5 x9

0
0 1 2 3 4 5 6

Figure 1: Toy data for AdaBoost.

5. Now let γ = mint γt . From 3 and 4, we know the training error approaches zero at exponential rate
with respect to T . Then how many rounds are needed to achieve a training error ε > 0? Please express
in big-O notation, T = O(·).
6. Consider the data set in Figure 1. Run T = 3 iterations of AdaBoost with decision stumps (axis-aligned
separators) as the base learners. Illustrate the learned weak hypotheses {ht } in Figure 1 and fill in
Table 1. The MATLAB code that generates Figure 1 is available on the course website.
We recommend writing a simple program as it might be tedious to calculate by hand. It will also help
you understand how it works in practice.

t t αt Dt (1) Dt (2) Dt (3) Dt (4) Dt (5) Dt (6) Dt (7) Dt (8) Dt (9) errS (H)
1
2
3

Table 1: AdaBoost results

3 Gaussian Mixture Model (10 Points) (Hao)


Consider a multivariate Gaussian Mixture Model with K components:
K
X
p(x) = πk N (x|µk , Σk ) (3)
k=1
P
1. Show that E[x] = k πk µk .

2. Show that Cov[x] = k πk [Σk + µk µ> >


P
k ] − E[x]E[x] .

4 K-means (40 Points) (Hao)


Given n data samples in X ⊆ Rd and an integer K, we showed in class that the K-means algorithm tries to
determine K clusters {Ck }K K d
k=1 with centers UK = {µk }k=1 ⊆ R , and a mapping function f : X → {1, · · · , K}

3
which assigns each xi ∈ X to one of the clusters, so as to optimize the following objective,
K nk Xnk
X 1 X
φ= kxki − xkj k2 (4)
nk i=1 j=1
k=1

where xki denotes the ith sample in Ck and nk is the number of data samples in Ck .

4.1 Theory
1. Prove the following Lemma.
Lemma 1. Given a set of points X ⊆ Rd with their center as x̄. For any point s,
X X
kx − sk2 − kx − x̄k2 = |X | · kx̄ − sk2 (5)
x∈X x∈X

2. Use Lemma.1 to prove that minimizing the objective in Eq.4 is equal to minimizing the following
objective:
K X
X n
ω(UK , f ; X ) = 1(f (xi ) = k)kxi − µk k2 (6)
k=1 i=1

3. Algorithm.1 presents how K-means proceeds. Show respectively that both Step 1 and Step 2 will
decrease the objective φ (or ω).

Algorithm 1: K-means Algorithm


1 Initialize {µk }K
k=1 (randomly, if necessary).
2 repeat
3 Step 1: Decide the class memberships of {xi }ni=1 by assigning each of them to its nearest cluster
center.
4 Step 2: For each k ∈ {1, · · · , K}, set µk to be the center of mass of all points in Ci :
Pnk
µi = n1k i=1 xki
5 until the objective no longer changes;

4. Let Ω(K) = minUK ,f ω(UK , f ; X ). Show that Ω is non-increasing in K.

5. In K-means (as in Algorithm.1), we terminate the iterative process when the objective no longer
changes. Prove that K-means terminates in a finite number of iterations.

4.2 Implementation
Now you are ready to implement K-means by yourself. A dataset including 2429 human faces is provded in
the file kmeans data.csv. Each of the 2429 lines in this file corresponds to a 19 × 19 image of a human face.
Every image is represented as a 361-dimensional vector of grayscale values, in column-major format.
1. Implement K-means algorithm, as detailed in Alogorithm.1. Your implementation should initialize
{µk }K
k=1 by uniformly randomly choosing from X . Compute the objective value in Eq.4 of each it-
eration. You K-means algorithm should be terminated when a given number of iterations M are
reached.

4
2. Run your implementation for 15 times, using k = 5, M = 50. Draw the objective v.s. iterations for all
15 runs in one plot. Have they converged? How many iterations does each iteration take to converge?
Choose the run with minimal objective value, compute the mean faces for this run, i.e., the centers of
the clusters. Visualize the mean faces.
3. Usually the clustering results by K-means can be greatly improved by carefully choosing an initialization
strategy. K-means++ is a randomized seeding technique which can improve both the speed and the
accuracy of K-means [1]. Algorithm.2 elaborates how K-means++ initializes the clustering centers
{µk }K
k=1 .

Algorithm 2: K-means++ Initialization


1 Take one center µ1 , chosen uniformly at random from X .
2
2 Take a new center µk (k > 1) from X , so that P r(µk = xi ) = PnD(xi ) 2 where D(x) is the distance
j=1 D(x j)

from x to its nearest center in {µk }j−1


k=1 .
3 Repeat the above step until all {µk }Kk=1 have been initialized.

Implement K-means++ based on your K-means implementation. Note that you need to implement a
sampler by yourself which samples from a multinomial distribution. Then, run your implementation
with K-means++ for 15 times, using k = 5, M = 50. Draw the objective v.s. iterations for all 15 runs
in one plot. How many iterations do they take to converge? Compute the mean faces for the run with
the minimal objective. Visualize the mean faces. Compare your curve and mean faces to your previous
ones. Conclude your observation.
Submit both the write-up and your code.

References
[1] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the
eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial
and Applied Mathematics, 2007. 5

You might also like