Mathematical Foundations For Data Analysis: Jeff M. Phillips
Mathematical Foundations For Data Analysis: Jeff M. Phillips
Jeff M. Phillips
Mathematical Foundations
for Data Analysis
Springer Series in the Data Sciences
Series Editors
David Banks, Duke University, Durham, NC, USA
Jianqing Fan, Department of Financial Engineering, Princeton University,
Princeton, NJ, USA
Michael Jordan, University of California, Berkeley, CA, USA
Ravi Kannan, Microsoft Research Labs, Bangalore, India
Yurii Nesterov, CORE, Universite Catholique de Louvain, Louvain-la-Neuve,
Belgium
Christopher Ré, Department of Computer Science, Stanford University, Stanford,
USA
Ryan J. Tibshirani, Department of Statistics, Carnegie Melon University,
Pittsburgh, PA, USA
Larry Wasserman, Department of Statistics, Carnegie Mellon University,
Pittsburgh, PA, USA
Springer Series in the Data Sciences focuses primarily on monographs and graduate
level textbooks. The target audience includes students and researchers working in
and across the fields of mathematics, theoretical computer science, and statistics.
Data Analysis and Interpretation is a broad field encompassing some of the
fastest-growing subjects in interdisciplinary statistics, mathematics and computer
science. It encompasses a process of inspecting, cleaning, transforming, and
modeling data with the goal of discovering useful information, suggesting
conclusions, and supporting decision making. Data analysis has multiple facets
and approaches, including diverse techniques under a variety of names, in different
business, science, and social science domains. Springer Series in the Data Sciences
addresses the needs of a broad spectrum of scientists and students who are utilizing
quantitative methods in their daily research. The series is broad but structured,
including topics within all core areas of the data sciences. The breadth of the series
reflects the variation of scholarly projects currently underway in the field of
machine learning.
Mathematical Foundations
for Data Analysis
123
Jeff M. Phillips
School of Computing
University of Utah
Salt Lake City, UT, USA
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
to Bei, Stanley, and Max
Preface
This book is meant for data science students, and in particular preparing students
mathematically for advanced concepts in data analysis. It can be used for a
self-contained course that introduces many of the basic mathematical principles
and techniques needed for modern data analysis, and can go deeper in a variety of
topics; the shorthand math for data may be appropriate. In particular, it was
constructed from material taught mainly in two courses. The first is an early
undergraduate course which is designed to prepare students to succeed in rigorous
Machine Learning and Data Mining courses. The second course is the advanced
Data Mining course. It should be useful for any combination of such courses. The
book introduces key conceptual tools which are often absent or brief in the
undergraduate curriculum, and for most students, helpful to see multiple times. On
top of these, it introduces the generic versions of the most basic techniques that
comprise the backbone of modern data analysis. And then it delves deeper into a
few more advanced topics and techniques—still focusing on clear, intuitive, and
lasting ideas, instead of specific details in the ever-evolving state of the art.
Notation
Consistent, clear, and crisp mathematical notation is essential for intuitive learning.
The domains which comprise modern data analysis (e.g., statistics, machine
learning, and algorithms) until recently had matured somewhat separately with
their own conventions for ways to write the same or similar concepts. Moreover, it
is commonplace for researchers to adapt notation to best highlight ideas within
specific papers. As such, much of the existing literature on the topics covered in
this book has varied, sometimes inconsistent notation, and as a whole can be
confusing. This text attempts to establish a common, simple, and consistent
notation for these ideas, yet not veer too far from how concepts are consistently
represented in the research literature, and as they will be in more advanced courses.
vii
viii Preface
Indeed, the most prevalent sources of confusion in earlier uses of this text in class
have arisen around overloaded notation.
This book is written for students who have already taken calculus, for several
topics it relies on integration (continuous probability) and differentiation (gradient
descent). However, this book does not lean heavily on calculus, and as data science
introductory courses are being developed which are not calculus-forward, students
following this curriculum may still find many parts of this book useful.
For some advanced material in Sampling, Nearest Neighbors, Regression,
Clustering, Classification, and especially Big Data, some basic familiarity with
programming and algorithms will be useful to fully understand these concepts.
These topics are deeply integrated with computational techniques beyond numer-
ical linear algebra. When the implementation is short and sweet, several imple-
mentations are provided in Python. This is not meant to provide a full introduction
to programming, but rather to help break down any barriers among students
worrying that the programming aspects need to be difficult—many times they do
not!
Probability and Linear Algebra are essential foundations for much of data
analysis, and hence also in this book. This text includes reviews of these topics.
This is partially to keep the book more self-contained, and partially to ensure that
there is a consistent notation for all critical concepts. This material should be
suitable for a review on these topics but is not meant to replace full courses. It is
recommended that students take courses on these topics before, or potentially
concurrently with, a more introductory course from this book.
If appropriately planned for, it is the hope that a first course taught from this
book could be taken as early as the undergraduate sophomore level, so that more
rigorous and advanced data analysis classes can be taken during the junior year.
Although we touch on Bayesian inference, we do not cover most of classical
statistics; neither frequentist hypothesis testing nor similar Bayesian perspectives.
Most universities have well-developed courses on these topics that are also very
useful, and provide a complementary view of data analysis.
Vital concepts introduced include the concentration of measure and PAC bounds,
cross-validation, gradient descent, a variety of distances, principal component
analysis, and graph-structured data. These ideas are essential for modern data
analysis, but are not often taught in other introductory mathematics classes in a
computer science or math department, or if these concepts are taught, they are
presented in a very different context.
Preface ix
labeled
data
(X, y) regression prediction
X
unlabeled dimensionality
clustering structure
reduction
data
scalar set
outcome outcome
On Data
While this text is mainly focused on mathematical preparation, what would data
analysis be without data? As such we provide a discussion on how to use these
tools and techniques on actual data, with simple examples given in Python. We
choose Python since it has increasingly many powerful libraries often with efficient
backends in low-level languages like C or Fortran. So for most data sets, this
provides the proper interface for working with these tools. Data sets can be found
here: https://mathfordata.github.io/data/.
But arguably, more important than writing the code itself is a discussion on
when and when not to use techniques from the immense toolbox available. This is
one of the main ongoing questions a data scientist must ask. And so, the text
attempts to introduce the readers to this ongoing discussion—but resists diving into
an explicit discussion of the actual tools available.
x Preface
Three themes that this text highlights to aid in a broader understanding of these
fundamentals are examples, geometry, and ethical connotations. These are each
offset in colored boxes.
This book provides numerous simple and sometimes fun examples to demonstrate
key concepts. It aims to be as simple as possible (but not simpler), and make data
examples small, so they can be fully digested. These are illustrated with figures
and plots, and often the supporting Python code is integrated when it is illustrative.
For brevity, the standard import commands from Python are only written once per
chapter, and state is assumed carried forward throughout the examples within a
chapter. Although most such Python parts are fully self-contained.
Many of the ideas in this text are inherently geometric, and hence we attempt to
provide many geometric illustrations and descriptions to use visual intuition to shed
light on otherwise abstract concepts. These boxes often go more in depth into what
is going on, and include the most technical proofs. Occasionally the proofs are not
really geometric, yet for consistency the book retains this format in those cases.
As data analysis glides into an abstract, nebulous, and ubiquitous place, with a
role in automatic decision making, the surrounding ethical questions are becoming
more important. As such, we highlight various ethical questions which may arise
in the course of using the analysis described in this text. We intentionally do not
offer solutions, since there may be no single good answer to some of the dilemmas
presented. Moreover, we believe the most important part of instilling positive ethics
is to make sure analysts at least think about the consequences, which we hope is
partially achieved via these highlighting boxes and ensuing discussions.
I would like to thank the gracious support from NSF in the form of grants
CCF-1350888, IIS-1251019, ACI-1443046, CNS-1514520, CNS-1564287, and
IIS-1816149, which have funded my cumulative research efforts during the writing
of this text. I would also like to thank the University of Utah, as well as the Simons
Institute for Theory of Computing, for providing excellent work environments
while this text was written. And thanks to Natalie Cottrill, Yang Gao, Koriann
South, and many other students for a careful reading and feedback.
xi
Contents
1 Probability Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Conditional Probability and Independence . . . . . . . . . . . . . . . . . 4
1.3 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Joint, Marginal, and Conditional Distributions . . . . . . . . . . . . . . 8
1.7 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7.1 Model Given Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
xiii
xiv Contents
3.5 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Square Matrices and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.7 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 Linear Regression with Multiple Explanatory Variables . . . . . . . 99
5.3 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.1 Other ways to Evaluate Linear Regression Models . . . . . 108
5.5 Regularized Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5.1 Tikhonov Regularization for Ridge Regression . . . . . . . . 110
5.5.2 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.5.3 Dual Constrained Formulation . . . . . . . . . . . . . . . . . . . . 113
5.5.4 Matching Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Contents xv
8 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.1 Voronoi Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.1.1 Delaunay Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.1.2 Connection to Assignment-Based Clustering . . . . . . . . . . 182
8.2 Gonzalez’s Algorithm for k-Center Clustering . . . . . . . . . . . . . . 183
8.3 Lloyd’s Algorithm for k-Means Clustering . . . . . . . . . . . . . . . . . 185
8.3.1 Lloyd’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.3.2 k-Means++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8.3.3 k-Mediod Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.3.4 Soft Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.4 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
8.4.1 Expectation-Maximization . . . . . . . . . . . . . . . . . . . . . . . 196
8.5 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
xvi Contents
9 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.1 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.1.1 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.1.2 Cross-Validation and Regularization . . . . . . . . . . . . . . . . 212
9.2 Perceptron Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.3 Support Vector Machines and Kernels . . . . . . . . . . . . . . . . . . . . 217
9.3.1 The Dual: Mistake Counter . . . . . . . . . . . . . . . . . . . . . . 218
9.3.2 Feature Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
9.3.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 221
9.4 Learnability and VC dimension . . . . . . . . . . . . . . . . . . . . . . . . . 222
9.5 kNN Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9.6 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9.7 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
9.7.1 Training with Back-propagation . . . . . . . . . . . . . . . . . . . 230
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Chapter 1
Probability Review
Abstract Probability is a critical tool for modern data analysis. It arises in dealing
with uncertainty, in prediction, in randomized algorithms, and Bayesian analysis.
To understand any of these concepts correctly, it is paramount to have a solid and
rigorous statistical foundation. This chapter reviews the basic definitions necessary
for data analysis foundations.
We define probability through set theory, starting with a sample space Ω. This
represents the set of all things that might happen in the setting we consider. One such
potential outcome ω ∈ Ω is a sample outcome; it is an element of the set Ω. We are
usually interested in an event that is a subset A ⊆ Ω of the sample space.
Consider rolling a single fair, 6-sided die. Then Ω = {1, 2, 3, 4, 5, 6}. One roll may
produce an outcome ω = 3, rolling a 3. An event might be A = {1, 3, 5}, any odd
number.
The probability of rolling an odd number is then the ratio of the size of the event
to the size of the sample space
• The probability of the union of disjoint events is equivalent to the sum of their
individual probabilities. Formally, for any sequence A1, A2, . . . where for all i j
that Ai ∩ A j = ∅, then
Pr Ai = Pr(Ai ).
i=1 i=1
Now consider flipping a biased coin with two possible events Ω = {H, T } (i.e.,
heads H = A1 and tails T = A2 ). The coin is biased so the probabilities Pr(H) =
0.6 and Pr(T) = 0.4 of these events are not equal. However we notice still 0 ≤
Pr(T), Pr(H) ≤ 1, and that Pr(Ω) = Pr(H ∪ T) = Pr(H) + Pr(T) = 0.6 + 0.4 = 1.
That is, the sample space Ω is the union of these two events, which cannot both occur
(i.e., H ∩ T = ∅), so they are disjoint. Thus Ω’s probability can be written as the sum
of the probability of those two events.
Sample spaces Ω can also be continuous, representing some quantity like water,
time, and landmass which does not have discrete quantities. All of the above definitions
hold for this setting.
Assume you are riding a Swiss train that is always on time, but its departure is only
specified to the minute (specifically, 1:37 pm). The true departure is then in the state
space Ω = [1:37:00, 1:38:00). A continuous event may be A = [1:37:00 − 1:37:40),
the first 40 seconds of that minute.
Perhaps the train operators are risk averse, so Pr(A) = 0.80. That indicates that
0.8 fraction of trains depart in the first 2/3 of that minute (less than the 0.666
expected from a uniform distribution).
Consider flipping a fair coin with Λ = {H, T }. If it lands as heads H, then I get 1
point, and if it lands as T, then I get 4 points. Thus the mapped-to sample space
is Ω = {1, 4}. This describes the random variable X, defined as X(H) = 1 and
X(T) = 4.
A∩B A|B
A
Y Y Y
B B
ΩB
Ω Ω Ω Ω
(a) (b) (c) (d)
4 1 Probability Review
Now consider two events A and B. The conditional probability of A given B is written
as Pr(A | B), and can be interpreted as the probability of A, restricted to the setting
(A∩B)
where we know B is true. It is defined in simpler terms as Pr(A | B) = PrPr (B) ,
that is, the probability that A and B are both true divided by (normalized by) the
probability that B is true. Be careful, this is only defined when Pr(B) 0.
Two events A and B are independent of each other if and only if
Pr(A | B) = Pr(A).
Consider two random variables T and C. Variable T is 1 if a test for cancer is positive,
and 0 otherwise. Variable C is 1 if a patient has cancer, and 0 otherwise. The joint
probability of the events is captured in the following table:
cancer no cancer
C=1 C=0
tests positive for cancer T = 1 0.1 0.02
tests negative for cancer T = 0 0.05 0.83
Note that the sum of all cells (the joint sample space Ω) is 1. The conditional
probability of having cancer, given a positive test, is Pr(C = 1 | T = 1) = 0.1+0.02
0.1
=
0.8333. The probability of cancer (ignoring the test) is Pr(C = 1) = 0.1+0.05 = 0.15.
Since Pr(C = 1 | T = 1) Pr(C = 1), then events T = 1 and C = 1 are not
independent.
Two random variables X and Y are independent if and only if, for all possible
events A ⊆ ΩX and B ⊆ ΩY , these events A and B are independent: Pr(A ∩ B) =
Pr(A) Pr(B).
Discrete random variables can often be defined through tables (as in the above
cancer example), or we can define a function fX (k) as the probability that random
variable X is equal to k. For continuous random variables, we need to be more
careful: we will use calculus. We will next develop probability density functions
(pdfs) and cumulative density functions (cdfs) for continuous random variables;
1.3 Density Functions 5
the same constructions are sometimes useful for discrete random variables as well,
which basically just replace an integral with a sum.
We consider a continuous sample space Ω, and a random variable X with outcomes
in Ω. The probability density function of a random variable X is written as fX . It
∫ to any event A ⊂ Ω so the probability X is an element of
is defined with respect
A is Pr(X ∈ A) = ω ∈ A fX (ω)dω. The value fX (ω) is not equal to Pr(X = ω) in
general, since for continuous functions Pr(X = ω) = 0 for any single value ω ∈ Ω.
Yet, we can interpret fX as a likelihood function; its value has no units, but they can
be compared and outcomes ω with larger fX (ω) values are more likely.
Next we will define the cumulative density function FX (t); it is the probability
∫ t Here it is typical to have Ω = R, the set of
that X takes on a value of t or smaller.
real numbers. Now define FX (t) = ω=−∞ fX (ω)dω.
When the cdf is differentiable, we can then define a pdf in terms of a cdf as
X (ω)
fX (ω) = dFdω .
Example: Normal Random Variable
2π 2π
have plotted the cdf and pdf in the range [−3, 3] where most of the mass lies.
mean = 0
variance = 1
sigma = math.sqrt( variance )
x = np. linspace (-3, 3, 201)
Linearity of Expectation
A key property of expectation is that it is a linear operation. That means for two
random variables X and Y , we have E[X + Y ] = E[X] + E[Y ]. For a scalar value α,
we also have E[αX] = αE[X].
Example: Expectation
Let H be the random variable of the height of a man in meters without shoes. Let
the pdf fH of H be a normal distribution with expected value μ = 1.755m and
with standard deviation 0.1m. Let S be the random variable of the height added by
wearing a pair of shoes in centimeters (1 meter is 100 centimeters); its pdf is given
by the following table:
S=1 S=2 S=3 S=4
0.1 0.1 0.5 0.3
Then the expected height of someone wearing shoes in centimeters is
1.5 Variance 7
Note how the linearity of expectation allowed us to decompose the expression 100 ·
H + S into its components, and take the expectation of each one individually. This
trick is immensely powerful when analyzing complex scenarios with many factors.
1.5 Variance
The variance of a random variable X describes how spread out it is, with respect to
its mean E[X]. It is defined as
The equivalence of those two above common forms above uses that E[X] is a fixed
scalar:
Example: Variance
In this discrete case, the domain of fX,Y is restricted so fX,Y ∈ [0, 1] and so
x,y ∈X×Y fX,Y (x, y) = 1, e.g., the sum of probabilities over the joint sample space
is 1. Sometimes this discrete function is referred to as a probability mass function
(pmf) instead of a probability density function; we will use pdf for both.
The marginal pdf in the discrete case is defined by summing over one variable
fX (x) = fX,Y (x, y) = Pr(X = x, Y = y).
y ∈ΩY y ∈ΩY
Consider a student who randomly chooses his pants and shirt every day. Let P be a
random variable for the color of pants, and S a random variable for the color of the
shirt. Their joint probability is described by this table:
S=green S=red S=blue
P=blue 0.3 0.1 0.2
P=white 0.05 0.2 0.15
1.6 Joint, Marginal, and Conditional Distributions 9
Adding up along columns, the marginal distribution fS for the color of the shirt
is described by the following table:
S=green S=red S=blue
0.35 0.3 0.35
Isolating and renormalizing the middle “S=red” column, the conditional distri-
bution fP |S (· | S= red) is described by the following table:
0.3 = 0.3333
P=blue 0.1
P=white 0.20.3 = 0.6666
where
v − μ
is the Euclidian distance between v and μ (see Section 4.2). For the
2-dimensional case where v = (vx, vy ) and μ = (μx, μy ), this is defined as
1 −(vx − μx )2 − (vy − μy )2
G2 (v) = exp .
2πσ 2 2σ 2
A magical property about the Gaussian distribution is that all conditional versions
of it are also Gaussian, of a lower dimension. For instance, in the 2-dimensional case,
G2 (vx | vy = 1) is a 1-dimensional Gaussian, which is precisely a normal distribution
N = G1 . There are many other essential properties of the Gaussian that we will see
throughout this text, including that it is invariant under all basis transformations and
that it is the limiting distribution for central limit theorem bounds.
10 1 Probability Review
When the dimension d is implicit, but useful to write the mean μ and variance
σ 2 (or standard deviation σ) in the definition of the pdf explicitly, we can write
Gμ,σ or Nμ,σ .
Bayes’ rule is the key component in how to build likelihood functions, which when
optimized are essential to evaluating “models” based on data. Bayesian reasoning is
a much broader area that can go well beyond just finding the single “most optimal”
model. This line of work, which this section will only introduce, reasons about the
many possible models and can make predictions with this uncertainty taken into
account.
Given two events M and D, Bayes’ rule states that
Pr(D | M) · Pr(M)
Pr(M | D) = .
Pr(D)
Mechanically, this provides an algebraic way to invert the direction of the condi-
tioning of random variables, from (D given M) to (M given D). It assumes nothing
about the independence of M and D (otherwise, it is pretty uninteresting). To derive
this, we use
Pr(M ∩ D) = Pr(M | D) Pr(D)
and also
Pr(M ∩ D) = Pr(D ∩ M) = Pr(D | M) Pr(M).
Combining these, we obtain Pr(M | D) Pr(D) = Pr(D | M) Pr(M), from which we
can divide by Pr(D) to solve for Pr(M | D). So Bayes’ rule is uncontroversially true;
any “frequentist versus Bayesian” debate is about how to model data and perform
analysis, not the specifics or correctness of this rule.
Consider two events M and D with the following joint probability table:
M=1 M=0
D = 1 0.25 0.5
D = 0 0.2 0.05
We can observe that indeed Pr(M | D) = Pr(M ∩ D)/Pr(D) = 0.25
0.75 = 3 , which is
1
equal to
.25
Pr(D | M) Pr(M) (.2 + .25) .25 1
= .2+.25 = = .
Pr(D) .25 + .5 .75 3
1.7 Bayes’ Rule 11
But Bayes’ rule is not very interesting in the above example. In that example, it
is actually more complicated to calculate the right side of Bayes’ rule than the left
side.
Consider you bought a new car and its windshield was cracked, the event W. If the
car was assembled at one of the three factories A, B, or C, you would like to know
which factory was the most likely point of origin.
Assume that near you 50% of cars are from factory A (that is, Pr(A) = 0.5) and
30% are from factory B (Pr(B) = 0.3), and 20% are from factory C (Pr(C) = 0.2).
Then you look up statistics online and find the following rates of cracked wind-
shields for each factory—apparently this is a problem! In factory A, only 1% are
cracked, in factory B 10% are cracked, and in factory C 2% are cracked. That is,
Pr(W | A) = 0.01, Pr(W | B) = 0.1, and Pr(W | C) = 0.02.
We can now calculate the probability that the car came from each factory:
• Pr(A | W) = Pr(W | A) · Pr(A)/ Pr(W) = 0.01 · 0.5/ Pr(W) = 0.005/Pr(W).
• Pr(B | W) = Pr(W | B) · Pr(B)/ Pr(W) = 0.1 · 0.3/ Pr(W) = 0.03/Pr(W).
• Pr(C | W) = Pr(W | C) · Pr(C)/ Pr(W) = 0.02 · 0.2/ Pr(W) = 0.004/Pr(W).
We did not calculate Pr(W), but it must be the same for all factory events,
so to find the highest probability factory we can ignore it. The probability Pr(B | W) =
0.03/Pr(W) is the largest, and B is the most likely factory.
Thus, by using Bayes’ rule, we can maximize Pr(M | D) using Pr(M) and Pr(D | M).
We do not need Pr(D) since our data is given to us and fixed for all models.
In some settings, we may also ignore Pr(M), as we may assume all possible
models are equally likely. This is not always the case, and we will come back to this.
In this setting, we just need to calculate Pr(D | M). This function L(M) = Pr(D | M)
is called the likelihood of model M.
1 Consider a set S and a function f : S → R. The max s∈S f (s) returns the value f (s ∗ ) for some
element s ∗ ∈ S which results in the largest valued f (s). The argmax s∈S f (s) returns the element
s ∗ ∈ S which results in the largest valued f (s); if this is not unique, it may return any such s ∗ ∈ S.
12 1 Probability Review
A model is usually a simple pattern from which we think data is generated, but then
observed with some noise. Classic examples, which we will explore in this text,
include the following:
• The model M is a single point in Rd ; the data is a set of points in Rd near M.
• linear regression: The model M is a line in R2 ; the data is a set of points such that
for each x-coordinate, the y-coordinate is the value of the line at that x-coordinate
with some added noise in the y value.
• clustering: The model M is a small set of points in Rd ; the data is a large set of
points in Rd , where each point is near one of the points in M.
• PCA: The model M is a k-dimensional subspace in Rd (for k d); the data is a
set of points in Rd , where each point is near M.
• linear classification: The model M is a halfspace in Rd ; the data is a set of
labeled points (with labels + or −), so the + points are mostly in M, and the −
points are mainly not in M.
Log-likelihoods
An important trick used in understanding the likelihood, and in finding the MAP
model M ∗ , is to take the logarithm of the posterior. Since the logarithm oper-
ator log(·) is monotonically increasing on positive values, and all probabilities
(and more generally pdf values) are non-negative (treat log(0) as −∞), then
argmax M ∈Ω M Pr(M | D) = argmax M ∈Ω M log(Pr(M | D)). It is commonly
applied on only the likelihood function L(M), and log(L(M)) is called the log-
likelihood. Since log(a · b) = log(a) + log(b), this is useful in transforming def-
initions of probabilities, which are often written as products Πi=1 k P into sums
i
k
log(Πi=1 Pi ) = i=1 log(Pi ), which are easier to manipulate algebraically.
k
Moreover, the base of the log is unimportant in model selection using the MAP
estimate because logb1 (x) = logb2 (x)/logb2 (b1 ), and so 1/logb2 (b1 ) is a coefficient
that does not affect the choice of M ∗ . The same is true for the maximum likelihood
estimate (MLE): M ∗ = argmax M ∈Ω M L(M).
Let the data D be a set of points in R1 : {1, 3, 12, 5, 9}. Let Ω M be R so that the
model is parametrized by a point M ∈ R. If we assume that each data point is
observed with independent Gaussian noise (with σ = 2, its pdf is described as
N M,2 (x) = g(x) = √1 exp(− 18 (M − x)2 ), then
8π
1 1
Pr(D | M) = g(x) = √ exp(− (M − x)2 ) .
x ∈D x ∈D 8π 8
1.7 Bayes’ Rule 13
Recall that we can take the product Πx ∈D g(x) since we assume independence of
x ∈ D! To find M ∗ = argmax M Pr(D | M) is equivalent to argmax M ln(Pr(D | M)),
the log-likelihood which is
1 1
ln( Pr(D | M)) = ln √ exp(− (M − x) ) 2
x ∈D 8π 8
1 1
= − (M − x)2 + |D| ln( √ ).
x ∈D
8 8π
We can ignore the last term in the sum since it does not depend on M. The first
term is maximized when x ∈D (M − x)2 is minimized, which occurs precisely as
1
E[D] = |D | x ∈D x, the mean of the data set D. That is, the maximum likelihood
model is exactly the mean of the data D, and is quite easy to calculate.
Here we will show that sum of squared errors from n values X = {x1, x2, . . . , xn } is
minimized at their average. We do so by showing that the first derivative is 0 at the
average. Consider the sum of squared errors cost sX (z) at some value z
n
n
n
n
sX (z) = (xi − z)2 = z 2 − 2xi z + xi2 = nz 2 − (2 xi )z + xi2
i=1 i=1 i=1 i=1
dsX (z) n
= 2nz − 2 xi = 0.
dz i=1
n
Solving for z yields z = n1 i=1 xi . And this is exactly the average X. The second
d2
derivative ( dz 2 sX (z) = 2n) is positive, so this is a minimum point, not a maximum
point.
xi x1 x3 x2
A single squared error function si (z) = (xi − z)2
is a parabola. And the sum
of squared error functions is also a parabola, shown in green in the figure. This is
because we can expand the sum of squared errors as a single quadratic equation
in z, with coefficients determined by the xi values. This green parabola is exactly
minimized at the average of X, the mark.
14 1 Probability Review
The symbol ∝ means proportional to; that is, there is a fixed (but possibly unknown)
constant factor c multiplied on the right (in this case, c = 1/Pr(D)) to make them
equal: Pr(M | D) = c · Pr(D | M) · Pr(M).
However, we may want to use continuous random variables, so strictly using
probability Pr at a single point is not always correct. So we can replace each of these
with pdfs
p(M | D) ∝ f (D | M) · π(M).
posterior likelihood prior
Each of these terms has a common name. As above, the conditional probability or
pdf Pr(D | M) ∝ f (D | M) is called the likelihood; it evaluates the effectiveness
of the model M using the observed data D. The probability or pdf of the model
Pr(M) ∝ π(M) is called the prior; it is the assumption about the relative propensity
of a model M, before or independent of the observed data. And the left-hand side
Pr(M | D) ∝ p(M | D) is called the posterior; it is the combined evaluation
of the model that incorporates how well the observed data fits the model and the
independent assumptions about the model.
Again it is common to be in a situation where, given a fixed model M, it is possible
to calculate the likelihood f (D | M). And again, the goal is to be able to compute
p(M | D), as this allows us to evaluate potential models M, given the data, we have
seen, D.
The main difference is a careful analysis of π(M), the prior—which is not neces-
sarily assumed uniform or “flat.” The prior allows us to encode our assumptions.
Let us estimate the height H of a typical student from our university. We can construct
a data set D = {x1, . . . , xn } by measuring the height of a set of random students in
inches. There may be an error in the measurement, and this is an incomplete set, so
we do not entirely trust the data.
So we introduce a prior π(M). Consider we read that the average height of a fully
grown person is 66 inches, with a standard deviation of σ = 6 inches. So we assume
that
1
π(M) = N66,6 (M) = √ exp(−(μ M − 66)2 /(2 · 62 ))
π72
is normally distributed around 66 inches.
Now, given this knowledge, we adjust the MLE example from the last subsection
using this prior.
1.8 Bayesian Inference 15
• What if our MLE estimate without the prior (e.g., |D1 | x ∈D x) provides a value
of 5.5?
The data is very far from the prior! Usually this means something is wrong. We
could find argmax M p(M | D) using this information, but that may give us an
estimate of say 20 (that does not seem correct). A more likely explanation is a
mistake somewhere: probably we measured in feet instead of inches!
So how important is the prior? In the average height example, it will turn out to be
worth only (1/9)th of one student’s measurement. But we can give it more weight.
Let us continue the example about the height of an average student from our univer-
sity, and assume (as in the MLE example) the data is generated independently from
a model M with Gaussian noise with σ = 2. Thus the likelihood of the model, given
the data, is
1 1
f (D | M) = g(x) = √ exp(− (μ M − x) ) .2
x ∈D x ∈D 8π 8
Now using that the prior of the model is π(M) = √1 exp(−(μ M − 66)2 /72), the
π72
posterior is given by
1
p(M | D) ∝ f (D | M) · √ exp(−(μ M − 66)2 /72).
π72
It is again easier to work with the log-posterior which is monotonic with the posterior,
using some unspecified constant C (which can be effectively ignored):
x ∈D
8 72
∝− 9(μ M − x)2 − (μ M − 66)2 + C
x ∈D
Weighted Average
When combining the effects of iid data and a prior, we may need to use a weighted
average. Given a set of n values x1, x2, . . . , xn as well as n corresponding weights
w1, w2, . . . , wn , the weighted average is defined as
n
wi xi
i=1 .
i=1 wi
n
n
Note that the denominator W = i=1 wi is the total weight. Thus, for the special case
when all wi = 1, this is the same as the regular average since W = n and each term
in the numerator is simply 1xi = xi ; hence the entire quantity is n1 i=1 n
xi .
My favorite barbecue spice mix is made from 2 cups salt, 1.5 cups black pepper, and
1 cup cayenne pepper. At the store, salt costs 16¢ per cup, black pepper costs 50¢
per cup, and cayenne pepper costs 175¢ per cup. How much does my barbecue spice
mix cost per cup?
This would be a weighted average with weights w1 = 2, w2 = 1.5, and w3 = 1 for
salt, black pepper, and cayenne pepper, respectively. The values being averaged are
1.8 Bayesian Inference 17
x1 = 16¢, x2 = 50¢, and x3 = 175¢ again for salt, black pepper, and cayenne pepper,
respectively. The units for the weights cancel out, so they do not matter (as long as
they are the same). But the units for the values do matter. The resulting cost per cup
is
2 · 16¢ + 1.5 · 50¢ + 1 · 175¢ 32¢ + 75¢ + 175¢ 282¢
= = ≈ 63¢.
2 + 1.5 + 1 4.5 4.5
Another aspect of Bayesian inference is that we not only can calculate the maximum
likelihood model M ∗ , but can also provide a posterior value for any model! This
value is not an absolute probability (it is not normalized, and regardless it may be of
measure 0), but it is powerful in other ways:
• We can say (under our model assumptions, which are now clearly stated) that one
model M1 is twice as likely as another M2 , if p(M1 | D)/p(M2 | D) = 2.
• We can now use more than one model for the prediction of a value. Given a new
data point x , we may want to map it onto our model as M(x ) or assign it a
score of fit. Instead of doing this for just one “best” model M ∗ , we can take a
weighted average of all models, weighted by their posterior, that is, marginalizing
over models.
• We can define a range of parameter values (with more model assumptions) that
likely contains the true model.
Consider three possible models M1 , M2 , and M3 that assume data D is observed with
normal noise with standard deviation 2. The difference in the models is the mean
of the distribution which is assumed to be 1.0, 2.0, and 3.0, respectively in these
models.
We also are given a prior for each model, in the following table:
π(M1 ) π(M2 ) π(M3 )
0.4 0.5 0.1
Now given a single data point x = 1.6 in D, the posterior for each model is
1
p(M1 | D) = C √ exp(−(x − 1.0)2 /8) · π(M1 ) = C (0.96)(0.4) = C · 0.384
4 2π
1
p(M2 | D) = C √ exp(−(x − 2.0)2 /8) · π(M2 ) = C (0.98)(0.5) = C · 0.49
4 2π
1
p(M3 | D) = C √ exp(−(x − 3.0)2 /8) · π(M3 ) = C (0.78)(0.1) = C · 0.078
4 2π
18 1 Probability Review
where C accounts for the unknown probability of the data, and C folds in other fixed
normalizing constants.
As before, it follows easily that M2 is the maximum posterior estimate—and our
single best choice given the prior and our data. We also know it is about 6 times
more likely than model M3 , but only about 1.3 times more likely than model M1 .
Say we want to predict the square of the mean of the true model. Instead of just
returning the maximum posterior (i.e., 4 for M2 ), we can take the weighted average:
Exercises
1.1 Consider the probability table below for the random variables X and Y . One
entry is missing, but you should be able to derive it. Then calculate the following
values.
1. Pr(X = 3 ∩ Y = 2)
2. Pr(Y = 1)
3. Pr(X = 2 ∩ Y = 1)
4. Pr(X = 2 | Y = 1)
X=1 X=2 X=3
Y = 1 0.25 0.1 0.15
Y = 2 0.1 0.2 ??
1.2 An “adventurous” athlete has the following running routine every morning: He
takes a bus to a random stop, then hitches a ride, and then runs all the way home.
The bus, described by a random variable B, has four stops where the stops are at a
distance of 1, 3, 4, and 7 miles from his house—he chooses each with probability
1/4. Then the random hitchhiking takes him further from his house with a uniform
distribution between −1 and 4 miles; that is, it is represented as a random variable
H with pdf described as
1/5 if x ∈ [−1, 4]
f (H = x) =
0 if x [−1, 4].
What is the expected distance he runs each morning (all the way home)?
1.3 Consider rolling two fair die D1 and D2 ; each has a probability space of Ω =
{1, 2, 3, 4, 5, 6} which each value equally likely. What is the probability that D1 has
a larger value than D2 ? What is the expected value of the sum of the two die?
1.4 Let X be a random variable with a uniform distribution over [0, 2]; its pdf is
described as
f (X = x) = 1/2 if x ∈ [0, 2]
0 if x [0, 2].
What is the probability f (X = 1)?
1.5 Consider rolling two fair die D1 and D2 ; each has a probability space of Ω =
{1, 2, 3, 4, 5, 6} which each value equally likely. What is the probability that D1 has
a larger value than D2 ? What is the expected value of the sum of the two die?
1.6 Let L be a random variable describing the outcome of a lottery ticket. The lottery
ticket costs 1, and with probability 0.75 returns 0 dollars. With probability 0.15, it
returns 1 dollar. With probability 0.05, it returns 2 dollars. With probability 0.04, it
returns 5 dollars. And with probability 0.01, it returns 100 dollars.
1. Given a lottery ticket (purchased for you), what is the expected money returned
by playing it?
20 1 Probability Review
2. Factoring in the cost of buying the ticket, what is the expected money won by
buying and playing the lottery ticket?
3. If you were running the lottery, and wanted to have the player expect to lose 0.01
dollars (1 cent) per ticket, how much should you charge?
1.7 Use Python to plot the pdf and cdf of the Laplace distribution ( f (x) =
2 exp(−|x|)) for values of x in the range [−3, 3]. The function SciPy.stats.laplace
1
may be useful.
1.8 Consider the random variables X and Y described by the joint probability table.
X=1 X=2 X=3
Y = 1 0.10 0.05 0.10
Y = 2 0.30 0.25 0.20
Derive the following values:
1. Pr(X = 1)
2. Pr(X = 2 ∩ Y = 1)
3. Pr(X = 3 | Y = 2)
Compute the following probability distributions:
4. What is the marginal distribution for X?
5. What is the conditional probability for Y , given that X = 2?
Answer the following questions about the joint distribution:
6. Are random variables X and Y independent?
7. Is Pr(X = 1) independent of Pr(Y = 1)?
1.9 Consider two models M1 and M2 , where from prior knowledge we believe that
Pr(M1 ) = 0.25 and Pr(M2 ) = 0.75. We then observe a data set D. Given each model,
we assess the likelihood of seeing that data, given the model, as Pr(D | M1 ) = 0.5
and Pr(D | M2 ) = 0.01. Now that we have the data, which model has a higher
probability of being correct?
1.11 Consider a data set D with 10 data points {−1, 6, 0, 2, −1, 7, 7, 8, 4, −2}. We want
to find a model for M from a restricted sample space Ω = {0, 2, 4}. Assume the data
has Laplace noise defined, so from a model M a data point’s probability distribution
is described as f (x) = 14 exp(−|M − x|/2). Also assume we have a prior assumption
on the models so that Pr(M = 0) = 0.25, Pr(M = 2) = 0.35, and Pr(M = 4) = 0.4.
Assuming all data points in D are independent, which model is most likely?
1.8 Bayesian Inference 21
1.12 A student realizes she needs to file some registration paperwork 5 minutes from
the deadline. She will be fined $1 for each minute that it is late. So if it takes her 7
minutes, she will be 2 minutes late, and be fined $2. The time is rounded down to
the nearest minute late (so 2.6 minutes late has a $2 fine).
Her friend tells her that the usual time to finish is a uniform distribution between
4 minutes and 8 minutes. That is, the time to finish represents a random variable T
with pdf
1/4 if x ∈ [4, 8]
f (T = x) =
0 if x [4, 8].
1. What is the expected amount of money she will be fined if she starts immediately?
2. What is the expected amount of money she will be fined if she starts in 2 minutes?
1.13 A glassblower creates an intricate artistic glass bottle and attempts to measure
its volume in liters; the 10 measures are:
{1.82, 1.71, 2.34, 2.21, 2.01, 1.95, 1.76, 1.94, 2.02, 1.89}.
To sell the bottle, by regulation, she must label its volume up to 0.1 Liters (it could
be 1.7L or 1.8L or 1.9L, and so on). Her prior estimate is that it is 2.0L, but with a
normal distribution with a standard deviation of 0.1; that is, her prior for the volume
V being x is described by the pdf
for some unknown constant C (since it is only valid at increments of 0.1). Assuming
the 10 empirical estimates of the volume are unbiased, but have a normal error with
a standard deviation of 0.2, what is the most likely model for the volume V?
Chapter 2
Convergence and Sampling
Abstract This topic will overview a variety of extremely powerful analysis results
that span statistics, estimation theory, and big data. It provides a framework to think
about how to aggregate more and more data to get better and better estimates. It
will cover the Central Limit Theorem (CLT), Chernoff-Hoeffding bounds, Probably
Approximately Correct (PAC) algorithms, as well as analysis of importance sampling
techniques which improve the concentration of random samples.
Most data analysis start with some data set; we will call this data set P. It will be
composed of a set of n data points P = {p1, p2, . . . , pn }.
But underlying this data is almost always a very powerful assumption that this
data comes iid from a fixed, but usually, unknown pdf, called f . Let’s unpack this:
What does “iid” mean: Identically and Independently Distributed. “Identically”
means each data point was drawn from the same f . “Independently” means that
the first points have no bearing on the value of the next point.
Example: Polling
Here we will talk about estimating the mean of f . To discuss this, we will now
introduce a random variable X ∼ f : a hypothetical new data point. The mean of f
is the expected value of X: E[X]. n
We will estimate the mean of f using the sample mean, defined as P̄ = n1 i=1 pi .
The following diagram represents this common process: from an unknown process
f , we consider n iid random variables {Xi } corresponding
n to a set of n independent
observations {pi }, and take their average P̄ = n1 i=1 pi to estimate the mean of f .
P̄ = {pi } ← {Xi } ∼ f
1 realize iid
n
The central limit theorem is about how well the sample mean approximates the true
mean. But to discuss the sample mean P̄ (which is a fixed value), we need to discuss
n
random variables {X1, X2, . . . , Xn } and their mean X̄ = n1 i=1 Xi . Note that again
X̄ is a random variable. If we are to draw a new iid data set P and calculate a new
sample mean P̄ , it will likely not be exactly the same as P̄; however, the distribution
of where this P̄ is likely to be is precisely X̄. Arguably, this distribution is more
important than P̄ itself.
There are many formal variants of the central limit theorem, but the basic form is
as follows:
Central Limit Theorem
The leftmost chart shows this process for n = 2, and the next ones are for n = 3,
n = 10, and n = 30. Each histogram shows sample distributions with the same
approximate mean of 50. And as n increases, the variance decreases: for n = 2, there
is some support close to 0 and 100, but for n = 30 the support is between 30 and 70,
and mostly between 40 and 60. Moreover, all distributions appear vaguely like the
normal distribution, with of course some error due to the histogram representation,
because it is based on 30,000 samples of P̄, and because n is finite.
This process is made more concrete in the following Python code.
import matplotlib as mpl
import matplotlib . pyplot as plt
% matplotlib inline
import random
numBins = 100
numTrials = 30000
n = 10 # change for n = 2, 3, 10, and 30
sampleMeans = []
for j in range( numTrials ):
sampleSum = 0;
for i in range(n):
sampleSum += random . choice ( range( numBins ))
# uniform random in [0, numBins]
sampleMeans . append (float( sampleSum )/ float(n))
Remaining Mysteries
There should still be at least a few aspects of this not clear yet: (1) What does
“convergence” mean? (2) How do we formalize or talk about this notion of error?
(3) What does this say about our data P̄?
First, convergence refers to what happens as some parameter increases, in this
case n. As the number of data points increases and as n “goes to infinity,” the
above statement ( X̄ looks like a normal distribution) becomes more and more true.
For small n, the distribution may not quite look normal, it may be more bumpy or
26 2 Convergence and Sampling
maybe even multi-modal. The statistical definitions of convergence are varied, and
we will not go into them here; we will instead replace it with more useful phrasing
in explaining aspects (2) and (3).
Second, the error now has two components. We cannot simply say that P̄ is at
most some distance ε from μ. Something unusual may have happened (the sample
is random after all). And it is not useful to try to write the probability that P̄ = μ;
for equality in continuous distributions, this probability is indeed 0. But we can
combine these notions. We can say the distance between P̄ and μ is more than ε,
with probability at most δ. This is called “probably approximately correct” or PAC.
Third, we want to generate some sort of PAC bound (which is far more useful
than “ X̄ looks kind of like a normal distribution”). While a frequentist may be happy
with a confidence interval and a Bayesian may want a full normal posterior, these
two options are not directly available since again, X̄ is not exactly normal. So we will
discuss some very common concentration of measure tools. These do not exactly
capture the shape of the normal distribution, but provide upper bounds for its tails,
and will allow us to state PAC bounds.
We will introduce shortly the three most common concentration of measure bounds,
which provide increasingly strong bounds on the tails of distributions but require
more and more information about the underlying distribution f . Each provides a
probability approximately correct (PAC) bound of the following form:
Pr[| X̄ − E[ X̄]| ≥ ε] ≤ δ.
That is, the probability that most δ that X̄ (which is some random variable, often a
sum of iid random variables) is further than ε to its expected value (which is μ, the
expected value of f where Xi ∼ f ). Note that we do not try to say this probability is
exactly δ; this is often too hard. In practice, there are a variety of tools, and a user
may try each one and see which one gives the best bound.
It is useful to think of ε as the error tolerance and δ as the probability of failure,
i.e., failure meaning that we exceed the error tolerance. However, often these bounds
will allow us to write the required sample size n in terms of ε and δ. This allows us
to trade these two terms off for any fixed known n; we can guarantee a smaller error
tolerance if we are willing to allow more probability of failure and vice-versa.
We will formally describe these bounds and give some intuition of why they are true
(but not full proofs). But what will be the most important is what they imply. If you
2.3 Concentration of Measure 27
just know the distance of the expectation from the minimal value, you can get a very
weak bound. If you know the variance of the data, you can get a stronger bound. If
you know that the distribution f has a small and bounded range, then you can make
the probability of failure (the δ in PAC bounds) very very small.
Let X be a random variable such that X ≥ 0, that is, it cannot take on negative values.
Then the Markov inequality states that for any parameter α > 0,
E[X]
Pr[X > α] ≤ .
α
Note this is a PAC bound with ε = α − E[X] and δ = E[X]/α, or we can rephrase
this bound as follows: Pr[X − E[X] > ε] ≤ δ = E[X]/(ε + E[X]).
Consider balancing the pdf of some random variable X on your finger at E[X], like
a waiter balances a tray. If your finger is not under a value μ, E[X] = μ, then the pdf
(and the waiter’s tray) will tip and fall in the direction of μ—the “center of mass.”
Now for some amount of probability α, how large can we increase its location so
that we retain E[X] = μ? For each part of the pdf we increase, we must decrease
some in proportion. However, by the assumption X ≥ 0, the pdf must not have a
mass below 0. In the limit of this, we can set Pr[X = 0] = 1 − α, and then move the
remaining α probability as large as possible, to a location δ so E[X] = μ. That is,
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
E[X] = 2 E[X] = 2
Consider the pdf f drawn in blue in the following figures, with E[X] for X ∼ f
marked as a red dot. The probability that X is greater than 5 (e.g., Pr[X ≥ 5]) is the
shaded area.
Notice that in both cases, Pr[X ≥ 5] is about 0.1. This is the quantity we want to
bound above by δ. But since E[X] is much larger in the first case (about 3.25), then
the bound δ = E[X]/α is much larger, about 0.65. In the second case, E[X] is much
smaller (about 0.6), so we get a much better bound of δ = 0.12.
Let R be a random variable describing the number of millimeters of rain that will
fall in Salt Lake City the next June. Say we know E[R] = 20 mm.
We use the Markov inequality to bound the probability that more than 50 mm
will fall as
E[R] 20
Pr[R ≥ 50] ≤ = = 0.4
50 50
Hence, based on the expected value alone, we can bound with the probability less
than 0.4 that it will rain more than 50 mm next June.
Now let X be a random variable where we know Var[X] and E[X]. Then the Cheby-
shev inequality states that for any parameter ε > 0,
Var[X]
Pr[|X − E[X]| ≥ ε] ≤ .
ε2
Again, this clearly is a PAC bound with δ = Var[X]/ε 2 . This bound is typically
stronger than the Markov one since δ decreases quadratically in ε instead of linearly.
Compared to the Markov inequality, it requires knowledge of Var[X], but it does not
require X ≥ 0.
2.3 Concentration of Measure 29
Again let R be a random variable for the millimeters of rain in Salt Lake City next
June with E[R] = 20. If we also know that the variance is not too large, specifically
Var[R] = 9 (millimeters squared), then we can apply the Chebyshev inequality to
get an improved bound:
Var[R] 9
Pr[R ≥ 50] ≤ Pr[|R − E[R]| ≥ 30] ≤ = = 0.01
302 900
That is, by using the expectation (E[R] = 20) and variance (Var[R] = 9), we can
reduce the probability of exceeding 50 mm to at most probability 0.01.
Note that in the first inequality, we convert from a one-sided expression R ≥ 50 to
a two-sided expression |R−E[R]| ≥ 30 (that is either R−E[R] ≥ 30 or E[R]−R ≥ 30).
This is a bit wasteful, and stricter one-sided variants of Chebyshev inequality exist;
we will not discuss these here in an effort for simplicity.
σ2
Pr[| X̄ − E[Xi ]| ≥ ε] ≤ .
nε 2
Consider now that we have input parameters ε and δ, our desired error tolerance and
probability of failure. If we can draw Xi ∼ f (iid) for an unknown f (with known
expected value and variance σ), then we can solve for how large n needs to be:
n = σ 2 /(ε 2 δ).
Since E[ X̄] = E[Xi ] for iid random variables X1, X2, . . . , Xn , there is not a simi-
lar meaningfully different extension for the Markov inequality which demonstrates
improved concentration with increase in data size n.
Following the above extension of the Chebyshev inequality, we can consider a set
n
of n iid random variables X1, X2, . . . , Xn where X̄ = n1 i=1 Xi . Now assume we
know that each Xi lies in a bounded domain [b, t], and let Δ = t − b. Then the
Chernoff-Hoeffding inequality states for any parameter ε > 0,
−2ε 2 n
Pr[| X̄ − E[ X̄]| > ε] ≤ 2 exp .
Δ2
Again this is a PAC bound, now with δ = 2 exp(−2ε 2 n/Δ2 ). For a desired error
tolerance ε and failure probability δ, we can set n = (Δ2 /(2ε 2 )) ln(2/δ). Note that
this has a similar relationship with ε as the Chebyshev bound, but the dependence
of n on δ is exponentially less for this bound.
30 2 Convergence and Sampling
Consider rolling a fair die 120 times and recording how many times a 3 is returned.
Let T be the random variable for the total number of 3s rolled. Each roll is a 3 with
probability 1/6, so the expected number of 3s is E[T] = 20. We would like to answer
what the probability is that more than 40 rolls return a 3.
To do so, we analyze n = 120 iid random variables T1, T2, . . . , Tn associated with
each roll. In particular, Ti = 1 if the
ith roll is a 3 and is 0 otherwise. Thus E[Ti ] = 1/6
for each roll. Using T̄ = T/n = n1 i=1 n
Ti and noting that Pr[T ≥ 40] = Pr[T̄ ≥ 1/3],
we can now apply our Chernoff-Hoeffding bound as
1
−2(1/6)2 · 120
Pr[T ≥ 40] ≤ Pr T̄ − E[Ti ] ≥ ≤ 2 exp
6 12
= 2 exp(−20/3) ≤ 0.0026.
So we can say that less than 3 out of 1000 times of running these 120 rolls, we should
see at least 40 returned 3s.
In comparison, we could have also applied a Chebyshev inequality. The variance of
a single random variable Ti is Var[Ti ] ≤ 5/36, and hence Var[T] = n· Var[Ti ] = 50/3.
Hence we can bound
That is, using the Chebyshev inequality, we were only able to claim that this event
should occur at most 42 times out of 1000 trials.
Finally, we note that in both of these analyses, we only seek to bound the prob-
ability that the number of rolls of 3 exceeds some threshold (≥ 40), whereas the
inequality we used bounded the absolute value of the deviation from the mean. That
is, our goal was one-way, and the inequality was a stronger two-way bound. Indeed
these results can be improved by roughly a factor of 2 by using similar one-way
inequalities that we do not formally state here.
Relating this all back to the Gaussian distribution in the CLT, the Chebyshev
bound only uses the variance information about the Gaussian, but the Chernoff-
Hoeffding bound uses all of the “moments”: this allows the probability of failure to
decay exponentially.
These are the most basic and common PAC concentration of measure bounds, but
are by no means exhaustive.
• Using the Markov inequality, we can say that Pr[X > 1.5] ≤ 1/(1.5) ≈ 0.6666
and Pr[X > 3] ≤ 1/3 ≈ 0.33333
or Pr[X − μ > 0.5] ≤ 23 and Pr[X − μ > 2] ≤ 13 .
• Using the Chebyshev inequality, we can say that Pr[|X − μ| > 0.5] ≤ (1/3)/0.52 =
3 (which is meaningless). But Pr[|X − μ| > 2] ≤ (1/3)/(2 ) = 12 ≈ 0.08333.
4 2 1
Now consider a set of n = 100 random variables X1, X2, . . . , Xn all drawn
iid from
the same pdf f as above. Now we can examine the random variable X̄ = n1 i=1 n
Xi .
We know that μn = E[ X̄] = μ and that σn = Var[ X̄] = σ /n = 1/(3n) = 1/300.
2 2
• Using the Chebyshev inequality, we can say that Pr[| X̄ − μ| > 0.5] ≤ σn2 /(0.5)2 =
75 ≈ 0.01333 and Pr[| X̄ − μ| > 2] ≤ σn /2 = 1200 ≈ 0.0008333.
1 2 2 1
• Using the Chernoff-Hoeffding bound, we can say that Pr[| X̄ − μ| > 0.5] ≤
2 exp(−2(0.5)2 n/Δ2 ) = 2 exp(−100/8) ≈ 0.0000074533 and Pr[| X̄ − μ| > 2] ≤
2 exp(−2(2)2 n/Δ2 ) = 2 exp(−200) ≈ 2.76 · 10−87 .
Union Bound
Consider s possibly dependent random events Z1, . . . , Zs . The probability that all
events occur is at least
s
1− (1 − Pr[Z j ]).
j=1
Returning to the example of rolling a fair die n = 120 times, and bounding the prob-
ability that a 3 was returned more than 40 times, let’s now consider the probability
that no number was returned more than 40 times. Each number corresponds with a
1 I suppose these situations are the villains in this analogy, like “Riddler,” “Joker,” and “Poison Ivy.”
The union bound can also aid other concentration inequalities like Chebyshev, which I suppose is
like “Catwoman.”
32 2 Convergence and Sampling
random event Z1, Z2, Z3, Z4, Z5, andZ6 , of that number occurring at most 40 times.
These events are not independent, but nonetheless we can apply the union bound.
Using our Chebyshev inequality result that Pr[Z3 ] ≥ 1 − 0.042 = 0.958, we can
apply this symmetrically to all Z j . Then by the union bound, we have the probability
that all numbers occur less than 40 times on n = 120 independent rolls is at least
6
1− (1 − 0.958) = 0.748
j=1
Alternatively, we can use the result from the Chernoff-Hoeffding bound that
Pr[Z j ] ≥ 1 − 0.0026 = 0.9974 inside the union bound to obtain that all numbers
occurs no more than 40 times with probability at least
6
1− (1 − 0.9974) = 0.9844
j=1
So this joint event (of all numbers occurring at most 40 times) occurs more than 98%
of the time, but using Chebyshev, we were unable to claim that it happened more
than 75% of the time.
Quantile Estimation
An important use case of concentration inequalities and the union bound is to estimate
distributions. For random variable X, let fX describe its pdf and FX its cdf. Suppose
now we can draw iid samples P = {p1, p2, . . . , pn } from fX , then we can use these
n data points to estimate the cdf FX . To understand this approximation, recall that
FX (t) is the probability that random variable X takes a value at most t. For any choice
of t, we can estimate this from P as nrank P (t) = |{pi ∈ P | pi ≤ t}|/n, i.e., as the
fraction of samples with value at most t. The quantity nrank P (t) is the normalized
rank of P at value t, and the value of t for which nrank P (t) ≤ φ < nrank P (t + η), for
any η > 0, is known as the φ-quantile of P. For instance, when nrank P (t) = 0.5, it
is the 0.5-quantile and thus the median of the data set. The interquartile range is the
interval [t1, t2 ] such that t1 is the 0.25-quantile and t2 is the 0.75-quantile. We can
similarly define the φ-quantile (and hence the median and interquartile range) for a
distribution fX as the value t such that FX (t) = φ.
The following illustration shows a cdf FX (in blue) and its approximation via nor-
malized rank nrank P on a sampled point set P (in green). The median of the P and
its interquartile range are marked (in red).
2.3 Concentration of Measure 33
1 nrankP
0.75 FX
0.5
0.25
0 median P
interquartile range
For a given value t, we can quantify how well nrank P (t) approximates FX (t) using
a Chernoff-Hoeffding bound. For a given t, for each sample pi , describe a random
variable Yi which is 1 if pi ≤ t and 0 otherwise. Observe that E[Yi ] = FX (t), since it
is precisely the probability that a random variable Xi ∼ fX (representing data point
pi , not yet realized) is less than t. Moreover, the random variable for nrank P (t) on a
n
future iid sample P is precisely Ȳ = n1 i=1 Yi . Hence, we can provide a PAC bound
on the probability (δ) of achieving more than ε error for any given t, as
−2ε 2 n
Pr[|Ȳ − FX (t)| ≥ ε] ≤ 2 exp = δ.
12
If we have a desired error ε (e.g., ε = 0.05) and probability of failure δ (e.g.,
δ = 0.01), we can solve for how many samples are required: these values are
satisfied with
1 2
n ≥ 2 ln .
2ε δ
However, the above analysis only works for a single value of t. What if we wanted
to show a similar analysis simultaneously for all values of t, that is, with how many
samples n can we then ensure that with probability at least 1 − δ, for all values of t
we will have | nrank P (t) − FX (t)| ≤ ε?
We will apply the union bound, but, there is another challenge we face: there are an
infinite number of possible values t for which we want this bound to hold! We address
this by splitting the error component into two pieces ε1 and ε2 so ε = ε1 + ε2 ; we
can set ε1 = ε2 = ε/2. Now we consider 1/ε1 different quantiles {φ1, φ2, . . . , φ1/ε1 }
where φ j = j · ε1 + ε1 /2. This divides up the probability space (i.e., the interval
[0, 1]) into 1/ε1 + 1 parts, so the gap between the boundaries of the parts is ε1 .
We will guarantee that each of these φ j -quantiles are ε2 -approximated. Each φ j
corresponds with a t j so t j is the φ j -quantile of fX . We do not need to know what
the precise values of t j are; however, we do know that t j ≤ t j+1 , so they grow
monotonically. In particular, this implies that for any t, t j ≤ t ≤ t j+1 , then it must be
34 2 Convergence and Sampling
The illustration shows the set {φ1, φ2, . . . , φ8 } of quantile points overlayed on the cdf
FX . With ε2 = 1/8, these occur at values φ1 = 1/16, φ2 = 3/16, φ3 = 5/16, . . .,
and evenly divide the y-axis. The corresponding values {t1, t2, . . . , t8 } non-uniformly
divide the x-axis. But as long as any consecutive pair t j and t j+1 is approximated,
because the cdf FX and nrank are monotonic, then all intermediate values t ∈ [t j , t j+1 ]
are also approximated.
φ1
8
φ7
φ6 FX
φ5
φ4
φ3
φ2
φ1
0 t1 t2 t3 t4 t5 t6 t7 t8
So what remains is to show that for all t ∈ T = {t1, . . . , t1/ε1 }, a random variable
Ȳ (t j ) for nrank P (t j ) satisfies Pr[|Ȳ (t j ) − FX (t j )| ≤ ε2 ] ≤ δ. By the above Chernoff-
Hoeffding bound, this holds with probability 1 − δ = 1 − 2 exp(−2(ε2 )2 n) for each
t j . Applying the union bound over these s = 1/ε1 events, we find that they all hold
with probability
s
1
1− 2 exp(−2(ε2 )2 n) = 1 − 2 exp(−2(ε2 )2 n).
j=1
ε1
Setting ε1 = ε2 = ε/2, a sample of size n will provide a nrank P function so for any t
we have | nrank P (t) − FX (t)| ≤ ε, with probability at least 1 − ε4 exp(− 12 ε 2 n). Setting
the probability of failure ε4 exp(−ε 2 n/2) = δ, we can solve for n to see that we get at
most ε error with probability at least 1 − δ using n = ε22 ln( εδ 4
) samples. Using VC
dimension-based arguments (see Section 9.4), we can reduce the number of samples
required from proportional to ε12 log ε1 to ε12 .
Many important convergence bounds deal with approximating the mean of a distri-
bution. When the samples are all uniformly drawn from an unknown distribution, the
2.4 Importance Sampling 35
above bounds (and their generalizations) are the best way to understand convergence,
up to small factors. In particular, this is true when the only access to the data is a
new iid sample.
However, when more control can be made over how the sample is generated, then
in some cases simple changes can dramatically improve the accuracy of the estimates.
The key idea is called importance sampling and has the following principle: sample
larger weight data points more frequently, but in the estimate weigh them inverse to
the sampling probability.
Consider a discrete and very large set A = {a1, a2, . . . , an } where each element ai
has an associated weight w(ai ). Our goal will be to estimate the expected (or average)
weight
1
w̄ = E[w(ai )] = w(ai ).
n a ∈A
i
This set may be so large that we do not want to explicitly compute the sum, or
perhaps soliciting the weight is expensive (e.g., like conducting a customer survey).
So we want to avoid doing this calculation over all items. Rather, we sample k items
iid {a1 , a2 , . . . , ak } (each a j uniformly and independently chosen from A, so some
may be taken more than once), solicit the weight w(a j ) of each a j , and estimate the
average weight as
1
k
ŵ = w(a j ).
k j=1
How accurately does ŵ approximate w̄? If all of the weights are roughly uniform
or well-distributed in [0, 1], then we can apply a Chernoff-Hoeffding bound so
Importance Sampling
We slightly recast the problem assuming a bit more information. There is a large set of
items A = {a1, a2, . . . , an }, and on sampling an item a j , its weight w(a j ) is revealed.
Our goal is to estimate w̄ = n1 ai ∈ A w(ai ). We can treat Wi as a random variable
n
for each sampled weight w(a j ), then w̄ = n1 i=1 Wi is also a random variable. In
this setting, we also know for each ai (before sampling) some information about the
range of its weight. That is, we have a bounded range [0, ψi ]2, so 0 ≤ w(ai ) ≤ ψi .
This upper bound serves as an importance ψi for each ai . Let Ψ = i=1 n
ψi be the
sum of all importances.
As alluded to above, the solution is the following two-step procedure called
importance sampling.
We will first show that importance sampling provides an unbiased estimate, that
Ψ
is, E[wI ] = w̄. Define a random variable Z j to be the value nψ j
w(a j ). By linearity
of expectation and the independence of the samples, E[wI ] = k1 kj=1 E[Z j ] =
E[Z j ]. Sampling proportional to ψi means object ai is chosen with probability ψi /Ψ.
Summing over all elements, we get
n
Ψ
n ψi Ψ
E[w I ] = E[Z j ] = Pr[a j = ai ] · w(ai ) = · w(ai )
i=1
nψi i=1
Ψ nψi
1
n
= w(ai ) = w̄.
n i=1
Note that this worked for any choice of ψi . Indeed, uniform sampling (which
implicitly has ψi = 1 for each ai ) also is an unbiased estimator. The real power of
importance sampling is that it improves the concentration of the estimates.
Improved Concentration
To improve the concentration, the critical step is to analyze the range of each estimator
Ψ
nψ j · w(a j ). Since we have that w(a j ) ∈ [0, ψ j ], then as a result
Ψ Ψ Ψ
· w(a j ) ∈ · [0, ψ j ] = [0, ].
nψ j nψ j n
Now applying a Chernoff-Hoeffding bound, we can upper-bound the probability that
wI has more than ε errors with respect to w̄:
−2ε 2 k
Pr[| w̄ − ŵ| ≥ ε] ≤ 2 exp =δ
(Ψ/n)2
Fixing the allowed error ε and probability of failure δ, we can solve for the number
of samples required as
(Ψ/n)2
k= ln(2/δ).
2ε 2
Now instead of depending quadratically on the largest possible value Δ as in the
uniform sampling, this now depends quadratically on the average upper bound on
all values Ψ/n. In other words, with importance sampling, we reduce the sample
complexity from depending on the maximum importance maxi ψi to on the average
importance Ψ/n.
Consider a company with n = 10,000 employees and we want to estimate the average
salary. However the salaries are very imbalanced; the CEO makes way more than
the typical employee does. Say we know the CEO makes at most 2 million a year,
but the other 9,999 employees make at most 50 thousand a year.
Using just uniform sampling of k = 100 employees, we can apply a Chernoff-
Hoeffding bound to estimate the average salary ŵ from the true average salary w̄
with error more than $8,000 with probability
−2(8,000)2 · 100 −2
Pr[| ŵ − w̄| ≥ 8,000] ≤ 2 exp = 2 exp ≈ 1.99
(2 million)2 625
This is a useless bound, since the probability is greater than 1. If we increase the
error tolerance to half a million, we still only get a good estimate with probability
0.42. The problem hinges on if we sample the CEO, and our estimate is too high; if
we do not, then the estimate is too low.
Now using importance sampling, the CEO gets an importance of 2 million, and
the other employees all get an importance of 50 thousand. The average importance
is now Ψ/n = 50,195, and we can that bound the probability that the new estimate
wI is more than $8,000 from w̄ is at most
−2(8,000)2 · 100
Pr[|w I − w̄| ≥ 8,000] ≤ 2 exp ≤ 0.017.
(51,950)2
So now for 98.3% of the time, we get an estimate within $8,000. In fact, we get an
estimate within $4,000 for at least 38% of the time. Basically, this works because
38 2 Convergence and Sampling
we expect to sample the CEO about twice, but then weigh that contribution slightly
higher. On the other hand, when we sample a different employee, we increase the
effect of their salary by about 4%.
Implementation
In this illustration, 6 elements with normalized weights w(ai )/W are depicted in a
bar chart on the left. These bars are then stacked end-to-end in a unit interval on the
right, precisely stretched from 0.00 to 1.00. The ti values mark the accumulation
of probability that one of the first i values is chosen. Now when a random value
u ∼ unif(0, 1] is chosen at random, it maps into this “partition of unity” and selects
an item. In this case, it selects item a4 since u = 0.68 and t3 = 0.58 and t4 = 0.74
for t3 < u ≤ t4 .
8
u ∼ unif(0, 1]
0.2
normalized weights
0.68
0.2
8
0.1
6
0.1
2
0.1
6
0.0
4
4
0
0.0
0.2
0.4
0.5
0.7
0.8
1.0
a1 a2 a3 a4 a5 a6 t0 t1 t2 t3 t4 t5 t6
Many examples discussed in this book analyze data assumed to be k elements drawn
iid from a distribution. However, when algorithmically generating samples, it can be
advantageous to sample k elements without replacement from a known distribution.
While this can make the analysis slightly more complicated, variants of Chernoff-
Hoeffding bound exist for without-replacement random samples instead of indepen-
dent random variables. Moreover, in practice the concentration and approximation
quality is often improved using without-replacement samples and is especially true
when drawing weighted samples.
When sampling data proportional to weights, if elements exist with sufficiently
large weights, then it is best to always sample these high-weight elements. The
low-weight ones need to be selected with some probability, and this should be
proportional to their weights, and then re-weighted as in importance sampling. A
technique called priority sampling elegantly combines these properties.
The new weight function w : A → [0, ∞) has only k items with non-zero values;
only those need to be retained in the sample A. This has many nice properties, most
importantly E[w (ai )] = w(ai ). Thus for any subset S ⊂ A, we can estimate the sum
of weights in that subset ai ∈S w(ai ) using only ai ∈S∩A w (ai ), and this has the
correct expected value. Thus for w P = n1 i=1 n
w (ai ) as an analog to importance
sampling, we also have E[w P ] = w̄.
Additionally, the elements with very large weights (those with weight above τ) are
always retained. This is because ρi ≥ w(ai ) for all i (since 1/ui ≥ 1), so if w(ai ) > τ
then ρi > τ and it is always chosen, and its new weight is w (ai ) = max(w(ai ), τ) =
w(ai ) which is the same as before. Hence, for a fixed τ this item has no variance in its
effect on the estimate. The remaining items have weights assigned as if in importance
sampling, and so the overall estimate has small (and indeed near-optimal) variance.
In this example, 10 items are shown with weights from w(a10 ) = 0.08 to w(a1 ) =
1.80. For a clearer picture, they are sorted in decreasing order. Each is then given
a priority by dividing the weight by a different ui ∼ unif(0, 1] for each element. To
40 2 Convergence and Sampling
8
2.6
9
1.8
0
0
1.8
1.8
3
1.5
1
1.2
0
0
1.1
1.1
1.1
1.1
τ τ
0
9
0.8
new weights
0.6
5
9
0
priorities
0.5
weights
0.4
0.5
1
0.4
0.4
9
2
0
0.2
0.3
9
0.3
0.2
0.1
1
8
0.1
0.0
Exercises
2.1 Consider a pdf f so that a random variable X ∼ f has expected value
E[X] = 3 and variance Var[X] = 10. Now consider n = 10 iid random variables
10
X1, X2, . . . , X10 drawn from f . Let X̄ = 1
10 i=1 Xi .
1. What is E[ X̄]?
2. What is Var[ X̄]?
3. What is the standard deviation of X̄?
4. Which is larger, Pr[X > 4] or Pr[ X̄ > 4]?
5. Which is larger, Pr[X > 2] or Pr[ X̄ > 2]?
2.2 Let X be a random variable that you know is in the range [−1, 2] and you know
has an expected value of E[X] = 0. Use the Markov inequality to upper-bound
Pr[X > 1.5].
(Hint: you will need to use a change of variables.)
2.3 For a random variable X, let Y = Var[X] = E[(X − E[X])2 ] be another random
variable representing its variance. Apply the Markov inequality to Y , and some
algebra, to derive the Chebyshev inequality for X.
2.5 Consider a (parked) self-driving car that returns n iid estimates to the distance
of a tree. We will model these n estimates as a set of n scalar random variables
X1, X2, . . . , Xn taken iid from an unknown pdf f , which we assume models the true
distance plus unbiased noise (the sensor can take many iid estimates in rapid fire
fashion.). The sensor is programmed to only return values between 0 and 20 feet,
and that the variance of the sensing noise is 64 feet squared. Let X̄ = n1 i=1
n
Xi . We
want to understand as a function of n how close X̄ is to μ, which is the true distance
to the tree.
1. Use Chebyshev’s inequality to determine a value n so that Pr[| X̄ − μ| ≥ 1] ≤ 0.5.
2. Use Chebyshev’s inequality to determine a value n so that Pr[| X̄ − μ| ≥ 0.1] ≤ 0.1.
3. Use the Chernoff-Hoeffding bound to determine a value n so that Pr[| X̄ − μ| ≥
1] ≤ 0.5.
4. Use the Chernoff-Hoeffding bound to determine a value n so that Pr[| X̄ − μ| ≥
0.1] ≤ 0.1.
42 2 Convergence and Sampling
2.6 Consider two random variables C and T describing how many coffees and teas I
will buy in the coming week; clearly neither can be smaller than 0. Based on personal
experience, I know the following summary statistics about my coffee and tea buying
habits: E[C] = 3 and Var[C] = 1 also E[T] = 2 and Var[T] = 5.
1. Use Markov’s inequality to upper-bound the probability that I buy 4 or more
coffees, and the same for teas: Pr[C ≥ 4] and Pr[T ≥ 4].
2. Use Chebyshev’s inequality to upper-bound the probability that I buy 4 or more
coffees, and the same for teas: Pr[C ≥ 4] and Pr[T ≥ 4].
2.7 The average score on a test is 82 with a standard deviation of 4 percentage points.
All tests have scores between 0 and 100.
1. Using Chebyshev’s inequality, what percentage of the tests have a grade of at least
70 and at most 94?
2. Using Markov’s inequality, what is the highest percentage of tests which could
have a score less than 60?
2.8 Consider a random variable X with expected values E[X] = 7 and variance
Var[X] = 2. We would like to upper-bound the probability Pr[X < 5].
1. Which bound can and cannot be used with what we know about X (Markov,
Chebyshev, or Chernoff-Hoeffding), and why?
2. Using that bound, calculate an upper bound for Pr[X < 5].
3. Describe a probability distribution for X where the other two bounds are definitely
not applicable.
2.9 Consider n iid random variables X1, X2, . . . , Xn with expected value E[Xi ] = 20
and variance Var[Xi ] = 2. Assume we also know that each Xi must satisfy 15 ≤ Xi ≤
n
22. We now want to analyze the random variable of their average X̄ = n1 i=1 Xi .
Assume first that n = 20 (the number of random variables).
1. Use the Chebyshev inequality to upper-bound Pr[ X̄ > 21].
2. Use the Chernoff-Hoeffding inequality to upper-bound Pr[ X̄ > 21].
Now assume first that n = 200 (the number of random variables).
3. Use the Chebyshev inequality to upper-bound Pr[ X̄ > 21].
4. Use the Chernoff-Hoeffding inequality to upper-bound Pr[ X̄ > 21].
2.10 Consider 5 items a1, a2, a3, a4, a5 with weights w1 = 2, w2 = 3, w3 = 10,
w4 = 8, and w5 = 1. But before sampling, we only know upper bounds on the
weights with w1 ∈ [0, 4], w2 ∈ [0, 4], w3 ∈ [0, 10], w4 ∈ [0, 10], and w5 ∈ [0, 2].
1. In the context of importance sampling, list the importance ψi for each item.
2. Use the Partition of Unity approach to sample 2 items according to their impor-
tance. Use u1 = 0.377 and u2 = 0.852 as the “random” values in unif(0, 1].
3. Report the estimate of the average weight using importance sampling, based on
the two items sampled.
Chapter 3
Linear Algebra Review
Abstract We briefly review many key definitions and aspects of linear algebra that
will be necessary for the remainder of the book. This review includes basic operations
for vectors and matrices. A highlight is the dot product and its intuitive geometric
properties.
For the context of data analysis, the critical part of linear algebra deals with vectors
and matrices of real numbers.
In this context, a vector v = (v1, v2, . . . , vd ) is equivalent to a point in Rd . By
default, a vector will be a column of d numbers (where d is context specific)
⎡ v1 ⎤
⎢ ⎥
⎢ v2 ⎥
⎢ ⎥
v=⎢ . ⎥
⎢ .. ⎥
⎢ ⎥
⎢ vn ⎥
⎣ ⎦
but in some cases we will assume the vector is a row
vT = [v1 v2 . . . vd ].
where vector ai = [Ai,1, Ai,2, . . . , Ai,d ], and Ai, j is the element of the matrix in the
ith row and jth column. We can write A ∈ Rn×d when it is defined on the reals.
Note that a vector will always print as a row vector even if it is a column vector. The
transpose option reverses the role of columns and rows, and the operation in Python
can be observed using
print(v.T)
print(A.T)
However, again since Python always prints vectors as rows, for the vector v those
results will display the same.
An element of a vector or matrix, or submatrix can be printed as
print(v[2])
print(A[1 ,2])
print(A[:, 1:3])
a1 = (−0.5, 1.5)
a3 = (1, 1)
a2 = (2.5, 0.75)
A transpose operation (·)T reverses the roles of the rows and columns, as seen
above with vector v. For a matrix, we can write
⎡ A1,1 A2,1 . . . An,1 ⎤⎥
⎡ | | | ⎤ ⎢⎢
⎢ ⎥ ⎢ A1,2 A2,2 . . . An,2 ⎥⎥
AT = ⎢⎢ a1 a2 . . . an ⎥⎥ = ⎢ . .. .. . ⎥.
⎢ | | | ⎥ ⎢⎢ .. . . .. ⎥⎥
⎣ ⎦ ⎢
A A
⎣ 1,d 2,d . . . An,d ⎥⎦
Ax = b
where
⎡ x1 ⎤
−2 ⎢ ⎥ 3 −7 2
b= x = ⎢⎢ x2 ⎥⎥ and A = .
6 ⎢ x3 ⎥ −1 2 −5
⎣ ⎦
46 3 Linear Algebra Review
We can add together two vectors or two matrices only if they have the same dimen-
sions. For vectors x = (x1, x2, . . . , xd ) ∈ Rd and y = (y1, y2, . . . , yd ) ∈ Rd , then
vector
z = x + y = (x1 + y1, x2 + y2, . . . , xd + yd ) ∈ Rd .
Vector addition can be geometrically realized as just chaining two vectors together.
It is easy to see that this operation is commutative. That is x + y = y + x, since it
does not matter which order we chain the vectors, both result in the same summed-to
point.
x+y
+x
+y
y
in Python as
B = np. array ([[2 ,5 ,3] ,[4 ,8 ,1]])
A+B
Matrix-Matrix Products
Multiplication only requires alignment along one dimension. For two matrices A ∈
Rn×d and B ∈ Rd×m we can obtain a new matrix C = AB ∈ Rn×m where Ci, j , the
element in the ith row and jth column of C, is defined as
d
Ci, j = Ai,k Bk, j .
k=1
resulting in
14 29 6
C= ,
36 76 19
where the upper right corner C1,1 = G1,1 · B1,1 + G1,2 · B2,1 = 1 · 2 + 3 · 4 = 14.
Vector-Vector Products
There are two types of vector-vector products, and their definitions follow directly
from that of matrix-matrix multiplication (since a vector is a matrix where one of
the dimensions is 1). But it is worth highlighting these.
Given two column vectors x, y ∈ Rd , the inner product or dot product is written
as
48 3 Linear Algebra Review
⎡ y1 ⎤
⎢ ⎥
⎢ y2 ⎥
d
⎢ ⎥
xT y = x · y = x, y = [x1 x2 . . . xd ] ⎢ . ⎥ = xi yi,
⎢ .. ⎥
⎢ ⎥ i=1
⎢ yd ⎥
⎣ ⎦
where xi is the ith element of x and is similar for yi . This text will prefer the notation
x, y since the same can be used for row vectors, and there is no confusion with
scalar multiplication in using x · y. Whether a vector is a row or a column is often
arbitrary; in a computer, they are typically stored the same way in memory.
Note that this dot product operation produces a single scalar value. And it is a
linear operator. So this means for any scalar value α and three vectors x, y, z ∈ Rd ,
we have
αx, y + z = αx, y + z = α (x, y + x, z) .
This operation is associative, distributive, and commutative.
πu (v)
3 4
u=( , ) v = (2, 1)
5 5
Moreover, since u = length(u) = 1, then we can also interpret u, v as the
length of v projected onto the line through u. That is, let πu (v) be the closest point
to v on the line through u (the line through u and the line segment from v to πu (v)
make a right angle). Then
u, v = length(πu (v)) = πu (v).
3.3 Norms 49
Matrix-Vector Products
3.3 Norms
This measures the “straight-line” distance from the origin to the point at v. A vector
v with norm v = 1 is said to be a unit vector; sometimes a vector x with x = 1
is said to be normalized.
However, a “norm” is a more general concept. A class called L p norms is well-
defined for any parameter p ∈ [1, ∞) as
1/p
d
v p = |vi | p .
i=1
50 3 Linear Algebra Review
n
d
AF = Ai, j =
2 ai 2,
i=1 j=1 i=1
where Ai, j is the element in the ith row and jth column of A, and where ai is the ith
row vector of A. The spectral norm is defined for a matrix A ∈ Rn×d as
It is useful to think of these x and y vectors as being unit vectors, then the denominator
can be ignored (as they are 1). Then we see that x and y only contain “directional”
information, and the arg max vector (e.g., the x which maximizes Ax/ x) points
in the directions that maximize the norm.
Consider a set of k vectors x1, x2, . . . , xk ∈ Rd and a set of k scalars α1, α2, . . . , αk ∈
R. Then a central property of linear algebra allows us to write a new vector in Rd as
k
z= αi xi .
i=1
If span(X) = Rd (that is, for vectors X = x1, x2, . . . , xk ∈ Rd all vectors are in the
span), then we say X forms a basis.
n
xi = αj x j
j=1
ji
of the other vectors in the set.
52 3 Linear Algebra Review
a1 = (1.5, 0)
a3 = (1, 1)
a2 = (2, 0.5)
In general, a linear subspace of a data set also includes the origin 0 = (0, 0). The line
(or in general the k-dimensional subspace) which passes through all data points, but
not necessarily the origin (as shown here), is known as an affine subspace.
3.5 Rank
The rank of a set of vectors X = {x1, . . . , xn } is the size of the largest subset X ⊂ X
which is linearly independent. Usually we report rank(A) as the rank of a matrix A.
It is defined as the rank of the rows of the matrix, or the rank of its columns; it turns
out these quantities are always the same.
If A ∈ Rn×d , then rank(A) ≤ min{n, d}. If rank(A) = min{n, d}, then A is said to
be full rank. For instance, if d < n, then using the rows of A = [a1 ; a2 ; . . . ; an ], we
n
can describe any vector z ∈ Rd as the linear combination of these rows: z = i=1 αi ai
for some set {α1, . . . , αn }. In fact, if A is full rank, we can do so and set all but d of
these scalars to 0.
have rank 2. Hence, both are full rank. We can compute this in Python as
np. matrix_rank (A)
np. matrix_rank (G)
A matrix A is said to be square if it has the same number of columns as it has rows.
Inverse
A−1 A = I = AA−1
A positive definite matrix M ∈ Rn×n is a symmetric matrix with all real and positive
eigenvalues. Another characterization is that for every vector x ∈ Rn , then xT M x is
positive. A positive semidefinite matrix M ∈ Rn×n may have some eigenvalues at 0
but they are otherwise positive, and still all real; equivalently for any vector x ∈ Rn ,
then xT M x may be zero or positive.
Determinant
n
| A| = (−1)i+1 A1,i · | Ă1,i | = A1,1 · | Ă1,1 | − A1,2 · | Ă1,2 | + A1,3 · | Ă1,3 | − A1,4 · | Ă1,4 | +. . .
i=1
3.7 Orthogonality 55
For a 3 × 3 matrix
⎡a b c ⎤
⎢ ⎥
M = ⎢⎢ d e f ⎥⎥
⎢g h i ⎥
⎣ ⎦
the determinant is defined as
e f d f d e
|M | = a
− b
+c .
h i g i gh
3.7 Orthogonality
Two vectors x, y ∈ Rd are orthogonal if x, y = 0. This means those vectors are at
a right angle to each other.
Example: Orthogonality
Consider two vectors x = (2, −3, 4, −1, 6) and y = (4, 5, 3, −7, −2). They are orthog-
onal since
A set of columns which are normalized and all orthogonal to each other are said
to be orthonormal. If V ∈ Rn×d and has orthonormal columns, then V T V = I (here
I is d × d) but VV T I.
A square matrix U ∈ Rn×n that has all orthonormal rows and all orthonormal
columns is orthogonal. It follows that
UT U = I = UUT
since for any normalized vector u that u, u = u = 1, and any two distinct
orthonormal columns ui u j then ui, u j = 0. Orthogonal matrices are norm
preserving under multiplication. That means for an orthogonal matrix U ∈ Rn×n and
any vector x ∈ Rn , U x = x.
Moreover, the columns [u1, u2, . . . , un ] of an orthogonal matrix U ∈ Rn×n form
a basis for Rn . This means that for any vector x ∈ Rn , there exists a set of scalars
56 3 Linear Algebra Review
n
α1, . . . , αn such that x = i=1 αi ui . More interestingly, since ui are unit vectors, we
also have x 2 = i=1 n
αi2 .
This can be interpreted as U describing a rotation (with possible mirror flips) to
a new set of coordinates. That is the old coordinates of x are (x1, x2, . . . , xn ) and the
coordinates in the new orthogonal basis [u1, u2, . . . , un ] are (α1, α2, . . . , αn ).
3.7 Orthogonality 57
Exercises
3.2 Consider two vectors u = (0.5, 0.4, 0.4, 0.5, 0.1, 0.4, 0.1) and
v = (−1, −2, 1, −2, 3, 1, −5).
1. Check if u or v is a unit vector.
2. Calculate the dot product u, v.
3. Are u and v orthogonal?
v = (1, 2, 5, 2, −3, 1, 2, 6, 2)
u = (−4, 3, −2, 2, 1, −3, 4, 1, −2)
w = (3, 3, −3, −1, 6, −1, 2, −5, −7)
p = (4, 2, −6, x)
q = (2, −4, 1, −2)
Abstract At the core of most data analysis tasks and their formulations is a distance.
This choice anchors the meaning and the modeling inherent in the patterns found and
the algorithms used. However, there are an enormous number of distances to choose
from. We attempt to survey those most common within data analysis. This chapter
also provides an overview of the most important properties of distances (e.g., is it a
metric?) and how they are related to the dual notion of a similarity. We provide some
common modeling dynamics which motivate some of the distances, and overview
their direct uses in nearest neighbor approaches and how to algorithmically deal with
the challenges that arise.
4.1 Metrics
So what makes a good distance? There are two aspects to answer this question. The
first aspect is whether it captures the “right” properties of the data, but this is a
sometimes ambiguous modeling problem. The second aspect is more well-defined;
that is, does it satisfy the properties of a metric?
A distance d : X × X → R+ is a bivariate operator (it takes in two arguments, say
a ∈ X and b ∈ X) that maps to R+ = [0, ∞). It is a metric if
(M1) d(a, b) ≥ 0 (non-negativity)
(M2) d(a, b) = 0 if and only if a = b (identity)
(M3) d(a, b) = d(b, a) (symmetry)
(M4) d(a, b) ≤ d(a, c) + d(c, b) (triangle inequality)
A distance that satisfies (M1), (M3), and (M4) (but not necessarily (M2)) is called
a pseudometric.
A distance that satisfies (M1), (M2), and (M4) (but not necessarily (M3)) is called
a quasimetric.
In the next few sections we outline a variety of common distances used in data
analysis, and provide examples of their use cases.
4.2.1 L p Distances
• The most common is the L2 distance, also know as the Euclidean distance,
d
d2 (a, b) = a − b = a − b2 = (ai − bi )2 .
i=1
This is also known as the Manhattan distance since it is the sum of lengths on
each coordinate axis; the distance you would need to walk in a city like Manhattan
since you must stay on the streets and cannot cut through buildings.
• Another common modeling goal is the L0 distance
d
d0 (a, b) = a − b0 = d − 1(ai = bi ),
i=1
4.2 L p Distances and their Relatives 61
1 if ai = bi
where 1(ai = bi ) = Unfortunately, d0 is not convex.
0 if ai bi .
When each coordinate ai or bi is either 0 or 1, then this is known as the Hamming
distance.
• Finally, another useful variation is the L∞ distance
0
L1
a1 a2
The L2 ball is the only distance invariant to the choice of axis. This means, for
instance, that if it is rotated it stays the same. This is not true for any other L p balls.
It is also possible to draw L p balls for p < 1. However, these balls are not
convex; they “curve in” between the coordinate axis. Algorithmically, this makes
them difficult to work with. It also, in effect, is the reason those L p distances are not
a metric; they violate the triangle inequality.
62 4 Distances and Nearest Neighbors
All of the above distances are metrics, and in general all L p distances are metrics
for p ∈ [1, ∞).
Some of the metric requirements are easy to show hold for all L p distances. (M1)
and (M2) hold since the distances are, at the core, a sum of non-negative terms, and
are only all 0 if all coordinates are identical. (M3) holds since |ai − bi | = |bi − ai |,
the vector subtraction is symmetric.
Property (M4—triangle inequality) is a bit trickier to show, and the general proof
is beyond the scope of this book. The proof for L2 is more straight-forward, and nat-
urally follows by the geometry of a triangle. Consider the line a,b that goes through
two points a, b ∈ Rd . Let πa,b (c) be the orthogonal projection of any point c ∈ Rd
onto this line a,b . Now we can decompose a squared distance d2 (a, c)2 (and symmet-
rically d2 (c, b)2 ) into d2 (a, c)2 = d2 (a, πa,b (c))2 + d2 (πa,b (c), c)2 by the Pythagorian
theorem. Moreover, the distance d2 (a, b) ≤ d2 (a, πa,b (c)) + d2 (πa,b (c), b), where
there is an equality when πa,b (c) is on the line segment between a and b. Thus we
can conclude
a,b
a πa,b (c) b
These L p distances should not be used to model data when the units and meaning
on each coordinate are not the same. For instance, consider representing two people
p1 and p2 as points in R3 where the x-coordinate represents height in inches, the
y-coordinate represents weight in pounds, and the z-coordinate represents income
in dollars per year. Then most likely this distance is dominated by the z-coordinate
income which might vary on the order of 10,000 while the others vary on the order
of 10.
However, for the same data we could change the units, so the x-coordinate rep-
resents height in meters, the y-coordinate represents weight in centigrams, and the
z-coordinate represents income in dollars per hour. The information may be exactly
the same, only the unit changed. It is now likely dominated by the y-coordinate
representing weight.
4.2 L p Distances and their Relatives 63
These sorts of issues can hold for distances other than L p as well. Some heuristics
to overcome this are: set hand-tuned scaling of each coordinate, “standardize” the
distance so they all have the same min and max value (e.g., all in the range [0, 1]), or
”normalize” the distance so they all have the same mean and variance. Applying any
of these mechanically without any inspection of their effects may have unintended
consequences. For instance the [0, 1] standardization is at the mercy of outliers, and
mean-variance normalization can have strange effects in multi-modal distributions.
Again, these are not solutions, they are modeling choices or heuristics! This modeling
challenge will be highlighted again with data matrices in Section 7.1.
With some additional information about which points are “close” or “far” one
may be able to use the field of distance metric learning to address some of these
problems. A simple solution can then be derived over all Mahalanobis distances
(defined below), using some linear algebra and gradient descent. But without this
information, there is no one right answer. If your axes are the numbers of apples
(x-axis) and number of oranges (y-axis), then it is literally comparing apples to
oranges!
An extension to the L2 distance is the Mahalanobis distance defined for two vectors
a, b ∈ Rd and a d × d matrix M as
d M (a, b) = (a − b)T M(a − b).
through the eigenvectors and by the eigenvalues of M (see Chapter 7). As long as
all eigenvalues are positive and real (implying M is positive definite) then d M is a
metric; since then the skew is well-defined and full-dimensional.
The cosine distance measures 1 minus the cosine of the “angle” between vectors
a = (a1, a2, . . . , ad ) and b = (b1, b2, . . . , bd ) in Rd
d
a, b
ai bi
dcos (a, b) = 1 − = 1 − i=1 .
ab ab
Recall that if θ a,b is the angle between vectors a and b then cos(θ a,b ) = a a,b
b .
Hence dcos (a, b) = 1 − cos(θ a,b ).
Note that dcos (A, B) ∈ [0, 2] and it does not depend on the magnitude a of
the vectors since this is normalized out. It only cares about their directions. This is
useful when vectors represent data sets of different sizes and we want to compare how
similar are those distributions, but not their size. This makes dcos at best a psuedo-
metric since for two vectors a and a = (2a1, 2a2, . . . , 2ad ) where a = 2a have
dcos (a, a ) = 0, but they are not equal.
Sometimes dcos is defined only with respect to normalized vectors a, b ∈ Sd−1 ,
where
Sd−1 = x ∈ Rd | x = 1 .
In this case, then more simply
dcos (a, b) = 1 − a, b
.
Restricted to vectors in Sd−1 , then dcos does not have the issue of two vectors a b
such that dcos (a, b) = 0. However, it is not yet a metric since it can violate the triangle
inequality.
A simple isometric transformation of the cosine distance (this means the distance-
induced ordering between any pairs of points are the same) is called the angular
distance dang ; that is for a, b, c, d ∈ Sd−1 if dcos (a, b) < dcos (c, d) then dang (a, b) <
dang (c, d). Specifically we define for any a, b ∈ Sd−1
That is, this undoes the cosine-interpretation of the cosine distance, and only mea-
sures the angle. Since, the inner product for any a, b ∈ S−1 is in the range [−1, 1], then
the value of dcos is in the range [0, 2], but the value of dang (a, b) ∈ [0, π]. Moreover,
dang is a metric over Sd−1 .
4.2 L p Distances and their Relatives 65
The angular interpretation of the cosine distance dcos (a, b) = 1 − cos(a, b) and
angular distance dang (a, b) = θ a,b is convenient to think of for points on the sphere.
To understand the geometry of the angular distance between two points, it is sufficient
to consider them as lying on the unit sphere S1 ⊂ R2 . Now dang (a, b) is the radians
of the angle θ a,b , or equivalently the arclength traveled between a and b if walking
on the sphere.
dang (a, b)
θa,b
S1
Now we can see why dang is a metric if restricted to vectors on Sd−1 . (M1) and
(M3) hold by definition, and (M2) on Sd−1 holds because no two distinct unit vectors
have an angle of 0 radians between them. To show the triangle inequality (M4)
(here we need to think in S2 ⊂ R3 ), observe that since dang measures the shortest
distance restricted to the sphere, there is no way that dang (a, b) can be longer than
dang (a, c) + dang (c, b) since that would imply going through point c ∈ Sd−1 makes
the path from a to b shorter—which is not possible.
4.2.4 KL Divergence
That is Δ◦d−1 defines the set of d-dimensional discrete probability distributions where
for a ∈ Δ◦d−1 , the coordinate ai is the probability that Xa takes the ith value. The
(d − 1)-dimensional (closed) simplex Δd−1 differs in that it also allows values ai to
66 4 Distances and Nearest Neighbors
be 0, i.e., to have 0 probability to take the ith value. But the KL Divergence is only
defined over vectors on Δ◦d−1 .
Then we can define the Kullback-Leibler Divergence (often written dK L (Xa Xb ))
as
d
dK L (Xa, Xb ) = dK L (a, b) = ai ln(ai /bi ).
i=1
It is the expectation (over the probability distribution a) of the natural log of the
ratios between a and b, and can be interpreted as the “information gain” in using
distribution b in place of a.
Note that dK L is not a metric, violating (M3) since it is not symmetric. It also
violates the triangle inequality (M4).
We now introduce some more general notions of distance, in particular ones that
are heavily used to understand the relationship between text data: strings of words
or characters. There are other techniques which draw more heavily on the semantic
and fine-grained structural properties of text. We focus here on the ones which have
simple mathematical connections and as a result are often more scalable and flexible.
Observe the example with A = {1, 3, 5} and B = {2, 3, 4}. These are represented as
a Venn diagram with a blue region for A and a red one for B. Element 6 is in neither
set. Then the cardinality of A is | A| = 3, and it is the same for B. The intersection
4.3 Distances for Sets and Strings 67
A ∩ B = {3} since it is the only object in both sets, and is visually represented as the
purple region. The union A ∪ B = {1, 2, 3, 4, 5}. The set difference A \ B = {1, 5} and
B \ A = {2, 4}. The complement Ā = {2, 4, 6} and the complement A ∪ B = {6}.
Finally, the symmetric difference is AB = {1, 2, 4, 5}.
6
1 4
3
2
A 5
A∩B B
Consider two sets A = {0, 1, 2, 5, 6} and B = {0, 2, 3, 5, 7, 9}. The Jaccard distance
between A and B is
| A ∩ B|
dJ (A, B) = 1 −
| A ∪ B|
|{0, 2, 5}| 3
=1− = 1 − = 0.625.
|{0, 1, 2, 3, 5, 6, 7, 9}| 8
Notice that if we add an element 7 to A (call this set A) that is already in B, then
the numerator increases, but the denominator stays the same. So then dJ (A, B) =
1 − 48 = 0.5 and the distance is smaller—they are closer to each other.
On the other hand, if we add an element 4 to A (call this set A) which is in
neither A or B, then the numerator stays the same, but the denominator increases.
So then dJ (A, B) = 1 − 39 ≈ 0.666 and then distance is larger—the sets are farther
from each other.
68 4 Distances and Nearest Neighbors
The Jaccard distance is a popular distance between sets since it is a metric, and it
is invariant to the size of the sets. It only depends on the fraction of the items among
both sets which are the same. It also does not require knowledge of some larger
universe Ω of elements that the sets may be from. For instance, as in the example,
we can implicitly require that the sets contain only positive integers, but do not need
to know an upper-bound on the largest positive integer allowed.
To show that dJ is a metric, we need to show that the 4 properties each hold. The first
three are direct. In particular (M1) and (M2) follow from dJ (A, B) ∈ [0, 1] and only
being 0 if A = B. Property (M3) holds by the symmetry of set operations ∩ and ∪.
The triangle inequality (M4) requires a bit more effort to show, namely for any
sets A, B, C that dJ (A, C) + dJ (C, B) ≥ dJ (A, B). We will use the notation that
| A ∩ B| | AB|
dJ (A, B) = 1 − = .
| A ∪ B| | A ∪ B|
We first rule out that there are elements c ∈ C which are not in A or not in B.
Removing these elements from C will only decrease the left-hand side of the triangle
inequality while not affecting the right-hand side. So if C violates this inequality,
we can assume there is no such c ∈ C and it will still violate it. So now we assume
C ⊆ A and C ⊆ B.
A
C
Now we have
| A \ C| |B \ C|
dJ (A, C) + dJ (C, B) = +
| A| |B|
| A \ C| + |B \ C|
≥
| A ∪ B|
| AB|
≥ = dJ (A, B).
| A ∪ B|
The first inequality follows since | A|, |B| ≤ | A ∪ B|. The second inequality holds
since we assume C ⊆ A and C ⊆ B, hence C ⊆ A ∩ B, and thus | A \ C| + |B \ C| ≥
| A \ B| + |B \ A| = | AB|.
4.3 Distances for Sets and Strings 69
Let Σ be a set, in this case an alphabet of possible characters (e.g., all ASCII
characters, or all lowercase letters so Σ = {a, b, . . . , z}). Then we can say a string a
of length d is an element in Σ d ; that is an ordered sequence of characters from the
alphabet Σ. The edit distance considers two strings a, b ∈ Σ d , and
where an operation can delete a letter or insert a letter. In fact, the strings are not
required to have the same length, since we can insert items in the shorter one to make
up the difference.
mines
1 : minles insert l
2 : miles delete n
3 : smiles insert s
There are many alternative variations of operations. The insert operation may
cost more than the delete operation. Or we could allow a replace operation at the
same unit cost as either insert or delete; in this case the edit distance of mines and
smiles is only 2.
Edit distance is a metric. (M1) holds since the number of edits is always non-
negative. (M2) is true because there are no edits only if they are the same. (M3) holds
because the operations can be reversed. (M4) follows because if c is an intermediate
“word” then the ded (a, c) + ded (c, b) = ded (a, b), otherwise it requires more edits.
Is this good for large text documents? Not really. It is slow to compute—basically
requiring quadratic time dynamic programing in the worst case to find the smallest
set of edits. And removing one sentence can cause a large edit distance without
changing meaning. But this is good for small strings. Some version is used in most
spelling recommendation systems (e.g., a search engine’s auto-correct). It is a good
guide that usually ded (a, b) > 3 is pretty large since, e.g., ded (cart, score) = 4.
70 4 Distances and Nearest Neighbors
There are many many choices of distances. Which one to choose (it is definitely
a choice) is based on (a) computational efficiency and (b) modeling effectiveness.
The Euclidean distance d2 and sometimes the Jaccard distance dJ are often chosen
because various algorithmic and computational benefits are available for these—
as we will see, this efficiency comes not just from time to compute the distance
once, but how it can be used within more complex operations. However, each has
other benefits due to modeling factors. Sometimes this modeling is just based on
mathematical properties (is it a metric? are my vectors normalized?), sometimes it is
intuitive, and sometimes it can be empirically validated by measuring performance on
downstream applications. In this section we show how to arrive at various distances
as the logical choice in an example case of modeling text.
As mentioned, edit distance ded is useful for shorter strings. The other variants
will all be more useful when dealing with much larger texts.
In this section we will use as a running example the text from the following 4
short documents. In practice, these approaches are typically applied to much longer
documents (e.g., text on a webpage, a newspaper article, and a person’s bio).
D1 : I am Pam.
D2 : Pam I am.
D3 : I do not like jelly and ham.
D4 : I do not, do not, like them, Pam I am.
The simplest model for converting text into an abstract representation for applying
a distance is the bag-of-words approach. Intuitively, each document creates a “bag”
and throws each word in that “bag” (a multi-set data structure), and maintains only
the count of each word. This transforms each document into a multi-set. However it
is convenient to think of it as a (very sparse, meaning mostly 0s) vector.
That is, consider a vector v ∈ RD for a very large D, where D is the number of
all possible words. Each coordinate of such a vector corresponds to one word and
records the count of that word.
These vector representations naturally suggest that one could use an L p distance,
most commonly d2 , to measure their distance. However, it is more common to use
the cosine distance dcos . This has the advantage of not penalizing certain documents
4.4 Modeling Text with Distances 71
for their length, in principle focusing more on their content. For instance, a document
with a simple phrase would be identical under dcos to another document that repeated
that phrase multiple times. Or two documents about, say, baseball would typically
draw from a similar set of words (e.g., {bat, ball, hit, run, batter, inning})
and likely be close even if their lengths differ.
Example: Bag-of-Words
For the running example, consider D-dimensional space with D = 11; it could be
much higher. For each coordinate, we list the corresponding word as
(am, and, do, ham, I, jelly, like, not, Pam, them, zebra).
v1 = (1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0)
v2 = (1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0)
v3 = (0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0)
v4 = (1, 0, 2, 0, 2, 0, 1, 2, 1, 1, 0).
We can now use the Euclidean distance d2 between these vectors (or any other L p
distance). We notice that d2 (v1, v2 ) = 0, even though the text is different. However,
since the bag-of-words only measures which words are present, and not how they
are used, it cannot distinguish between these cases.
Also notice that the 11th coordinate for the word zebra is never used, and is
0 in all coordinates of the vectors. If this coordinate was omitted and only a 10-
dimensional representation was used it would not change these distances. On the
other hand, these vectors can be much larger and represent many other words, and
this will not affect the distance.
(v1, v2 ) (v1, v3 ) (v1, v4 ) (v2, v3 ) (v2, v4 ) (v3, v4 )
d2 0 2.83 3.32 2.83 3.32 3
dcos 0 0.781 0.423 0.781 0.423 0.339
Alternatively, we can use the cosine distance dcos . This models the text differently;
the main difference is that it normalizes the vectors. So for instance D3 and D4 are
now likely to be further apart simply because they contain more words, as is the case
with d2 . This metric still treats D1 and D2 as identical.
These distance calculations are simple to perform and print in Python. First the
Euclidean distance is computed as
import numpy as np
from numpy import linalg as LA
print ([1 -( v1n @ v2n), 1-( v1n @ v3n), 1-( v1n @ v4n),
1-( v2n @ v3n), 1-( v2n @ v4n), 1-( v3n @ v4n )])
v̄1 = (0.325, 0.003, 0.003, 0.003, 0.325, 0.003, 0.003, 0.003, 0.325, 0.003, 0.003)
v̄2 = (0.325, 0.003, 0.003, 0.003, 0.325, 0.003, 0.003, 0.003, 0.325, 0.003, 0.003)
v̄3 = (0.001, 0.142, 0.142, 0.142, 0.142, 0.142, 0.142, 0.142, 0.001, 0.001, 0.001)
v̄4 = (0.100, 0.001, 0.199, 0.001, 0.199, 0.001, 0.100, 0.199, 0.100, 0.100, 0.001) .
import SciPy as sp
from SciPy import stats
reg = 0.01
v1r = v1+reg
v2r = v2+reg
v3r = v3+reg
v4r = v4+reg
4.4.2 k -Grams
Using the running example, we show the 2-word grams of each of the documents.
Each gram is shown as a set of two words in square brackets.
There are many variants of how to construct k-grams for words. The most prominent
of which is to use k consecutive characters instead of k consecutive words; we call
these character k-grams.
Many other modeling decisions go into constructing a k-gram. Should punctuation
be included? Should whitespace be used as a character, or should sentence breaks be
used as a word in word grams? Should differently capitalization characters or words
represent distinct objects? And most notoriously, how large should k be?
Through all of these variants, a few common rules apply. First, the more expressive
the k-grams (e.g., keeping track of punctuation, capitalization, and whitespace), the
larger the quantity of data that is required to get meaningful results—otherwise most
documents will have a Jaccard distance 1 or very close to it unless large blocks of text
are verbatim repeated (for instance, plagiarism). But for long and structured articles
(e.g., newspaper articles with a rigid style guide), some of these more expressive
choices can be effective.
Second, for longer articles it is better to use words and larger values of k, while
for shorter articles (like tweets) it is better to use smaller k and possibly character
k-grams. The values used for k are often perhaps-surprisingly short, such as k = 3
or k = 4 for both characters and words.
Finally, there are a few structured tricks which come up. It can be useful to
particularly keep track of or emphasize starts of sentences, capitalized words, or
word k-grams which start with “stop words” (that is, very common words like {a,
for, the, to, and, that, it, . . . } that often signal starts of expressions).
Continuous Bag-of-Words
These word vectors embedding have led to dramatic improvements for various natu-
ral language processing tasks such as translation between languages and comparing
76 4 Distances and Nearest Neighbors
meaning at sentence level structures. As a result, they are now commonly incor-
porated into state-of-the-art text models. However, they have also been shown to
implicitly encode bias into the embeddings, coming from the text corpuses on which
they were built. For instance in many such embeddings, the word man is significantly
closer to the word engineer than is the word woman. Such biases, even if uninten-
tional, may affect automated hiring decisions based on gender. As a data analyst,
should you use these models to help make predictions if they are known to include
biases, even if they actually lead to better prediction and generalization?
4.5 Similarities
x| A ∩ B| + y| A ∪ B| + z| AB|
sx,y,z,z (A, B) = .
x| A ∩ B| + y| A ∪ B| + z | AB|
For instance sJ = s1,0,0,1 . Note that this family includes a complement operation of
A ∪ B and thus seems to require to know the size of the entire domain Ω from which
A and B are subsets. However, in the case of Jaccard similarity and others when
y = 0, this term is not required, and thus the domain Ω is not required to be defined.
Other common set similarities in this family include the following.
4.5 Similarities 77
| A ∩ B| + | A ∪ B|
Hamming: sHam (A, B) = s1,1,0,1 (A, B) =
| A ∩ B| + | A ∪ B| + | AB|
| A ∩ B|
Andberg: sAndb (A, B) = s1,0,0,2 (A, B) =
| A ∩ B| + 2| AB|
| A ∩ B| + | A ∪ B|
Rogers-Tanimoto: sRT (A, B) = s1,1,0,2 (A, B) =
| A ∩ B| + | A ∪ B| + 2| AB|
2| A ∩ B|
Sørensen-Dice: sDice (A, B) = s2,0,0,1 (A, B) =
2| A ∩ B| + | AB|
For sJ , sHam , sAndb , and sRT , then d(A, B) = 1 − s(A, B) is a metric. In particular,
dHam (A, B) = |Ω|(1 − sHam (A, B)) is known as the Hamming distance; it is typically
applied between bit vectors. In the bit vector setting it counts the number of bits the
vectors differ on. Indeed, if we represent each object i ∈ Ω by a coordinate bi in
|Ω|-dimension vector b = (b1, b2, . . . , b |Ω| ) then these notions are equivalent.
2(3)
Sørensen-Dice: sDice (A, B) = 5+6 ≈ 0.545
The most basic normed similarity is the Euclidean dot product. For two vectors
p, q ∈ Rd , then we can define the dot product similarity as
sdot (p, q) = p, q
;
that is, just the dot product. And indeed this converts to the Euclidean distance as
d2 (p, q) = p − q2 = sdot (p, p) + sdot (q, q) − 2sdot (p, q)
= p, p
+ q, q
− 2 p, q
= p22 + q22 − 2 p, q
.
78 4 Distances and Nearest Neighbors
However, in this case the similarity could be arbitrarily large, and indeed the
similarity of a vector p with itself sdot (p, p) = p, p
= p 2 is the squared Euclidean
norm. To enforce that the similarity is at most 1, we can normalize the vectors first,
and we obtain the cosine similarity as
p q p, q
scos (p, q) = ,
= .
p q pq
Converting to a distance via the normed transformation
p q
scos (p, p) + scos (q, q) − 2scos (p, q) = d2 ,
p q
is still the Euclidean distance, but now between the normalized vectors pp and qq .
However, for this similarity, it is more common to instead use the set transformation
to obtain cosine distance:
p, q
For Gaussian kernels, and a larger class called characteristic kernels (a substantial
subset of positive definite kernels), this distance is a metric. A positive definite kernel
K is one where for any point set P = {p1, p2, . . . , pn } the n × n “gram” matrix G
with element Gi, j = K(pi, p j ) is positive definite.
One can also extend these normed similarities to be between sets of vector. Now let
P = {p1, p2, . . . , pn } ⊂ Rd and Q = {q1, q2, . . . , qm } ⊂ Rd be two sets of vectors.
We use a kernel K to define a similarity κ between sets as a double sum over their
kernel interactions
κ(P, Q) = K(p, q).
p ∈P q ∈Q
This can again induce a generalization of the kernel distance between point sets as
dK (P, Q) = κ(P, P) + κ(Q, Q) − 2κ(P, Q).
4.5 Similarities 79
A kernel density estimate is a powerful modeling tool that allows one to take a finite
set of points P ⊂ Rd and a smoothing operator, a kernel K, and transform them into
a continuous function kde P : Rd → R.
The most common kernel is the (normalized) Gaussian kernel
1 x − p 2
K(p, x) = exp − ,
σ d (2π)d 2σ 2
which is precisely the evaluation of the Gaussian distribution Gd (x) with mean
∫p and standard deviation σ evaluated at x. The coefficient ensures the integral
x ∈R d
K(p, x)dx = 1 for all choices of p. Thus it can be interpreted as taking the
“mass” of a point p and spreading it out according to this probability distribution.
Now, a kernel density estimate is defined at a point x ∈ Rd as
1 1
kde P (x) = K(p, x) = κ(P, {x}).
|P| p ∈P |P|
That is, it gives each point p ∈ P a mass of |P1 | (each weighted uniformly), then
spreads each of these |P| masses out using the Gaussian kernels, and sums them
up. While P is discrete in nature, the kernel density estimate kde P allows one to
interpolate between these points in a smooth natural way.
These set of vector similarities, and related structures, play a large role in non-
linear data analysis. There are two ways to conceptualize this. First, each use of
the Euclidean dot product (which will be quite ubiquitous in this book) is simply
replaced with a kernel, which is not a linear operator on Rd ; this is often called
the “kernel trick”. Second, one can think of the functions K(p, ·) or more generally
κ(P, ·) as an element of a function space, which acts like an infinite-dimensional
vector space. For positive definite kernels, this space is called a reproducing kernel
Hilbert space (RKHS) where κ(P, Q) serves as the inner product, and thus κ(P, P)
is the norm. Although one cannot represent this data exactly as vectors in this space,
the structure and properties of this RKHS provide the mathematical validation for
replacing the Euclidean inner product p, x
with K(p, x).
80 4 Distances and Nearest Neighbors
This chapter so far has surveyed numerous different distances and similarities, which
are themselves a small subset of all distances and similarities one may consider
when modeling data and an analysis task. While metric properties and other nice
mathematical properties are useful, another key concern is the computational cost
associated with using the distances and similarities. This is also not just the case of
a single evaluation, but when comparing a very large set of objects (say of n = 100
million objects). Then how can one quickly determine which ones are close, or given
a query object q, which ones in the large set are close to this query?
In 1-dimensional Euclidean data, these problems are relatively simple to address.
Start by sorting all objects, and storing them in sorted order. Then the nearby objects
are the adjacent ones in the sorted order. And if this ordering induces a balanced
binary tree, then on a query, nearby objects can be found in logarithmic running time
(i.e., time proportional to log n).
Next we introduce locality sensitive hashing to address these question for higher-
dimensional vectorized objects and for metrics on set-based representations. The
main idea is for a family of objects B, and a similarity s : B × B → [0, 1], to define
a family of random hash functions H with roughly the following property.
For a similarity function s, a locality sensitive hash family H is a set of hash functions
so for any two objects p, q ∈ B that
That is for any p, q ∈ B, a randomly chosen hash function h ∈ H will cause those
two objects to collide with probability roughly the same as their similarity.
s (q, ·)
s (q, p) = 0.32
-1 0 1 q 2 p 3 4
4.6 Locality Sensitive Hashing 81
A simple hash family for this is a set of randomly shifted grids, where items in the
same grid cell hash together. Define H = {hη | η ∈ [0, 1)} where hη (x) = x + η.
That is each hη maps the input to an integer (representing the index of a hash bucket)
where a grid defines consecutive sets (intervals) of numbers of length 1 which are
all mapped to the same hash bucket. The parameter η randomly shifts these grid
boundaries. So whether or not a pair of points are in the same grid cell depends on
the random choice of η, and more similar points are more likely to be in the same
grid cell, and hence the same hash bucket.
η1
-1 0 1 q 2 p 3 4
η2
-1 0 1 q 2 p 3 4
η3
-1 0 1 q 2 p 3 4
η4
-1 0 1 q 2 p 3 4
η5
-1 0 1 q 2 p 3 4
The example above shows 5 hash functions hη1 , hη2 , hη3 , hη4 , hη5 ∈ H . In this
example p and q are in the same hash bucket for 2 of the 5 hash functions (for hη1
and hη4 ), so we would estimate the similarity between p and q as 2/5 = 0.4.
Indeed, we can verify that for any p, q ∈ R that
For any p, q with |p− q| > 1, both the probability and the similarity is 0; and if p = q,
then both the probability and similarity is 1. The probability that p and q are not
hashed together is the probability a randomly shifted grid boundary falls between
them, which is precisely |p − q| given that |p − q| ≤ 1. Hence the probability and
similarity in this case are both 1 − |p − q|, as desired.
A common and important data structure is a hash table that relies on a different
type of hash function, which to distinguish it from an LSH, we call a separating
hash function. These hash functions h : B → U maps an object b ∈ B into a fixed
universe U; typically U can be represented as a set of integers {0, 1, 2, . . . , u − 1},
representing indexes of an array of size u. Hash tables are again defined with respect
to a family H, and we consider a random choice h ∈ H. Given this random choice,
then a perfect separating hash function guarantees that for any two b, b ∈ B that
Prh ∈H [h(b) = h(b )] = 1/u.
82 4 Distances and Nearest Neighbors
It is important to distinguish these two data structures and types of hash functions.
Separating hash functions are powerful and useful, but are not locality sensitive hash
functions.
More generally, locality sensitive hash families are defined with respect to distance d.
A hash family H is (γ, φ, α, β)-sensitive with respect to d when it has the following
properties:
• Prh ∈H [h(p) = h(q)] > α if d(p, q) < γ
• Prh ∈H [h(p) = h(q)] < β if d(p, q) > φ
For this to make sense we need α > 0 and β < 1 for γ ≤ φ. Ideally we want α to
be large, β to be small, and φ − γ to be small. Then we can repeat this with more
random hash functions to amplify the effect using concentration of measure bounds;
this can be used to make α larger and β smaller while fixing γ and φ. Ultimately, if
α becomes close to 1, then we almost always hash items closer than γ in the same
bucket, and as β becomes close to 0, then we almost never hash items further than φ
in the same bucket. Thus to define similar items (within a φ − γ tolerance) we simply
need to look at which ones are hashed together.
Revisit the triangle similarity s and its associated grid-based hash family H .
Recall for any objects p, q we have Prh ∈H [h(p) = h(q)] = s (p, q). Then for some
similarity threshold τ = d (p, q) = 1 − s (p, q) we can set τ = γ = φ and have
α = 1 − τ and β = τ.
1 1
Prh∈H [h(p) = h(q)]
1−τ
β
1−τ
α
τ 1−τ
0 0
0 d (p, q) γ=φ 1 0 s (q, p) 1
This same setting (and picture) can be used for any threshold τ ∈ (0, 1) where the
similarity, hash family pair (s, H) satisfies Prh ∈H [h(p) = h(q)] = s(p, q).
4.6 Locality Sensitive Hashing 83
This LSH machinery is useful to answer three types of tasks given a large data set P
(which could have size |P| in the hundreds of millions):
• For a threshold τ, determine all pairs p, p ∈ P so that d(p, p) ≤ τ
(or symmetrically, so that s(p, p) ≥ τ , e.g., with τ = 1 − τ).
• For a threshold τ and a query q, return all p ∈ P so that d(p, q) ≤ τ.
• For a query q, return point p̃ ∈ P so that it is approximately argmin p ∈P d(p, q).
The first two tasks are important in deduplication and plagiarism detection. For
instance, this task occurs when a search engine wants to avoid returning two webpages
which are very similar in content, or an instructor wants to quickly check if two
students have turned in very similar assignments to each other, or very similar to one
from an online solution repository.
In each case, we desire these tasks to take time roughly proportional to the number
of items within the distance threshold τ. In very high-dimensional Euclidean space,
or in the space of sets, a brute force checking of all distances would require time
proportional to |P| 2 for the first task, and proportional to |P| time for the second
task. This can be untenable when for instance the size of |P| is hundreds of millions.
The third task, as we will see in the classification section, is an essential tool
for data analysis. If we know how the objects p ∈ P in the database P behave,
then we can guess that q will behave similarly to the closest known example p∗ =
argmin p ∈P d(p, q). We can solve for p∗ by invoking the second task multiple times
while doing a binary-like search on τ; if the threshold is too large, we short cut the
operation of the second task as soon as we find more than a constant number of
τ-close items, and compare those directly.
The first task can also essentially be reduced to the second one, but just running
the second task with each p ∈ P as a query. So our description below will focus on
the second task.
In all of these tasks, we will be able to solve them by allowing some approximation.
We may not return the closest item, but one that is not too much further than the
closest item. We may return some items slightly beyond our distance threshold and
may miss some which are barely closer than it. Given that the choice of distance or
similarity used is a modeling choice (why not use a different one?), we should not
take its exact value too seriously.
84 4 Distances and Nearest Neighbors
Example: Banding
both p4 and q have h2,1 = 1 and h2,2 = 2. Note that although p2 collides with q on
3 individual hash functions (h1,1 = 3, h2,1 = 1, and h3,2 = 2) it never has an entire
banded-hash function collide, so it is not marked as close.
Analysis of Banding
We will analyze the simple case where there is a hash family H so that we have
Prh ∈H [h(p) = h(q)] = s(p, q). Let s = s(p, q), and we will analyze the case where
we use r bands with b hash functions each.
So this function fb,r (s) = 1 − (1 − s b )r describes the probability that two objects
of similarity s are marked as similar in the banding approach. A similar, but slightly
messier, analysis can be applied for any (γ, φ, α, β)-sensitive hash family.
We plot fb,r as an S-curve where on the x-axis s = s(p, q) and the y-axis represents
the probability that the pair p, q is detected as close. In this example we have
t = r · b = 15 with r = 5 and b = 3.
s 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
f3,5 (s) 0.005 0.04 0.13 0.28 0.48 0.70 0.88 0.97 0.998
1
0.75
0.5
0.25
Next we change the r and b parameters and observe what happens to the plot
of the S-curve fr,b (s). We show combinations of values r ∈ {4, 8, 16, 32} and b ∈
{2, 4, 6, 8}. The value b is fixed in each row, and increases as the rows go down. The
value r is fixed in each column, and increases as the columns go from left to right.
1
r=4 r=8 r = 16 r = 32
1 1 1
b= 2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 1 1 1
b= 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 1 1 1
b= 6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 1 1 1
b= 8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Choice of r and b
Locality sensitive hash families can be developed for more than just the simplistic
1-dimensional triangle similarity. Here we describe such a family for the angular
distance dang and similarity sang . Recall these functions operate on vectors a, b ∈
Sd−1 , which are unit vectors in Rd .
A hash hu from hash family Hang is defined as
Hang = hu (·) = sign( ·, u
) | u ∼ Unif(Sd−1 ) .
Let us unpack this. First choose a uniform random unit vector u ∈ Sd−1 , that is,
so any unit vector in Rd is equally likely to be chosen (u ∼ Unif(Sd−1 )). Then
hu (p) = sign( a, u
), that is, if the dot product of a and u is positive, it returns +1,
and if it is negative, it returns −1.
We will show at Prhu ∈Hang [hu (a) = hu (b)] = sang (a, b)/π for any a, b ∈ Sd−1 .
Thus, after scaling by π, the amplification analysis for this similarity and hash family
pair is exactly the same as with the simple triangle similarity example.
Here we see why Prhu ∈Hang [hu (a) = hu (b)] = sang (a, b)/π.
First consider data points a, b ∈ S1 , the circle. Recall they induce a distance
dang (a, b) ∈ [0, π] which is the minimum arclength between them. Each point cor-
responds to a set Fa,⊥ ⊂ S1 of all points u ∈ S1 so a, u
≥ 0; it corresponds to a
halfspace with boundary through the origin, and with normal a, when embedded in
R2 . A vector u results in hu (a) hu (b) only if it is in exactly one of Fa,⊥ or Fb,⊥ .
The arclength of the xor of these regions is 2 · dang (a, b); it includes example vector
u but not u.
S1 Fa,⊥
Fb,⊥
a
θa,b
dang (a, b)
u
b
u
88 4 Distances and Nearest Neighbors
A random unit vector u ∈ S1 can be generated with a uniform value ξ ∈ [0, 2π] and
walking along the circle clockwise from (1, 0) an arclength of ξ. Hence, the fraction
of such u for which hu (a) hu (b) is (2dang (a, b))/(2π). And since sang (a, b)/π =
1 − dang (a, b)/π, then as desired, the probability hu (a) = hu (b) is sang (a, b)/π.
The more general case for points p, q ∈ Sd−1 reduces to this analysis on S1 . The
great circle which passes through p, q ∈ Sd−1 is an instance of S1 . The hash signs
of hu (a) and hu (b) are still induced by halfspaces Fa,⊥ and Fb,⊥ in Rd mapped to
halfspaces in R2 . And while there are now several ways to generate the random unit
vectors in Sd−1 , a sufficient way is to first generate the “coordinate” as ξ ∈ [0, 2π]
as it pertains to the great circle through p, q on S1 , and then generate the remaining
“coordinates” which do not affect the inclusion in Fa,⊥ and Fb,⊥ .
If we have two uniform random numbers u1, u2 ∼ Unif[0, 1] then we can generate
two independent 1-dimensional Gaussian random variables as (using the Box-Muller
transform)
y1 = −2 ln(u1 ) cos(2πu2 )
y2 = −2 ln(u1 ) sin(2πu2 ).
A uniform Gaussian has the (amazing!) property that all coordinates (in any
orthogonal basis) are independent of each other. Thus to generated a point x ∈ Rd
from a d-dimensional Gaussian, for each coordinate i we assign it the value of an
independent 1-dimensional Gaussian random variable.
Moreover, this amazing property for a d-dimensional Gaussian random variable
x ∼ Gd implies that if it is projected to any 1-dimensional subspace, the result is
a 1-dimensional Gaussian distribution. That is for any unit vector v and x ∼ Gd ,
then x, v
∼ G1 . This means it does not favor any directions (as the d-variate
uniform distribution (Unif[−1, 1])d favors the “corners”), and hence if we normalize
u ← x/ x2 for x ∼ Gd , then u ∼ Unif(Sd−1 ), as desired.
4.6 Locality Sensitive Hashing 89
The LSH hash family for Euclidean distance d2 is a combination of the ideas for
angular distance and for the triangle similarity. However, the probability does not
work out quite as nicely so we require the more general notion of (γ, φ, α, β)-
sensitivity.
The hash family H L2,τ requires a desired similarity parameter τ, which was
implicit in the scaling of the triangle similarity, but not present in the angular
similarity. The set is defined as
H L2,τ = hu,η (·) = ·, u
+ η | u ∼ Unif(Sd−1 ), η ∼ Unif[0, τ] .
That is, it relies on a random unit vector u, and a random offset η. The hash operation
hu,η (x) = x, u
+ η first projects onto direction u, and then offsets by η in this
directions, and rounds up to an integer. For a large enough integer t (e.g., if there are
at most n data points, then using t = n3 ), it is also common to instead map to a finite
domain 0, 1, 2, . . . , t − 1 as hu,η (x) mod t.
The hash family H L2,τ is (τ/2, 2τ, 1/2, 1/3)-sensitive with respect to d2 .
θu,a−b u
To see that we do not observe too many spurious collisions, we can use an
argument similar to that for angular distance. If a − b > 2τ = φ and they collide,
then the vector u must be sufficiently orthogonal to (a − b) so the norm is reduced
by half; otherwise if | a, u
− b, u
| ≥ τ, they cannot be in the same bin. Formally
| a, u
− b, u
| = | a − b, u
| = a − bu| cos θ u,a−b |,
where θ u,a−b is the angle (in radians) between u and a − b. We know | cos(θ u,a−b )| ≤
1/2 only if θ u,a−b ≥ π/3. Fixing the 2-dimensional subspace which includes 0, u,
90 4 Distances and Nearest Neighbors
and a − b, then we can see that the probability θ u,a−b ≥ π/3 is at most 1/3 = β (that
is there are (2π/3) options for u out of 2π total).
There exists a beautiful extension to create similar LSH hash families H L p ,τ for any
L p distance with p ∈ (0, 2]. The main difference is to replace the u ∼ Unif(Sd−1 )
which is really some x ∼ Gd with another d-dimensional random variable from what
is known as a p-stable distribution.
A distribution μ over R is p-stable (for p ≥ 0) if the following holds. Con-
sider d + 1 random variables X0, X1, . . . , X
d ∼ μ. Then considering any d real
values {v1, . . . , vd }, the random variable i vi Xi has the same distribution as
( i |vi | p )1/p X0 . Such p-stable distributions exist for p ∈ (0, 2]. Special cases are
• The Gaussian distribution G(x) = √1 exp(−x 2 /2) is 2-stable.
2π
• The Cauchy distribution C(x) = 1 1
π 1+x 2 is 1-stable.
Intuitively, this allows us to replace a sum of random variables with a single
random variable by adjusting coefficients carefully. But actually, it is the composition
of the coefficients which is interesting.
In particular, for p = 2 and a vector v = (v1, v2, . . . , vd ) where we want to
(1/g0 ) i gi vi = (1/g0 ) g, v
, where g = (g1, . . . , gd ). Note that by dividing by g0 ,
we are approximately correctly normalizing the d-dimensional Gaussian random
variable. This turns out to be sufficient for Euclidean LSH and other similar methods
for high-dimensional data we will see later.
Using the Cauchy random variables
c0, c1, . . . , cd ∼ C in place of the Gaussian
ones allows us to estimate v1 = i |vi | as (1/c0 ) c, v
with c = (c1, . . . cd ).
The final example of LSH we will provide is for the Jaccard distance, which is defined
on sets. This hash family is HJ again and has the nice property that Prhσ ∈HJ [hσ (A) =
hσ (B)] = sJ (A, B), so the previous amplification analysis applies directly. Consider
sets defined over a domain Ω of size n, and define Πn as the set of all permutations
from Ω to distinct integers in [n] = {1, 2, 3, . . . , n}. So for each element ω ∈ Ω then
each σ ∈ Πn maps σ(ω) = i for some i ∈ [n], and for ω ω then σ(ω) σ(ω ).
In practice we can relax this restriction as we discuss below. Then we define
HJ = hσ (A) = min σ(ω) | σ ∼ Unif(Πn ) .
ω∈A
4.6 Locality Sensitive Hashing 91
That is, the hash function hσ applies σ to each ω in A, and then returns the smallest
value out of all of them. Due to the return of the smallest value, this type of approach
is often called a min hash.
Observe that the Jaccard similarity sJ (A, B) = 28 . On the hash functions there is a
collision for hσ1 (A) = hσ1 (B), but not for hσ2 , so they would estimate the similarity
as 12 .
It is useful to understand why Prhσ ∈HJ [hσ (A) = hσ (B)] = sJ (A, B). We think
of three types of elements from Ω: the objects Ω A∩B which are in both A and
B; the objects Ω AB which are in either A or B, but not both; and the objects
ΩA,B which are in neither A nor B. Note that the Jaccard similarity is precisely
|Ω A∩B |
sJ (A, B) = |Ω A∩B |+ |Ω AB | . Recall, the similarity has no dependence on the set ΩA,B
hence we actually do not need to know all of Ω. But we can also see that the
|Ω A∩B |
probability of a hash collision is precisely |Ω A∩B |+ |Ω AB | . Any value returned by hσ
can only be an element from Ω A∩B ∪ Ω AB , these are equally likely, and it is only a
collision if it is one from Ω A∩B .
A, but it will be small enough to have insignificant effect on the analysis. Moreover,
most programming languages include built-in such hash functions that can operate
on strings or large integers or double precisions values (e.g., using the SHA-1 hash
with a salt). Let f be a built-in hash function that operates on strings Σ k , and maps
to some range [N]. Let ⊕ be a concatenation function so for input string ω ∈ Σ k and
S another string, then ω ⊕ S is their concatenation, also a string. Then we can define
a hash function fS as
fS (ω) = f (ω ⊕ S).
Now we can define our hash family for Jaccard similarity using min hashing,
defined on a set of objects A as
This formulation leads to a simple algorithm, where given a set A, we can scan
over the set and maintain the value of the element that is the smallest. When using
several hash functions h1, . . . , hk ∈ HJ , this leads to a minhash signature, a vector
v(A) ∈ [N]k ; that is, the jth coordinate v(A) j = h j (A). Given just a single built-in
separating hash function f , and a set of k salts S1, S2, . . . , Sk , we can still make a
single scanning pass over A, and maintain all k coordinates.
If the set A is the k-gram representation of a text document, then this process
can be even more streamlined. Rather than first creating set A, and then using this
approach to calculate the minhash signature, one can scan the document as the
first for loop above. Each k-gram ω generated on the fly can be hashed by the k
hash functions, and compared against the maintained minhash signature. Duplicate
k-grams will not affect the minhash value (since they are the same or greater, and never
less than the existing value).
4.6 Locality Sensitive Hashing 93
Exercises
4.1 Consider two vectors a = (1, 2, −4, 3, −6) and b = (1, 2, 5, −2, 0) ∈ R5 .
1. Calculate the d1 (a, b), d2 (a, b), d0 (a, b), and d∞ (a, b), and sort the answers.
2. Without calculating d3 (a, b), explain where it will fall in the sorted order.
3. Does it make sense to normalize these vectors so they lie in Δ4 (i.e., by dividing
by a1 and b1 , respectively), and use the Kullback-Liebler divergence dK L on
them? Why or why not?
4. Normalize the data sets to live on S4 , and compute the cosine distance between
them.
4.3 Find a set of 3 points in R2 so that d1/2 (the L p distance with p = 1/2) does not
satisfy the triangle inequality. Show that d1/2 still satisfies the other properties of a
metric.
4.4 Find a set of 3 points in S1 so that dcos does not satisfy the triangle inequality.
4.7 Find 4 news articles or websites on the internet (each at least 500 words) so that
the k-grams of their text have significantly different pairwise Jaccard distances. Two
should be very similar (almost duplicates), one similar to the duplicates, and one
fairly different.
94 4 Distances and Nearest Neighbors
1. Report the texts used, the type of k-gram used, and the resulting Jaccard distances.
2. Build Min Hash signatures of length t (using t hash functions) to estimate the
Jaccard distance between these sets of k-grams. Try various sizes t (between 10
and 500), and report a value sufficient to reliably empirically distinguish the three
classes of distances (almost-duplicate, similar, and different).
4.8 Consider computing an LSH using t = 216 hash functions. We want to find all
object pairs which have Jaccard similarity above threshold τ = .75.
1. Estimate the best values of hash functions b within each of r bands to provide the
S-curve
f (s) = 1 − (1 − s b )r
with good separation at τ. Report these values b, r.
2. Consider the 4 objects A, B, C, D, with the following pairwise similarities:
A B C D
A 1 0.75 0.25 0.35
B 0.75 1 0.1 0.45
C 0.25 0.1 1 0.92
D 0.35 0.45 0.92 1
Use your choice of r and b and f (·) designed to find pairs of objects with similarity
greater than τ. What is the probability, for each pair of the four objects, of being
estimated as similar (i.e., similarity greater than τ = 0.75)? Report 6 numbers.
4.9 Explore random unit vectors: their generation and their average similarity.
1. Provide the code to generate a single random unit vector in d = 10 dimensions
using only the operation u ← Unif(0, 1) which generates a uniform random
variable between 0 and 1. (This u ← Unif(0, 1) operation will be called multiple
times.) Other operations (linear algebraic, trigonometric, etc) are allowed, but no
other sources of randomness.
2. Generate t = 150 unit vectors
t in Rd for d = 10. Plot cdf of their pairwise dot
products (yes, calculate 2 dot products).
3. Repeat the above experiment to generate a cdf of pairwise dot products in d = 100.
4. Now repeat the above step with d = 100 using Angular Similarity instead of just
the dot product.
Abstract This chapter introduces the model of linear regression and some simple
algorithms to solve for it. In the simplest form it builds a linear model to predict one
variable from one other variable or from a set of other variables. We will demonstrate
how this simple technique can extend to building potentially much more complex
polynomial models. To harness this model complexity, we introduce the central and
extremely powerful idea of cross-validation; this method fundamentally changes the
statistical goal of validating a model, to characterizing how it interacts with the
data. Next we introduce basic techniques for regularized regression including ridge
regression and lasso, which can improve the performance, for instances as measured
under cross-validation, by introducing bias into the derived model. Some of these
approaches, like lasso, also bias toward sparse solutions which can be essential for
high-dimensional and over-parameterized data sets. We provide simple algorithms
toward these goals using matching pursuit, and explain how these can be used toward
compressed sensing.
We will begin with the simplest form of linear regression. The input is a set of n
2-dimensional data points (X, y) = {(x1, y1 ), (x2, y2 ), . . . , (xn, yn )}. The ultimate
goal will be to predict the y values using only the x values. In this case x is the
explanatory variable and y is the dependent variable.
The notation (X, y) with an uppercase X and lowercase y will become clear
later. This problem commonly generalizes where the x-part (explanatory vari-
ables) becomes multidimensional, but the y-part (the dependent variable) remains
1-dimensional.
In order to do this, we will “fit” a line through the data of the form
y = (x) = ax + b,
where a (the slope) and b (the intercept) are parameters of this line. The line is our
“model” for this input data.
Consider the following data set that describes a set of heights and weights.
height (in) weight (lbs)
66 160
68 170
60 110
70 178
65 155
61 120
74 223
73 215
75 235
67 164
69 ?
Note that in the last entry, we have a height of 69, but we do not have a weight. If we
were to guess the weight in the last row, how should we do this?
We can draw a line (the red one) through the data points. Then we can guess the
weight for a data point with height 69, by the value of the line at height 69 inches:
about 182 pounds.
Measuring Error
The purpose of this line is not just to be close to all of the data (for this we will
have to wait for PCA and dimensionality reduction). Rather, its goal is prediction;
specifically, using the explanatory variable x to predict the dependent variable y.
In particular, for every value x ∈ R, we can predict a value ŷ = (x). Then on our
data set, we can examine for each xi how close ŷi is to yi . This difference is called a
residual:
ri = yi − ŷi = yi − (xi ).
Note that this residual is not the distance from yi to the line , but the (signed)
distance from yi to the corresponding point on line with the same x value. Again,
this is because our only goal is prediction of y. And this will be important as it
allows techniques to be immune to the choice of units (e.g., inches or feet, pounds
or kilograms)
So the residual measures the error of a single data point, but how should we
measure the overall error of the entire data set? The common approach is the sum of
squared errors:
5.1 Simple Linear Regression 97
n
n
n
SSE((X, y), ) = ri2 = (yi − ŷi )2 = (yi − (xi ))2 .
i=1 i=1 i=1
Solving for
To solve for the line which minimizes SSE((X, y), ) there is a very simply solution, in
n
two steps. Calculate averages x̄ = n1 i=1 n
xi and ȳ = n1 i=1 yi , and create centered
n-dimension vectors X̄ = (x1 − x̄, x2 − x̄, . . . , xn − x̄) for all x-coordinates and
Ȳ = (y1 − ȳ, y2 − ȳ, . . . , yn − ȳ) for all y-coordinates.
1. Set a = Ȳ, X̄/ X̄ 2
2. Set b = ȳ − a x̄
This defines (x) = ax + b.
We will provide the proof for why this is the optimal solution for the high-
dimensional case (in short, it can be shown by expanding out the SSE expression,
taking the derivative, and solving for 0). We will only provide some intuition here.
First let us examine the intercept
1
n
b= (yi − axi ) = ȳ − a x̄.
n i=1
This setting of b ensures that the line y = (x) = ax + b goes through the point ( x̄, ȳ)
at the center of the data set since ȳ = ( x̄) = a x̄ + b.
Second, to understand how the slope a is chosen, it is illustrative to reexamine
the dot product as
where θ is the angle between the n-dimensional vectors Ȳ and X̄. Now in this
expression, the Ȳ / X̄ captures how much on (root-squared) average Ȳ increases
as X̄ does (the rise-over-run interpretation of slope). However, we may want this to
be negative if there is a negative correlation between X̄ and Ȳ , or really this does
not matter much if there is no correlation. So the cos θ term captures the correlation
after normalizing the units of X̄ and Ȳ .
This analysis is simple to carry out in Python. First we can plot the data, and its
mean.
import numpy as np
import matplotlib as mpl
import matplotlib . pyplot as plt
# % matplotlib inline # use this line in notebooks
x = np. array ([66 , 68, 60, 70, 65, 61, 74, 73, 75, 67])
y = np. array ([160 , 170, 110, 178, 155, 120, 233, 215, 235, 164])
Magically, using linear algebra, everything extends gracefully to using more than one
explanatory variables. Consider a data set (X, y) = {(x1, y1 ), (x2, y2 ), . . . , (xn, yn )}
where each data point has xi = (xi,1, xi,2, . . . , xi,d ) ∈ Rd and yi ∈ R. That is there
are d explanatory variables, as the coordinates of xi , and one dependent variable in
yi . We would now like to use all of these variables at once to make a single (linear)
prediction about the variable yi . That is, we would like to create a model
d
ŷi = Mα (xi ) = Mα (xi,1, xi,2, . . . , xi,d ) = α0 + α j xi, j
j=1
In the above equivalent notations α0 serves the purpose of the intercept b, and all
of the α j s replace the single coefficient a in the simple linear regression. Indeed,
we can write this model as a dot product between the (d + 1)-dimensional vectors
α = (α0, α1, . . . , αd ) and (1, xi,1, xi,2, . . . , xi,d ) = (1, xi ). As promised, the magic of
linear algebra has allowed us to describe a more complex linear model Mα . Next we
will see how to solve it.
Given a data point xi = (xi,1, xi,2, . . . , xi,d ), we can again evaluate our prediction
ŷi = M(xi ) using the residual value ri = yi − ŷi = yi − M(xi ). And to evaluate a set
of n data points, it is standard to consider the sum of squared error as
n
n
SSE(X, y, M) = ri2 = (yi − M(xi ))2 .
i=1 i=1
To obtain the coefficients which minimize this error, we can now do so with very
simple linear algebra.
First we construct a n × (d + 1) data matrix X̃ = [1, X1, X2, . . . , Xd ], where the
first column 1 is the n-dimensional all ones column vector [1; 1; . . . ; 1]. Each of the
next d columns is a column vector X j , where xi, j = Xi, j is the ith entry of X j and
represents the jth coordinate of data point xi . Then we let y be a n-dimensional
column vector containing all the dependent variables. Now we can simply calculate
the (d + 1)-dimensional column vector α = (α0, α1, . . . , αd ) as
α = ( X̃ T X̃)−1 X̃ T y.
Let us compare to the simple case where we have 1 explanatory variable. The
( X̃ T X̃)−1 term replaces the X̄1 2 term. The X̃ T y replaces the dot product Ȳ, X̄.
And we do not need to separately solve for the intercept b, since we have created a
new column in X̃ of all 1s. The α0 replaces the intercept b; it is multiplied by a value
1 in X̃ equivalent to b just standing by itself.
100 5 Linear Regression
Often the matrices X and X̃ are used interchangeably, and hence we drop the
˜ from X̃ in most situations. We can either simply treat all data points xi as one-
dimension larger (with always a 1 in the first coordinate), or we can fit a model on
the original matrix X and ignore the offset parameter α0 , which is then by default 0.
The former approach, where each xi is just assumed one dimension larger, is more
common since it automatically handles the offset parameter.
To build a model, we recast the data as an 11 × 4 matrix X = [1, X1, X2, X3 ]. We let
y be the 11-dimensional column vector.
⎡ 1 232 33 402 ⎤ ⎡ 2201 ⎤
⎢ ⎥ ⎢ ⎥
⎢ 1 10 22 160 ⎥ ⎢ 0 ⎥
⎢ ⎥ ⎢ ⎥
⎢ 1 6437 343 231 ⎥ ⎢ 7650 ⎥
⎢ ⎥ ⎢ ⎥
⎢ 1 512 101 17 ⎥ ⎢ 5599 ⎥
⎢ ⎥ ⎢ ⎥
⎢ 1 441 212 55 ⎥ ⎢ 8900 ⎥
⎢ ⎥ ⎢ ⎥
X = ⎢⎢ 1 453 53 99 ⎥⎥ y = ⎢⎢ 1742 ⎥⎥
⎢1 2 2 10 ⎥⎥ ⎢ 0 ⎥
⎢ ⎢ ⎥
⎢ 1 332 79 154 ⎥ ⎢ 1215 ⎥
⎢ ⎥ ⎢ ⎥
⎢ 1 182 20 89 ⎥ ⎢ 699 ⎥
⎢ ⎥ ⎢ ⎥
⎢ 1 123 223 12 ⎥ ⎢ 2101 ⎥
⎢ ⎥ ⎢ ⎥
⎢ 1 424 32 15 ⎥ ⎢ 8789 ⎥
⎣ ⎦ ⎣ ⎦
5.2 Linear Regression with Multiple Explanatory Variables 101
y ≈ Xα.
Fixing the data X and y, and representing M by its parameters α, we can consider a
function S(α) = SSE(X, y, M). Then we observe that S(α) is a quadratic function in
α, and it is convex (see Section 6.1), so its minimum is when the gradient is 0. This
will be true when each partial derivative is 0. Let Xi, j be the jth coordinate of xi
dri
so residual ri = (yi − α, xi ) has partial derivative dα j
= −Xi, j . Then set to 0 each
partial derivative:
dS(α) n
dri n n
0= =2 ri =2 ri (−Xi, j ) = 2 (yi − α, xi )(−Xi, j )
dα j i=1
dα j i=1 i=1
Multiplying by the inverse gram matrix X T X on both sides reveals the desired
α = (X T X)−1 X T y.
102 5 Linear Regression
Geometry: To see why these are called the normal equations, consider a form
0 = (y − Xα)T X,
This includes when v = α; under this setting then Xv = Xα = ŷ. Notice that
r = y − Xα is the vector in Rn that stores all residuals (so y = ŷ + r). Then the
normal equations implies that
that is, for the optimal α, the prediction ŷ and the residual vector r are orthogonal.
Since ŷ = Xα is restricted to the (d + 1)-dimensional span of the columns of X, and
α minimizes r 2 , then this orthogonality implies that r is the normal vector to this
(d + 1)-dimensional subspace.
Sometimes linear relations are not sufficient to capture the true pattern going on in
the data with even a single dependent variable x. Instead we would like to build a
model of the form:
ŷ = M2 (x) = α0 + α1 x + α2 x 2
or more generally for some polynomial of degree p
ŷ = Mp (x) = α0 + α1 x + α2 x 2 + . . . + αp x p
p
= α0 + αi x i .
i=1
5.3 Polynomial Regression 103
We found more height and weight data, in addition to the ones in the height-weight
example above.
height (in) weight (lbs)
61.5 125
73.5 208
62.5 138
63 145
64 152
71 180
69 172
72.5 199
72 194
67.5 172
Again we can measure error for a single data point (xi, yi ) as a residual as
ri = ŷ − yi = Mα (xi ) − yi and the error on n data points as the sum of squared
residuals
n
n
SSE(P, Mα ) = ri2 = (Mp (xi ) − yi )2 .
i=1 i=1
Under this error measure, it turns out we can again find a simple solution for the
residuals α = (α0, α1, . . . , αp ). For each explanatory variable data value x we create
a (p + 1)-dimensional vector
104 5 Linear Regression
v = (1, x, x 2, . . . , x p ).
And then for n data points (x1, y1 ), . . . , (xn, yn ) we can create an n × (p + 1) data
matrix
⎡ 1 x1 x 2 . . . x p ⎤ ⎡ y1 ⎤
⎢ 1 1 ⎥ ⎢ ⎥
⎢ 1 x2 x 2 . . . x p ⎥ ⎢ y2 ⎥
⎢ 2 2 ⎥ ⎢ ⎥
X̃p = ⎢ . . . . . ⎥ y= ⎢ . ⎥.
⎢ .. .. .. . . .. ⎥ ⎢ .. ⎥
⎢ ⎥ ⎢ ⎥
⎢ 1 xn x 2 . . . xnp ⎥ ⎢ vn ⎥
⎣ n ⎦ ⎣ ⎦
Then we can solve the same way as if each data value raised to a different
power was a different explanatory variable. That is we can solve for the coefficients
α = (α0, α1, α2, . . . , αp ) as
α = ( X̃pT X̃p )−1 X̃pT y.
5.4 Cross-Validation
So it appears, as we increase p larger and larger, the data is fit better and better. The
only downside appears to be the number of columns needed in the matrix X̃p , right?
Unfortunately, that is not right. Then how (and why?) do we choose the correct value
of p, the degree of the polynomial fit?
A (very basic) statistical (hypothesis testing) approach may be to choose a model
of the data (the best fit curve for some polynomial degree p, and assume Gaussian
noise), then calculate the probability that the data fell outside the error bounds of
that model. But maybe many different polynomials are a good fit?
5.4 Cross-Validation 105
x12 3 4 5 6 7 8 9
y 4 6 8.2 9 9.5 11 11.5 12 11.2
With the following polynomial fits for p = {1, 2, 3, 4, 5, 8}. Believe your eyes, for
p = 8, the curve actually passes through each and every point exactly.
Recall, our goal was for a new data point with only an x value to predict its y
value. Which do you think does the best job?
• Why randomly? Because you do not want to bias the model to do better on some
parts than others in how you choose the split. Also, since we assume the data
elements are iid from some underlying distribution, then the test data is also iid
from that distribution if chosen randomly.
• How large should the test data be? It depends on the data set. Both 10% and
33% are common. Intuitively, the goal is to estimate that the SSE, which after
dividing by its size, is an average squared error. That is, the goal is to estimate an
average of an unknown distribution, and it maps to the central limit theorem, and
to concentration of measure bounds. We could use these bounds via the number of
test samples and properties of the observed distributions to estimate the accuracy
of the generalization prediction.
Let (X, y) be the full data set (with n rows of data), and we split it into data
sets (Xtrain, ytrain ) and (Xtest, ytest ) with ntrain and ntest rows, respectively, with n =
ntrain + ntest . Next we build a model with the training data, e.g.,
α = (Xtrain
T
Xtrain )−1 Xtrain
T
ytrain .
Then we evaluate the model Mα on the test data Xtest , often using SSE(Xtest, ytest, Mα )
as
SSE(Xtest, ytest, Mα ) = (yi − Mα (xi ))2
(xi ,yi )∈(Xtest,ytest )
= (yi − (1; xi ), α)2 .
(xi ,yi )∈(Xtest,ytest )
Now split our data sets into a train set and a test set:
5.4 Cross-Validation 107
x2 3 4 6 7 8 x1 5 9
train: test:
y 6 8.2 9 11 11.5 12 y 4 9.5 11.2
with the following polynomial fits for p = {1, 2, 3, 4, 5, 8} generating model Mαp on
the test data. We then calculate the SSE(xtest, ytest, Mαp ) score for each (as shown):
And the polynomial model with degree p = 2 has the lowest SSE score of 0.916.
It is also the simplest model that does a very good job by the “eye-ball” test. So we
would choose this as our model.
Using Python, it is easy to generate the model M on the training data, in this case
with polynomial degree 3
xT = np.array ([2, 3, 4, 6, 7, 8])
yT = np.array ([6, 8.2, 9, 11, 11.5 , 12])
xE = np.array ([1, 5, 9])
yE = np.array (4, 9.5, 11.2])
p=3
coefs=np. polyfit (xT ,yT ,p)
M = np. poly1d ( coefs)
We can then calculate the residual and error on the test data
resid = ffit(xE) - yE
print(resid)
SSE = LA.norm( resid )**2
print(SSE)
Finally, we can plot the result on the data
plt.axis ([0 ,10 ,0 ,15])
s=np. linspace (0 ,10 ,101)
plt.plot(s,ffit(s),’r-’,linewidth =2.0)
Leave-one-out Cross-Validation
But, not training on the test data means that you use less data, and your model is
worse! If your data is limited, you may not want to “waste” this data!
If your data is very large, then leaving out 10% is not a big deal. But if you only
have 9 data points it can be. The smallest the test set could be is 1 point. But then it
is not a very good representation of the full data set.
The alternative, called leave-one-out cross-validation, is to create n different
training sets, each of size n − 1 (X1,train, X2,train, . . . , Xn,train ), where Xi,train contains
all points except for xi , which is a one-point test set. Then we build n different
models M1, M2, . . . , Mn , evaluate each model Mi on the one test point xi to get
an error Ei = (yi − Mi (xi ))2 , and average their errors E = n1 i=1 n
Ei . Again, the
parameter with the smallest associated av erage error E is deemed the best. This
allows you to build a model on as much data as possible, while still using all of the
data to test.
However, this requires roughly n times as long to compute as the other techniques,
so is often too slow for really big data sets.
The statistics literature has developed numerous tests to evaluate the effectiveness
of linear models for regression. For instance, an F-test evaluates if the residuals are
normally distributed—as one would expect if minimizing SSE.
A popular statistic is the coefficient of determination R2 . It captures the proportion
n
of variance in the data explained by the model. Again let ȳ = n1 i=1 yi be the average
dependent variable, and then the population variance in y is defined as
n
SStot = (yi − ȳ)2 .
i=1
cross-validation. They do not directly predict how well the model will work on new
data, and they do not seamlessly allow one to compare across any set of models (e.g.,
compare to a model M which might be something other than a linear regression
model).
We can replace the L2 measure in the SSE with other cost functions like mean
absolute error
n
n
MAE(Mα, (X, y)) = |yi − Mα (xi )| = |ri |.
i=1 i=1
This is less sensitive to outliers and corresponds with a Laplace noise model. And
while algorithms exist for minimizing this, they are more complicated, and far less
common than SSE.
Gauss-Markov Theorem
Consider the simple example with one explanatory variable. In this case, the α1
parameter corresponds to the slope of the line. Very large sloped models (like α(1) )
are inherently unstable. A small change in the x value will lead to large changes in
the y value (e.g., ŷ = α(1) x then ŷδ = α(1) (x + δ) = ŷ + α(1) δ). The figure shows
two models α(1) with large slope and model α(2) with small slope, both with similar
sum of squared residuals. However, changing x3 to x3 + δ results in a much larger
change for the predicted value using models α(1) .
5.5 Regularized Regression 111
This picture only shows the effect of α with one explanatory variable, but this is
most useful with many explanatory variables. In this higher-dimensional setting, it
is more common to extrapolate: make predictions on new x which are outside the
region occupied by the data X. In higher dimensions, it is hard to cover the entire
domain with observations. And in these settings where predictions are made past the
extent of data, it is even more dangerous to have parameters α j which seem to fit the
data locally, but change rapidly beyond the extent of the data.
Hence, smaller α are simpler models that are more cautious in extrapolated
predictions, and so the predictions ŷ are less sensitive to changes in x.
Moreover, it turns out that solving for the solution αs◦ which minimizes
W◦ (X, y, α, s) is as simple as for the ordinary least squares where s = 0. In par-
ticular
αs◦ = arg min W◦ (X, y, α, s) = (X T X + (s/n)I)−1 X T y,
α
Improved Generalization
Indeed, it can be shown that appropriately setting some s > 0 can “get around”
the Gauss-Markov Theorem. Formally, consider data drawn iid from a distribution
112 5 Linear Regression
{(60, 110), (61, 120), (65, 180), (66, 160), (68, 170), (70, 225), (73, 215), (75, 235)}
On the held out test set, the model without regularization α(0) achieved a sum of
squared errors of 524.05, and the model with regularization α(0.01) achieved a sum
of squared errors of 361.30.
5.5.2 Lasso
This has the same general form as ridge regression, but it makes the subtle change
in the norm of the regularizer from · 22 to · 1 .
This alternative form of regularization has all of the basic robustness and stronger
generalization properties associated with biasing toward a smaller normed choice
of α in ridge regression. However, it also has two additional effects. First, the
function with L1 norm is less simple to optimize. Although there are algorithmic
and combinatorial approaches, there is not a simple closed form using the matrix
inverse available anymore. We will discuss a couple of ways to optimize this object
soon, and then in the more general context of the following chapter. Second, this
formulation has the additional benefit that it also biases toward sparser solutions.
That is, for large enough values of s, the optimal choice of
will have multiple coordinates αs, j = 0. And, in general, as s increases, the number
of indexes j with αs, j = 0 will increase.
Sparser models α, those with several coordinates α j = 0, are useful for various
reasons. If the number of non-zero coordinates is small, it is simpler to understand
and more efficient to use. That is, to make a prediction from a new data point x, only
the coordinates which correspond to non-zero α j values need to be considered. And
while this does not definitively say which coordinates are actually meaningful toward
understanding the dependent variable, it provides a small and viable set. The process
of selecting such a subset is known as variable selection. It is known that under
some often reasonable assumptions, if the data is indeed drawn from a sparse model
with noise, then using lasso to recover αs for the proper s can recover the true model
exactly. This striking result is relatively recent, and already has many important
implications in statistical inference, signal processing, and machine learning.
There is an equivalent formulation to both ridge regression and lasso that will make
it easier to understand their optimization properties, and to describe a common
approach used to solve for their solution. That is, instead of solving the modified
objective functions W◦ and W, we can solve the original least squares objective
Xα − y22 , but instead provide a hard constraint on the allowable values of α.
Specifically, we can reformulate ridge regression for some parameter t > 0 as
For ridge regression for each parameter s > 0 and solution αs◦ there is a corresponding
parameter t > 0 so αt◦ = αs◦ . And respectively, for lasso for each s there is a t so
αs = αt. To see this, for a fixed s, find the solution αs◦ (e.g., using the closed-form
solution), then set t = αs◦ . Now using this value of t, the solution αt◦ will match that
of αs◦ since it satisfies the hard norm constraint, and if there was another solution α
satisfying that norm constraint with a smaller Xα − y value, this would be smaller
in both terms of W◦ than αs◦ , a contradiction to the optimality of αs◦ .
Hence, we can focus on solving either the soft norm formulations W◦ and W or
the hard norm formulations, for ridge regression or lasso. Ultimately, we will search
over the choices of parameters s or t using cross-validation, so we do not need to
know the correspondence ahead of time.
Consider the hard constraint variants of lasso and ridge regression with some parame-
ter t. We will visualize the geometry of this problem when there are d = 2 explanatory
variables. The value of t constrains the allowable values of α to be within metric
balls around the origin. These are shown below as an L1 -ball for lasso (in green) and
an L2 -ball for ridge regression (in blue). Note that the constraint for ridge is actually
α22 ≤ t, where the norm is squared (this is the most common formulation, which
we study), but we can always just consider the square root of that value t as the one
we picture—the geometry is unchanged.
Then we can imagine the (unconstrained) ordinary least squares solution α∗ as
outside of both of these balls. The cost function Xα − y22 is quadratic, so it must
vary as a parabola as a function of α, with minimum at α∗ . This is shown with red
shading. Note that the values of α which have a fixed value in Xα − y22 form
concentric ellipses centered at α∗ .
The key difference between ridge regression and lasso is now apparent in their
solutions. Because the L1 -ball is “pointier,” and specifically along the coordinate
axes, it reaches out along these coordinate axes. Thus the innermost ellipse around
α∗ which the L1 -ball reaches tends to be along these coordinate axes—whereas there
is no such property with the L2 -ball. These coordinate axes correspond with values
5.5 Regularized Regression 115
of α which have 0 coordinates, thus the lasso solution with L1 constraint explicitly
biases toward sparse solutions.
The same phenomenon holds in higher dimensions (but which are harder to draw).
However, in this setting, the L1 -ball is even pointier. The quadratic cost function now
may not always first intersect the L1 -ball along a single coordinate axes (which has
only a single non-zero coordinate), but it will typically intersect the L1 -ball along
multidimensional ridges between coordinate axes, with only those coordinates as
non-zero.
As t becomes smaller, it is more and more likely that the solution is found on a
high-degree corner or ridge (with fewer non-zero coordinates), since it is harder for
the convex ellipses to sneak past the pointy corners. As t becomes larger, the solution
approaches the OLS solution a∗ , and unless α∗ itself has coordinates very close to
0, then it will tend to reach the L1 ball away from those corners.
We next describe a simple procedure called matching pursuit (MP) that provides a
simple greedy alternative toward solving regression problems. This is particularly
helpful for the lasso variants for which there is no closed-form solution. In particular,
when MP is run with the lasso objective, then it is sometimes called basis pursuit
or forward subset selection because it iteratively reveals a meaningful set of features
or a basis within the data matrix X which captures the trends with respect to the
dependent variable.
Lasso and Matching Pursuit can be used for feature selection, that is choosing a
limited number of dimensions (the features) with which predictions from regression
can generalize nearly as well, or in some cases better, than if all of the features
are used. However, these approaches can be unstable when two features (or sets of
features) are correlated. That is, there could be multiple subsets of features of the
same size which provide nearly as good generalization.
Now consider you are working in a large business that pays for various sorts of
indicators or features for customers or products it is trying to model. You notice that
by small changes in the regularization parameter, two different subsets of features
are selected, with both providing approximately the same generalization estimates
when cross-validating. Changing which subsets you choose to purchase and use will
dramatically affect the business of one of the companies trying to sell these features.
Do you have an obligation to keep the financial benefits of this company in mind as
you select features?
Alternatively, while the two subsets of features provide similar generalization
predictions when averaged across the entire data set, on many individual data sets, it
116 5 Linear Regression
changes the predictions drastically. For instance, this may update the prediction for
a product line to go from profitable to unprofitable. How should you investigate this
prognostication before acting on it?
j ∗ = argmax |r, X j |
j
where X j is the n-dimensional, jth column of the data matrix X. It is then possible
to solve for α j ∗ as
1 s
α j ∗ = argmin r − X j γ 2 + s|γ| = r, X j ± .
γ Xj 2 2
The choice of ± (either addition or subtraction of the s/2) term needs to be checked
in the full expression.
This algorithm is greedy, and may not result in the true optimal solution: it may
initially choose a coordinate which is not in the true optimum, and it may assign it
a value α j which is not the true optimum value. However, there are situations when
the noise is small enough that this approach will still find the true optimum. When
using the W◦ objective for ridge regression, then at each step when solving for α j ,
we can solve for all coordinates selected so far; this is more robust to local minimum
and is often called orthogonal matching pursuit.
For either objective, this can be run until a fixed number k coordinates have been
chosen (as in Algorithm 5.5.1), or until the residual’s norm r 22 is below some
threshold. For the feature selection goal, these coordinates are given an ordering
where the more pertinent ones are deemed more relevant for the modeling problem.
5.5 Regularized Regression 117
Consider the data set (X, y) where X has n = 5 data points and d = 7 dimensions.
⎡1 8 −3 5 4 −9 4 ⎤⎥ ⎡ 43.22 ⎤
⎢ ⎢ ⎥
⎢1 −2 4 8 −2 −3 2 ⎥⎥ ⎢ 46.11 ⎥
⎢ ⎢ ⎥
X = ⎢⎢ 1 9 6 −7 4 −5 −5 ⎥⎥ y = ⎢⎢ −24.63 ⎥⎥
⎢1 6 −14 −5 −3 9 −2 ⎥⎥ ⎢ −42.61 ⎥
⎢ ⎢ ⎥
⎢1 −2 11 −6 3 −5 1 ⎥⎦ ⎢ −19.76 ⎥
⎣ ⎣ ⎦
This was generated by applying a model αT = [0, 0, 0, 5, 0, −2, 0] onto X and
adding a small amount of noise to obtain y. Running ordinary least squares
would fail since the system has more unknowns (d = 7) than equations (n = 5).
However, we could run ridge regression; setting s = 0.5 fits a dense model
◦ = [−0.30, 0.26, −0.17, 4.94, −0.22, −1.87, 0.31].
α0.5
Running MP with the W objective, again using regularization parameter s = 0.5,
does recover a sparse model. The first step identifies index j = 4 as having its column
X6T = [5, 8, −7, −5, −6] as being most correlated with r = y. Solving for α4∗ = 5.47
which is larger than the optimal 5.0, but reasonably close, we then update
This suboptimal choice of α4∗ is still enough to reduce the norm of r from r = 82.50
to r = 29.10.
The next step again correctly selects j = 6 for column X6T = [−9, −3, −5, 9, −5]
having the most remaining alignment with r. It solves for α6∗ = −1.93, less (in
absolute value) than the ideal −2.0, but again fairly close. This updates
print("norm:", LA.norm(y), y)
r = y
for i in range(k):
118 5 Linear Regression
# udpate residual
r = r - X[:,j]*aj
print(" update :", j, aj , LA.norm(r))
columns (other than the first offset column) by subtracting its mean before initiating
the algorithms. Otherwise large constant effects (absorbed in the offset) appear as
meaningful coefficients.
Compressed Sensing
A demonstration of what sort of data can be recovered exactly using lasso is shown
in the compressed sensing problem. There are variants of this problem that show up
in computer vision, astronomy, and medical imagining; we will examine a simple
form.
First consider an unknown signal vector s ∈ Rd . It is d-dimensional, but is known
to have most of its coordinates as 0. In our case, we consider with m d non-zero
coordinates, and for simplicity assume these have value 1. For example, imagine a
telescope scanning the sky. For most snapshots, the telescope sees nothing (a 0), but
occasionally it sees a star (a 1). This maps to our model when it takes d snapshots,
and only sees m stars. Many other examples exist: a customer will buy m out of d
products a store sells; an earthquake registers on only m out of d days on record;
or only m out of d people test positive for a rare genetic marker. We will use as an
example s as
sT = [0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0].
The sensing of s with xi is recorded as yi = xi, s, which is a single scalar value (in
our setting an integer). In our example
yi = xi, s =0+0+0+0+1+0+0+0+0+0+0+0+0+0+0+1+0+0+0+1-1+0-1
+0+0+0+0+0+0+1+0+0 = 2.
The simplest recovery approach is with Matching Pursuit; this (actually the OMP
variant) requires the measurement size constant C closer to 20. This modified version
of MP is sketched in Algorithm 5.5.2. The key step is modified to choose the
dimension j where the column in the measurement matrix X j has the largest dot
product with r. We guess this has a 1 bit, and then factor out the effect of the
measurement matrix witnessing a 1 bit at that location in the residual r. This is
repeated either for a fixed (m) number of steps, or until the residual becomes all 0.
Consider a specific example for running Matching Pursuit, this has d = 10, m = 3,
and n = 6. Let the (unknown) input signal be
s = [0, 0, 1, 0, 0, 1, 0, 0, 1, 0].
x1, s = 0 + 0 + 1 + 0 + 0 + 0 + 0 + 0 + (−1) + 0 = 0.
y = X sT = [0, 0, 0, 1, 1, −2]T .
Finally, we observe column 6 has the most explanatory power of the new r. We set
s6 = 1 and update r = r − X6 γ3 = (0, 0, 0, 0, 0, 0). We have now completely explained
y using only 3 data elements.
This will not always work so cleanly on a small example. Using MP typically needs
something like n = 20m log d measurements (instead of n = 6). Larger measurement
sets act like concentrating random variables, and the larger the sets the more likely
that at each step we chose the correct index j as most explanatory.
122 5 Linear Regression
Exercises
5.1 Let the first column of the data set D be the explanatory variable x, and let the
fourth column be the dependent variable y. [That is: ignore columns 2 and 3 for
now]
1. Run simple linear regression to predict y from x. Report the linear model you
found. Predict the value of y for new x values of 0.3, 0.5, and 0.8.
2. Use cross-validation to predict generalization error, with error of a single data
point (x, y) from a model M as (M(x) − y)2 . Describe how you did this, and which
data was used for what.
3. On the same data, run polynomial regression for p = 2, 3, 4, 5. Report polynomial
models for each. With each of these models, predict the value of y for a new x
values of 0.3, 0.5, and 0.8.
4. Cross-validate to choose the best model. Describe how you did this, and which
data was used for what.
5.2 Now let the first three columns of the data set D be separate explanatory variables
x1 , x2 , x3 . Again let the fourth column be the dependent variable y.
• Run linear regression simultaneously using all three explanatory variables. Report
the linear model you found. Predict the value of y for new (x1 , x2 , x3 ) values of
(0.3, 0.4, 0.1), (0.5, 0.2, 0.4), and (0.8, 0.2, 0.7).
• Use cross-validation to predict generalization error; as usual define the error of a
single data point (x1 , x2 , x3 , y) from a model M as (M(x1, x2, x3 ) − y)2 . Describe
how you did this, and which data was used for what.
5.3 Let data set x ∈ Rn hold the data for an explanatory variable, and data set y ∈ Rn
be the data for the dependent variable. Here n = 100.
1. Run simple linear regression to predict y from x. Report the linear model you
find. Predict the value of y for the new x values of 0.4 and 0.7.
2. Split the data into a training set (the first 70 values) and the test set (the last 30
values). Run simple linear regression on the training set, and report the linear
model. Again predict the y value at x value of 0.4 and of 0.7.
3. Using the testing data, report the residual vector (it should be 30-dimensional,
use the absolute value for each entry) for the model built on the full data, and
another one using the model built just from the training data. Report the 2 norm
of each vector.
5.5 Regularized Regression 123
Also compute the 2-norm of the residual vector for the training data (a 70-
dimensional vector) for the model build on the full data, and also for the model
built on the training data.
4. Expand data set x into a n×(p+1) matrix X̃p using standard polynomial expansion
for p = 5. Report the first 5 rows of this matrix.
Build and report the degree-5 polynomial model using this matrix on the training
data.
Report the 2 norm of the residual vector built for the testing data (from a 30-
dimensional vector) and for the training data (from a 70-dimensional vector).
5.4 Consider a data set (X, y) where X ∈ Rn×3 , and its decomposition into a test
(Xtest, ytest ) and a training data set (Xtrain, ytrain ). Assume that Xtrain is not just a subset
of X, but also prepends a columns of all 1s. We build a linear model
α = (Xtrain
T
Xtrain )−1 Xtrain
T
ytrain,
where α ∈ R4 . The test data (Xtest, ytest ) consists of two data points: (x1, y1 ) and
(x2, y2 ), where x1, x2 ∈ R3 . Explain how to use (write a mathematical expression)
this test data to estimate the generalization error. That is, if one new data point x
arrives, how much squared error would we expect the model α to have compared to
the unknown true value y?
5.5 Instead of prepending column vectors of all 1s from X to X̃, consider prepending
a vector of all 2s. Explain if this approach is more powerful, less powerful, or the
same as the standard approach. How would the coefficients be interpreted?
5.6 Consider input data (X, y) where X ∈ Rn×d , and assume the rows are drawn iid
from some fixed and unknown distribution.
1. Describe three computational techniques to solve for a model α ∈ Rd+1 which
minimizes Xα− y (so no regularization, you may use ideas from Chapter 6). For
each, briefly describe what the methods are —do not just list different commands in
Python—and explain potential advantages of each; these advantages may depend
on the values of n and d or the variant in the model being optimized.
2. Now contrast these above model (no regularization) with two regularized models
(ridge regression and lasso). Explain the advantages of each scenario.
5.7 Assume that for a regression problem the dependent variable y values are in
[0, 1], and so we can assume the residuals of any reasonable model built will also
be in [0, 1]. Use a Chernoff-Hoeffding bound to calculate how many test set points
are required to ensure, we estimate the generalization error of our model on a single
new point within ε = 0.02 with probability of at least 1 − δ = 0.99.
1. Solve for the coefficients α or αs using Least Squares and Ridge Regression with
s = {1, 5, 10, 15, 20, 25, 30}; that is generate the 1 least squares solution (implicitly
s = 0), and 7 different ridge regression solutions. For each set of coefficients,
report the error in the estimate ŷ of y as y − Xαs 2 = y − ŷ2 .
2. Create three row-subsets of X and y
• X1 and y1 : rows 1...66 of both data sets.
• X2 and y2 : rows 34...100 of both data sets.
• X3 and y3 : rows 1...33 and 67...100 of both data sets.
Repeat the above procedure to build models αs on these subsets and cross-validate
the solution on the remainder of X and y that was held out. Specifically, learn the
coefficients αs using, say, X1 and y1 , and then measure the norm on rows 67...100
as ȳ1 − X̄1 αs where X̄1 and ȳ1 are rows 67...100.
3. Which approach works best (averaging the results from the three subsets): Least
Squares, or for which value of s using Ridge Regression?
5.10 Consider the measurement matrix M ∈ R8×10 (which was populated with
random values from {−1, 0, +1}) and observation vector y ∈ R8 , generated as y = M s
using a sparse signal s ∈ R10 .
⎡ 0 0 −1 −1 0 0 1 0 1 −1 ⎤⎥ ⎡ 0 ⎤
⎢ ⎢ ⎥
⎢ 0 1 1 1 −1 0 1 −1 0 1 ⎥⎥ ⎢ 1 ⎥
⎢ ⎢ ⎥
⎢ −1 −1 0 1 −1 1 1 0 0 0 ⎥⎥ ⎢ 0 ⎥
⎢ ⎢ ⎥
⎢ −1 1 0 −1 1 1 1 1 0 −1 ⎥⎥ ⎢ 2 ⎥
M = ⎢⎢ and y = ⎢⎢ ⎥⎥ .
⎢ −1 −1 1 1 1 0 0 1 1 1 ⎥⎥ ⎢ −1 ⎥
⎢ −1 1 0 −1 1 −1 1 −1 1 −1 ⎥⎥ ⎢ 0 ⎥
⎢ ⎢ ⎥
⎢ 0 1 0 1 −1 1 1 0 1 0 ⎥⎥ ⎢ 2 ⎥
⎢ ⎢ ⎥
⎢ −1 0 1 0 0 −1 1 −1 −1 1 ⎥⎦ ⎢ −1 ⎥
⎣ ⎣ ⎦
Use Matching Pursuit (as described in Algorithm 5.5.2) to recover the non-zero
entries from s. Record the order in which you find each entry and the residual vector
after each step.
Chapter 6
Gradient Descent
Abstract In this topic, we will discuss optimizing over general functions f . Typically,
the function is defined as f : Rd → R; that is its domain is multidimensional (in
this case d-dimensional point α) and output is a real scalar (R). This often arises to
describe the “cost” of a model which has d parameters which describe the model
(e.g., degree (d − 1)-polynomial regression) and the goal is to find the parameters
α = (α1, α2, . . . , αd ) with minimum cost. Although there are special cases where
we can solve for these optimal parameters exactly, there are many cases where we
cannot. What remains in these cases is to analyze the function f , and try to find its
minimum point. The most common solution for this is gradient descent where we
try to “walk” in a direction so the function decreases until we no longer can.
6.1 Functions
function value is illustrated by color relative to the function value at the red point:
where it is blue indicates the function value is above that of the red point, and where
it is green it is below the red point.
Minimum and maximum are special points where the function value are all the
same color in some neighborhood ball (either all blue for a minimum or all green for
a maximum). The saddle points are special points where the neighborhood colors
are not connected.
Convex Functions
In many cases we will assume (or at least desire) that our function is convex.
To define this it will be useful to define a line ⊂ Rd as follows with any two
points p, q ∈ Rd . Then, a line p,q is the set of all points defined by any scalar λ ∈ R
as
p,q = {x = λp + (1 − λ)q | λ ∈ R}.
When λ ∈ [0, 1], then this defines the line segment between p and q.
A function is convex if for any two points p, q ∈ Rd , on the line segment between
them has value less than (or equal) to the values at the weighted average of p and q.
That is, it is convex if
For all p, q ∈ R and for all λ ∈ [0, 1] f (λp + (1 − λ)q) ≤ λ f (p) + (1 − λ) f (q).
Removing the “or equal” condition, the function becomes strictly convex.
There are many very cool properties of convex functions. For instance, for two
convex functions f and g, then h(α) = f (α) + g(α) is convex and so is h(α) =
max{ f (α), g(α)}. But one will be most important for us:
Any local minimum of a convex function will also be a global minimum. A strictly
convex function will have at most a single minimum: the global minimum.
This means if we find a minimum, then we must have also found a global minimum
(our goal).
128 6 Gradient Descent
6.2 Gradients
For a function f (α) = f (α1, α2, . . . , αd ), and a unit vector u = (u1, u2, . . . , ud ) which
represents a direction, then the directional derivative is defined as
f (α + hu) − f (α)
∇u f (α) = lim .
h→0 h
We are interested in functions f which are differentiable; this implies that ∇u f (α)
is well-defined for all α and u.
Let e1, e2, . . . , ed ∈ Rd be a specific set of unit vectors defined so that ei =
(0, 0, . . . , 0, 1, 0, . . . , 0) where for ei the 1 is in the ith coordinate.
Then define d
∇i f (α) = ∇ei f (α) = f (α).
dαi
It is the derivative in the ith coordinate, treating all other coordinates as constants.
We can now, for a differentiable function f , define the gradient of f as
df df df df df df
∇f = e1 + e2 + . . . + ed = , , . . ., .
dα1 dα2 dαd dα1 dα2 dαd
Example: Gradient
Linear Approximation
From the gradient we can easily recover the directional derivative of f at point α,
for any direction (unit vector) u as
∇u f (α) =
∇ f (α), u.
This implies the gradient describes the linear approximation of f at a point α. The
slope of the tangent plane of f at α in any direction u is provided by ∇u f (α).
Hence, the direction which f is increasing the most at a point α is the unit vector u
where ∇u f (α) =
∇ f (α), u is the largest. This occurs at ∇ f (α) = ∇ f (α)/∇ f (α),
the normalized gradient vector.
To find the minimum of a function f , we then typically want to move from any
point α in the direction −∇ f (α); at regular points this is the direction of steepest
descent.
6.3 Gradient Descent 129
Basically, for any starting point α(0) the algorithm moves to another point in the
direction opposite to the gradient, in the direction that locally decreases f the fastest.
How fast it moves depends on the scalar learning rate γk and the magnitude of the
gradient vector ∇ f (α(k) ).
Stopping Condition
The most critical parameter of gradient descent is γ, the learning rate. In many cases
the algorithm will keep γk = γ fixed for all k. It controls how fast the algorithm
works. But if it is too large, as we approach the minimum, then the algorithm may
go too far, and overshoot it.
130 6 Gradient Descent
Lipschitz Bound
f (α(k) ) − f (α∗ ) ≤ ε.
For the k = C/ε claim (and others stated below), we assume that f (α(0) ) − f (α∗ )
is less than some absolute constant, which relates to the constant C. Intuitively, the
closer we start to the optimum, the fewer steps it will take.
For a convex quadratic function f (e.g., most cost functions derived by sum of
squared errors), the gradient ∇ f is L-Lipschitz.
The constant C in k = C log(1/ε) also depends on the condition number L/η. The
conditions of this bound, imply that f is sandwiched between two convex quadratic
functions; specifically for any α, p ∈ Rd that we can bound
η L
f (α) +
∇ f (α), p − α + p − α 2 ≤ f (p) ≤ f (α) +
∇ f (α), p − α + p − α 2 .
2 2
Line Search
then we keep subdividing the region [0, γ ] into pieces, and excluding ones which
cannot contain the minimum.
For instance, the golden section search divides a range [b, t] containing the optimal
γk into three sub-intervals (based on the golden ratio) so [b, t] = [b, b) ∪ [b, t ] ∪
(t , t]. And each step, we can determine that either γk [b, b) if f (t ) < f (b), or
γk (t , t] if f (b) < f (t ). This reduces the range to [b, t] or [b, t ], respectively,
and we recurse.
A function f is shown in blue in the domain [b, t]. Observe that in the first step,
f (b) < f (t ) so the domain containing the minimum shrinks to [b, t ]. In the next
step, f (t ) < f (b) so the domain containing the minimum shrinks to [b, t].
Adjustable Rate
In practice, line search is often slow. Also, we may not have a Lipschitz bound. It is
often better to try a few fixed γ values, probably being a bit conservative. As long as
f (α(k) ) keep decreasing, it works well. This also may alert us if there is more than
one local minimum if the algorithm converges to different locations.
An algorithm called backtracking line search automatically tunes the parameter
γ. It uses a fixed parameter β ∈ (0, 1) (preferably in (0.1, 0.8); for instance, use
β = 3/4). Start with a large step size γ (e.g., γ = 1, but “large” very much depends
on the scale of f ). Then at each step of gradient descent at location α, if
γ
f (α − γ∇ f (α)) > f (α) − ∇ f (α) 2
2
then update γ = βγ. This shrinks γ over the course of the algorithm, and if f is
strongly convex, it will eventually decrease γ until it satisfies the condition for linear
convergence.
2
3 3 1
f (x, y) = x− + (y − 2)2 + xy
4 2 4
and has gradient
9 9 1 1
∇ f (x, y) = x − + y , 2y − 4 + x .
8 4 4 4
We run gradient descent for 10 iterations within initial position (5, 4), while varying
the learning rate in the range γ = {0.01, 0.1, 0.2, 0.3, 0.5, 0.75}.
We see that with γ very small, the algorithm does not get close to the minimum.
When γ is too large, then the algorithm jumps around a lot, and is in danger of
not converging. But at a learning rate of γ = 0.3, it converges fairly smoothly and
reaches a point where ∇ f (x, y) is very small. Using γ = 0.5 almost overshoots in
the first step; γ = 0.3 is smoother, and it is probably best to use a curve that looks
smooth like that one, but with a few more iterations.
These contour plots can be easily generated in Python:
import numpy as np
from numpy import linalg as LA
def func(x,y):
return (0.75*x -1.5)**2 + (y -2.0)**2 + 0.25* x*y
For data analysis, the most common use of gradient descent is to fit a model to data.
In this setting, we have a data set (X, y) = {(x1, y1 ), (x2, y2 ), . . . , (xn, yn )} ∈ Rd × R
and a family of models M so each possible model Mα is defined by a k-dimensional
vector α = {α1, α2, . . . , αk } with k parameters.
Next we define a loss function L((X, y), Mα ) which measures the difference
between what the model predicts and what the data values are. To choose which
parameters generate the best model, we let f (α) : Rk → R be our function of inter-
est, defined f (α) = L((X, y), Mα ). Then, we can run gradient descent to find our
model Mα∗ . For instance, we can set
f (α) = L((X, y), Mα ) = SSE((X, y), Mα ) = (yi − Mα (xi ))2 . (6.1)
(xi ,yi )∈(X,y)
assuming Gaussian noise (where we had a closed-form solution), but also many other
variants (often where there is no known closed-form solution). It also includes least
squares regression and its many variants; we will see this in much more detail next.
And will include other topics (including clustering, PCA, classification) we will see
later in the book.
Now, we will work through how to use gradient descent for simple quadratic regres-
sion on one-dimensional explanatory variables. That is, this will specify the function
f (α) in equation (6.1) to have k = 3 parameters as α = (α0, α1, α2 ), and so for each
(xi, yi ) ∈ (X, y) we have xi ∈ R. Then, we can write the model again as a dot product
Mα (xi ) =
α, (1, xi, xi2 ) = α0 + α1 xi + α2 xi2 .
α := α − γ∇ f (α)
we only need to define ∇ f (α). We will first show this for the case where n = 1, that is
when there is a single data point (x1, y1 ). For quadratic regression, the cost function
f (α) = f1 (α) = (α0 + α1 x1 + α2 x12 − y1 )2 is convex. Next derive
d d d
f (α) = f1 (α) = (Mα (x1 ) − y1 )2
dα j dα j dα j
d
= 2(Mα (x1 ) − y1 ) (Mα (x1 ) − y1 )
dα j
d
2
j
= 2(Mα (x1 ) − y1 ) ( α j x1 − y1 )
dα j j=0
j
= 2(Mα (x1 ) − y1 )x1 .
Using this convenient form (which generalizes to any polynomial model), we define
6.4 Fitting a Model to Data 137
d d d
∇ f (α) = f (α), f (α), f (α)
dα0 dα1 dα2
= 2 (Mα (x1 ) − y1 ), (Mα (x1 ) − y1 )x1, (Mα (x1 ) − y1 )x12 .
To generalize this to multiple data points (n > 1), there are two standard ways. Both
of these take strong advantage of the cost function f (α) being decomposable. That
is, we can write
n
f (α) = fi (α),
i=1
where each fi depends only on the ith data point (xi, yi ) ∈ (X, y). In particular, for
quadratic regression
First notice that since f is the sum of fi s, where each is convex, then f must also
be convex; in fact the sum of these usually becomes strongly convex (it is strongly
convex as long as the corresponding feature vectors are full rank).
α1 α1
(1, xi )
fi (α) = ( α, (1, xi ) y i )2
(1, xi ) (1, xi )
α, (1, xi ) = yi α0 α0
Thus while each fi is convex, it is not strongly convex since along the line (shown
in green) it has a 0 directional derivative. If we add a second function fi (the dashed
one) defined by (xi, yi ), then f (α) = fi (α) + fi (α). That it is the sum of the blue
halfpipe and the dashed-purple halfpipe. As long as these are not parallel, then f
becomes
strongly convex. Linear algebraically, this happens as long as the matrix
1 xi
is full rank.
1 xi
Two approaches toward gradient descent will take advantage of this decomposition
in slightly different ways. This decomposable property holds for most loss functions
for fitting a model to data.
The first technique, called batch gradient descent, simply extends the definition of
∇ f (α) to the case with multiple data points. Since f is decomposable, then we use
the linearity of the derivative to define
d d n n
j
f (α) = fi (α) = 2(Mα (xi ) − yi )xi
dα j i=1
dα j i=1
and thus
6.4 Fitting a Model to Data 139
n
n
n
∇ f (α) = 2 (Mα (xi ) − yi ), 2 (Mα (xi ) − yi )xi, 2 (Mα (xi ) − yi )xi2
i=1 i=1 i=1
n
= 2(Mα (xi ) − yi ), 2(Mα (xi ) − yi )xi, 2(Mα (xi ) − yi )xi2
i=1
n
=2 (Mα (xi ) − yi ) 1, xi, xi2 .
i=1
That is, the step is now just the sum of the terms from each data point. Since f is
(strongly) convex, then we can apply all of the nice convergence results discussed
about (strongly) convex f before. However, computing ∇ f (α) each step takes time
proportional to n, which can be slow for large n (i.e., for large data sets).
The second technique is called incremental gradient descent; see Algorithm 6.4.1. It
avoids computing the full gradient each step, and only computes ∇ fi (α) for a single
data point (xi, yi ) ∈ (X, y). For quadratic regression with Mα (x) =
α, (1, x, x 2 ), it
is
∇ fi (α) = 2(Mα (xi ) − yi )(1, xi, xi2 ).
A more common variant of this is called stochastic gradient descent; see Algo-
rithm 6.4.2. Instead of choosing the data points in order, it selects a data point (xi, yi )
at random each iteration (the term “stochastic” refers to this randomness).
On very large data sets (i.e., big data!), these algorithms are often much faster
than the batch version since each iteration now takes constant time. However, it does
not automatically inherit all of the nice convergence results from what is known
140 6 Gradient Descent
Exercises
6.2 Consider running gradient descent with a fixed learning rate γ. For each of the
following, we plot the function value over 10 steps (the function is different each
time). Decide whether the learning rate is probably too high, too low, or about
right.
1. f1 : 100, 99, 98, 97, 96, 95, 94, 93, 92, 91
2. f2 : 100, 50, 75, 60, 65, 45, 75, 110, 90, 85
3. f3 : 100, 80, 65, 50, 40, 35, 31, 29, 28, 27.5, 27.3
4. f4 : 100, 80, 60, 40, 20, 0, –20, –40, –60, –80, –100
Starting with (x, y) = (0, 2) run the gradient descent algorithm for each function.
Run for T iterations, and report the function value at the end of each step.
1. First, run with a fixed learning rate of γ = 0.05 for f1 and γ = 0.0015 for f2 .
2. Second, run with any variant of gradient descent you want. Try to get the smallest
function value after T steps.
For f1 you are allowed only T = 10 steps. For f2 you are allowed T = 100 steps.
6.4 In the first G.csv dataset provided, use the first three columns as explanatory
variables x1, x2, x3 , and the fourth as the dependent variable y. Run gradient descent
on α ∈ R4 , using the dataset provided to find a linear model
ŷ = α0 + α1 x1 + α2 x2 + α3 x3
minimizing the sum of squared errors. Run for as many steps as you feel necessary.
On each step of your run, print on a single line: (1) the model parameters α(i) =
[α0(i), α1(i), α2(i), α3(i) ], (2) the value of a function f (α(i) ), estimating the sum of squared
errors, and (3) the gradient ∇ f (α(i) ). (These are the sort of things you would do to
check/debug a gradient descent algorithm; you may also want to plot some of these.)
1. First run batch gradient descent.
2. Second run incremental gradient descent.
Choose one method which you preferred (either is ok to choose), and explain why
you preferred it to the other methods.
142 6 Gradient Descent
6.5 Explain what parts of the above procedures on data set G.csv would change if
you instead are minimizing the sum of absolute value of residuals, not the sum of
squared residuals?
• Is the function still convex?
• Does the gradient always exist?
6.6 Consider the quadratic (polynomial of degree 2) regression on a data set (X, y)
where each of n data points (xi, yi ) has xi ∈ R2 and yi ∈ R. To simplify notation, let
each xi = (ai, bi ).
1. Expand xi = (ai, bi ) and write the model Mα (xi ) as a single dot product of the
form
Mα (xi ) =
α, (?, ?, . . . , ?)
where α is a vector, and you need to fill in the appropriate ?s.
2. Write the batch (of size n) gradient ∇ f (α) for this problem, where
n
f (α) = (Mα (xi ) − yi )2 .
i=1
Your expression for ∇ f (α) should use the term (Mα (xi )− yi ) as part of its solution.
Chapter 7
Dimensionality Reduction
Abstract This chapter will build a series of techniques to deal with high-dimensional
data. Unlike regression problems, our goal is not to predict a value (the y-coordinate),
it is to understand the “shape” of the data, for instance a low-dimensional rep-
resentation that captures most of the meaning of the high-dimensional data. This
is sometimes referred to as unsupervised learning (as opposed to regression and
classification, where the data has labels, known as supervised learning). Like most
unsupervised scenarios (as most parents know), you can have a lot of fun, but it is
easy to get yourself into trouble if you are not careful. We will cover many inter-
connected tools, including the singular value decomposition (SVD), eigenvectors
and eigenvalues, the power method, principal component analysis (PCA), multidi-
mensional scaling, as well as several other use cases of the SVD. The chapter will
conclude by contrasting PCA with the main alternative method for dimensionality
reduction: random projections.
We will start with data in a matrix A ∈ Rn×d , and will call upon linear algebra to
rescue us. It is useful to think of each row ai of A as a data point in Rd , so there are
n data points. Each dimension j ∈ 1, 2, . . . , d corresponds with an attribute of the
data points.
• In movie ratings, we may consider n users who have rated each of d movies on a
score of 1 − 5. Then each row ai represents a user, and the jth entry of that user
is the score given to the j movie.
• Consider the price of a stock measured over time (say the closing price each day).
Many time-series models consider some number of days (d days, for instance 25
days) to capture the pattern of the stock at any given time. So for a given closing
day, we consider the d previous days. If we have data on the stock for 4 years
(about 1000 days the stock market is open), then we can create a d-dimensional
data points (the previous d = 25 days) for each day (except the first 25 or so). The
data matrix is then comprised of n data points ai , where each corresponds to the
closing day, and the previous d days. The jth entry of ai is the value on ( j − 1)
days before the closing day i.
• Finally, consider a series of pictures of a shape (say the Utah teapot). The camera
position is fixed as is the background, but we vary two things: the rotation of
the teapot, and the amount of light. Here each pictures is a set of say d pixels
(say 10,000 if it is 100 × 100), and there are n pictures. Each picture is a row
of length d, and each pixel corresponds to a column of the matrix. Similar, but
more complicated scenarios frequently occur with pictures of a person’s face, or
3d-imaging of an organ.
In each of these scenarios, there are many (n) data points, each with d attributes.
The next goal is to uncover a pattern, or a model M. In this case, the model will be a
low-dimensional subspace B. It will represent a k-dimensional space, where k d.
For instance, in the example with images, there are only two parameters which are
changing (rotation and lighting), so despite having d = 10,000 dimensions of data,
2 dimension should be enough to represent everything about the data in that controlled
environment.
7.1 Data Matrices 145
7.1.1 Projections
Different than in linear regression this family of techniques will measure error as
a projection from ai ∈ Rd to the closest point πB (ai ) on B. We next define these
concepts more precisely using linear algebra.
First recall, that given a unit vector v ∈ Rd and any data point p ∈ Rd , then the
dot product
v, p
is the norm of p after it is projected onto the line through v. If we multiply this scalar
by v then
πv (p) = v, pv;
it results in the point on the line through unit vector v that is closest to data point p.
This is a projection of p onto vector v.
To understand this for a subspace B, we will need to define an orthogonal basis.
For now we will assume that B contains the origin (0, 0, 0, . . . , 0) (as did the line
through v). Then if B is k-dimensional, there is a k-dimensional orthogonal basis
VB = {v1, v2, . . . , vk } so that
• For each vi ∈ VB we have v j = 1, that is v j is a unit vector.
• For each pair vi, v j ∈ VB with i j we have vi, v j = 0; the pairs are orthogonal.
• For any point x ∈ B we can write x = kj=1 α j v j ; in particular α j = x, v j .
k
k
πB (p) = v j , pv j = πv j (p).
j=1 j=1
Geometry of Projection
where α1 (p) = p, v1 and α2 (p) = p, v2 . Thus, we can also represent πB (p) in the
basis {v1, v2 } with the two coordinates (α1 (p), α2 (p)).
146 7 Dimensionality Reduction
R3
πB (p)
α1 (p)
v1 v2
α2 (p)
The other powerful part of the basis VB is that it defines a new coordinate
system. Instead of using the d original coordinates, we can use new coordinates
(α1 (p), α2 (p), . . . , αk (p)) where αi (p) = vi, p. To be clear πB (p) is still in Rd , but
there is a k-dimensional representation if we restrict to B.
When B is d-dimensional, this operation can still be interesting. The basis we
choose VB = {v1, v2, . . . , vd } could be the same as the original coordinate axis, that
is we could have vi = ei = (0, 0, . . . , 0, 1, 0, . . . , 0) where only the ith coordinate is 1.
But if it is another basis, then this acts as a rotation (with possibility of also a mirror
flip). The first coordinate is rotated to be along v1 ; the second along v2 ; and so on.
In πB (p), the point p does not change, just its representation.
As usual our goal will be to minimize the sum of squared errors (SSE). In this case,
we define this objective as
SSE(A, B) = ai − πB (ai ) 2,
ai ∈ A
As compared to linear regression, this is much less a “proxy goal” where the true
goal was prediction. Now, we have no labels (the yi values), so we simply try to fit a
model through all of the data.
The standard goal for a projection subspace B is to minimize the sum of squared
errors between each ai to its projected approximation πB (ai ).
a1
πB (a1 )
How do we solve for this? Our methods for linear regression do not solve the
correct problem. The cost function is different in that it only measures error in one
coordinate (the dependent variable), not the projection cost. It is also challenging
to use the standard versions of gradient descent. The restriction that each vi ∈ VB
is a unit vector is a non-convex constraint. So ... linear algebra will come to the
rescue—in the form of the SVD.
A really powerful and useful linear algebra operation is called the singular value
decomposition (the SVD). It extracts an enormous amount of information about
a matrix A. This section will define it and discuss many of its uses. Then, we
will describe one algorithm (the power method) which constructs the SVD. But in
general, one simply calls the procedure in your favorite programming language and
it calls the same highly optimized back-end from the Fortran LAPACK library.
The SVD takes in a matrix A ∈ Rn×d and outputs three matrices U ∈ Rn×n ,
S ∈ Rn×d and V ∈ Rd×d , so that A = USV T .
148 7 Dimensionality Reduction
[U, S, V] = svd(A)
VT
A = U S
For a “short and fat” matrix A with more columns that rows (d > n), the picture
is similar. In fact, we can recover the tall and thin view by in this case considering
AT and swapping the roles of U → V T and V T → U.
VT
A = U S
The matrix S is mainly all zeros; it only has non-zero elements along its diagonal.
So Si, j = 0 if i j. The remaining values σ1 = S1,1 , σ2 = S2,2 , . . ., σr = Sr,r are
known as the singular values. They are non-decreasing so
σ1 ≥ σ2 ≥ . . . ≥ σr ≥ 0
where r ≤ min{n, d} is the rank of the matrix A. That is the number of non-zero
singular values reports the rank (this is a numerical way of computing the rank of a
matrix).
The matrices U and V are orthogonal. Thus, their columns are all unit vectors
and orthogonal to each other (within each matrix). The columns of U, written
u1, u2, . . . , un , are called the left singular vectors; and the columns of V (i.e., rows of
V T ), written v1, v2, . . . , vd , are called the right singular vectors.
7.2 Singular Value Decomposition 149
This means for any vector x ∈ Rd , the columns of V (the right singular vectors)
d
provide a basis. That is, we can write x = i=1 αi vi for αi = x, vi . Similarly for
any vector y ∈ Rn , the columns of U (the left singular vectors) provide a basis. This
also implies that x = V T x and y = yU . So these matrices and their vectors
do not capture any “scale” information about A, that is all in S. Rather they provide
the bases to describe the scale in a special form.
The left singular vectors are columns in U shown as thin vertical boxes, and right
singular vectors are columns in V and thus thin horizontal boxes in V T . The singular
values are shown as small boxes along the diagonal of S, in decreasing order. The
grey boxes in S and U illustrate the parts which are unimportant (aside from keeping
the orthogonality in U); indeed in some programming languages (like Python), the
standard representation of S is as a square matrix to avoid maintaining all of the
irrelevant 0s in the grey box in the lower part.
S VT right singular vector
important directions (vj by σj )
orthogonal: creates basis
A = U singular value
importance of singular vectors
decreasing rank order: σj ≥ σj+1
One can picture the geometry of the SVD by how a unit vector x ∈ Rd interacts with
the matrix A as a whole. The aggregate shape of A is a contour, the set of points
{x( Ax) | x = 1}
v2
v1
x
Ax
Now consider a vector x = (0.243, 0.97) (scaled very slightly so it is a unit vector,
x = 1). Multiplying by V T rotates (and flips) x to p = V T x; still p = 1.
x2 v2
x
v2
p
x1 v1
v1
q v2
v1
A = np. array ([[4.0 ,3.0] , [2.0 ,2.0] , [ -1.0 , -3.0] , [ -5.0 , -2.0]])
U, s, Vt = LA.svd(A)
print(U)
print(s)
print(Vt)
Then for a vector x we can observe its progress as it moves through the components
x = np. array ([0.243 ,0.97])
x = x/LA.norm(x)
p = Vt @ x
print(p)
y = U @ q
print(y)
print(A @ x)
So how does this help solve the initial problem of finding B∗ , which minimized the
SSE? The singular values hold the key.
It turns out that there is a unique singular value decomposition, up to ties in the
singular values. This means, there is exactly one (up to singular value ties) set of
right singular vectors which rotate into a basis so that Ax = SV T x for all x ∈ Rd
(recall that U is orthogonal, so it does not change the norm, Uq = q).
Next we realize that the singular values come in sorted order σ1 ≥ σ2 ≥ . . . ≥ σr .
In fact, they are defined so that we choose v1 so it maximizes Av1 and Av1 = σ1
(and thus A2 = σ1 ). Then, we find the next singular vector v2 which is orthogonal
to v1 and maximizes Av2 , as Av2 = σ2 , and so on. In general σj = Av j .
If we define B with the basis VB = {v1, v2, . . . , vk }, then for any vector a
2
d k
d
a − πB (a) = v j a, v j −
2
v j a, v j = a, v j 2 .
j=1 j=1 j=k+1
So the projection error of this VB is that part of a in the last (d − k) right singular
vectors.
But we are not trying to directly predict new data here (like in regression).Rather,
we are trying to approximate the data we have. We want to minimize i ai −
πB (ai ) 2 . But for any vector v, we recall now that
7.2 Singular Value Decomposition 153
⎡ a1, v ⎤ 2
⎢ ⎥
⎢ a2, v ⎥ n
⎢ ⎥
Av 2 = ⎢ . ⎥ = ai, v 2 .
⎢ .. ⎥
⎢ ⎥ i=1
⎢ an, v ⎥
⎣ ⎦
Thus, the projection error can be measured with a set of orthonormal vectors
w1, w2, . . . , wd−k which are each orthogonal to B, as d−k j =1 Aw j . When defin-
2
ing B as the first k right singular values, then these orthogonal vectors can be the
remaining (d − k) right singular vectors (that is w j = w j−k = v j ), so the projection
error is
n n
d
ai − πB (ai ) 2 = ai, v j 2
i=1 i=1 j=k+1
n
d
d
d
= ai, v j 2 = Av j 2 = σj2 .
j=k+1 i=1 j=k+1 j=k+1
And thus by how the right singular vectors are defined, the above expression is
equivalent to the sum of squared errors
SSE(A, B) = ai − πB (ai ) 2 .
ai ∈ A
This is minimized (restricting that B contains the origin) when B is defined as the
span of the first k singular vectors.
A − Ak 2 and A − Ak F .
In fact, the solution Ak will satisfy A − Ak 2 = σk+1 and A − Ak F2 = dj=k+1 σj2 .
This Ak matrix also comes from the SVD. If we set Sk as the matrix S in the
decomposition so that all except the first k singular values are 0, then it has rank
k. Hence, Ak = USk V T also has rank k and is our solution. But we can notice that
when we set most of Sk to 0, then the last (d − k) columns of V are meaningless
since they are only multiplied by 0s in USk V T , so we can also set those to all 0s, or
remove them entirely (along with the last (d − k) columns of Sk ). Similarly, we can
make 0 or remove the last (n − k) columns of U. These matrices are referred to as Vk
and Uk respectively, and also Ak = Uk Sk VkT .
Another way to understand this construction is to rewrite A as
154 7 Dimensionality Reduction
r
A = USV T = σj u j vTj
j=1
where each u j vTj is an n × d matrix with rank 1, the result of an outer product. We
can ignore terms for j > r since then σj = 0. This makes clear how the rank of A is
at most r since it is the sum of r rank-1 matrices. Extending this, then Ak is simply
k
Ak = Uk Sk VkT = σj u j vTj .
j=1
d
This implies that =
AF2 j=1 σj
and Ak F2 = kj=1 σj2 . We can see this since
2
d
AF2 = USV T F2 = SF2 = σj2
j=1
k
Ak F2 = USk V T F2 = Sk F2 = σj2 .
j=1
For a matrix A ∈ Rn×d , the best rank-k approximation is a matrix Ak ∈ Rn×d derived
from the SVD. We can zero-out all but the first k singular values in S to create a
matrix Sk . If we similarly zero out all but the first k columns of U (the top k left
singular vectors) and the first k rows of V T (the top k right singular vectors), these
represent matrices Uk and Vk . The resulting product Uk Sk VkT is Ak .
Sk VkT
0s or
0s
ignore
0s
Ak = Uk or
ignore
The rank-k matrix Vk also encodes the projection onto the best k-dimensional sub-
space. Simply left-multiply any data point x ∈ Rd with VkT and the resulting vector
VkT x ∈ Rk is the k-dimensional representation of the best k-dimensional subspace
(that contains the origin) in terms of maximizing the sum of squared variation among
the remaining points.
7.3 Eigenvalues and Eigenvectors 155
Each row of VkT , the right singular vector v j results in the jth coordinate of the
new vector as v j , x.
⎡ ⎤
v1 , x
⎢ v2 , x ⎥ VkT
⎢ ⎥
⎢ .. ⎥ ⇐
⎣ . ⎦ vj
vk , x x
vj , x
For many data matrices A ∈ Rn×d , most of the variation in the data will occur within
a low rank approximation. That is, for a small k d it will be that A − Ak F2 / AF2
will be very small. For instance for n = 1000, d = 50 and k = 3 it may be that
A − Ak F2 / AF2 ≤ 0.05 or for k = 10 that A − Ak F2 / AF2 ≤ 0.001 (less than
0.1% of all variation is removed). Such matrices have light-tailed distributions of
singular values.
However, for many other internet-scale data sets this is very much not the case.
And even for large values of k (say k = 100), then A − Ak F2 / AF2 is large (say
> 0.4). These are known as heavy-tailed distributions, and discussed more in Section
11.2.
MR = AT A and ML = AAT .
Example: Eigendecomposition
Consider the matrix M = AT A ∈ R2×2 , using the same A ∈ R3×2 as above. We get
symmetric, square matrix
46 29
M= .
29 26
Its eigenvectors are the columns of the matrix
0.814 0.581
V= ,
−0.581 0.814
Next consider the SVD of A so that [U, S, V] = svd(A). Then we can write
MR V = AT AV = (V SUT )(USV T )V = V S 2 .
Note that the last step follows because for orthogonal matrices U and V, then UT U = I
and V T V = I, where I is the identity matrix, which has no effect in the product. The
matrix S is a diagonal square1 matrix S = diag(σ1, σ2, . . . , σd ). Then S 2 = SS (the
product of S with S) is again diagonal with entries S 2 = diag(σ12, σ22, . . . , σd2 ).
Now consider a single column v j of V (which is the jth right singular vector of
A). Then extracting this column’s role in the linear system MR V = V S 2 we obtain
MR v j = v j σi2 .
This means that v j , the jth right singular vector of A, is an eigenvector (in fact the
jth eigenvector) of MR = AT A. Moreover, the jth eigenvalue λ j of MR is the jth
singular value of A squared: λ j = σj2 .
Similarly, we can derive
1 Technically, the S matrix from the svd has dimensions S ∈ R n×d . To make this simple argument
work with this technicality, let us first assume w.l.o.g. (without loss of generality) that d ≤ n. Then
the bottom n − d rows of S are all zeros, which mean the right n − d rows of U do not matter.
So we can ignore both these n − d rows and columns. Then S is square. This makes U no longer
orthogonal, so U T U is then a projection, not identity; but it turns out this is a projection to the
span of A, so the argument still works.
7.4 The Power Method 157
and hence the left singular vectors of A are the eigenvectors of ML = AAT and the
eigenvalues of ML are also the squared singular values of A.
Eigendecomposition
In general, the eigenvectors provide a basis for a matrix M ∈ Rn×n in the same way
that the right V or left singular vectors U provide a basis for matrix A ∈ Rn×d . In fact,
it is again a very special basis, and is unique up to the multiplicity of eigenvalues.
This implies that all eigenvectors are orthogonal to each other.
Let V = [v1, v2, . . . , vn ] be the eigenvectors of the matrix M ∈ Rn×n , as columns
in the matrix V. Also let L = diag(λ1, λ2, . . . , λn ) be the eigenvalues of M stored on
the diagonal of matrix L. Then, we can decompose M as
M = V LV −1 .
M −1 = V L −1V −1 .
M −1 = V L −1V T ,
which was required in our almost closed-form solution for linear regression. Now
we just need to compute the eigendecomposition, which we will discuss next.
The power method refers to what is probably the simplest algorithm to compute the
first eigenvector and value of a matrix; shown in Algorithm 7.4.1. By factoring out
the effect of the first eigenvector, we can then recursively repeat the process on the
remainder until we have found all eigenvectors and values. Moreover, this implies
we can also reconstruct the singular value decomposition as well.
We will consider M ∈ Rn×n , a positive semidefinite matrix: M = AT A.
We can unroll the for loop to reveal another interpretation. Directly set u(q) =
M u(0) , so all iterations are incorporated into one matrix-vector multiplication.
q
u(i) := AT (Au(i−1) )
A1 := A − Av1 v1T
M1 := AT1 A1
Then we run PowerMethod(M1 = AT1 A1, q) to recover v2 , and λ2 ; factor them out of
M1 to obtain M2 , and iterate.
To understand why the power method works, assume we know the eigenvectors
v1, v2, . . . , vn and eigenvalues λ1, λ2, . . . , λn of M ∈ Rn×n .
Since the eigenvectors form a basis for M, and assuming it is full rank, then it is
also a basis for all of Rn (if not, then it does not have n eigenvalues, and we can fill out
the rest of the basis of Rn arbitrarily). Hence any vector, including the initialization
random vector u(0) , can be written as
n
u(0) = αj v j .
j=1
7.4 The Power Method 159
Recall that α j = u(0), v j , and since it is random, it is possible to claim() that with
probability at least 1/2 that for any α j we have that |α j | ≥ 2√1 n . We will now assume
√
that this holds for j = 1, so α1 > 1/(2 n).
[() Since u (0) is a unit vector, its norm is 1, and because {v1, . . . , vn } is a basis, then 1 = u (0) 2 =
n
j=1 α j . Since it is random, then E[α j ] = 1/n for each j. Applying a concentration of measure
2 2
(almost a Markov Inequality, but need to be more careful), we can argue that with probability 1/2
√
any α2j > (1/4) · (1/n), and hence α j > (1/2) · (1/ n). ]
M M2 M3 M4
Next toward formalizing this, since we can interpret that algorithm as v = M q u(0) ,
we analyze M q . If M has jth eigenvector v j and eigenvalue λ j , that is, Mv j = λ j v j ,
q
then M q has jth eigenvalue λ j since
q
M q v j = M · M · . . . · Mv j = M q−1 (v j λ j ) = M q−2 (v j λ j )λ j = . . . = v j λ j .
This holds for each eigenvalue of M q . Hence, we can rewrite the output by summing
over the terms in the eigenbasis as
n q
j=1 α j λ j v j
v = .
q 2
j=1 (α j λ j )
n
Finally, we would like to show our output v is close to the first eigenvector v1 . We
can measure closeness with the dot product (actually we will need to use its absolute
value since we might find something close to −v1 ). If they are almost the same, the
dot product will be close to 1 (or −1).
q
α1 λ1
|M q u(0), v1 | =
q 2
j=1 (α j λ j )
n
q q q√
α1 λ1 α1 λ1 λ2 n
≥ ≥ q q√ = 1 − q q√
2q 2q α1 λ1 + λ2 n α1 λ1 + λ2 n
α12 λ1 + nλ2
q
λ2
≥ 1 − 2n .
λ1
160 7 Dimensionality Reduction
The first inequality holds because λ1 ≥ λ2 ≥√λ j for all j > 2. The third inequality
q
(going to third line) holds by dropping the λ2 n term in the denominator, and since
√
α1 > 1/(2 n).
Thus if there is “gap” between the first two eigenvalues (λ1 /λ2 is large), then this
algorithm converges quickly (at an exponential rate) to where |v, v1 | = 1.
Recall that the original goal of this topic was to find the k-dimensional subspace B
to minimize
A − πB (A)F2 = ai − πB (ai ) 2 .
ai ∈ A
We have not actually solved this problem yet. The top k right singular values Vk of A
only provided this bound assuming that B contains the origin: (0, 0, . . . , 0). However,
this might not be the case!
principal components
⎡1 5 ⎤
⎢ ⎥
Given an initial matrix A = ⎢⎢ 2 3 ⎥⎥ its center vector is c̃ = [2 6]. In Python we can
⎢ 3 10 ⎥
⎣ ⎦
calculate c̃ explicitly, then subtract it
import numpy as np
A= np.array ([[1 ,5] ,[2 ,3] ,[3 ,10]])
c = A.mean (0)
Ac = A - np.outer(np.ones (3),c)
⎡ 0.666 −0.333 −0.333 ⎤
⎢ ⎥
Or use the centering matrix C3 = ⎢⎢ −0.333 0.666 −0.333 ⎥⎥
⎢ −0.333 −0.333 0.666 ⎥
⎣ ⎦
C = np.eye (3) - np.ones (3)/3
Ac = C @ A
⎡ −1 −1 ⎤
⎢ ⎥
The resulting centered matrix is à = ⎢⎢ 0 −3 ⎥⎥ .
⎢ 1 4 ⎥
⎣ ⎦
rank approximations through direct SVD and through PCA are examples of this:
Q = πVk (P). However, these techniques require an explicit representation of P to
start with. In some cases, we are only presented P through distances. There are two
common variants:
• We are provided a set of n objects X, and a bivariate function d : X × X → R
that returns a distance between them. For instance, we can put two cities into an
airline website, and it may return a dollar amount for the cheapest flight between
those two cities. This dollar amount is our “distance.”
• We are simply provided a matrix D ∈ Rn×n , where each entry Di, j is the distance
between the ith and jth point. In the first scenario, we can calculate such a
matrix D.
Multidimensional scaling (MDS) has the goal of taking such a distance matrix D
for n points and giving (typically) low-dimensional Euclidean coordinates to these
points so that the resulting points have similar spatial relations to that described in
D. If we had some original data set A which resulted in D, we could just apply PCA
to find the embedding. It is important to note, in the setting of MDS we are typically
just given D, and not the original data A. However, as we will show next, we can
derive a matrix that will act like AAT using only D.
Classical MDS
The most common variant of MDS is known as classical MDS which uses the
centering matrix Cn = I − n1 11T from PCA, but applies it on both sides of D(2)
which squares all entries of D so that Di,(2)j = Di,2 j . This double centering process
creates a matrix M = − 12 Cn D(2) Cn . Then, the embedding uses the top k eigenvectors
of M and scales them appropriately by the square-root of the eigenvalues. That is
let V LV = M be its eigendecomposition. The let Vk and Lk represent the top k
eigenvectors and values, respectively. The final point set is Vk Lk1/2 = Q ∈ Rn×k . So
the n rows q1, . . . , qn ∈ Rk are the embeddings of points to represent distance matrix
D. This algorithm is sketched in Algorithm 7.6.1
may be less than n eigenvectors, or they may be associated with complex eigenvalues.
So if our goal is an embedding into k = 3 or k = 10, this is not totally assured to
work; yet Classical MDS is used a lot nonetheless.
A similarity matrix S is an n × n matrix where entry Si, j is the similarity between the
ith and the jth data point. The similarity often associated with Euclidean distance
ai − a j is the standard inner product (i.e., dot product) ai, a j . In particular, we
can expand squared Euclidian distance as
and hence
1
ai, a j = ai 2 + a j 2 − ai − a j 2 . (7.1)
2
Next we observe that for the n × n matrix AAT the entry [AAT ]i, j = ai, a j . So it
seems hopeful we can derive AAT from S using equation (7.1). That is we can set
ai − a j 2 = Di,2 j . However, we also need values for ai 2 and a j 2 .
Since the embedding has an arbitrary shift to it (if we add a shift vector s to all
embedding points, then no distances change), then we can arbitrarily choose a1 to be
at the origin. Then a1 2 = 0 and a j 2 = a1 − a j 2 = D1,2 j . Using this assumption
and equation (7.1), we can then derive the similarity matrix AAT . Specifically, each
entry is set
1 2
[AAT ]i, j = (Di,1 + D2j,1 − Di,2 j ).
2
Let Ei be an n × n matrix that is all 0s except the ith column which is all 1s. Then
for a matrix Z ∈ Rn×n , the product Ei Z has each column as the ith column of Z,
and the product Z EiT has each row as the ith row of Z. Using this notation, we can
reformulate the above expression as single matrix operation, where D(2) is define so
Di,(2)j = Di,2 j , as
1
AAT = (E1 D(2) + D(2) E1T − D(2) ).
2
However, it is now clear there is nothing special about the first row, and we replace
each E1 with any Ei , or indeed the average over them. Using the so-called mean
n
matrix E = n1 i=1 Ei = n1 11T , this results in
164 7 Dimensionality Reduction
1
n
AAT = Ei D(2) + D(2) EiT − D(2)
2n i=1
1
n
1 (2) (2) (2) 1
=− D − Ei D − D E T
2 n i=1 n i=1 i
1
= − D(2) − E D(2) − D(2) E
2
1
= − (I − E)D(2) (I − E) + E D(2) E
2
1
= − Cn D(2) Cn + D̄(2) .
2
The last line follows since (I − E) = Cn is the centering matrix and using
1 2 T
n n
D̄(2) = E D(2) E = D 11
n2 i=1 j=1 i, j
which is an n × n matrix with all entries the same, specifically the average of all
n n
squared distances c = n12 i=1 2
j=1 Di, j .
We have almost derived classical MDS, but need to remove D̄(2) . So we consider
an alternative n × n matrix X = [x1 ; x2 ; . . . ; xn ] as a stacked set of points so that
xi, x j = ai, a j − c. Thus
1
X X T = AAT − D̄(2) = − Cn D(2) Cn = M.
2
Since c is a fixed constant, we observe for any i, j that
That is X represents the distances in D(2) just as well as A. Why should we use
X X T instead of AAT ? First, it is simpler. But more importantly, it “centers” the data
points x1, . . . xn . That is if we reapply the centering matrix on X to get Z = Cn X,
then we recover X again, so the points are already centered:
1 1
Z Z T = (Cn X)(X T CnT ) = − Cn Cn D(2) Cn Cn = − Cn D(2) Cn = X X T .
2 2
This follows since Cn Cn = (I − E)(I − E) = I 2 + E 2 − 2E = I + E − 2E = Cn . The
key observation is that E 2 = n12 11T 11T = n12 1n1T = E since the inner 1T 1 = n is a
dot product over two all-ones vectors.
Thus, classical MDS is not only a low-rank Euclidean representation of the
distances, but results in data points being centered, as in PCA.
7.6 Multidimensional Scaling 165
l,V = LA.eig(M)
s = np.real(np.power(l ,0.5))
Note we converted the eigenvalues l to singular values, and forced them to be real
valued, since some numerical error can occur generating small imaginary terms.
Then, we project the data to the most informative two dimensions so we can plot
it.
V2 = V[: ,[0 ,1]]
s2 = np.diag(s [0:2])
Q = V2 @ s2
Another tool that can be used to learn a Euclidian distance for data is linear discrim-
inant analysis (or LDA). This term has a few variants, we focus on the multi-class
setting. This means we begin with a data set X ⊂ Rd , and a known partition of X
into k classes (orclusters) S1, S2, . . . , Sk ⊂ X, so Si = X and Si ∩ S j = ∅ for i j.
Let μi = |S1i | x ∈Si x be the mean of class i, and let Σi = |S1i | x ∈Si (x − μi )(x −
μi )T by its covariance.
Similarly, we can represent the overall mean as μ = |X1 | x ∈X x. Then, we can
represent the between class covariance as
1
k
ΣB = |Si |(μi − μ)(μi − μ)T .
|X | i=1
1 1
k k
ΣW = |Si |Σi = (x − μi )(x − μi )T .
|X | i=1 |X | i=1 x ∈S
i
The figures shows three color-coded classes with their means: μ1 for blue, μ2 for
red, and μ3 for green. The vectors, whose sum of outer-products generates the class
covariances Σ1 , Σ2 , and Σ3 are color-coded as well. The overall mean, and the vectors
for the between class covariance ΣB are shown in black.
μ1
μ μ2
μ3
uT ΣB u
.
uT ΣW u
7.8 Distance Metric Learning 167
For any k ≤ k − 1, we can directly find the orthogonal basis Vk = {v1, v2, . . . , vk }
that maximizes the above goal with an eigendecomposition. In particular, Vk is the
top k eigenvectors of ΣW
−1 Σ . Then to obtain the best representation of X, we set the
B
new data set as the projection onto the k -dimensional space spanned by Vk
X̃ = VkT X
so x̃ = VkT x = (x, v1 , x, v2 , . . . , x, vk ) ∈ Rk .
This retains the dimensions which show difference between the classes, and
similarity among the classes. The removed dimensions will tend to show variance
within classes without adding much difference between the classes. Conceptually, if
the data set can be well-clustered under the k-means clustering formulation, then the
Vk (say when rank k = k − 1) describes a subspace with should pass through the k
sites {s1 = μ1, s2 = μ2, . . . , sk = μk }; capturing the essential information needed to
separate the centers.
When using PCA, one should enforce that the input matrix A has the same units in
each column (and each row). What should one do if this is not the case?
Let us reexamine the root of the problem: the Euclidean distance. It takes two
vectors p, q ∈ Rd (perhaps rows of A) and measures:
d
dEuc (p, q) = p − q = p − q, p − q = (pi − qi )2 .
i=1
If each row of a data set A represents a data point, and each column an attribute,
then the operation (pi − qi )2 is finesince pi and qi have the same units (they quantify
d
the same attribute). However, the i=1 over these terms adds together quantities that
may have different units.
The naive solution is to just brush away those units. These are normalization
approaches where all values pi and qi (in column i) are divided by a constant si with
the same units as the elements in that column, and maybe adding a constant. In one
approach si is chosen to standardize all values in each column to lie in [0, 1]. In the
other common approach, the si values are chosen to normalize: so that the standard
deviation in each column is 1. Note that both of these approaches are affected oddly
by outliers—a single outlier can significantly change the effect of various data points.
Moreover, if new dimensions are added which have virtually the same value for each
data point; these values are inflated to be as meaningful as another signal direction.
As a result, these normalization approaches can be brittle and affect the meaning
of the data in unexpected ways. However, they are also quite common, for better or
worse.
168 7 Dimensionality Reduction
So given X and sets of pairs C and F, the goal is to find M to make the close point
have small d M distance, and far points have large d M distance. Specifically, we will
consider finding the optimal distance d M ∗ as
That is we want to maximizes the closest pair in the far set F, while restricting that
all pairs in the close set C have their sum of squared distances are at most κ, some
constant. We will not explicitly set κ, but rather restrict M in some way so on average
it does not cause much stretch. There are other reasonable similar formulations, but
this one will allow for simple optimization.
Notational Setup
Let H = {xi ,x j } ∈C (xi − x j )(xi − x j )T ; note that this is a sum of outer products, so
H is in Rd×d . For this to work, we will need to assume that H is full rank; otherwise
we do not have enough close pairs to measure. Or we can set H = H + δI for a small
scalar δ.
Further, we can restrict M to have trace Tr(M) = d and hence satisfying some
constraint on the close points, fixing the scaling of the distance. Recall that the
trace of a matrix M is the sum of M’s eigenvalues. Let P be the set of all positive
semidefinite matrices with trace d; hence the identity matrix I is in P. Also, let
= {α ∈ R |F | | αi = 1 & all αi ≥ 0}.
Let τi, j ∈ F (or simply τ ∈ F when the indexes are not necessary) to represent
a far pair {xi, x j }. And let Xτi, j = (xi − x j )(xi − x j )T ∈ Rd×d , an outer product.
Let X̃τ = H −1/2 Xτ H −1/2 . It turns out our optimization goal is now equivalent (up to
scaling factors, depending on κ) to finding
7.9 Matrix Completion 169
argmax min ατ X̃τ , M.
M ∈P α∈
τ ∈F
Here X, M = s,t Xs,t Ms,t , a dot product over matrices, but since X will be related
to an outer product between two data points, this makes sense to think of as d M (X).
Optimization Procedure
Given the formulation above, we will basically try to find an M which stretches the far
points as much as possible while keeping M ∈ P. We do so using a general procedure
referred to as Frank-Wolfe optimization, which increases our solution using one data
point (in this case a far pair) at a time.
Set σ = d · 10−5 as a small smoothing parameter. Define a gradient as
τ ∈F exp(− X̃τ , M/σ) X̃τ
gσ (M) = .
τ ∈F exp(− X̃τ , M/σ)
Observe this is a weighted average over the X̃τ matrices. Let vσ, M be the maximal
eigenvector of gσ (M); the direction of maximal gradient.
Then the algorithm is simple. Initialize M0 ∈ P arbitrarily; for instance, as M0 = I.
Then repeatedly find for t = 1, 2, . . . two steps: (1) find vt = vμ, Mt −1 , and (2) set
Mt = t−1t Mt−1 + t vt vt , where vt vt is an outer product. This is summarized in
d T T
Algorithm 7.8.1.
A common scenario is that a data set P is provided, but is missing some (or many!)
of its attributes. Let us focus on the case where P ∈ Rn×d that is an n × d matrix, that
we can perhaps think of as n data points in Rd , that is with d attributes each. For
instance, Netflix may want to recommend a subset of n movies to d customers. Each
movie has only been seen, or even rated, by a small subset of all d customers, so for
each (movie, customer) pair (i, j), it could be a rating (saying a score from [0, 10],
or some other value based on their measured interaction), denoted Pi, j . But most
170 7 Dimensionality Reduction
pairs (i, j) are empty, there is no rating yet. A recommendation would be based on
the predicted score for unseen movies.
The typical notation defines a set Ω = {(i, j) | Pi, j ∅} of the rated (movie,
⊥ (P)
customer) pairs. Then let ΠΩ (P) describe the subset of pairs with scores and ΠΩ
the compliment, the subset of pairs without scores. The goal is to somehow fill in
the values ΠΩ ⊥ (P).
Consider a matrix where the 5 represent different users, and 4 columns represent
movies. The ratings are described numerically, with a large score indicating a better
rating. However, some entries there is no rating, and these are denoted as x.
⎡3 4 x 8⎤ ⎡1 1 0 1⎤
⎢ ⎥ ⎢ ⎥
⎢1 x 5 x⎥ ⎢1 0 1 1⎥
⎢ ⎥ ⎢ ⎥
P = ⎢⎢ x 4 x 9 ⎥⎥ with mask Ω = ⎢0 1 0 1⎥ .
⎢ ⎥
⎢9 7 1 x⎥ ⎢1 1 1 0⎥
⎢ ⎥ ⎢ ⎥
⎢2 2 x x⎥ ⎢1 1 0 0⎥
⎣ ⎦ ⎣ ⎦
The simplest variants fill in the missing values as the average of all existing values
in a row (the average rating in a movie). Or it could be the average of existing scores
of a column (average rating of a customer). Or an average of these averages. But these
approaches are not particularly helpful for personalizing the rating for a customer
(e.g., a customer who likes horror movies but not rom-coms is likely to score things
differently than one with the opposite preferences).
A common assumption in this area is that there are some simple “latent factors”
which determine a customer’s preferences (e.g., they like horror, and thrillers, but
not rom-coms). A natural way to capture this is to assume there is some low-rank
structure in P. That is we would like to find a low-rank model for P that fits the
observed scores ΠΩ (P). The most common formulation looks like ridge regression:
1
P∗ = argmin ΠΩ (P − X)F2 + λ X ∗ .
X ∈R n×d 2
Here X ∗ is the nuclear norm of a matrix, it corresponds to the sum of its singular
values (recall squared Frobenius norm is different since it is the sum of squared
singular values), and it serves as a regularization term which biases the solution
toward being low-rank.
A simple, and common way to approach this problem is iterative, and outlined in
Algorithm 7.9.1. Start with some guess for X (e.g., average of rows and of columns
of PΩ ). Take the svd of X to obtain USV T ← svd(X). Shrink all singular values by
λ, but no smaller than 0 (similar to Frequent Directions in Section 11.3, but not on
the squared values). This operation φλ is defined for diagonal matrix S as
7.10 Random Projections 171
where (x − λ)+ = max{0, x − λ}. Then we update X̂ = Uφλ (S)V T ; this provides a
lower rank estimate since φλ (S) will set some of the singular values to 0. Finally, we
refill the known values Ω as
⊥
X ← ΠΩ (P) + ΠΩ ( X̂),
and repeat until things do not change much on an update step (it has “converged”).
repeat
USV T ← svd(X)
X̂ ← Uφλ (S)V T
X ← ΠΩ (P) + ΠΩ⊥ ( X̂)
until “converged”
return X̂
This typically does not need too many iterations, but if n and d are large, then
computing the SVD can be expensive. Various matrix sketching approaches (again
see Section 11.3) can be used in its place to estimate a low-rank approximation more
efficiently—especially those that pay attention to matrix sparsity. This is appropriate
since the rank will be reduced in the φλ step regardless, and is the point of the
modeling.
The idea to create ν is very simple: choose a linear one at random! Specifically,
to create ν, we create k random unit vectors v1, v2, . . . , vk , then project
onto the
subspace spanned by these vectors. Finally, we need to renormalize by d/k so the
expected norm is preserved.
A classic result, named after its authors, called the Johnson-Lindenstrauss Lemma,
shows that if k is roughly (1/ε 2 ) log(n/δ) then for all a, a ∈ A equation (7.2) is
satisfied with probability at least 1 − δ. The proof can almost be directly derived
via a Chernoff-Hoeffding bound plus Union bound, however, it requires a more
careful concentration of measure bound based on Chi-squared random variables.
For each distance, each random projection (after appropriate normalization) gives
an unbiased estimate; this requires the 1/ε 2 term to average these errors so the
concentrated
! difference from the unbiased estimate is small. Then, the union bound
over all n2 < n2 distances yields the log n term.
Interpretation of Bounds
It is pretty amazing that this bound does not depend on d. Moreover, it is essentially
tight; that is, there are known point sets such that it requires dimensions proportional
to 1/ε 2 to satisfy equation (7.2).
Although the log n component can be quite reasonable, the 1/ε 2 part can be quite
onerous. For instance, if we want error to be within 1% error, we may need k at
about log n times 10,000. It requires very large d so that setting k = 10,000 is useful.
Ultimately, this may be useful when k > 200, and not too much precision is needed,
and PCA is too slow. Otherwise, SVD or its approximations are often a better choice
in practice.
7.10 Random Projections 173
Exercises
7.1 Read data set A.csv as a matrix A ∈ R30×6 . Compute the SVD of A and report
1. the third right singular vector,
2. the second singular value, and
3. the fourth left singular vector.
4. What is the rank of A?
Compute Ak for k = 2.
5. What is A − Ak F2 ?
6. What is A − Ak 22 ?
Center A. Run PCA to find the best two-dimensional subspace B to minimize
A − πB (A)F2 . Report
7. A − πB (A)F2 and
8. A − πB (A)22 .
7.2 Consider another matrix A ∈ R8×4 with squared singular values σ12 = 10,
σ22 = 5, σ32 = 2, and σ42 = 1.
1. What is the rank of A?
2. What is A − A2 F2 , where A2 is the best rank-2 approximation of A.
3. What is A − A2 22 , where A2 is the best rank-2 approximation of A.
4. What is A22 ?
5. What is AF2 ?
Let v1, v2, v3, v4 be the right singular vectors of A.
6. What is Av2 2 ?
7. What is v1, v3 ?
8. What is v4 ?
9. What is the second eigenvector of AT A?
10. What is the third eigenvalue of AT A?
11. What is the fourth eigenvalue of AAT ?
Let a1 ∈ R4 be the first row of A.
12. Write a1 in the basis defined by the right singular vectors of A.
7.3 Consider two matrices A1 and A2 both in R10×3 . A1 has singular values σ1 = 20,
σ2 = 2, and σ3 = 1.5. A2 has singular values σ1 = 8, σ2 = 4, and σ3 = 0.001.
1. For which matrix will the power method most likely converge faster to the top
eigenvector of AT1 A1 (or AT2 A2 , respectively), and why?
174 7 Dimensionality Reduction
Given the eigenvectors v1, v2, v3 of AT A. Explain step by step how to recover the
following. Specifically, you should write the answers as linear algebraic expressions
in terms of v1 , v2 , v3 , and A; it can involve taking norms, matrix multiply, addition,
subtraction, but not something more complex like SVD.
2. the second singular value of A
3. the first right singular vector of A
4. the third left singular vector of A
7.4 Describe what will happen if the power method is run after the initialization u(0)
is set to be the second eigenvector?
7.6 Consider again data set A as a matrix A ∈ R30×6 . Treat the first 20 rows as class
1, the next 20 rows (rows 21-40) as class 2, and the last 20 rows as class 3. Use LDA
to embed the 60 points into R2 so as to best separate these classes.
7.9 Consider again the input data set A as a matrix A, representing 30 data points in
6 dimensions. Here, we will compare random projections to best rank-k approxima-
tions and PCA (as in Exercise 7.1). (Typically, a much larger scale is needed to see
the benefits of random projection).
1. Compute the 30 × 30 matrix D which measures all pairwise distance between
points in A (in the original six-dimensional representation).
2. For k = 2, find Ak , the best rank-k approximation of A. Compute the distance
matrix D2 , and compute both the summed squared error D − D2 F2 and the worst
case error D − D2 ∞ = maxi, j |Di, j − (D2 )i, j |.
3. Find the best two-dimensional subspace B to project A onto with PCA, and its
representation πB (A). Use this representation to compute another distance matrix
DB , and again compute D − DB F2 and D − DB ∞ .
4. Randomly project A onto a two-dimensional subspace as a new point set Q (using
Algorithm 7.10.1). Compute a new distance matrix DQ and compute D − DQ F2
and D − DQ ∞ .
5. Since the previous random-projection step was randomized, repeat it 10 times,
and calculate the average values for D − DQ F2 and D − DQ ∞ .
6. Increase the dimension used in the random projection until each of D − DQ F2
and D−DQ ∞ (averaged over 10 trials) match or is less than those errors reported
by PCA (it may be larger than the original 6 dimensions).
1. Use the standard notation for the SVD of A as [U, S, V T ] = svd(A), and its
components to describe the set Q.
2. Let D ∈ Rn×n be the distance matrix of A, and let DQ ∈ Rn×n be the distance
matrix of Q. Use the notation for the representation of Q using the SVD of A to
derive an expression for
D − DQ F2 = (Di,2 j − (DQ )2i, j ).
i, j
Abstract Clustering is the automatic grouping of data points into subsets of similar
points. There are numerous ways to define this problem, and most of them are quite
messy. And many techniques for clustering actually lack a mathematical formulation.
We will initially focus on what is probably the cleanest and most used formulation:
assignment-based clustering which includes k-center and the notorious k-means
clustering. For background, we will begin with a mathematical detour in Voronoi
diagrams. Then we will also describe various other clustering formulations to give
insight into the breadth and variety of approaches in this subdomain.
On Clusterability
However, this may be slow (naively check the distance to all k sites for each point
x), and does not provide a general representation or understanding for all points. The
“correct” solution to this problem is the Voronoi diagram.
The Voronoi diagram decomposes Rd into k regions (a Voronoi cell), one for
each site. The region for site si is defined.
Ri = {x ∈ Rd | φS (x) = si }.
For four points {s1, s2, s3, s4 } in R1 , the boundaries between the Voronoi cells are
shown. The Voronoi cell for s3 is highlighted. The points x ∈ R1 so that φS (x) = s3
comprises this highlighted region.
s1 s2 s3 s4
If these regions are nicely defined, this solves the post office problem. For any
point x, we just need to determine which region it lies in (for instance in R2 , once
we have defined these regions, through an extension of binary search, we can locate
the region containing any x ∈ R2 in only roughly log k distance computations and
comparisons). But what do these regions look like, and what properties do they have.
We will start our discussion in R2 . Further, we will assume that the sites S are in
general position: in this setting, it means that no set of three points lie on a common
line, and that no set of four points lie on a common circle.
The boundary between two regions Ri and R j , called a Voronoi edge, is a line or
line segment. This edge ei, j is defined as
the set of all points equal distance to si and s j , and not closer to any other point s .
Why is the set of points ei, j a line segment? If we only have two points in S, then it
is the bisector between them. Draw a circle centered at any point x on this bisector,
8.1 Voronoi Diagrams 179
and if it intersects one of si or s j , it will also intersect the other. This is true since
we can decompose the squared distance from x to si along orthogonal components:
along the edge, and perpendicular to the edge from si to πei, j (si ).
s1
e1,2 x
s2
Similarly, a Voronoi vertex vi, j, is a point where three sites si , s j , and s are all
equidistance, and no other points are closer:
This vertex is the intersection (and end point) of three Voronoi edges ei, j , ei, , and
e j, . Think of sliding a point x along an edge ei, j and maintaining the circle centered
at x and touching si and s j . When this circle grows to where it also touches s , then
ei, j stops.
See the following example with k = 6 sites in R2 . Notice the following properties:
edges may be unbounded, and the same with regions. The circle centered at v1,2,3
passes through s1 , s2 , and s3 . Also, Voronoi cell R3 has 5 = k − 1 vertices and edges.
s4
e1,3 s5
s1
v1,2,3 s3
e1,2
s6
s2
e2,3
180 8 Clustering
Size Complexity
So how complicated can these Voronoi diagrams get? A single Voronoi cell can have
k −1 vertices and edges. So can the entire complex be of size that grows quadratically
in k? (each of k regions requiring complexity of roughly k)? No. The Voronoi vertices
and edges describe a planar graph (i.e., can be drawn in the plane, R2 , with no edges
crossing). And planar graphs have asymptotically the same number of edges, faces,
and vertices. In particular, Euler’s Formula for a planar graph with n vertices, m
edges, and k faces is that k + n − m = 2. To apply this rule to Voronoi diagrams,
either one needs to ignore the infinite edges (e.g., e1,2 in the example) or connect
them all to a single imaginary vertex “at infinite.” And Kuratowski’s criteria says for
n ≥ 3, then m ≤ 3n − 6. Hence, k ≤ 2n − 4 for n ≥ 3. The duality construction to
Delauney triangulations (discussed below) will complete the argument. Since there
are k faces (the k Voronoi cells, one for each site), then there are also asymptotically
roughly (3/2)k edges and k/2 vertices.
However, this does not hold in R3 . In particular, for R3 and R4 , the complexity
(number of cells, vertices, edges, faces, etc) grows quadratically in k. This means,
there could be roughly as many edges as they are pairs of vertices!
But it can get much worse. In Rd (for general d) then the complexity is exponential
in d, that is asymptotically roughly k d/2 . This is a lot. Hence, this structure is
impractical to construct in high dimensions.
Curse of Dimensionality
The curse of dimensionality, refers to when a problem has a nice, simple, and
low-complexity structure in low dimensions, but then becomes intractable and unin-
tuitive in high dimensions. For instance, many geometric properties, like the size
complexity of Voronoi diagrams, have linear complexity and are easy to draw in low
dimensions, but are unintuitive, and have size complexity that grows exponentially
as the dimension grows.
Moreover, since this structure is explicitly tied to the post office problem, and the
nearest neighbor function φS , it indicates that in R2 this function is nicely behaved,
but in high dimensions, it is quite complicated.
A fascinating aspect of the Voronoi diagram is that it can be converted into a very
special graph called the Delaunay triangulation where the sites S are vertices. This
is the dual of the Voronoi diagram.
• Each face Ri of the Voronoi diagram maps to a vertex si in the Delaunay triangu-
lation.
8.1 Voronoi Diagrams 181
• Each vertex vi, j, in the Voronoi diagram maps to a triangular face fi, j, in the
Delaunay triangulation.
• Each edge ei, j in the Voronoi diagram maps to an edge ēi, j in the Delaunay
triangulation.
See the following example with 6 sites in R2 . Notice that every edge, face, and vertex
in the Delaunay triangulation corresponds to a edge, vertex, and face in the Voronoi
diagram. Interestingly, the associated edges may not intersect; see e2,6 and ē2,6 .
s4
e1,3 s5
s1
v1,2,3 s3
e1,2
ē2,6 s6
s2
e2,3
e2,6
Because of the duality between the Voronoi diagram and the Delaunay triangu-
lation, their complexities are the same. That means the Voronoi diagram is of size
roughly k for k sites in R2 , but more generally is of asymptotic size k d/2 in Rd .
The existence of the Delaunay triangulation shows that there always exist a
triangulation: A graph with vertices of a given set of points S ⊂ R2 so that all
edges are straightline segments between the vertices, and each face is a triangle. In
fact, there are many possible triangulations: one can always simply construct some
triangulation greedily, draw any possible edges that does not cross other edges until
no more can be drawn.
The Delaunay triangulation, however, is quite special. This is the triangulation that
maximizes the smallest angle over all triangles; for meshing applications in graphics
and simulation, skinny triangles (with small angles) cause numerical issues, and so
these are very useful.
In Circle Property
Another cool way to define the Delaunay triangulation is through the in circle
property. For any three points, the smallest enclosing ball either has all three points
on the boundary, or has two points on the boundary and they are antipodal to each
182 8 Clustering
other. Any circle with two points antipodal on the boundary si and s j (i.e., si and
s j are on exact opposite spots on the circle), and contains no other points, then the
edge ei, j is in the Delaunay triangulation. The set of edges defined by pairs of points
defining empty circles is a subset of the Delaunay triangulation called the Gabriel
graph.
Any circle with three points on its boundary si , s j , and s , and no points in its
interior, then the face fi, j, is in the Delaunay triangulation, as well as its three edges
ei, j , ei, and e j, . But does not imply those edges are in the Gabriel graph.
For instance, on a quick inspection, (in the example above) it may not be clear if
edge e3,5 or e4,6 should be in the Delaunay triangulation. Clearly, it cannot be both
since they cross. But the ball with boundary through s3 , s4 , and s6 would contain s5 ,
so the face f3,4,6 cannot be in the Delaunay triangulation. On the other hand, the ball
with boundary through s3 , s6 , and s5 does not contain s4 or any other points in S, so
the face f3,5,6 is in the Delaunay triangulation.
So we want every point assigned to the closest center, and want to minimize the sum
of the squared distances of all such assignments.
There are several other useful variants including:
• the k-center clustering problem: minimize maxx ∈X d(φS (x), x)
• the k-median clustering problem: minimize x ∈X d(φS (x), x)
The k-mediod variant is similar to k-median, but restricts that the centers S must
be a subset of X.
Moreover, the mixture of Gaussians approach will allow for more flexibility in the
assignment and more modeling power in the shape of each cluster.
8.2 Gonzalez’s Algorithm for k-Center Clustering 183
We begin with what is arguably the simplest and most general clustering algorithm:
the Gonzalez’s algorithm. This algorithm directly maps to the k-center formulation,
where again every point is assigned to the closest center, and the goal is to minimize
the length of the longest distance of any such assignment pairing.
Unfortunately, the k-center clustering problem is NP-hard to solve exactly.1 In
fact, for the general case, it is NP-hard to find a clustering within a factor 2 of the
optimal cost!
Luckily, there is a simple, elegant, and efficient algorithm that achieves this factor
2 approximation. That is, for value k and a set X, it finds a set of k sites Ŝ so
Moreover, in practice, this often works better than the worst case theoretical guar-
antee. This algorithm, presented as Algorithm 8.2.1, is usually attributed to Teofilo
F. Gonzalez (1985), hence the name. It can be described with the following miserly
maxim: Be greedy, and avoid your neighbors!.
The algorithm is iterative, building up the set of sites S over the run of algorithm.
It initializes the first site s1 arbitrarily. And it maintains a set of sites S which have i
sites after i steps; that is initially the set only contains the one site s1 chosen so far.
Then it adds to the set the point x ∈ X which is furthest from any of the current sites;
that is the one with largest φS (x) value. This function φS changes over the course of
the algorithm as more sites are added to S.
1 The term NP-hard (i.e., non-deterministic polynomial time hard) refers to being as hard as a set
of problems in terms of runtime with respect to the size of their input. For these problems, if the
correct solution is found, it can be verified quickly (in time polynomial in the size of the input), but
to find that correct solution the only known approaches are essentially equivalent to a brute-force
search over an exponentially large set of possible solutions, taking time exponential in the input
size. It is not known if it must take time exponential in the input size, or if there may be a solution
which takes time polynomial in the input size (the class of problems P); but in practice, it is often
assumed this is not possible. In short, these problems are probably very hard to solve efficiently.
184 8 Clustering
In other words, back to our maxim, it always adds a new site that is furthest from
the current set of sites.
This simple, elegant, and efficient algorithm works for any metric d, and the
2-approximation guarantee will still hold. The resulting set of sites S are a subset
of the input points X, this means it does not rely on the input points being part of
a nice, easy to visualize, space, e.g., Rd . However, this algorithm biases the choice
of centers to be on the “edges” of the dataset, each chosen site is as far away from
existing sites as possible, and then is never adjusted. There are heuristics to adjust
centers afterwards, including the algorithms for k-means and k-mediod clustering.
Let sk+1 = argmax x ∈X d(x, φS (x)) and let the cost of the solution found be R =
maxx ∈X d(x, φS (x)) = d(s j , sk+1 ), where s
j = φS (sk+1 ) (in the picture s
j = s1 ). We
know that no two sites can be within a distance R; that is mins,s
∈S d(s, s
) ≥ R;
otherwise, they would not have been selected as a site, since sk+1 (inducing the
radius R) would have been selected first.
Thus if the optimal set of k sites {o1, o2, . . . , ok } have total cost less than R/2, they
must each be assigned to a distinct site in S. Assume otherwise that site s ∈ S (e.g., s5
in the figure) has none of the optimal sites assigned to it, and the closest optimal site is
o j , assigned to center s j ∈ S (e.g., s2 in the figure). Since we assume d(o j , s j ) < R/2,
8.3 Lloyd’s Algorithm for k-Means Clustering 185
and d(s, s j ) ≥ R, then by triangle inequality d(s j , o j ) + d(o j , s) ≥ d(s j , s) and hence it
would imply that d(o j , s) > R/2, violating the assumption (e.g., o j must be assigned
to s5 ).
However, if each distinct site in S has one optimal site assigned to it, then the site
around s
j that has assigned sk+1 to it, and sk+1 cannot be covered by one center o
j
with radius less than R/2. This is again because d(s
j , sk+1 ) = R so one of d(s
j , o
j )
and d(sk+1, o
j ) must be at least R/2. Since all other sites in S are at distance at least
R from sk+1 , the same issue occurs in trying to cover them and some optimal site
with radius less than R/2 as well.
s5 oj
s2
s1 oj
sk+1
R
s3
s4
So we want every point assigned to the closest site, and want to minimize the sum
of the squared distance of all such assignments.
We emphasize the term “k-means” refers to a problem formulation, not to any
one algorithm. There are many algorithms with aim of solving the k-means problem
formulation, exactly or approximately. We will mainly focus on the most common:
Lloyd’s algorithm. Unfortunately, it is commonly written in data mining literature
“the k-means algorithm,” which typically should be more clearly stated as Lloyd’s
algorithm.
186 8 Clustering
When people think of the k-means problem, they usually think of the following
algorithm, attributed to Stuart P. Lloyd from a document in 1957, although it was
not published until 1982.2
The algorithm is again fairly simple and elegant; however, unlike Gonzalez’s
algorithm for k-center, the sites are all iteratively updated. It initializes with any set
of k sites, and then iteratively updates the locations of these sites. Moreover, this
assumes that the input data X lies in Rd and the implicit distance d is Euclidean; it
can be generalized to X in a few other Euclidean-like spaces as well.
As shown in Algorithm 8.3.1, each iteration is split into two steps. In the first step,
each point x ∈ X is explicitly mapped to the closest site si = φS (x). In the second
step, the set of points Xi = {x ∈ X | φS (x) = si } which are mapped to the ith site si
are gathered, and that site is adjusted to the average of all points in Xi .
This second step is why we assume X ∈ Rd (that d is the Euclidean
distance) since
the average operation can be naturally defined for Rd (as |X1i | x ∈Xi x). This results
in a point in Rd , but not necessarily a point in X. So, differing from Gonzalez’s
algorithm, this does not in general return a set of sites S which are drawn from the
input set X; the sites could be anywhere.
After initializing the sites S as k = 4 points from the data set, the algorithm assigns
data points in X to each site. Then in each round, it updates the site as the average of
the assigned points, and reassigned points (represented by the Voronoi cells). After
4 rounds it has converged to the optimal solution.
2 Apparently, the IBM 650 computer Lloyd was using in 1957 did not have enough computational
power to run the (very simple) experiments he had planned. This was replaced by the IBM 701,
but it did not have quite the same “quantization” functionality as the IBM 650, and the work was
forgotten. Lloyd was also worried about some issues regarding the k-means problem not having a
unique minimum.
8.3 Lloyd’s Algorithm for k-Means Clustering 187
Convergence
The number of rounds is finite. This is true since the cost function cost(X, S) will
always decrease. To see this, write it as a sum over S.
cost(X, S) = φS (x) − x 2 (8.1)
x ∈X
= si − x 2 . (8.2)
si ∈S x ∈Xi
Then in each step of the repeat-until loop, this cost must decrease. This holds in
the first step since it moves each x ∈ X to a subset Xi with the corresponding center
si closer to (or the same distance to) x than before. So in equation (8.1), as φS (x) is
updated for each x, the term x − si = x − φS (x) is reduced (or the same). This
holds in the second step since in equation (8.2), for each inner sum x ∈Xi si − x 2 ,
the single point si which minimizes this cost is precisely the average of Xi . This
follows by the same argument as in 1-dimension (see Section 1.7.1), since by the
Pythagorean formula, the squared distance to the mean can be decomposed along
188 8 Clustering
the coordinates. So reassigning si as described also decreases the cost (or keeps it
the same).
Importantly, if the cost decreases each step, then it cannot have the same set of
centers S on two different steps, since that would imply the assignment sets {Xi }
would also be the same. Thus, in order for this to happen, the cost would need to
decrease after the first occurrence, and then increase to obtain the second occurrence,
which is not possible.
Since, there are finite ways each set of points can be assigned to different clusters,
then, the algorithm terminates in a finite number of steps.
Two examples are shown for local minimum for Lloyd’s algorithm. In these con-
figurations, Lloyd’s algorithm is converged, but these are not the optimal k-means
cost.
There are two typically ways to help manage the consequences of this property:
careful initialization and random restarts.
A random restart, uses the insight that the choice of initialization can be chosen
randomly, and this starting point is what determines what minimum is reached. So,
then the basic Lloyd’s algorithm can be run multiple times (e.g., 3-5 times) with
different random initializations. Then, ultimately the set of sites S, across all random
restarts, which obtain the lowest cost(X, S) is returned.
Initialization
The initial paper by Lloyd advocates to choose the initial partition of X into disjoint
subsets X1, X2, . . . , Xk arbitrarily. However, some choices will not be very good. For
instance, if we randomly place each x ∈ X into some Xi , then (by the central limit
theorem) we expect all si = |X1i | x ∈Xi x to all be close to the mean of the full data
set |X1 | x ∈X x.
A bit safer way to initialize the data is to choose a set S ⊂ X at random.
Since each si is chosen separately (not as an average of data points), there is no
8.3 Lloyd’s Algorithm for k-Means Clustering 189
convergence to mean phenomenon. However, even with this initialization, we may run
Lloyd’s algorithm to completion, and find a sub-optimal solution (a local minimum!).
However, there are scenarios where such random initialization is unlikely to find the
right local minimum, even under several random restarts.
A more principled way to choose an initial set S is to use an algorithm like
Gonzalez (see Section 8.2), or k-means++ (see Section 8.3.2). In particular, the
initialization by Gonzalez is guaranteed to be within a factor of 2 of the optimal
k-center objective. While this objective is different than the k-means objective, it in
that sense cannot be too bad; and then Lloyd’s algorithm will only improve from
that point on. k-means++ is similar, but randomized, and more tuned to the specific
k-means objective.
Example: k-Means
To demonstrate k-means we leverage the powerful Python learning and data analysis
library sklearn. Unlike several other algorithms in this book, k-means has a number
of tricky edge cases. How should one determine convergence? What to do if there is
a cluster with no data points? When to do a random restart? The methods in sklearn
apply a number of heuristics to these scenarios to handle these issues out of view.
First, we can generate and plot a data set with 4 nicely shaped clusters
import numpy as np
import matplotlib . pyplot as plt
% matplotlib inline
And then plot the clusters, by color, as well as the center points in red.
190 8 Clustering
Number of Clusters
So what is the right value of k? Like with PCA, there is no perfect answer toward
choosing how many dimensions the subspace should be. When k is not given to
you, typically, you would run with many different values of k. Then create a plot of
cost(S, X) as a function of k. This cost will always decrease with larger k; but of
course k = n is of no use. At some point, the cost will not decrease much between
values (this implies that probably two centers are used in the same grouping of data,
so the squared distance to either is similar). Then, this is a good place to choose k.
We consider two data sets with 300 points in R2 . The first has 7 well-defined clusters,
the second has 1 blob of points.
We run Lloyd’s for k-means on both sets for values of k ranging from k = 1
to k = 15, and plot the mean squared error of the result for both. As can be seen,
as k increases, the error always decreases, and eventually starts to flatten out. We
marked with a red • at the true number of clusters. For the k = 7 example (left), the
8.3 Lloyd’s Algorithm for k-Means Clustering 191
curve decreases rapidly and then levels out, and the true number of clusters is at the
“elbow” of the plot. However, when there is only k = 1 cluster, the MSE curve does
not decrease as rapidly, and there is no real “elbow.”
8.3.2 k -Means++
The initialization approach for Lloyd’s algorithm, that is most tuned to the k-means
objective is known as k-means++, or D2 -sampling. This algorithm is randomized,
unlike Gonzalez, so it is compatible with random restarts. Indeed analyses which
argues for approximation guarantees in Lloyd’s algorithm require several random
restarts with k-means++.
As Algorithm 8.3.2 describes, the structure is like Gonzalez algorithm, but is not
completely greedy. It iteratively chooses each next center randomly—the further the
squared distances is from an existing center, the more likely it is chosen. For a large
set of points (perhaps grouped together) which are far from an existing center, then it
is very likely that one (it does not matter so much which one) of them will be chosen
as the next center. This makes it likely that any “true” cluster will find some point as
a suitable representative.
The critical step in this algorithm, and the difference from Gonzalez, is choosing
the new center proportional to some value (the value d(x, φS (x))2 ). Recall that this
task has an elegant solution called the Partition of Unity, discussed in the context of
Importance Sampling (Section 2.4).
192 8 Clustering
Like many algorithms in this book, Lloyd’s depends on the use of a SSE cost function.
Critically, this works because for X ∈ Rd , then
1
x = average(X) = arg min s − x 2 .
|X | x ∈X s ∈R d
x ∈X
There are not in general similar properties for other costs functions, or when X is
not in Rd . For instance, one may want to solve the k-medians problem where one
just minimizes the sum of (non-squared) distances. In particular, this subset has
no closed-form solution for X ∈ Rd for d > 1. Most often gradient descent-type
approaches are used for finding an updated center under a k-median objective.
An alternative to the averaging step is to choose
si = arg min d(x, s)
s ∈Xi
x ∈Xi
The k-mediod clustering formulation is often used because it is general, but also
fairly robust to outliers. A single outlier point may be chosen as a separate clus-
8.3 Lloyd’s Algorithm for k-Means Clustering 193
ter center for k-center or k-means clustering, with no other points in that cluster.
However, for k-mediods or k-median clustering, it is more likely to be added to an
existing cluster without as dramatically increasing the cost function. Typically, this
property is viewed as a positive for k-mediods in that the resulting cluster subsets
are not greatly distorted by a single outlier.
However, if these outliers are then added to a cluster for which it does not fit, it
may cause other problems. For instance, if a model is fit to each cluster, then this
outlier may distort this model.
Consider the case where each data point is a person applying for a loan. It may be
an outlier is included in a cluster, and has had multiple bankruptcies that caused them
to default on the loan, and cost previous banks to lose a lot of money. This outlier
may distort the model of that cluster so it predicts all of the customers in it are not
expected to return on the loan, in expectation. On the other hand, excluding that data
point would allow the model to predict more accurately that most of the customers
will repay the loan, and be profitable. As a data modeler, what obligations do you
have to check for such outliers within a clustering? What is a good way to mitigate
the effects of instability in the way data points are clustered? Does the answer to the
previous questions remain the same, even if the expected overall profit level for the
bank should stay the same?
Sometimes it is not desirable to assign each point to exactly one cluster. Instead, we
may split a point between one or more clusters, assigning a fractional value to each.
This is known as soft clustering, whereas the original formulation is known as hard
clustering.
There are many ways to achieve a soft clustering. For instance, consider the
following Voronoi diagram-based approach using natural neighbor interpolation
(NNI). Let V(S) be the Voronoi diagram of the sites S (which decomposes Rd ). Then
construct V(S ∪ x) for a particular data point x; the Voronoi diagram of the sites S
with the addition of one point x. For the region Rx defined by the point x in V(S ∪ x),
overlay it on the original Voronoi diagram V(S). This region Rx will overlap with
regions Ri in the original Voronoi diagram; compute the volume vi for the overlap
k region. Then the fractional weight for x into each site si is defined
with each such
wi (x) = vi / i=1 vi .
We can plug any such step into Lloyd’s algorithm, and then recalculate si as the
weighted average of all points partially assigned to the ith cluster.
s1 x2
x3
s2 x4
x1
x6 x5
x7
x8
s3
x9
x10
For the hard clustering assignment, it is convenient to represent the assignment
as a single array of integers, which indicates the index of the site:
point 1 2 3 4 5 6 7 8 9 10
cluster index 1 1 2 2 2 2 3 3 3 3
On the other hand, for a soft clustering assignment, it is better to represent a
probability distribution over the sites for each point. That is each point xi maps to a
discrete probability distribution pi ∈ Δk . For simplicity, the assignment probabilities
shown are fairly sparse, with a lot of 0 probabilities; but in other cases (like for mixture
of Gaussians shown next) these distributions pi ∈ Δ◦k where no assignment can have
0 probability.
point 1 2 3 4 5 6 7 8 9 10
cluster 1 probability 1 1 0.4 0 0 0 0.3 0 0 0
cluster 2 probability 0 0 0.6 1 1 1 0.3 0 0 0
cluster 3 probability 0 0 0 0 0 0 0.4 1 1 1
The k-means formulation tends to define clusters of roughly equal size. The squared
cost discourages points far from any center. It also, does not adapt much to the
density of individual centers.
An extension is to fit each cluster Xi with a Gaussian distribution Gd (μi, Σi ),
defined by a mean μi and a covariance matrix Σi . Recall that the pdf of a d-
dimensional Gaussian distribution is defined as
1 1 1 T −1
fμ,Σ (x) = exp − (x − μ) Σ (x − μ)
(2π)d/2 |Σ| 2
where |Σ| is the determinant of Σ. Previously we had mainly considered this distri-
bution where Σ = I was the identity matrix, and it was ignored in the notation. For
8.4 Mixture of Gaussians 195
Now the goal is, given a parameter k, find a set of k pdfs F = { f1, f2, . . . , fk }
where, fi = fμi ,Σi , to maximize
max fi (x),
fi ∈F
x ∈X
or equivalently to minimize
min − log( fi (x)).
fi ∈F
x ∈X
For the special case where we restrict that Σi = I (the identity matrix) for each
mixture, then the second formulation (the log-likelihood version) is equivalent to
the k-means problem (depending on choice of hard or soft clustering). In particular
with Σ = I,
and in the general case (x − μ)T Σ−1 (x − μ) = dΣ−1 (x, μ)2 is the squared Σ−1 -
Mahalanobis distance between x and μ.
This hints that we can adapt Lloyd’s algorithm toward this problem as well. To
replace the first step of the inner loop, we assign each x ∈ X to the Gaussian which
maximizes fi (x):
But for the second step, we need to replace a simple average with an estimation
of the best fitting Gaussian
to a data set Xi . This is also simple. First, calculate the
mean as μi = |X1i | x ∈Xi x. Then calculate the covariance matrix Σi of Xi as the sum
of outer products
Σi = (x − μi )(x − μi )T .
x ∈Xi
8.4.1 Expectation-Maximization
Clustering can provide more than a partition, it can also provide a hierarchy of the
clusters. That is, at the root of a hierarchy all data points are in the same cluster. And
at the leaf nodes, each data point is in its own cluster. Then, the intermediate layers
can provide various levels of refinement, depending on how many or how tightly
related clusters are desired.
8.5 Hierarchical Clustering 197
This small example shows the hierarchy of clusters found on n = 5 data points.
Associated with these clusters and this progression is a tree, showing the local order
in which points are merged into clusters.
1 3 1 3 1 3 1 3
2 5 2 5 2 5 2 5
4 4 4 4
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
There are two basic approaches toward this: top-down and bottom-up. The top-
down approach starts with all data points in the same cluster, and iteratively partitions
the data into 2 (or sometimes more) parts. The main task then, is given a data set, how
to split it into two components as best possible. The canonical top-down clustering
approach is spectral clustering. This is primarily based on graph ideas, but can be
generalized to work with any choice of similarity function s (or implicitly a linked
distance function d), so the base object is an affinity matrix instead of the adjacency
matrix of a graph. An in depth discussion is deferred to Section 10.3.
The bottom-up approach often called hierarchical agglomerative clustering takes
the opposite tact, outlined in Algorithm 8.5.1. Starting with a cluster for each data
point, its main task is to determine which pair of clusters to join into a single cluster.
As the standard choice is, the closest two clusters, then the key element is to study
distances between clusters.
There are many ways to define such a distance dC between clusters. Each is
defined with respect to another general distance function d which can be applied
to individual data points, and in some cases also representative objects. The most
common are
• Single Link: dC (Si, S j ) = argminsi ∈Si ,s j ∈S j d(si, s j ).
This takes the distance between the closest two points among all points in the
clusters. This allows the clusters to adjust to the shape and density of the data,
but can lead to oddly shaped clusters as well.
198 8 Clustering
• Mean Link: dC (Si, S j ) = si ∈Si s j ∈S j d(si, s j ).
This takes the average distance between pairs of points in a cluster (written
unnormalized). It behaves similar to the k-means objective.
• Complete Link: dC (Si, S j ) = argmaxsi ∈Si ,s j ∈S j d(si, s j ).
This enforces “round” clusters. It can only join two clusters if all points are similar.
• Center Link: dC (Si, S j ) = d(ci, c j ) where ci and c j are central points representing
Si and S j .
The choice of how to define the central point is varied. For instance, it could
be the average, median, or representative median as in k-means, k-medians, or
k-mediods objectives. Or it could be any other easy-to-determine or robust way
to represent the clusters.
These variants of clusters provide additional structural information beyond center-
based clusters, in that they also provide which clusters are close to each other, or
allow a user to refine to different levels of granularity. However, this comes at the
cost of efficiency. For instance, on n data points, these algorithms naively take time
proportional to n3 . Some improvements or approximations can reduce this to closer
to time proportional to n2 , but may add instability to the results.
Hierarchies can create powerful visuals to help make sense of complex data sets that
are compared using a complex distance function over abstract data representations.
For instance, phylogentic trees are the dominate way to attempt to present evolu-
tionary connections between species. However, these connections can provide false
insights by those over-eager to make new scientific or business connections.
A common way to model and predict the careers of professional athletes is to use
their early years to cluster and group athletes with others who had careers before
them. Then an up-and-coming athlete can be predicted to have careers similar to
those in the same cluster. However, if a new mold of athlete—e.g., basketball players
who shoot many 3-pointers, and are very tall—has many junior players, but not
senior players to use to predict their career. If they are grouped with other tall
players, a coach or general manager may treat them as rebounders minimizing their
effectiveness. Or if they are treated as shooting guards, they may be given more
value. How might the use of the clustering and its hierarchy be used to mitigate this
potential downsides of these possible models?
Such effects are amplified in less scrutinize hiring and management situations.
Consider a type of job applicant who has traditionally not be hired in a role, how
will their clustering among past applicants be harmful or helpful for their chance at
being hired, or placed in a management role? And as a data modeler, what is your
role in aiding in these decisions?
8.6 Density-Based Clustering and Outliers 199
Example: DBScan
An example data set X of n = 15 points. They are clustered using DBScan using a
density parameters of τ = 3, and radius r as illustrated in the balls. For instance, x4
is a core point, since there 3 points in the highlighted ball of radius r around it. All
core points are marked, and the induced graph is shown.
r
x4
x3
x1
x2
x5
Note that even though x1 and x2 are within a distance r, they are not connected
since neither is a core point. The points with no edges, like x2 and x3 , are not in any
clusters. And there are two distinct clusters from the two connected components of
the graph: point x5 is in a different cluster than x1 and x4 , which are in the same
cluster.
200 8 Clustering
So all core points (the points which are in dense enough regions) are in clusters,
as well as any points close enough to those points. And points are connected in the
same clusters if they are connected by a path through dense regions. This allows the
clusters to adapt to the shape of the data.
However, there is no automated way to choose the parameters r and τ. This choice
will greatly affect the shape, size, and meaning of the clusters. Also, this model also
assumes that the notion of density is consistent throughout the dataset.
A major advantage of DBScan is that it can usually be computed efficiently. For
each x ∈ X we need to determine if it is a core points, and find neighbors, but these
can usually quickly be accomplished with fast nearest neighbor search algorithms
(e.g., LSH in Section 4.6). Then usually the connected components (the clusters)
can be recovered without building the entire graph. However, in the worst case, the
algorithm will have runtime quadratic in the n, the size of the data set.
8.6.1 Outliers
In general, outliers are the cause of, and solution to, all of data mining’s problems.
That is, for any problem in data mining, one can try to blame it on outliers. And one
can claim that if they were to remove the outliers the problem would go away. That
is, when data is “nice” (there are no outliers) then the structure we are trying to find
should be obvious. There are many solutions, and each has advantages and pitfalls.
Density-Based Approaches
Conceptually regular points have dense neighborhoods, and outlier points have non-
dense neighborhoods. One can run a density-based clustering algorithm, and all
points which are not in clusters are the outliers. Variants of this idea define density
thresholds in other ways, for instance, for each point x ∈ X, counting the number of
points within a radius r, evaluating the kernel density estimate kde X (x), or measuring
the distance to the kth nearest neighbor. Each has a related version of distance and
density threshold.
Given a point set X for each point x ∈ X we can calculate its nearest neighbor p
(or more robustly its kth nearest neighbor). Then for p we can calculate its nearest
neighbors q in X (or its kth nearest neighbor). Now if d(x, p) ≈ d(p, q) then x is
probably not an outlier.
This approach adapts to the density in different regions of the point set. However,
it relies crucially on the choice of distance d. For generic distances this may be quite
slow, but for others we can employ LSH-type procedures to make this more efficient.
8.7 Mean Shift Clustering 201
Furthermore, density (or relative density) may not tell you how far a point is from a
model, for instance in the case of regression.
Clustering and outlier removal can be performed over population databases (e.g., of
a company’s customers or users). This may provide a better understanding of those
people whom are assigned centrally to a cluster, but those who have a poor fit to
a cluster or are marked an outlier, this may not be as favorable. Is it acceptable to
not provide any recommendations for some people? Is that better than less accurate
ones?
Now for something completely different. Clustering is a very very broad field with
no settled upon approach. To demonstrate this, we will quickly review an algorithm
called mean shift clustering. This algorithm shifts each data point individually to its
weighted center of mass. It terminates when all points converge to isolated sets.
202 8 Clustering
K(x, p) = exp(− x − p 2 /σ 2 )
for some given bandwidth parameter σ. The weighted center of mass around each
point p ∈ X is then defined as
K(x, p)x
μ(p) = x ∈X .
x ∈X K(x, p)
The algorithm just shifts each point to its center of mass: p ← μ(p).
This algorithm does not require a parameter k. However, it has other parame-
ters, most notably the choice of kernel K and its bandwidth σ. With the Gaussian
kernel (since it has infinite support, K(x, p) > 0 for all x, p), it will only stop
when all x are at the same point. Thus, the termination condition is also important.
Alternatively, a different kernel with bounded support may terminate automatically
(without a specific condition); for this reason (and for speed) truncated Gaussians
(i.e., Kτtrunc (x, p) = {0 if x − p > τ; otherwise K(x, p)}) are often used.
This algorithm not only clusters the data, but also is a key technique for de-noising
data. This is a process that not just removes noise (as often thought of as outliers),
but attempts to adjusts points to where they should have been before being perturbed
by noise—similar to mapping a point to its cluster center.
8.7 Mean Shift Clustering 203
Exercises
We will use four datasets, here:
https://mathfordata.github.io/data/P.csv
https://mathfordata.github.io/data/Q.csv
https://mathfordata.github.io/data/C-HAC.csv
https://mathfordata.github.io/data/Ck.csv
8.1 Download data sets P and Q. Both have 120 data points, each in 6 dimensions,
can be thought of as data matrices in R120×6 . For each, run some algorithm to
construct the k-means clustering of them. Diagnose how many clusters you think
each data set should have by finding the solution for k equal to 1, 2, 3, . . . , 10.
8.2 Draw the Voronoi diagram of the following set of points.
8.3 Consider the drawn data set where the black circles are data points and red
stars are the centers. What should you do, if running Lloyd’s algorithm for k-means
clustering (k = 2), and you reach this scenario, where the algorithm terminates?
8.4 Construct a data set X with 5 points in R2 and a set S of k = 3 sites so that
Lloyds algorithm will have converged, but there is another set S
, of size k = 3,
so that cost(X, S
) < cost(X, S). Explain why S
is better than S, but that Lloyds
algorithm will not move from S.
8.5 Consider this set of 3 sites: S = {s1 = (0, 0), s2 = (3, 4), s3 = (−3, 2)} ⊂ R2 .
We will consider the following 5 data points X = {x1 = (1, 3), x2 = (−2, 1), x3 =
(10, 6), x4 = (6, −3), x5 = (−1, 1)}.
204 8 Clustering
For each of the points compute the closest site (under Euclidean distance):
1. φS (x1 ) =
2. φS (x2 ) =
3. φS (x3 ) =
4. φS (x4 ) =
5. φS (x5 ) =
Now consider that we have 3 Gaussian distributions defined with each site s j
as a center μ j . The corresponding standard deviations are σ1 = 2.0, σ2 = 4.0
and σ3 = 5.0,
and we assume they are univariate so the covariance matrices are
σj 0
Σj = .
0 σj
6. Write out the probability density function (its likelihood f j (x)) for each of the
Gaussians.
Now we want to assign each xi to each site in a soft assignment. For each site s j
define the weight of a point as w j (x) = f j (x)/( 3j=1 f j (x)). Finally, for each of the
points calculate the 3 weights for the 3 sites
7. w1 (x1 ), w2 (x1 ), w3 (x1 ) =
8. w1 (x2 ), w2 (x2 ), w3 (x2 ) =
9. w1 (x3 ), w2 (x3 ), w3 (x3 ) =
10. w1 (x4 ), w2 (x4 ), w3 (x4 ) =
11. w1 (x5 ), w2 (x5 ), w3 (x5 ) =
8.6 Write down the cost function associated with the Mixture of Gaussians problem
and Algorithm 8.4.1 so that at each phase of the algorithm, this cost cannot increase.
8.7 There are many variants of hierarchical clustering; here we explore 3 using
dataset C-HAC which we treat as n = 19 points in R2 . The key difference in these
variants is how to measure the distance d(S1, S2 ) between two clusters S1 and S2 . We
will consider Single Link, Complete Link, and Mean Link with the base distance d as
Euclidian distance.
1. Run all hierarchical clustering variants on the provided data until there are k = 4
clusters, and report the results as sets. Plot the clusters using a different color and
marker to denote each clustering assignment.
2. Also run DBScan on this data with radius r = 5.0 and density τ = 3.
3. Provide a subjective answer to which variant you believe did the best job, and
which was the easiest to compute (think if the data was much larger)? And explain
why.
8.7 Mean Shift Clustering 205
8.8 We will use data set Ck, which we treat as n = 1040 points in R2 , to explore
assignment-based clustering.
1. Run Gonzalez Algorithm (Algorithm 8.2.1) and k-Means++ (Algorithm 8.3.2)
on the data set for k = 3. To avoid too much variation in the results, choose the
first point in the data file as the first site s1 .
Report the centers and the subsets (as pictures) for Gonzalez. For both also report:
• the 3-center cost max
x ∈X d(x, φS (x)) and
• the 3-means cost |X1 | x ∈X (d(x, φS (x)))2
(Note this has been normalized so easy to compare to 3-center cost)
2. For k-Means++, the algorithm is randomized, so you will need to report the
variation in this algorithm. Run it several trials (at least 20) and plot the cumulative
density function of the 3-means cost. Also report what fraction of the time the
subsets are the same as the result from Gonzalez.
3. Recall that Lloyd’s algorithm for k-means clustering starts with a set of k sites S
and runs as described in Algorithm 8.3.1.
a. Run Lloyds Algorithm with S initially with the first 3 points in the data set.
Report the final subset and the 3-means cost.
b. Run Lloyds Algorithm with S initially as the output of the Gonzalez algorithm
found above. Report the final subset and the 3-means cost.
c. Run Lloyds Algorithm with S initially as the output of each run of k-Means++
above. Plot a cumulative density function of the 3-means cost. Also report the
fraction of the trials that the subsets are the same as the input (where the input
is the result of the k-Means++ algorithm).
8.10 Run mean link clustering on data set Ck for 4 iterations. Do so 3 different times
for bandwidth parameter σ taking values {1, 5, 50}.
1. Plot the output at the end of each iteration for each value of σ.
2. Which value of σ does the best job of grouping similar data (finding clusters)?
Explain your opinion.
Chapter 9
Classification
Abstract This chapter returns to prediction. Unlike linear regression where we were
predicting a numeric value, in this case we are predicting a class: winner or loser,
yes or no, positive or negative. Ideas from linear regression can be applied here,
but for once least squares is not the dominant error paradigm as it can lead to some
structurally inconsistent results. Instead we will organize our discussion around the
beautiful geometry of linear classification and let the error model be chosen as
more abstract loss functions. This method can be generalized through support vector
machines to non-linear classifiers. We follow that with a discussion on learnability,
using VC dimension to quantify the sample complexity of classifiers. Again this
topic is wide and varied, and the back end of the chapter surveys other common and
diverse method in KNN classifiers, decision trees, and neural networks.
This is perhaps the central problem in data analysis. For instance, you may want
to predict:
• will a sports team win a game?
• will a politician be elected?
• will someone like a movie?
• will someone click on an ad?
• will I get a job? (If you can build a good classifier, then probably yes!)
Each of these is typically solved by building a general purpose classifier (about sports
or movies etc), then applying it to the problem in question.
Our input here is a point set X ⊂ Rd , where each element xi ∈ X also has an
associated label yi . And yi ∈ {−1, +1}.
Like in regression, our goal is prediction and generalization. We assume each
(xi, yi ) ∼ μ; that is, each data point pair, is drawn iid from some fixed but unknown
d
g(x) = α0 + x (1) α1 + x (2) α2 + . . . + x (d) αd = α0 + x (j) α j ,
j=1
for some set of scalar parameters α = (α0, α1, α2, . . . , αd ). Typically, different nota-
tion is used: we set b = α0 and w = (w1, w2, . . . , wd ) = (α1, α2, . . . , αd ) ∈ Rd . Then
we write
g(x) = b + x (1) w1 + x (2) w2 + . . . + x (d) wd = w, x + b.
We can now interpret (w, b) as defining a halfspace in Rd . Here, w is the normal
of that halfspace boundary (the single direction orthogonal to it) and b is the distance
from the origin 0 = (0, 0, . . . , 0) to the halfspace boundary in the direction w/
w
.
Because w is normal to the halfspace boundary, b is also distance from the closest
point on the halfspace boundary to the origin (in any direction).
We typically ultimately use w as a unit vector, but it is not important since this
can be adjusted by changing b. Let w, b be the desired halfspace with
w
= 1. Now
assume we have another w , b with
w
= β 1 and w = w /
w
, so they point
in the same direction, and b set so that they define the same halfspace. This implies
b = b · β, and b = b/β. So the normalization of w can simply be done post-hoc
without changing any structure.
Recall, our goal is g(x) ≥ 0 if y = +1 and g(x) ≤ 0 if y = −1. So if x lies directly
on the halfspace then g(x) = 0.
For each data points (xi, yi ) ∈ Rd × R, we can immediately represent xi as the value
of d explanatory variables, and yi as the single dependent variable. Then we can set
up a n × (d + 1) matrix X̃, where the ith row is (1, xi ); that is the first coordinate is 1,
and the next d coordinates come from the vector xi . Then with a y ∈ Rn vector, we
can solve for
α = ( X̃ T X̃)−1 X̃ T y
we have a set of d + 1 coefficients α = (α0, α1, . . . , αd ) describing a linear function
gα : Rd → R defined
gα (x) = α, (1, x) .
Hence b = α0 and w = (α1, α2, . . . , αd ). For x such that gα (x) > 0, we predict
y = +1 and for gα (x) < 0, we predict y = −1.
However, this approach is optimizing the wrong problem. It is minimizing how
close our predictions gα (x) is to −1 or +1, by minimizing the sum of squared errors.
But our goal is to minimize the number of mispredicted values, not the numerical
value.
We show 6 positive points and 7 negative points in Rd mapped to Rd+1 . All of the
d-coordinates are mapped to the x-axis. The last coordinate is mapped to the y-axis
210 9 Classification
and is either +1 (a positive point) or −1 (a negative point). Then the best linear
regression fit is shown (in purple), and the points where it has y-coordinate 0 defines
the boundary of the halfspace (the vertical black line). Note, despite there being a
linear separator, this method misclassifies two points because it is optimizing the
wrong measure.
R
b
Rd
w
Since, the linear regression SSE cost function is not the correct one, what is the
correct one? We might define a cost function Δ
n
Δ(gα, (X, y)) = (1 − 1(sign(yi ) = sign(g(xi )))
i=1
To use gradient descent for classifier learning, we will use a proxy for Δ called a
loss function L. These are sometimes implied to be convex, and their goal is to
approximate Δ. And in most cases, they are decomposable, so we can write
9.1 Linear Classifiers 211
n
f (α) = L(gα, (X, y)) = (gα, (xi, yi ))
i=1
n
= α (zi ) where zi = yi gα (xi )
i=1
n
= fi (α) where fi (α) = (zi = yi gα (xi )).
i=1
Note that the clever expression zi = yi gα (xi ) handles when the function gα (xi )
correctly predicts the positive or negative example in the same way. If yi = +1, and
correctly gα (xi ) > 0, then zi > 0. On the other hand, if yi = −1, and correctly
gα (xi ) < 0, then also zi > 0. For instance, the desired cost function, Δ is written as
0 if z ≥ 0
Δ(z) =
1 if z < 0.
Most loss functions (z) which are convex proxies for Δ mainly focus on how to
deal with the case zi < 0 (or zi < 1). The most common ones include
• hinge loss: = max(0, 1 − z)
⎧
⎪0 if z ≥ 1
⎪
⎨
⎪
• smoothed hinge loss: = (1 − z)2 /2 if 0 < z < 1
⎪
⎪
⎪1 − z if z ≤ 0
⎩2
• squared hinge loss: = max(0, 1 − z)2
• logistic loss: = ln(1 + exp(−z))
Loss functions are designed as proxies for the {0, 1} cost function Δ, shown in blue.
The most common loss functions, shown in red, are convex, mostly differentiable,
and in the range of z ∈ [−1, 1] fairly closely match Δ. Moreover, they do not penalize
well-classified data points, those with large z values.
hinge smoothed squared
logistic
hinge hinge
Δ
ReLU
z z z z z
The hinge loss is the closest convex function to Δ; in fact it strictly upper bounds
Δ. However, it is non-differentiable at the “hinge-point,” (at z = 1) so it takes some
care to use it in gradient descent. The smoothed hinge loss and squared hinge loss
are approximations to this which are differentiable everywhere. The squared hinge
loss is quite sensitive to outliers (similar to SSE). The smoothed hinge loss (related
to the Huber loss) is a nice combination of these two loss functions.
212 9 Classification
The logistic loss can be seen as a continuous approximation to the ReLU (rectified
linear unit) loss function, which is the hinge loss shifted to have the hinge point at
z = 0. The logistic loss also has easy-to-take derivatives (does not require case
analysis) and is smooth everywhere. Minimizing this loss for classification is called
logistic regression.
The writing of a cost function
n
n
f (α) = fi (α) = (zi )
i=1 i=1
where zi = yi gα (xi ) stresses that this is usually used within a gradient descent
framework. So the differentiability is of critical importance. And by the chain rule,
since each yi and xi are constants, then if ∇gα and dz d
are defined, then we can
invoke the powerful variants of gradient descent for decomposable functions: batch
and stochastic gradient descent.
Ultimately, in running gradient descent for classification, one typically defines the
overall cost function f also using a regularization term r(α). For instance, r(α) =
α
2 is easy to use (has nice derivatives) and r(α) =
α
1 (the L1 norm) induces
sparsity, as discussed in the context of regression. In general, the regularizer typically
penalizes larger values of α, resulting in some bias, but less overfitting of the data.
nThe regularizer r(α) is added to a decomposable loss function L(gα, (X, y)) =
i=1 (gα, (xi, yi )) as
f (α) = L(gα, (X, y)) + ηr(α),
where η ∈ R is a regularization parameter that controls how drastically to regularize
the solution.
Note that this function f (α) is still decomposable, so one can use batch, incre-
mental, or most commonly stochastic gradient descent.
Cross-Validation
Backing up a bit, the true goal is not minimizing f or L, but predicting the class
for new data points. For this, we again assume all data is drawn iid from some
fixed but unknown distribution. To evaluate how well our results generalize, we can
use cross-validation (holding out some data from the training, and calculating the
expected error of Δ on these held out “testing” data points).
9.2 Perceptron Algorithm 213
We can also choose the regularization parameter η by choosing the one that results
in the best generalization on the test data after training using each on some training
data.
Generalization goals are typically phrased so a data point has the smallest expected
loss. However, this can lead to issues when the data is imbalanced.
Consider a data set consisting for two types of students applying to college. Let
us use hair color as a proxy, and assume each applicant has blue hair or green
hair. If each type has different criteria that would make them successful, a classifier
can still use the hair color attribute in conjunction with other traits to make useful
prediction across both types of applicant. However, if there are significantly more
applicants with blue hair, then the classifier will have a smaller expected loss (and
even better generalization error) if it increases weights for the traits that correspond
with blue-haired applicants succeeding. This will make it more accurate on the blue-
haired applicants, but perhaps less accurate on the green-haired applicants, yet more
accurate overall on average across all applicants. This provides worse prediction and
more uncertainty for the green-haired students due only to the color of their hair; it
may even lower their overall acceptance rate.
What can be done as a data analyst to find such discrepancies? And should
something be changed in the modeling if these discrepancies are found?
Of the above algorithms, generic linear regression is not solving the correct problem,
and gradient descent methods do not really use any structure of the problem. In fact,
as we will see we can replace the linear function gα (x) = α, (1, x) with any function
g (even non-linear ones) as long as we can take the gradient.
Now, we will introduce the perceptron algorithm which explicitly uses the linear
structure of the problem. (Technically, it only uses the fact that there is an inner
product—which we will exploit in generalizations.)
214 9 Classification
Simplifications
For simplicity, we will make several assumptions about the data. First, we will
assume that the best linear classifier (w ∗, b∗ ) defines a halfspace whose boundary
passes through the origin. This implies b∗ = 0, and we can ignore it. This is basically
equivalent to for data point (xi, yi ) ∈ Rd ×R using xi = (1, xi) ∈ Rd where d +1 = d.
Second, we assume that for all data points (xi, yi ) that
xi
≤ 1 (e.g., all data
points live in a unit ball). This can be done by choosing the point xmax ∈ X with
largest norm
xmax
, and dividing all data points by
xmax
so that point has norm
1, and all other points have smaller norms.
Finally, we assume that there exists a perfect linear classifier. One that classifies
each data point to the correct class. There are variants of the perceptron algorithm to
deal with the cases without perfect classifiers, but are beyond the scope of this text.
The Algorithm
Now to run the algorithm, we start with some normal direction w (initialized as any
positive point), and then add misclassified points to w one at a time.
The Margin
It is the minimum distance of any data point xi to the boundary of the halfspace
for a perfect classifier. If the classifier misclassifies a point, under this definition,
the margin might be negative. For this framework, the optimal classifier (or the
maximum margin classifier) (w ∗, b∗ ) is the one that maximizes the margin
γ∗ = min yi (w ∗, xi + b∗ ).
(xi ,yi )∈X
A max-margin classifier is one that not just classifies all points correctly but does
so with the most “margin for error.” That is, if we perturbed any data point or the
classifier itself, this is the classifier which can account for the most perturbation and
still predict all points correctly. It also tends to generalize (in the cross-validation
sense) to new data better than other perfect classifiers.
For a set X of 13 points in R2 , and a linear classifier defined with (w, b). We illustrate
the margin as a pink strip. The margin γ = min(xi ,yi ) yi (w, xi + b) is drawn with an
↔ for each support point.
216 9 Classification
γ ∗ = yi (w ∗, xi + b) .
These are known as the support points, since they “support” the margin strip around
the classifier boundary.
Here, we will show that after at most T = (1/γ ∗ )2 steps (where γ ∗ is the margin of
the maximum margin classifier), then there can be no more misclassified points.
w
w w
w
To show this we will bound two terms as a function of t, the number of mistakes
found. The terms are w, w ∗ and
w
2 = w, w ; this is before we ultimately
normalize w in the return step.
First, we can argue that
w
2 ≤ t, since each step increases
w
2 by at most 1:
Second, we can argue that w, w ∗ ≥ tγ ∗ since each step increases it by at least
γ∗.Recall that
w ∗
= 1
The inequality follows from the margin of each point being at least γ ∗ with respect
to the max-margin classifier w ∗ .
Combining these facts (w, w ∗ ≥ tγ ∗ and
w
2 ≤ t) together we obtain
w √
tγ ∗ ≤ w, w ∗ ≤ w, =
w
≤ t.
w
It turns out, all we need to get any of the above perceptron machinery to work
is a well-defined (generalized) inner product. For two vectors p = (p1, . . . , pd ), x =
(x1, . . . , xd ) ∈ Rd , we have always used as the inner product the standard dot product:
d
p, x = pi · xi .
i=1
However, we can define inner products more generally as a similarity function, and
typically as a kernel K(p, x). For instance, we can use the following non-linear
functions
• K(p, x) = exp(−
p − x
2 /σ 2 ) for the Gaussian kernel, with bandwidth σ,
• K(p, x) = exp(−
p − x
/σ) for the Laplace kernel, with bandwidth σ, and
• K(p, x) = (p, x +c)r for the polynomial kernel of power r, with control parameter
c > 0.
No. In fact, this is the standard way to model several forms of non-linear classifiers.
The “decision boundary” is no longer described by the boundary of a halfspace. For
the polynomial kernel, the boundary must now be a polynomial surface of degree r.
For the Gaussian and Laplace kernel it can be even more complex; the σ parameter
essentially controls the curvature of the boundary. As with polynomial regression,
these can fit classifiers to more complex types of data, but comes at the expense of
mildly more computation cost, and perhaps worse generalization if we allow for too
complex of models and overfit to the data.
218 9 Classification
Consider a labeled data set (X, y) with X ⊂ R1 that is not linear separable because
all of the negatively labeled points are in the middle of two sets of positive points.
These are the points drawn along the x-axis.
However, if each xi is “lifted” to a two-dimensional point (xi, xi2 ) ∈ R2 (with
yellow highlight), then it can be linearly separated in this extended space of degree-2
polynomial classifiers. Of course, in the original space R1 , this classifier is not linear.
x2
To use a more general kernel within the perceptron algorithm, we use a different
interpretation of how to keep track of the weight vector w. Recall, that each step
we increment w by yi xi for some misclassified data point (xi, yi ). Instead we will
maintain a length n vector α = (α1, α2, . . . , αn ), where αi represents the number of
times that data point (xi, yi ) has been misclassified. Then we can rewrite
n
w= αi yi xi .
i=1
n
g(p) = w, p = αi yi xi, p = αi yi xi, p .
i=1 i=1
at most (1/γ ∗ )2 of these. So this is not significantly more space, depending on the
relationship between d and 1/γ ∗ .
The beauty of this form is that now we can easily replace xi, p with any other
kernel K(xi, p). That is, the function g(p) now becomes generalized to
n
g(p) = αi yi K(xi, p).
i=1
Note this g is precisely a kernel density estimate, with some elements having negative
weights (if yi = −1). Then a point p is classified as positive if g(p) > 0 and negative
otherwise.
The example shows 6 support points: 3 with a positive label {(3.5, 1), (4, 4), (1.5, 2)}
as pink +, and 3 with a negative label {(0.5, 0.5), (1, 4.5), (1.5, 2.5)} with a black .
A Gaussian kernel (with σ = 1) SVM is used to generate a separator shown as the
thick green line.
We also plot the height function (from negative red values to positive blue ones) of
the (unnormalized) kernel density estimate of the 6 points. It is actually the difference
of the kernel density estimate of the negative labeled points from that of the positive
labeled points.
If the margin is small, the data is not separable, or we simply do not want to deal
with the unknown size of a mistake counter, there is another option to use these
non-linear kernels. We can take all data, and apply a non-linear transformation to a
higher-dimensional space so the problem is linear again.
For a polynomial kernel, on d-dimensional data points, this is equivalent to the
polynomial regression expansion, described in Section 5.3. For a two dimensional
220 9 Classification
Then, we search over a six-dimensional parameter space (b, w) with b a scalar and
w = (w1, w2, w3, w4, w5 ) and the kernel defined K ↑ (p, w) = q, w . Note we are
slightly abusing notation here since in this case, K ↑ : Rd × Rdim(w) → R, where the
dimension of p ∈ Rd and w ∈ Rdim(w) do not match. Now, the z associated with a
data point (p, y) as input to a loss function (z) is defined as
Note that the number of free parameters in the feature expansion version of the
polynomial kernel is larger than when retaining the kernel in the form K(x, p) =
(x, p + b)r , when it is only d + 1. In particular, when p ∈ Rd and the polynomial
degree r is large, then this dimensionality can get high very quickly; the dimension
of q is dim(q) = O(d r ).
Such expansion is also possible for many other radius basis kernel (e.g., Gaus-
sian, Laplace), but the constructions are only approximate and randomized. Usually
it requires the dimensionality of q to be about 100 or more to get a faithful represen-
tation.
Generalization
A more general way to work with complex kernels is called a support vector machine
or SVM. Like with the illustration of the margin for linear classifiers, there are a
small number of data points which determine the margin, the support vectors. Just
these points are enough to determine the optimal margin.
In the case of complex non-linear kernels (e.g., Gaussians), all of the points
may be support vectors. Worse, the associated linear expanded space, the result of
complete variable expansion, is actually infinite! This means, the true weight vector
w would be infinite as well, so there is no feasible way to run gradient descent on its
parameters.
However, in most cases the actual number of support vectors is small. Thus, it will
be useful to represent the weight vector w as a linear combination of these support
vectors, without every explicitly constructing them. Consider a linear expansion of
a kernel K to an m-dimensional space (think of m as being sufficiently large that
it might as well be infinite). However, consider if there are only k support vectors
{s1, s2, . . . , sk } ⊂ X where X is the full data set. Each support vector si ∈ Rd has
a representation qi ∈ Rm . But the normal vector w ∈ Rm can be written as a linear
combination of the qi s; that is, for some parameters α1, α2, . . ., we must be able to
write
k
w= αi qi .
i=1
Thus, given the support vectors S = {s1, . . . , sk } we can represent w in the span of S
(and the origin), reparametrized as a k-dimensional vector α = (α1, α2, . . . , αk ). This
α vector is precisely the mistake counter for only the support vectors (the non-zero
components), although in this case the coordinates need not be integers.
More concretely, we can apply this machinery without ever constructing the qi
vectors. Each can be implicitly represented as the function qi = K(si, ·). Recall, we
only ever need to use the qi in qi, w . And we can expand w to
k
k
w= αi qi = αi K(si, ·).
i=1 i=1
Given this expansion, if we consider positive definite kernels, which include Gaussian
and Laplace kernels, then they have the so-called reproducing property, where each
generalized inner product K(w, p) can be written as a weighted sum of inner products
with respect to the support vectors:
k
K(w, p) = αi K(si, p).
i=1
That is, the reproducing property ensures that for every representation of a classifier
w, there is always a sets {α1, α2, . . .} and {s1, s2, . . .} to satisfy the above equation,
222 9 Classification
and thus can optimize over the choice of α to find the best w. Ultimately, for a data
point (p, y), the z in the loss function (z) is defined as
k
z = yK(w, p) = y αi K(si, p).
i=1
There are multiple ways to actually optimize SVMs: the task of finding the support
vectors S = {s1, . . . , sk }, and assigning their weights α = {α1, . . . , αk }. One is to run
the kernelized Perceptron algorithm, as outlined above. Alternatively, given a fixed
set of support vectors S, one can directly optimize over α using gradient descent,
including any loss function and regularizer as before with linear classifiers. Thus, if
we do not know S, we can just assume S = X, the full data set. Then, we can apply
standard gradient descent over α ∈ Rn . As mentioned, in most cases, most αi values
are 0 (and those close enough to 0 can often be rounded to 0). Only the points with
non-zero weights are kept as support vectors.
Alternatively, stochastic gradient descent works like perceptron, and may only use
a fraction of the data points. If we use a version of hinge loss, only misclassified points
or those near the boundary have a non-zero gradient. The very strongly classified
points have zero gradient, and the associated αi coordinates may remain 0. The
proper choice of loss function, and regularizer, can induces sparsity on the α values;
and the data points not used are not support vectors.
Models for classifiers can quickly become very complicated, with high degree poly-
nomial models or kernel models. Following this discussion we will explore even
more complex models in KNN classifiers, random forests, and deep neural nets. Is
there a mathematical model to understand if these are too complicated, other than
the data-driven cross-validation? Yes, in fact, there are several. The most common
one is based on the idea of VC dimension and sample complexity. This provides
a way to capture the complexity of the family of classifiers being considered, and
relate this to how much data is required to feasibly train a model with such a family.
If you have n = 10 data points in 4-dimensions, you should probably not use a
degree 3 polynomial. But what about 100 data points, or 1000, or 1 million?
VC dimension
This is best described with notions of combinatorial geometry. Consider a set of data
Z and a family of classifiers H (e.g., linear classifiers in Rd ), where each h ∈ H maps
9.4 Learnability and VC dimension 223
Linear separators can shatter a set of 3 points in R2 , as shown in the 8 panels below.
Each possible subset of these points (including the full and empty sets) is shown in
red, as well as the halfspace (in green) which corresponds with that subset. Thus the
VC dimension is at least 3.
However, linear separators cannot shatter a set of size 4. The two canonical
configurations are shown in the panels below, in addition to a labeling which cannot
be realized using a linear separator. Thus the VC dimension for points in R2 and
linear separators as H is exactly 3.
Example: VC dimension
Sample Complexity
When the VC dimension for a family of classifiers is bounded, then we can apply
powerful theorems to grapple with the learnability of the problem. These are again
modeled under the assumption that data X is drawn iid from an unknown distribution
μ. The question is if we learn a classifier h ∈ H on X, and apply it to a new point
x ∼ μ (not in X, e.g., a test set or yet to be seen data), then how likely are we
to misclassify it. We let ε ∈ (0, 1) be the acceptable rate of misclassification (e.g.,
ε = 0.01 means we misclassify it 1% of the time). We also need a probability of
failure δ, controlling the probability that X (also a random sample from μ) is a poor
representative of μ).
First consider a family of classifiers H for which there exists a perfect classifier
h∗ ∈ H , so that for any x, it will always return the correct label. If this is the case
and the VC dimension is ν, then n = O((ν/ε) log(ν/δ)) samples X ∼ μ are sufficient
to learn a classifier h which, with probability at least 1 − δ, will predict the correct
label on a (1 − ε)-fraction of new points drawn from μ.
If there does not exist a perfect classifier, then the situation is more challenging.
The best classifier h∗ ∈ H may still classify α-fraction of new points incorrectly.
Then if the VC dimension is ν, drawing n = O((1/ε 2 )(ν+log(1/δ))) samples X ∼ μ is
sufficient to learn a classifier h which, with probability at least 1 − δ, will misclassify
at most α + ε of new points drawn from μ.
These theorems do not explain how to learn these classifiers. In the perfect clas-
sifier setting, the perceptron algorithm can be used as long as there is an appropriate
inner product, and the margin γ is sufficiently large. Otherwise there may be more
than ε error induced in the learning process. This sort of analysis is more useful
in warning against certain overly complex classifiers, as an enormous number of
training data points may be required to be able to accurately learn the approximately
best classifier. Despite that, many modern techniques (like deep neural networks
and random forests, introduced below) which have no or enormous VC dimension
bounds have gained significant popularity and seem to generalize well. While these
tend to work best when training on huge data sets, the data set sizes often still do not
quite approach that required by the theoretical bounds.
9.5 kNN Classifiers 225
Now for something completely different. There are many ways to define a classifier,
and we have just touched on some of them.
The k-NN classifier (or k-nearest neighbors classifier) works as follows. Choose
a scalar parameter k (it will be far simpler to choose k as an odd number, say
k = 5). Next define a majority function maj : {−1, +1} k → {−1, +1}. For a set
Y = (y1, y2, . . . , yk ) ∈ {−1, +1} k it is defined as
+1 if more than k/2 elements of Y are + 1
maj(Y ) =
−1 if more than k/2 elements of Y are − 1.
Then for a data set X where each element xi ∈ X has an associated label yi ∈
{−1, +1}, define a k-nearest neighbor function φ X,k (q) that returns the k distinct
points in X which are closest to a query point q. Next let sign report yi for any input
point xi ; for a set of inputs xi , it returns the set of values yi .
Finally, the k-NN classifier is
That is, it finds the k-nearest neighbors of query point q, and considers all of the
class labels of those points, and returns the majority vote of those labels.
A query point q near many other positive points will almost surely return +1, and
symmetrically for negative points. This classifier works surprisingly well for many
problems but relies on a good choice of distance function to define φ X,k .
Unfortunately, the model for the classifier depends on all of X. So it may take
a long time to evaluate on a large data set X. In contrast the functions g for non-
kernelized methods above take time linear in d to evaluate for points in Rd , and thus
are very efficient.
A decision tree is a general classification tool for multidimensional data (X, y) with
X ∈ Rn×d and y ∈ {−1, +1} n . The classifier has a binary tree structure, where each
non-overlapping set of subtrees, each contain disjoints sets of X. And in particular
each node has a rule on Rd of how to split the subset of X in that subtree into two
parts, one for each child. Each leaf node of the tree is given a label −1 or +1, based
on the majority label from the subset S ⊂ X in that node. Then to use the tree to
classify an unknown value x ∈ Rd , it starts at the root of tree and follows each rule at
the node to choose an appropriate subtree; it follows these subtrees recursively until
it reaches a leaf, and then uses the label at the leaf.
So far, this structure should seem basically the same as other trees used for
nearest neighbor searching (e.g., kd-trees, an alternative to LSH in Section 4.6) or
226 9 Classification
for clustering (e.g., HAC in Section 8.5 or Spectral Clustering in Section 10.3).
Different from most clustering, and similar to kd-trees, decision trees are made
tractable by splitting on a single coordinate (out of the d choices) at each node.
x1 x6 y<5 y≥5
x9
x11
The construction of the decision tree is made top-down. It starts at the root which
contains all data X, and then splits recursively until some convergence property is
achieved. At each split, each coordinate is considered. For that coordinate, the best
split can be found by scanning the possible splits from low to high, and whatever
statistic being optimized can be updated easily as a single point is moved from one
side to the other in a linear scan. So the remaining considerations are what statistic
to use to decide on the coordinate (and the split point), and when to terminate the
process.
Splitting Criteria
There are two main criteria used to determine which coordinate to use in a split.
Given this criteria, the best split for that criteria is found for each coordinate, and the
best one is used. The ultimate goal is to construct the tree so that it is not very deep,
9.6 Decision Trees 227
but so each leaf node has nearly all of the same labels—or at least is a uniform mix
that is not easy to split.
Each criteria is defined for each subset S ⊂ X, which we can categorize in
two subsets: the positive labeled S+ = {xi ∈ S | yi = +1} and the negative labeled
S− = {xi ∈ S | yi = −1}. Further denote the probability of each class as p+ = |S+ |/|S|
and p− = |S− |/|S|.
IG (S) = p+ (1 − p+ ) + p− (1 − p− ).
Each term presents the probability that a member of that class will be chosen, times
the probability it will be misclassified. The probability it will be misclassified is
based on us choosing a label for that node not just based on the majority weight,
but by selecting a random item and using its class.
• The Information Gain is based on the entropy of the classes, and maximizes the
entropy of each subtree set S, defined as
And information gain is the value IG(S) = H(S) + |Sr | · H(Sr ) + |Sl | · H(Sl ).
One could continue splitting until each leaf node only has a single label. But this
would typically overfit. It is common to regularize the solution by fixing the total
number of nodes in a decision tree. At each step, any node can be split, and the
one which scores the highest on the splitting criteria (i.e., smallest Gini impurity
or largest information gain) is chosen. Alternatively, one can stop splitting when no
available split achieves a certain criteria.
Decision trees are quite general, and also extend directly to categorical data, or
data that is mixed with some coordinates as real values and some as categorical. It
can handle more than two labels. Moreover, it can naturally handle coordinates with
different units.
The procedural structure makes them easy to explain, and perhaps audit. The
classification decision is made using a trait-by-trait explanation that users even
unfamiliar with linear algebra or probability may be able to understand.
On the other hand, more high-powered methods can often provide better prediction
results in aggregate. A common way to make decision trees more powerful is to
228 9 Classification
build several decisions trees and then let them vote on the final classification using
majority voting like in a k-NN classifier. The two most common ways are bagging
(which resamples the data, then builds a new tree) and boosting (which builds each
subsequent tree based on what was poorly classified in previous ones). More complex
methods to build and combine trees is the random forest technique which often
achieves among the best classification results. However, with these more complex
combinations, the explainability is sacrificed as a result.
Consider analyzing medical history, generic, dietary, and exposure data related to
cancer patients. Your goal is to predict if a patient with a certain medical history
will get cancer. After splitting data into a test and training set, on the training data
you build a decision tree with a small number of branching points. It achieves 89%
accuracy on the test data.
However, you hope you can get a better performance result, and use a random forest
approach, which is far more complex. However, it improves the accuracy to 92% on
the test data. You need to report back to your medical colleagues to try to provide
actionable decisions to improve diagnosis and treatment. Which model should you
report? What percentage improvement is worth sacrificing the explainable model of
a decision tree to a more opaque model of a random forest or similar approach?
Now consider the application was not medical, such as predicting cancer, but was
in predicting which ad to place on a website so people are more likely to click on it.
Would your decision on which classifier type to use be the same?
Geometry of Neurons
x1
x2 w1
x3 w2 d
w3 {−1, +1} j=1 wj xj − b = x, w b > 0?
> b?
b?
xd wd
A neural network, is then just a network or directed graph of neurons like these.
Typically, these are arranged in layers. In the first layer, there may be d input values
x1, x2, . . . , xd . These may provide the input to t neurons (each neuron might use
fewer than all inputs). Each layer produces an output y1, y2, . . . , yt . These outputs
then serve as the input to the second layer, and so on.
In a neural net, typically each xi and yi is restricted to a range [−1, 1] or [0, 1] or
[0, ∞), not just the two classes {−1, +1}. Since a linear function does not guarantee
this range of its output (instead of a binary threshold), to achieve this at the output
of each node, they typically add an activation function φ(y). Common ones are
y −y
• hyperbolic tangent : φ(y) = tanh(y) = ee y −e
+e−y ∈ [−1, 1]
y
• sigmoid : φ(y) = 1+e1 −y = e ye +1 ∈ [0, 1]
• ReLu : φ(y) = max(0, y) ∈ [0, ∞).
These functions are not linear, and not binary. They act as a “soft” version of binary.
The hyperbolic tangent and sigmoid stretch values near 0 away from 0. Large values
stay large (in the context of the range). So it makes most values almost on the
boundaries of the range. And importantly, they are differentiable.
The ReLu has become very popular. It is not everywhere differentiable, but is
convex. It is basically the most benign version of activation, and is more like the
original neuron, in that if the value y is negative, it gets snapped to 0. If it is positive,
it keeps its value.
A two-layer neural network is already a powerful way to build classifiers, and
with enough neurons in the middle layer, is capable of learning any function—its VC
dimension is infinite. However, deep neural nets with 3 or often many more layers
(say 20 or 100 or more) have become very popular due to their effectiveness in many
tasks, ranging from image classification to language understanding. To be effective,
typically, this requires heavy engineering in how the layers are defined, and how the
connections between layers are made.
x φ h ∈ Rk
φ
φ a
φ
-1
φ φ φ
φ φ φ
φ φ φ
φ h1 φ φ {−1, +1}
x φ φ h2 φ h3
φ φ φ
φ φ φ
φ φ φ
Once the connections are determined, then the goal is to learn the weights on
each neuron so that for a given input, a final neuron fires if the input satisfies some
pattern (e.g., the input are pixels from a picture, and it fires if the picture contains a
car). This is theorized to be “loosely” how the human brain works. Although, neural
nets have pretty much diverged in how they learn, compared to attempts to replicate
the structure and process of learning in the human brain.
Given a data set X with labeled data points (x, y) ∈ X (with x ∈ Rd and y ∈
{−1, +1}), we already know how to train a single neuron so for input x it tends to
fire if y = 1 and not fire if y = −1. It is just a linear classifier! So, we can use the
perceptron algorithm, or gradient descent with a well-chosen loss function.
However, for neural networks to attain more power than simple linear classifiers,
they need to be at least two layers, and are often deep (e.g., for “deep learning”) with
20, 100, or more layers. For these networks, the perceptron algorithm no longer works
since it does not properly propagate across layers. However, a version of gradient
descent called back-propagation can be used. In short, it computes the gradient across
the edge weights in a network by chaining partial derivatives backwards through the
network.
Ultimately the network represents a function g : Rd → R, from which the sign
sign(g(x)) predicts the label of x. We can evaluate a given pair (xi, yi ) using a loss
function (yi g(xi )), and to use within a (stochastic) gradient descent framework, we
need to calculate the gradient of (yi g(xi )). However, there are many model param-
eters (each edge weight of each neuron in each layer), and they have complicated
dependencies. Back-propagation has the insight that the derivative of each model
parameter (which compose to form the gradient) can be calculated from the last
layer, then the second last layer, and so on. This is an application of the chain rule
from calculus. In particular we can write
Now set z = yg(x) = ya L ; for notational coherence, we denote the output g(x) as the
last activation layer a L . Then for l = L, a L is a scalar, so we set i = 1 for consistent
notation, and we can derive
da1L
where is the derivative of the loss function and noting that the derivative dW1,Lj
is
linear (from a dot product), so is simply h L−1 .
Since each of these terms was chosen
j
L−1 L
to be computable, (since h j and a1 are all computed in the process of computing
g(x)), then we can derive the gradient for each weight W1,L j in the last layer. If the
neural net was only one layer, then h L−1
j = x j , and this is nothing more than gradient
descent for a linear classifier.
Now the derivative of the weights in the second-to-last layer L − 1 is computed
in a similar way as the last layer L. That is
where (φ L−2 ) is the derivative of φ L−2 , and again each of these terms has been or
can be computed. Note that we have computed for each k
d(z)
And again in this process we have computed each daiL−2
. The second big insight
d(z)
(other than reusing each dail
) is that in each following layer there is only one sum
over the previous activation vector—it is not a sum of sums and so on.
In summary, to compute the derivative for a specific data point (x, y), we “feed
forward” x into the network, can compute all activation al and hidden layer hl
vectors. Then we use y to evaluate the loss function, and “back propagate” the loss
on the network, using the chain rule to compute relevant partial derivatives as we
go. Thus, the cost of one step is ultimately proportional to the size of the network,
and is reasonably efficient.
9.7 Neural Networks 233
Exercises
9.1 Consider the following “loss” function. i (zi ) = (1 − zi )2 /2, where for a data
point (xi, yi ) and prediction function f , then zi = yi · f (xi ). Predict how this might
work within a gradient descent algorithm for classification.
9.2 Consider a data set (X, y), where each data point (x1,i, x2,i, yi ) is in R2 × {−1, +1}.
Provide the psuedo-code for the Perceptron Algorithm using a polynomial kernel of
degree 2. You can have a generic stopping condition, where the algorithm simply
runs for T steps for some parameter T. (There are several correct ways to do this,
but be sure to explain how to use a polynomial kernel clearly.)
9.4 Consider the following Algorithm 9.7.1, called the Double-Perceptron. We will
run this on an input set X consisting of points X ∈ Rn×d and corresponding labels
y ∈ {−1, +1}.
For each of the following questions, the answer can be faster, slower, the same,
or not at all. And should be accompanied with an explanation.
1. Compared with Algorithm 9.2.1, explain how this algorithm with converge.
2. Next consider transforming the input data set X (not the y labels) so that all
coordinates are divided by 2. Now if we run Double-Perceptron how will the
results compare to regular Perceptron (Algorithm 9.2.1) on the original data set
X.
3. Finally, consider taking the original data set (X, y) and multiplying all entries in y
by −1, then running the original Perceptron algorithm. How will the convergence
compare to running the same Perceptron algorithm, on the original data set.
234 9 Classification
9.5 Consider a matrix A ∈ Rn×4 . Each row represents a customer (there are n
customers in the database). The first column is the age of the customer in years, the
second column is the number of days since the customer entered the database, the
third column is the total cost of all purchases ever by the customer in dollars, and the
last column is the total profit in dollars generated by the customer. So each column
has a different unit.
For each of the following operations, decide if it is reasonable or unreasonable
(think about the units).
1. Run simple linear regression (no regularization) using the first three columns to
build a model to predict the fourth column.
2. Use k-means clustering to group the customers into 4 types using Euclidean
distance between rows as the distance.
3. Use PCA to find the best two-dimensional subspace, so we can draw the customers
in a R2 in way that has the least projection error.
4. Use the perceptron to build a linear classification model based on the first three
columns to predict if the customer will make a profit +1 or not −1.
5. Use a decision tree to build a model based on the first three columns to predict if
the customer will make a profit +1 or not −1.
6. Use a KNN classifier (with Euclidean distance and k = 3) to build a model based
on the first three columns to predict if the customer will make a profit +1 or not
−1.
9.7 This question explores linear classifiers for non-linearly separable point sets.
1. Construct and report a set of labeled points (X, y) in R2 that is not linearly
separable (provide a plot).
2. Explain what will happen if you run the perceptron algorithm for a linear classifier
on this data set? (Do not set a fixed upper bound on T the number of steps)
3. Describe another algorithm which would provide an acceptable linear classifier
(but not necessarily a separator) for the set of points.
9.7 Neural Networks 235
9.9 Consider a decision tree algorithm where at each split node a linear SVM is run
to construct a linear classifier, but not necessarily aligned with a single coordinate.
List at least one advantage and two disadvantages of this algorithm compared to the
standard decision tree.
Chapter 10
Graph Structured Data
c
a e g
b f h
d
Di,i with Di,i . In particular, this means it is easy to invert a diagonal matrix since D−1
p
just replaces each Di,i with 1/Di,i ). One can check that the definition is consistent
with the more general definition of a matrix power, that is, (D p )1/p = D.
Again using our example graph, we can define the adjacency A and degree matrix
D:
01110000 30000000
1 0 0 1 0 0 0 0 0 2 0 0 0 0 0 0
1 0 0 1 1 0 0 0 0 0 3 0 0 0 0 0
1 1 1 0 0 0 0 0 0 0 0 3 0 0 0 0
A= D= .
0 0 1 0 0 1 1 0 0 0 0 0 3 0 0 0
0 0 0 0 1 0 1 1 0 0 0 0 0 3 0 0
0 0 0 0 1 1 0 0 0 0 0 0 0 0 2 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
Taking D to the power −1/2 is now
0.577 0 0 0 0 0 0 0
0 0.707 0 0 0 0 0 0
0 0 0.577 0 0 0 0 0
0 0 0 0.577 0 0 0 0
D−1/2 = .
0 0 0 0 0.577 0 0 0
0 0 0 0 0 0.577 0 0
0 0 0 0 0 0 0.707 0
0 0 0 0 0 0 0 1
The initial state q represents a probability distribution over vertices. For instance,
if the mental model of this process is a random walk around the vertices of a graph,
then an initial state q (i.e., specifically at a vertex b ∈ V with probability 1, in our
running example) is
qT = [0 1 0 0 0 0 0 0].
A more general model is when there is not a single vertex which dictates the state,
but a distribution over possible vertices. Then if there is a 10% chance of being in
state a, a 30% chance of being in state d, and a 60% change of being in state f , this
is represented as
qT = [0.1 0 0 0.3 0 0.6 0 0].
In general we need to enforce that q ∈ Δ |V | , that is, it represents a probability
distribution, so
q[i] ≥ 0
• each
• i q[i] = 1.
Now the probability transition matrix P is a column normalized matrix. That is,
each column must satisfy a probability distribution; each column P j ∈ Δ |V | , so the
entrees are non-negative and sum to 1. Each column i represents the probability of
where the next vertex would be, conditioned on starting at vertex vi . That is, entry
Pi,i describes that from vertex vi the probability is Pi,i that the next vertex is vi .
Using an adjacency matrix A, we can derive P by normalizing all of the columns, so
P j = A j / A j 1 .
The running example adjacency matrix A can derive a probability transition matrix
P as
0 1/2 1/3 1/3 0 0 0 0
1/3 0 0 1/3 0 0 0 0
1/3 0 0 1/3 1/3 0 0 0
1/3 1/2 1/3 0 0 0 0 0
P= .
0 0 1/3 0 0 1/3 1/2 0
0 0 0 0 1/3 0 1/2 1
0 0 0 0 1/3 1/3 0 0
0 0 0 0 0 1/3 0 0
We can also present this as a graph, where directed edges show the probability
transition probability from one node to another. The lack of an edge represents 0
probability of a transition. This sort of representation is already very cluttered, and
thus is not scalable to much larger graphs.
10.1 Markov Chains 241
1/2
1/3 c 1/3 g
a 1/3
1/3 e 1/3 1/2
1/3 1/3
1/3
1/3
1/3 h
1/3 1/3 1
1/3 1/3
1/2 f
1/2 d
b 1/3
Note that although the initial graph was undirected, this new graph is directed. For
instance, edge (a, b) from a to b has probability 1/3 while edge (b, a) from b to a has
probability 1/2, or more dramatically, edge ( f , h) from f to h has probability 1/3,
whereas edge (h, f ) from f to h has probability 1; that is, vertex h always transitions
to f .
Usually, only one of these two interpretations is considered. They correspond to quite
different algorithms and purposes, each with its own advantages. We will discuss
both.
A Markov chain is ergodic if there exists some t such that for all n ≥ t, each entry
in P n is positive. This means that from any starting position, after t steps there is
always a chance that we are in every state. That is, for any q, qn = P n q is positive in
all entries. It is important to make the distinction in the definition that it is not that
we have some positive entry for some n ≥ t, but for all n ≥ t, as we will see.
To characterize when a Markov chain is ergodic, it is simpler to rule out the cases
when it is not ergodic, and then if it does not satisfy these properties, it must be
ergodic. There are three such non-ergodic properties:
• It is cyclic. This means that it alternates between different sets of states every 2
or 3 or in general p steps. This is strict; even a single low probability event that
deviates from this makes it not cyclic. The cyclic nature does not need to be on
the entire graph; it may only be on a disconnected part of the graph.
• It has absorbing and transient states. This corresponds to some Markov chains
which can separate V into two classes A, T ⊂ V so that if a random walk leaves
some node in T and lands in a state in A, then it never returns to any state in T.
In this case, the nodes A are absorbing, and the nodes in T are transient. Note
that this only happens when the initial graph is directed, so the walk cannot go
backwards on an edge.
10.1 Markov Chains 243
1/2 1/2 0 0 0 0
1/2 49/100 0 0 0 0
010
1/2 0 0 1/100 1/4 1/4 1/4 1/4
1 0 1 .
1/2 1 0 0 1/4 1/4 1/4 1/4
0 0 0
0 0 1/4 1/4 1/4 1/4
0 0 1/4 1/4 1/4 1/4
• It is not connected. This property indicates that there are two sets of nodes
A, B ⊂ V such that there is no possible way to transition from any node in A to
any node in B.
1/2 1/2 0 0 0 0
1/2 1/2 0 0 0 0
010
10 0 0 1/3 1/2 1/3 0
1 0 0 .
01 0 0 1/3 0 1/3 0
0 0 1
0 0 1/3 1/2 1/3 0
0 0 0 0 0 1
When it IS Ergodic
From now on, we will assume that the Markov chain is ergodic. At a couple of critical
points, we will show simple modifications to existing chains that ensure this is true.
Now there is an amazing property that happens when a Markov chain is ergodic.
Let P∗ = P n as n → ∞; this is well-defined, and it will converge to a limiting matrix
P∗ . Now let q∗ = P∗ q. That is, there is also a limiting state, and this does not depend
on the choice of q.
l,V = LA.eig(P)
print(V[: ,0]/ LA.norm(V[: ,0] ,1))
Note that this distribution is not uniform and is also not directly proportional
to the transition probabilities. It is a global property about the connectivity of the
underlying graph. Vertex h which is the most isolated has the smallest probability,
and more central nodes have higher probability.
The second eigenvalue of P is 0.875 which is small enough that the convergence
is fast. If the part of the graph containing {a, b, c, d} was made harder to transition to
or from the other part of the graph, this value could be much larger (e.g., 0.99). On
the other hand, if another edge was added between say d and f , then this would be
10.1 Markov Chains 245
much smaller (e.g., 0.5), and the convergence would be even faster. We can retrieve
the second eigenvalue in Python as
print(l[1]. real)
This implicitly defines a Markov chain on the state space V. The transition matrix
is defined by the algorithm but is not realized as a matrix. Importantly, if the chain
is ergodic, then there exists some t such that i ≥ t, then Pr[vi = v] = w(v)/W.
This value t not only is a limiting notion, but also holds for some finite t (even
if V is continuous), through a property called “coupling from the past.” However,
determining when such a t has occurred analytically placing an upper bound on the
value required for t can be challenging.
Often the goal is to create many samples from w, which can then be used as
a proxy for the unknown w to estimate various quantities via the concentration of
measure bounds. The most formal analysis of such algorithms often dictates that it is
run for t steps: take one sample, then run for another t steps and take one additional
sample, and so on repeating tk times to get k samples. This of course seems wasteful
but is necessary to strictly ensure samples are independent.
In practice, it is more common to run for t = 1000 steps (the “burn in” period),
then take the next k = 5000 steps as a random sample. This repeats a total of only
t + k steps. This second method has “auto-correlation,” as samples vi and vi+1 are
likely to be “near” each other (either since K is local, or because it did not accept a
new state). Officially, we should take only one point every s steps, where s depends
on the degree of auto-correlation. But in practice, we take all k samples but treat
them (for the purpose of concentration bounds) as k/s samples.
10.2 PageRank
Search engines were revolutionized in the late 1990s when Google was formed, with
the PageRank algorithm as the basis for ranking webpages within its search engine.
Before PageRank, other search engines (e.g., Altavista, Lycos, and Infoseek) and
indexes (e.g., Yahoo! and LookSmart) were based almost entirely on a combination
of the content of the pages and manually curated lists. These aspects are still used as
part of an overall search and ranking method, but PageRank added the perspective
of also considering the importance of pages based on the global structure of the
webgraph.
Webgraph is a graph where each vertex is a webpage, and directed edges are
created when one webpage links to another one. Search engines implicitly had
stored these graphs already, since they ran “crawlers.” These were algorithms which
randomly followed links from one webpage to another, in this case, with the purpose
of cataloging the content of each webpage so it could be put into a large nearest
neighbor search algorithm (often based on cosine similarity using bag-of-words
models or Jaccard similarity using k-gram models).
Intuitively, the PageRank model extended the idea of a crawler to be a “random
surfer,” someone who randomly browses webpages. The goal was to identify web-
pages which a random surfer would commonly reach, and to mark them as more
important in the search engine. This model can be formalized as a Markov chain,
and the importance is given by the limiting state distribution q∗ .
10.2 PageRank 247
However, this only works when the graph is ergodic, and the webgraph is very
non-ergodic. It is not connected. And there are webpages which are linked to, but do
not link to, anything else; these are absorbing states. Worse, spammers could (and
do!) intentionally create such sets of webpages to capture the attention of crawlers
and random surfers.
Example: Webgraph
This includes a large part with black and gray edges; this two-layer structure
mirrors those used by spammers to attract the traffic of random surfers and automated
ranking in search engines. The gray edges are planted links to the gray pages they
control. These gray edges might be in comments on blogs, or links on Twitter, or
any other click bate. Then the gray pages can be easily updated to direct traffic to
the black pages which pay the spammers to get promoted. The effect of this sort of
structure can be reduced with PageRank but is not completely eliminated.
So PageRank adds one additional, but crucial, change to the standard Markov
chain analysis of the webgraph: teleportation. Roughly for every seven steps (15% of
the time), it instructs the random surfer to jump to a completely random node in the
webgraph. This makes the graph connected, eliminates absorbing states, increases
convergence rates, and jumps out of traps set up by webgraph spammers. That is, it
is now ergodic, so the limiting state q∗ exists. Moreover, there is an efficient way to
implement this without making the webgraph artificially dense, and exploding the
memory requirements.
R = (1 − β)P + βQ,
248 10 Graph Structured Data
Here 1 represents the length n, all 1s vector. Since Q has identical columns, it does
not matter which column is chosen, and q is eliminated from the second term. More
importantly, the second term is now a constant; it just adds β/n to each coordinate
created by the sparse matrix multiply (1 − β)Pq. The state after i rounds can be
computed inductively as
It is generally accepted that some webpages are more trustworthy than others,
for instance, Wikipedia pages, top-level pages at reputable universities, established
newspapers, and government pages. These sources typically have a formal review
process and a reputation to uphold. On the other hand, pages which are parts of blogs,
comment sections on news articles, and personal social media pages are typically
much less trustworthy.
With this in mind, variants on PageRank have been developed with this trust in
mind, as a way to help combat advanced spamming attempts. This works by having
trusted pages more frequently teleported to, and less likely to be teleported from.
10.3 Spectral Clustering on Graphs 249
This is easy to implement by adjusting the 1 vector in (β/n)1. Denote the limiting
vector of this process by the trustRank score. It has been suggested that webpages
which deviate in their standard PageRank score and their trustRank score are more
likely to be spam webpages, and can be down-weighted in the search results.
What are the ethical considerations one should consider when choosing to imple-
ment this? And what considerations should be taken in how to build the updated
trust weighting?
And as a result, the key step in top-down graph clustering is to find the cluster S
(and complement T = V \ S) that has the minimum ncut(S, T). Dividing by vol(S)
and vol(T) prevents this measure from finding either S or T that is too small, and
the cut(S, T) in the numerator will ensure a small number of edges are crossing this
partition.
250 10 Graph Structured Data
In the running example graph, clusters S = {a, b, c, d} and the singleton cluster with
S = {h} both have a small cut value with both cut(S , S̄ ) = 1 and cut(S, S̄) = 1. The
volumes however are very different with vol(S) = 6 and vol(S̄) = 5 while vol(S ) = 1
and vol(S̄ ) = 10.
c
a e g
b f h
d
S S
The difference in volumes shows up in the normalized cut score for S and S .
Specifically ncut(S , S̄ ) = 1 + 10
1
= 1.1, whereas ncut(S, S̄) = 16 + 15 = 0.367. Overall
S results in the smallest normalized cut score; this aligns with how it intuitively is
the partition which best divides the vertices in a balanced way without separating
along too many edges.
Affinity Matrix
This algorithm will start with the adjacency matrix A of a graph, and then transform
it further. However, the adjacency matrix A need not be 0 − 1. It can be filled with the
similarity value defined by a similarity function s : X × X → [0, 1] between elements
of a data set X; then A stands for affinity. The degree matrix is still diagonal but is
now defined as the sum of elements in a row (or column—it must be symmetric).
The remainder of the spectral clustering formulation and algorithm will be run the
same way; however, we continue the description with the graph representation as it
is more clean. However, this generalization allows us to apply spectral clustering to
point sets with an arbitrary data set and an appropriate similar measure s.
When the similarity of a pair is very small, it is often a good heuristic to round the
values down to 0 in the matrix to allow algorithms to take advantage of fast sparse
linear algebraic subroutines.
The key step in spectral clustering is found by mapping a graph to its Laplacian,
and then using the top eigenvector to guide the normalized cut. The term “spectral”
refers to the use of eigenvalues. We start by defining the (unnormalized) Laplacian
matrix of a graph G with n vertices as L0 = D − A, where as before A is the adjacency
(or affinity) matrix and D is the degree matrix. When A is an affinity matrix (each
entry is
a similarity), D is again a diagonal matrix, where the jth entry is defined as
D j, j = i=1;ij
n
Di, j .
10.3 Spectral Clustering on Graphs 251
3 −1 −1 −1 0 0 0 0
−1 2 0 −1 0 0 0 0
−1 0 3 −1 −1 0 0 0
−1 −1 −1 3 0 0 0 0
L0 = D − A = .
0 0 −1 0 3 −1 −1 0
0 0 0 0 −1 3 −1 −1
0 0 0 0 −1 −1 2 0
0 0 0 0 0 −1 0 1
Note that the entries in each row and column of L0 sum up to 0. We can think of
D as the flow into a vertex, and think of A as the flow out of the vertex (related
to the Markov chain formulation). This describes a process where the “flow keeps
flowing,” so it does not get stuck anywhere. That is, as much flows in flows out of
each vertex.
Again following our running example, we compute the eigenvectors and values of the
Laplacian L0 . The following table shows the eigenvalues λ1, . . . , λ8 and eigenvectors
√
u1, . . . , u8 . Note that λ1 = 0, and u1 is a constant vector (each entry is 1/ 8). The
second eigenvalue, the Fiedler vector, is λ2 = 0.278.
252 10 Graph Structured Data
b
d
a
v2 = −1 c f v2 = 1
v3 = −1
This k-dimensional representation hints at how to perform the cut into two subsets.
In fact, the typical strategy is to only use a single eigenvector, u2 . This provides a
1-dimensional representation of the vertices, a sorted order. There are then two
common approaches to find the cut.
The first approach is to just select vertices with values less than 0 in S, and those
greater or equal to 0 in S̄. For large complex graphs, this does not always work as
well as the next approach; in particular, there may be two vertices which have values
both very close to 0, but one is negative and one is positive. Often, we would like to
place these into the same cluster.
The second approach is to consider the cut defined by any threshold in this sorted
order. For instance, we can define Sτ = {vi ∈ V | u2,i ≤ τ} for some threshold
τ. In particular, it is easy to find the choice of Sτ which minimizes ncut(Sτ, S̄τ )
10.3 Spectral Clustering on Graphs 253
Normalized Laplacian
However, to optimize the cut found using this family of approaches to minimize
the normalized cut, it is better to use a different form of the Laplacian known as
the normalized Laplacian L of a graph. For a graph G, an identity matrix I, and
the graph’s diagonal matrix D and adjacency matrix A, its normalized Laplacian is
defined as
L = I − D−1/2 AD−1/2 .
We can also convert L0 to the normalized Laplacian L using the D−1/2 matrix as
Using that the normalized Laplacian is mechanically the same as using the (unnor-
malized) Laplacian. The second eigenvector provides a sorted order of the vertices,
and the best normalized cut can be found among the subsets according to this sorted
order.
Then to complete the spectral clustering, this procedure is recursively repeated
on each subset until some designed resolution has been reached.
254 10 Graph Structured Data
The following table shows the eigenvalues λ1, . . . , λ8 and eigenvectors u1, . . . , u8 of
the normalized Laplacian L for our example graph.
e
g c
v2 = −1 v2 = 1 a
d
f b
h
v3 = −1
Compared to the plot based on L0 , this one is even more dramatically stretched
out along u2 . Also, note that while the suggested cut along u2 is still at τ = 0, the
direction of the orientation is flipped. The vertices {a, b, c, d} are now all positive,
while the others are negative. The is because the eigendecomposition is not unique
in the choice of signs of the eigenvectors, even if eigenvalues are distinct. It is also
worth observing that in the second coordinate defined by u3 , vertex f now has a
different sign than vertices g and e, because the normalized Laplacian values large
cardinality cuts more than the unnormalized Laplacian.
Finding relationships and communities from a social network has been a holy grail
for the Internet data explosion since the late 90s, before Facebook, and even before
MySpace. This information is important for targeted advertising, for identifying
influential people, and for predicting trends before they happen, or spurring them to
make them happen.
10.4 Communities in Graphs 255
At the most general level, a social network is a large directed graph G = (V, E).
For decades, psychologists and others studied small-scale networks (100 friends,
seniors in a high school). Anything larger was too hard to collect and work with.
Also, mathematicians studied large graphs, famously as properties of random
graphs. For instance, the Erdös-Rényi model assumed that each pair of vertices had
an edge with probability p ∈ (0, 1). As p increased as a function |V |, they could
study properties of the connectedness of the graph: one large connected component
forms, then the entire graph is connected, then cycles and more complex structures
appear at greater rates.
There are many large active and influential social networks (Facebook, Twitter, Insta-
gram, etc.). These are influential because they are an important source of information
and news for their users. But in many cases, the full versions of these networks are
closely guarded secrets by the companies. These networks are expensive to host and
maintain, and most of these hosting companies derive value and profit by applying
targeted advertising to the users.
Various example networks are available through inventive and either sanctioned
or unsanctioned information retrieval and scraping techniques. Is it ethical to try to
scrape public or semi-private posts of users for either academic or entrepreneurial
reasons?
Some social media companies attempted to quantify the happiness derived by
users based on how they interacted with the social network. This included some
approaches by which the study potentially decreased the happiness of the users (e.g.,
showing only negative posts). Was it useful to perform these experiments, and how
and what could go wrong with this ability?
Before large social networks on the Internet, it was a painstaking and noisy process
to collect networks. And tracking them overtime was even more difficult. Now it is
possible to study the formation of these structures, and collect this data at a large
scale on the fly. This allows for more quantitative studies of many questions. An
example anthropological question is: Why do people join groups?
Consider a group of people C that have tight connections (say in a social network).
Consider two people X (Xavier) and Y (Yolonda). Who is more likely to join group
C?
• X has three friends in C, all connected.
• Y has three friends in C, none connected.
Before witnessing large social networks, both sides had viable theories with
reasonable arguments. Arguments for X were that there was safety and trust in
256 10 Graph Structured Data
friends who know each other. Arguments for Y were that there was independent
support for joining the group; hearing about it from completely different reasons
might be more convincing than from a single group of the same size. For static
network data, it is probably impossible to distinguish empirically between these
cases. But this can be verified by just seeing which scenario is more common as
networks form, and users decided to add pre-defined labels indicating a group.
It turns out the answer is: X. Vertices tend to form tightly connected subsets of
graphs.
10.4.2 Betweenness
defined as
A large score may indicate that an edge does not represent a central part of a
community. If to get between two communities you need to take this edge, its
betweenness score will be high: a “facilitator edge,” but not a community edge.
Similarly, the betweenness of a vertex is the number of shortest paths that go
through this vertex.
Calculating betw(a, b) for all edges can be time consuming. It typically requires
for each vertex, computing the shortest paths to all other vertices (the all pairs shortest
path problem). Then for each vertex, its effect on every edge can be calculated by
running careful dynamic programming on the directed acyclic graph (DAG) defined
by its shortest path.
10.4.3 Modularity
Communities can be defined and formed without providing an entire partition of the
vertices, as in a clustering. The common alternative approach is to define a score on
each potential community C ⊂ V, and then search for subsets with a large score. The
most common score for this approach is called modularity and, at a high level it is
approach was to not rely on such methods. Once an initial guess is formed, or
starting from a single vertex community C, then they can be incrementally updated
to look for high modularity communities. One can add either one vertex at a time
that most increases the score or all vertices which individually increase the score.
Alternatively, a random walk, similar to the Metropolis algorithm, can be used to
explore the space of communities.
10.4 Communities in Graphs 259
Exercises
Matrix Power: Choose some large enough value t and create M t . Then apply
q∗ = (M t )q0 . There are two ways to create M t ; first we can just let M i+1 = M i ∗ M,
repeating this process t −1 times. Alternatively (for simplicity, assume t is a power
of 2), in log2 t steps create M 2i = M i ∗ M i .
State Propagation: Iterate qi+1 = M ∗qi for some large enough number t iterations.
Random Walk: Start with a fixed state q0 = [0, 0, . . . , 1, . . . , 0, 0]T where there
is only a 1 at the ith entry, and then transition to a new state with only a 1 in the
jth entry by choosing a new location proportional to the values in the ith column
of M. Iterate this some large number t0 of steps to get state q0 . (This is the burn
in period.)
Now make t new steps starting at q0 and record the location after each step. Keep
track of how many times you have recorded each location and estimate q∗ as the
normalized version (recall q∗ 1 = 1) of the vector of these counts.
Eigenanalysis: Compute eigendecomposition of M, take the first eigenvector, and
L1 -normalize it.
1. Run each method (with t = 1024, q0 = [1, 0, 0, . . ., 0]T , and t0 = 100 when
needed) and report the answers.
2. Rerun the Matrix Power and State Propagation techniques with initialization
q0 = [0.1, 0.1, . . . , 0.1]T . What value of t is required to get as close to the true
answer as the older initial state?
3. Explain at least one Pro and one Con of each approach. The Pro should explain
a situation when it is the best option to use. The Con should explain why another
approach may be better for some situation.
4. Is the Markov chain ergodic? Explain why or why not.
5. Each matrix M row and column represents a node of the graph; label these from
1 to 10 starting from the top and from the left. What nodes can be reached from
node 4 in one step, and with what probabilities?
6. Repeat the above questions using PageRank with a teleportation rate of β = 0.15.
260 10 Graph Structured Data
10.2 Design a cyclic graph with a cycle length of 4 that also has transient and
absorbing states.
Abstract The term big data refers to the easy collection and storage of large amounts
of data that by virtue of its scale causes rethinking linearly CPU-bound computational
approaches and analysis methodologies. These challenges, and the development
of approaches to handle them, were in many ways a necessary precursor to the
data science revolution. For complex phenomena, with small data, the central limit
theorem may not reach convergence, and there may not be enough data points to
successfully hold out data to faithfully cross-validate. With small data, abstract
analysis models are less essential: low-dimensional data does not need its dimension
reduced, and data may be clustered or classified by hand (perhaps with the help of
plotting).
Most of the data science analysis methods discussed work fine for moderate to
large data sets. But when data is humongous, one may need to limit the times that
a single data set is read end-to-end. Without specialized strategies to deal with this
scale, humongous data may strangely limit analysis to very simplistic approaches.
However, there are three main simple and effective strategies to deal with this scale.
3. The third approach is sketching. This can be thought of as a refined and more
flexible version of sampling. Like sampling, this approach also reduces the size
footprint of the data, but stores it in some form of data structure S(X). These
structures are not lossless; you typically cannot recover the original data, but one
can ask queries of the sketch data structure about the data. Moreover, the data
structure size typically does not depend on the size of the data, but rather on
the accuracy of the answers returned. A random sample is a type of sketch (e.g.,
retaining an (ε, δ)-PAC style bound with size roughly (1/ε 2 ) log(1/δ)), but as we
will see, they can take many other forms.
Sketching is typically associated with the so-called streaming setting, where one
can make a single pass over the data. This means these sketch data structures can be
built in roughly as much time as it takes to read the data once, or can be maintained as
data is continually arriving. However, sketches can be considered in other scenarios;
for instance, in the distributed data, parallel setting, if we sketch each subset of data,
then the algorithms which minimize data movement (e.g., MapReduce) may only
need to send the sketches between the parallel processors. These are much smaller
and hence become more scalable in this setting as well.
Streaming is a model of computation that emphasizes space over all else. The goal
is to compute something using as little storage space as possible. So much so that we
cannot even store the input. Typically, you get to read the data once, you can update
some limited memory based on it, and then let the data go forever! Or sometimes,
less dramatically, you can make 2 or more passes on the data.
Formally, there is a stream A = a1, a2, . . . , an of n items. We will mainly
consider settings where either each ai ∈ [m] (is an element from a large domain of
size m) or each ai ∈ Rd is a d-dimensional vector (in some cases d = 1, so it is a
scalar). This means, the size of each ai is about log m bits (to represent an encoding
of which element) or d floating-point values (in the vector setting), and just to count
exactly how many items you have seen requires space log n bits. Unless otherwise
specified, log is used to represent log2 that is the base-2 logarithm. In the setting with
items from a discrete domain, the goal is usually to compute a summary S A using
space that is only polynomial in log n and log m (e.g., logc1 n + logc2 m for constants
c1, c2 ; note that this is better than an exponential bound where say 2log n = n is
basically just storing the entire data set).
Hashing
Many streaming algorithms use separating hash functions to compress data. This is
not to be confused with locality-sensitive hash functions discussed in Section 4.6.
11.1 The Streaming Model 263
a is a real number (it should be large with its binary representation, a good mix of
0s and 1s), frac(·) takes the fractional part of a number, e.g., frac(15.234) = 0.234,
and ·
takes the integer part of a number, rounding down so 15.234
= 15. It
can sometimes be more efficiently implemented as (xa/2q ) mod m where q is
essentially replacing the frac(·) operation and determining the number of bits
precision.
If a user wants something simple, concrete, and fast to implement themselves,
this is a fine choice. In this case, the randomness is the choice of a. Once that is
chosen (randomly), then the function ha (·) is deterministic.
3. Modular Hashing: h(x) = x mod m
(This is not recommended, but is a common first approach. It is listed here to
advise against it.)
This roughly evenly distributes a number x to [m], however, the problem is that
numbers that are both the same mod m always hash to the same location. One
can alleviate this problem by using a random large prime m < m. This will leave
bin {m + 1, m + 1, . . . , m − 1, m} always empty but has less regularity. But still
similar numbers, like 71 and 73, will never hash together.
264 11 Big Data and Sketching
Separating hash functions basically randomly maps data items into a fixed (and
often reduced) domain, and some will fall on top of each other. These collisions
lead to some error, but if one is careful, the large important items show through.
The unimportant items get distributed randomly, and essentially add a layer of noise.
Because it is random, this noise can be bounded.
An added benefit of the use of hash functions is the obliviousness. The algorithm
is completely independent of the order or content of the data. This means it is easy to
change items later if there was a mistake or sketch different parts of data separately,
and then combine them together later, or if data is reordered due to race conditions,
it has no effect on the output.
As a first warm-up task, consider a stream A = a1, a2, . . . , an where each element
ai ∈ R is a real value. A common task is to compute the sample mean and sample
variance over A.
In the case of mean(A), the mean ofA, we cannot simply maintain one value
through s elements as mean(As ) = 1s i=1 ai . Since when a new element as+1
arrives, averaging it with mean(As ) as mean(As+1 ) = 12 (mean(As ) + as+1 ) gives the
wrong answer. Instead we maintain two values: the sum Ms = i=1 ai and the total
count s. Then we can construct the mean on demand as
mean(As ) = Ms /s.
n
The sample variance over A = var(A) is defined as n1 i=1 (mean(A) − ai )2 . This
again appears tricky since we do not know mean(A) until we have witnessed the
entire
stream. However, we can maintain three values, the sum of squared values
Ss = i=1 s
ai2 , and the running sum Ms and count s as before. Then the sample
variance can be recovered as
Another important task is to draw (or maintain) a random sample from a stream. For
uniform random samples (i.e., each element is equally likely), the standard approach
is called reservoir sampling. It maintains a single sample x ∈ As = a1, a2, . . . , as .
At the start of the stream (for s = 1), it always sets x = a1 . Otherwise, on each step
it performs the following simple update rule. For the (s + 1)th step,
11.2 Frequent Items 265
as+1 with probability 1/(s+1)
x=
x otherwise.
Observe that at this point, element as+1 has been selected as x with probability
1/(s + 1) as desired. And by an inductive argument, this holds for every previously
witnessed element.
If one desired to maintain k samples x1, x2, . . . , xk , this approach can be extended.
If these should be chosen with replacement (for iid sampling, some points might be
selected more than once), then one can simply maintain k of these samplers in
parallel (each with their own randomness).
If these should be chosen without replacement, then one can simultaneously
maintain the full set X = {x1, x2, . . . , xk }, the reservoir. Then to initialize the first k
elements in the stream, {a1, a2, . . . , ak } are selected as the initial sample. Afterwards,
the update step is similar for element as+1 . With probability k/(s + 1), this item as+1
is put in the sample; and if this is the case, one of the current reservoir items is
randomly selected to be discarded. Otherwise, the reservoir set X stays the same.
Weighted Sampling
If one desires to maintain a weighted sample (e.g., the elements ai are selected
proportional to assigned weights wi ), then in the single element and with replacement
setting, the approach is similar. Instead of maintaining a total count s, we maintain
a total weight Ws = i=1 s
wi , and for a single item, it is selected with probability
ws+1 /(Ws + ws+1 ) = ws+1 /Ws+1 , or, as before, otherwise the selected item stays the
same.
For the weighted without-replacement setting, one can use priority sampling,
described in Section 2.4. Recall, in this approach, each item ai is independently
assigned a (random) priority ρi , and the top k of these are then the selected sample.
In the stream, we can also assign each element ai its priority ρi , so then maintaining
the largest k of these can be done by comparing each newly observed item’s priority
ρs+1 to the maintained set. If it is not smaller than all of these, then it pushes out the
smallest one. In most cases, one only needs to compare to the smallest kept priority
value, and so on a discard, this comparison takes constant time.
A core data mining problem is to find items that occur more than one would expect.
These may be called outliers, anomalies, or other terms. Statistical models can be
layered on top of or underneath these notions. In the streaming setting, we consider
a very simple version of this problem. There are n elements and they come from a
domain [m] (but both n and m might be very large, and we do not want to use them
266 11 Big Data and Sketching
on the order of n or m space). Some items in the domain occur more than once, and
we want to find the items which occur the most frequently.
If we can keep a counter for each item in the domain, this is easy. But we will
assume m is huge (like all possible IP addresses), and n is also huge, the number
of packets passing through a router in a day. So we cannot maintain a counter for
each possible item, and we cannot keep track of each individual element that passes
through. We must somehow maintain some (approximate) aggregate.
It will be useful to discusses this problem in terms of the frequency f j = |{ai ∈ A |
ai = j}|. The term f j represents the number of elements in the stream that have value
j. This notation leads to several useful aggregate values: the frequency
moments. Let
F1 = j f j = n be the total number of elements seen. Let F2 = 2
j f j be the root
0
of the sum of squares of elements counts. Let F0 = j f j be the number of distinct
elements. Note that typically both F2 and F0 are much less than F1 .
Our goal, in a streaming setting, will be to maintain approximate frequency counts
fˆj for all items j ∈ [m]. Note, even storing all of these numbers explicitly will require
too much space, so we will build a sketch data structure S which stores a few values
and implicitly induces values for the remaining indexes j (especially the items with
small frequency). Specifically, we desire an ε-approximate frequency data structure
S so that for any j ∈ [n], S provides an estimate S( j) = fˆj which satisfies
f j − εn ≤ fˆj ≤ f j + εn.
That is, the returned values fˆj is always with εn (that is ε percent) of the true
frequency count.
From another view, a φ-heavy hitter is an element j ∈ [m] such that f j > φn. We
want to build a data structure for ε-approximate φ-heavy-hitters so that
• it returns all j such that f j > φn;
• it returns no j such that f j < φn − εn;
• any j such that φn − εn ≤ f j < φn can be returned, but might not be.
An ε-approximate frequency data structure can be used to solve the associated heavy
hitter problem, which identifies all of the anomalously large elements. For instance,
in the case of a router monitoring IP address destinations, such sketch data structures
are used to monitor for denial of service attacks, where a huge number of packets
are sent to the same IP address overloading its processing capabilities.
We consider a data set with n = 60 elements and a domain of size m = 13. The
frequency of each item j ∈ [m] is labeled and represented by the height of the blue
bars; for instance, f5 = 7, since the 5th blue bar is of height 7.
These frequencies are approximated with red lines; each such approximate fre-
quency fˆj is within 2 of the true value so | f j − fˆj | ≤ 2 = 30 1
n. Thus this is an
1
30 -approximate frequency.
11.2 Frequent Items 267
Finally, we are interested in the 0.2-heavy-hitters. With φ = 0.2, these are the
items with f j > φn = (0.2) · 60 = 12. These are the bars which are above the green
line. Items 3 and 9 are φ-heavy hitters, since their frequencies f3 = 18 and f9 = 13
are both greater than 12. The ε-approximate frequencies would report the same items
since the same subset of items (3 and 9) are above the frequency threshold of 12.
Although, under an ε-approximate φ-heavy-hitter it would have been acceptable to
report only j = 3, since |φn − f9 | ≤ εn = 2.
18
13
heavy hitters
frequency
5
4 3
3
2 2
1 1
0 0
j =1 2 3 4 5 6 7 8 9 10 11 12 13
Heavy-Tailed Distributions
For many “Internet-scale” data sets, most of the data objects are associated with a
relatively small number of items. However, in other cases, the distribution is more
diverse; these are known as heavy-tailed distributions.
To illustrate this phenomenon, it is useful to consider a multi set A so each ai ∈ A
has ai = j ∈ [m]. For instance, all words in a book. Then the normalized frequency
of the jth element is defined as h j = |{ai ∈ A | ai = j}|/| A| = f j /n. These may
follow Zipf’s law, where the frequency decays very gradually.
Zipf’s Law
Zipf’s Law states that: the frequency of data is inversely proportional to its rank.
This can be quantified so the jth most frequent item will have normalized frequency
h j ≈ c/ j for a fixed constant c.
In the Brown corpus, an enormous English language text corpus, the three most
common words are “the” at 7% (hthe = 0.07); “of” at 3.5% (hof = 0.035); and “and”
at 2.8% (hand = 0.028). So for this application, the constant in Zipf’s law is roughly
c = 0.07.
268 11 Big Data and Sketching
The domination of some long-tail-built companies (like Amazon and Netflix) put
smaller more-focused companies out of business (local book stores or video rental
shops). What are some benefits and disadvantages of such an accumulation of con-
trol? And how should we try to weigh the corresponding pros and cons? If the cons
outweigh the pros, what are the mechanisms to try to prevent such domination, or in
the other case to encourage it?
We analyze why this algorithm solves the streaming majority problem. We only
need to consider the case where some j ∈ [m] has f j > n/2 otherwise, anything can
11.2 Frequent Items 269
We will next extend the majority algorithm to solve the ε-approximate frequency
estimation problem. Basically, it simply uses k − 1 counters and labels, instead
of one of each, and set k = 1/ε. Thus the total space will be proportional to
(k − 1)(log n + log m) ≈ (1/ε)(log n + log m). We use C, an array of (k − 1) counters
C[1], C[2], . . . , C[k −1]. And also use L, an array of (k −1) locations L[1], L[2], . . . ,
L[k − 1]. There are now three simple rules, with pseudocode in Algorithm 11.2.2:
• If we see a stream element that matches a label, we increment the associated
counter.
• If not, and a counter is 0, we reassign the associated label, and increment the
counter.
• Finally, if all counters are non-zero, and no labels match, then we decrement all
counters.
fq − εn ≤ fˆq ≤ fq .
After running Algorithm 11.2.3 on a stream A, the count-min sketch earns its
name, and on a query q ∈ [m] returns the minimum associated count:
11.2 Frequent Items 271
That is, for each ith hash function (row in the table), it looks up the appropriate
counter Ci,hi (q) , and returns the one with minimum value.
Now for a query q ∈ [n], the value fˆq is an underestimate ( fq ≤ fˆq ), since each
counter Ci,hi (q) associated with query q has been incremented for each instance of
q, and never decremented. Although it may have been incremented for other items,
we claim that fˆq ≤ fq + W for some bounded overcount value W.
Consider just one hash function hi . The overcount is increased by f j if hi (q) =
hi ( j). For each j ∈ [m], this happens with probability 1/k. We can now bound the
overcount using a Markov inequality.
Define a random variable Yi, j that represents the overcount caused on hi for q
because of element j ∈ [n]. That is, for each instance of j, it increments W by 1 with
probability 1/k, and 0 otherwise. Each instance of j has the same value hi ( j), so we
need to sum up all these counts, so
f j with probability 1/k
Yi, j =
0 otherwise.
Hence E[Yi, j ] = f j /k. We can then define a random variable Xi = j ∈[m];jq Yi, j ,
which bounds the expected total overcount. By linearity of expectation and n =
m
j=1 f j = F1 , we have
E[Xi ] = E[ Yi, j ] = f j /k ≤ F1 /k = εF1 /2.
jq jq
Now we recall the Markov inequality. For a random variable X ≥ 0 and a value
α > 0, Pr[X ≥ α] ≤ E[X]/α. We can apply this with Xi > 0, and we set α = εF1 .
Noting that E[|X |]/α = (εF1 /2)/(εF1 ) = 1/2, we have
But this was for just one hash function hi . Now we extend this to t = log(1/δ)
independent hash functions to bound the probability that they all have too large of
an overestimate,
Putting this all together, we have the PAC bound for count-min sketch for any
query q ∈ [m] that
272 11 Big Data and Sketching
fq ≤ fˆq ≤ fq + εF1
where the first inequality always holds, and the second holds with probability at least
1 − δ.
Since there are kt counters, and each requires log n space, the total counter space
is kt log n. But we also need to store t hash functions; these can be made to take
(roughly) log m space each. Then since t = log(1/δ) and k = 2/ε, it follows the
overall total space is roughly t(k log n + log m) = ((2/ε) log n + log m) log(1/δ).
While the space for the count-min sketch is larger than the Misra-Gries sketch,
it has an advantage that its guarantees hold under the so-called turnstile streaming
model. Under this model, each element ai ∈ A can either add one or subtract one
from the corpus (like a turnstile at the entrance of a football game), but each count f j
must remain non-negative. Thus count-min has the same guarantees in the turnstile
model, but Misra-Gries does not.
A predecessor of the count-min sketch is the count sketch. Its structure is very similar
to the count-min sketch; it again maintains a 2-dimensional array of counters, but
now with t = log(2/δ) and k = 4/ε 2 :
In addition to the t hash functions hi : [m] → [k], it maintains t sign hash functions
si : [m] → {−1, +1}. Then for a stream element a ∈ A, each hashed-to counter
Ci,hi (a) is incremented by si (a). So it might add 1 or subtract 1.
To query this sketch, it takes the median of all values, instead of the minimum.
Unlike the biased count-min sketch, the other items hashed to the same counter as
the query are unbiased. Half the time the values are added, and half the time they are
subtracted. So then the median of all rows provides a better estimate. This ensures
the following bound with probability at least 1 − δ for all q ∈ [m]:
| fq − fˆq | ≤ εF2 .
Note that this required k about 1/ε 2 instead of 1/ε, but usually F2 = 2
j f j is
much smaller than F1 = j f j , especially for skewed (light-tailed) distributions.
Bloom Filters
There are many variants of these hashing-based sketches, including techniques for
locality-sensitive hashing (see Section 4.6) and the extensions for matrices discussed
next. Another well-known and related idea is a Bloom filter, which approximately
maintains a set but not the frequencies of the items. This does not allow false
negatives (if an item is in the set, the data structure always reports that it is), but can
allow false positives (it sometimes reports an item in the set when it is not).
A Bloom filter processes a stream A = a1, a2, . . . where each ai ∈ [m] is an
element of domain of size m. The structure only maintains a single array B of b
bits. Each is initialized to 0. It uses k hash functions {h1, h2, . . . , hk } ∈ H with each
h j : [m] → [k]. It then runs streaming Algorithm 11.2.5 on S, where each ai uses
the k hash functions to force k bits as 1.
Then on a query to see if q ∈ A, it returns Yes only if for all j ∈ [k] that
B[h j (q)] = 1. Otherwise it returns No. Since any item which was put in B has
all associated hashed-to bits as 1, on a query it will always return Yes (no false
negatives). However, it may return Yes for an item even if it does not appear in the
stream. It does not even need to have the exact same hash values from another item
in the set; each hash collision could occur because of a different item.
The singular value decomposition (SVD) can be interpreted as finding the most
dominant directions in an n × d matrix A (or n points in Rd ), with n > d. Recall that
the SVD results in
274 11 Big Data and Sketching
[U, S, V] = svd(A)
where U = [u1, . . . , un ], S = diag(σ1, . . . , σd ), and V = [v1, . . . , vd ], so that A =
USV T and in particular A = dj=1 σj u j vTj . To approximate A, we just use the first
k components to find Ak = kj=1 σj u j vTj = Uk Sk VkT where Uk = [u1, . . . , uk ],
Sk = diag(σ1, . . . , σk ), and Vk = [v1, . . . , vk ]T . Then the vectors v j (starting with
smaller indexes) provide the best subspace representation of A.
But, although SVD has been heavily optimized on data sets that fit in memory,
it can sometimes be improved for very large data sets. The traditional SVD takes
proportional to min{nd 2, n2 d} time to compute, which can be prohibitive for large n
and/or d.
We next consider computing approximate versions of the SVD in the streaming
model. In the setting we focus on, A arrives in the stream, one row ai at step i. So
our input is a1, a2, . . . , ai, . . . , an , and at any points ai in the stream, we would like
to maintain a sketch of the matrix B which somehow approximates all rows up to
that point.
The first regime we focus on is when n is extremely large, but d is moderate. For
instance, n = 100 million and d = 100. The simple approach in a stream is to make
one pass using d 2 space, and just maintain the sum of outer products C = i=1 ai aTi ,
the d × d covariance matrix of A exactly.
We have that at any point i in the stream, where Ai = [a1 ; a2 ; . . . , ai ], the main-
tained matrix C is precisely C = Ai ATi . Thus the eigenvectors of C are the right
singular vectors of A, and the eigenvalues of C are the squared singular values of C.
This only requires d 2 space and nd 2 total time, and incurs no error.
We can choose the top k eigenvectors of C as Vk , and on a second pass of the
data, project all vectors on ai onto Vk to obtain the best k-dimensional embedding
of the data set.
11.3 Matrix Sketching 275
The next regime assumes that n is extremely large (say n = 100 million), and that d
is also uncomfortably large (say d = 100 thousand), and our goal is something like
a best rank-k approximation with k ≈ 10, so k d n. In this regime perhaps d is
so large that d 2 space is too much, but something close to dk space and ndk time is
reasonable. We will not be able to solve things exactly in the streaming setting under
these constraints, but we can provide a provable approximation with slightly more
space and time.
This approach, called Frequent Directions, can be viewed as an extension of the
Misra-Gries sketch. We will consider a matrix A one row (one point ai ∈ Rd ) at a
time. We will maintain a matrix B that is 2×d, that is it only has 2 rows (directions).
We say a row of B is empty if it contains all 0s. We maintain that one row of B is
always empty at the end of each round (this will always include the last row B2 ).
We initialize with the first 2 − 1 rows ai of A as B, keeping the last row B2 as all
zeros. Then on each new row ai ∈ A, we replace an empty row of B with ai . If after this
step, there are no remaining empty rows, then we need to create more with the follow-
ing steps. We set [U, S, V] = svd(B). Now examine S = diag(σ1, . . . , σ2 ), which is a
length 2 diagonal matrix. Then subtract δ = σ2 from each (squared) entry in S, that is
σj = max{0, σj2 − δ} and S = diag( σ12 − δ, σ22 − δ, . . . , σ−1 2 − δ, 0, . . . , 0).
Now we set B = S V T . Notice that since S only has non-zero elements in the
first − 1 entries on the diagonal, B is at most rank − 1 and the last + 1 rows are
empty.
The result of Algorithm 11.3.2 is a matrix B such that for any (direction) unit
vector x ∈ Rd
0 ≤ Ax 2 − Bx 2 ≤ A − Ak F2 /( − k)
and
A − AΠBk F2 ≤ A − Ak F2 ,
−k
276 11 Big Data and Sketching
for any k < , including when k = 0, and ΠBk projects onto the span of Bk .
So setting = 1/ε, in any direction in Rd , the squared norm in that direction is
preserved up to ε AF2 (that is, ε times the total squared norm) using the first bound.
Using = k + 1/ε, the squared norm of the direction is preserved up to the squared
Frobenius norm of the tail ε A − Ak F2 .
In the second bound if we set = k/ε + k, then we have A − AΠBk F2 ≤
(1 + ε) A − Ak F2 . That is, the span Bk , the best rank-k approximation of B, only
misses a (1 + ε)-factor more variation (measured by squared Frobenius norm) than
the best rank-k subspace Ak .
Just like with Misra-Gries, when some mass is deleted from one counter it is deleted
from all counters, and none can be negative. So here when one direction has
its mass (squared norm) decreased, at least directions (with non-zero mass) are
decreased by the same amount. So no direction can have more than 1/ fraction of
the total squared norm AF2 decreased from it.
Finally, since squared norm can be summed independently along any set of
orthogonal directions (by the Pythagorean theorem), we can subtract each of them
without affecting others. Setting = 1/ε implies that no direction x (e.g., assume
x = 1 and measure Ax 2 ) decreases its squared norm (as Bx 2 ) by more than
ε AF2 . By a more careful analysis showing that we only shrink the total norm
proportional to the “tail” A − Ak F2 , we can obtain the bound described above.
σ2
σ2 − σ22 − δ 2
σ1 − σ12 − δ 2
v2 σ1
v1
x
Why do we use SVD? SVD defines the true axis of the ellipse associated with the
norm of B (its aggregate shape) at each step. If we shrink along any other basis (or
a set of non-orthogonal vectors), we will warp the ellipse and will not be able to
ensure that each direction x of B shrinks in squared norm by at most δ .
The frequent direction algorithm calls an the SVD operation multiple times, so
that it is useful to bound its runtime to ensure it does not take much longer than a
single decomposition of A, which would take time proportional to nd 2 . In this case,
the matrix B which has its SVD computed is only 2 × d, so this operation can be
performed in time proportional to only d 2 . We do this step every rows, so it occurs
n/ times. In total, this takes time proportional to n/ · d 2 = nd. So when d
11.3 Matrix Sketching 277
We next move to a regime where n and d are again both large, and so might be k.
But a runtime of ndk may be too large—that is, we can read the data, but maybe a
factor of k times reading the data is also large. The next algorithms have runtime
slightly more than nd; they are almost as fast as reading the data. In particular, if
there are nnz(A) non-zero entries in a very sparse matrix, and we only need to read
these entries, then the runtime is almost proportional to nnz(A).
The goal is to approximate A up to the accuracy of Ak . But in Ak , the directions
vi are linear combinations of features. As a model, this may not make sense; for
example, what is a linear combination of genes? What is a linear combination of
typical grocery purchases? Instead, our goal is to choose V so that the rows of V are
also rows of A.
For each row of ai ∈ A, set wi = ai 2 . Then select = (k/ε)2 · log(1/δ) rows of
A, each proportional to wi . Then “stack” these rows x1, x2, . . . , x to derive a matrix
⎡ x1 ⎤
⎢ ⎥
⎢ x2 ⎥
⎢ ⎥
R= ⎢ . ⎥.
⎢ .. ⎥
⎢ ⎥
⎢ x ⎥
⎣ ⎦
These rows will jointly act in place of VkT . However since V was orthogonal, the
columns vi, v j ∈ Vk were orthogonal. However, the rows of R are not orthogonal, so
we need to consider an orthogonalization of them. Let ΠR = RT (RRT )−1 R be the
projection matrix for R, so that AR = AΠR describes the projection of A onto the
subspace of the directions spanned by R. Finally, we claim that
A − AΠR F ≤ A − Ak F + ε AF
with probability at least 1−δ. The analysis roughly follows from importance sampling
and concentration of measure bounds. This sketch can be computed in a stream using
weighted reservoir sampling or priority sampling.
Leverage Scores
A slightly better bound can be achieved, near the bound from Frequent Directions,
if a different sampling probability on the rows is used: the leverage scores. To define
these, first consider the SVD: [U, S, V] = svd(A). Now let Uk be the top k left
singular vectors, and let Uk (i) be the ith row of Uk . The leverage score of data point
278 11 Big Data and Sketching
A − AΠR F ≤ (1 + ε) A − Ak F .
However, computing the full SVD to derive these weights prevents a streaming algo-
rithm, and part of the motivation for sketching. Yet, there exist ways to approximate
these scores, and sample roughly proportional to them in a stream, and moreover,
the approximation via selected rows still provides better interpretability.
A significant downside of these row sampling approaches is that the (1/ε 2 )
coefficient can be quite large for small error tolerance. If ε = 0.01, meaning 1%
error, then this part of the coefficient alone is 10,000. In practice, these approaches
are typically most useful until > 100 or even larger.
A − [AV]k V T F ≤ (1 + ε) A − Ak F .
A factorized form (like the output of the SVD) can be computed for the product
[AV]k V T in time which scales linearly in the matrix size nd, and polynomially in
= k/ε.
Moreover, this preserves a stronger bound, called an oblivious subspace embed-
ding, using ≈ d/ε 2 so, for all x ∈ Rd
Ax
(1 − ε) ≤ ≤ (1 + ε).
Bx
This is a very strong bound that also ensures that given a matrix A of d-dimensional
explanatory variables, and a vector b of dependent variables, the result of linear
regression on S A and Sb provides a (1 ± ε) approximation to the result on the full A
and b.
11.3 Matrix Sketching 279
By increasing the size of the sketch to ≈ k 2 + k/ε for the rank-k approximation
result, or to ≈ d 2 /ε 2 for the oblivious subspace embedding result, a faster count
sketch-based approach can be used. In this approach, S has each column S j as all 0s,
except for one randomly chosen entry (this can be viewed as a hash to a row of B)
that is either −1 or +1 at random. This works just like a count sketch but for matrices.
The runtime of these count-sketch approaches becomes proportional to nnz(A),
the number of non-zeros in A. For very sparse data, such as those generated from
bag-of-words approaches, this is as fast as only reading the few relevant entries of
the data. Each row is now randomly accumulated onto one row of the output sketch
B instead of onto all rows as when using the Gaussian random variables approach.
However, this sketch B is not as interpretable as the row selection methods, and in
practice for the same space, it often works a bit worse (due to extra factors necessary
for the concentration of measure bounds to kick in) than the frequent directions
approach.
280 11 Big Data and Sketching
Exercises
11.3 Consider the two data sets S1 and S2, which contain m = 3,000,000 characters
and m = 4,000,000 characters, respectively. The order of the file represents the order
of the stream.
1. Run the Misra-Gries algorithm (Algorithm 11.2.2) with (k − 1) = 9 counters on
streams S1 and S2. Report the output of the counters at the end of the stream.
In each stream, use just the counters to report how many objects might occur more
than 20% of the time, and which must occur more than 20% of the time.
2. Build a count-min sketch (Algorithm 11.2.3) with k = 10 counters using t = 5
hash functions. Run it on streams S1 and S2.
For both streams, report the estimated counts for objects a, b, and c. Just from the
output of the sketch, which of these objects, with probably 1 − δ = 31/32 (that
is, assuming the randomness in the algorithm does not do something bad), might
occur more than 20% of the time?
3. Describe one advantage of the count-min sketch over the Misra-Gries algorithm.
11.4 Consider a data set S1, we will estimate properties of this stream using stream
sampling approaches.
1. Run k independent reservoir samplers (each with reservoir size 1) to select k items
for a set S. Determine the count of each character selected into S at the end of the
stream. Use the counts of each character in S to provide an estimate for the count
of those characters in the full sets. Run this for values of k = {10, 100, 1000}. You
may want to repeat with different randomness a few times, and take the average,
to get better estimates.
2. Repeat the above experiment with a single reservoir of size k for the same values
of k.
11.3 Matrix Sketching 281
3. Now treat the characters as weights based on alphabetical order. So “a” has a
weight of 1, “b” has a weight of 2, and ultimately “z” has a weight of 26. Now
create k samples (again using k = {10, 100, 1000}) using k independent weighted
reservoir samplers. Compute the estimate for the total weight of the stream using
importance sampling (Section 2.4). Repeat a few times, and average, so the
estimate concentrates.
4. Repeat this experiment using without-replacement sampling via priority sampling
for the same values of k (see Section 2.4.1).
11.5 The exact φ-heavy-hitter problem is as follows: return all objects that occur
more than φ% of the time. It cannot return any false positives or any false negatives.
In the streaming setting, this requires at least space proportional to m, n if there are
n objects that can occur and the stream is of length m.
A 2-pass streaming algorithm is one that is able to read all of the data in order
exactly twice, but still only has limited memory. Describe a small space 2-pass
streaming algorithm to solve the exact φ-heavy hitter problem, i.e., with ε = 0, (say
for φ = 10% threshold).
11.6 Consider a data set Z, which can be interpreted as matrix Z ∈ Rn×d with
n = 10,000 and d = 2,000, where each ith row is an element zi ∈ Rd of a stream
z1, z2, . . . , zn . For each of the following approaches (with = 10),
• create a sketch B ∈ R×d or BT B ∈ Rd×d ,
• report the covariance error Z T Z − BT B2 , and
• describe analytically (in terms of n and d, then plug in those values) how much
space was required.
The approaches are:
1. Read all of Z and directly calculate BT B = Z T Z.
2. Summed Covariance (Algorithm 11.3.1) to create BT B.
3. Frequent Directions (Algorithm 11.3.2) with parameter to create B.
4. Row Sampling (proportional to zi 2 ) using reservoir samplers to create B.
5. Random Projections using scaled Gaussian random variables to create B.
Index