Springer Series in The Data Sciences
Springer Series in The Data Sciences
Series Editors
David Banks, Duke University, Durham, NC, USA
Jianqing Fan, Department of Financial Engineering, Princeton University,
Princeton, NJ, USA
Michael Jordan, University of California, Berkeley, CA, USA
Ravi Kannan, Microsoft Research Labs, Bangalore, India
Yurii Nesterov, CORE, Universite Catholique de Louvain, Louvain-la-Neuve,
Belgium
Christopher Ré, Department of Computer Science, Stanford University, Stanford,
USA
Ryan J. Tibshirani, Department of Statistics, Carnegie Melon University,
Pittsburgh, PA, USA
Larry Wasserman, Department of Statistics, Carnegie Mellon University,
Pittsburgh, PA, USA
Springer Series in the Data Sciences focuses primarily on monographs and graduate
level textbooks. The target audience includes students and researchers working in
and across the fields of mathematics, theoretical computer science, and statistics.
Data Analysis and Interpretation is a broad field encompassing some of the
fastest-growing subjects in interdisciplinary statistics, mathematics and computer
science. It encompasses a process of inspecting, cleaning, transforming, and
modeling data with the goal of discovering useful information, suggesting
conclusions, and supporting decision making. Data analysis has multiple facets
and approaches, including diverse techniques under a variety of names, in different
business, science, and social science domains. Springer Series in the Data Sciences
addresses the needs of a broad spectrum of scientists and students who are utilizing
quantitative methods in their daily research. The series is broad but structured,
including topics within all core areas of the data sciences. The breadth of the series
reflects the variation of scholarly projects currently underway in the field of
machine learning.
Mathematical Foundations
for Data Analysis
123
Jeff M. Phillips
School of Computing
University of Utah
Salt Lake City, UT, USA
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
to Bei, Stanley, and Max
Preface
This book is meant for data science students, and in particular preparing students
mathematically for advanced concepts in data analysis. It can be used for a
self-contained course that introduces many of the basic mathematical principles
and techniques needed for modern data analysis, and can go deeper in a variety of
topics; the shorthand math for data may be appropriate. In particular, it was
constructed from material taught mainly in two courses. The first is an early
undergraduate course which is designed to prepare students to succeed in rigorous
Machine Learning and Data Mining courses. The second course is the advanced
Data Mining course. It should be useful for any combination of such courses. The
book introduces key conceptual tools which are often absent or brief in the
undergraduate curriculum, and for most students, helpful to see multiple times. On
top of these, it introduces the generic versions of the most basic techniques that
comprise the backbone of modern data analysis. And then it delves deeper into a
few more advanced topics and techniques—still focusing on clear, intuitive, and
lasting ideas, instead of specific details in the ever-evolving state of the art.
Notation
Consistent, clear, and crisp mathematical notation is essential for intuitive learning.
The domains which comprise modern data analysis (e.g., statistics, machine
learning, and algorithms) until recently had matured somewhat separately with
their own conventions for ways to write the same or similar concepts. Moreover, it
is commonplace for researchers to adapt notation to best highlight ideas within
specific papers. As such, much of the existing literature on the topics covered in
this book has varied, sometimes inconsistent notation, and as a whole can be
confusing. This text attempts to establish a common, simple, and consistent
notation for these ideas, yet not veer too far from how concepts are consistently
represented in the research literature, and as they will be in more advanced courses.
vii
viii Preface
Indeed, the most prevalent sources of confusion in earlier uses of this text in class
have arisen around overloaded notation.
This book is written for students who have already taken calculus, for several
topics it relies on integration (continuous probability) and differentiation (gradient
descent). However, this book does not lean heavily on calculus, and as data science
introductory courses are being developed which are not calculus-forward, students
following this curriculum may still find many parts of this book useful.
For some advanced material in Sampling, Nearest Neighbors, Regression,
Clustering, Classification, and especially Big Data, some basic familiarity with
programming and algorithms will be useful to fully understand these concepts.
These topics are deeply integrated with computational techniques beyond numer-
ical linear algebra. When the implementation is short and sweet, several imple-
mentations are provided in Python. This is not meant to provide a full introduction
to programming, but rather to help break down any barriers among students
worrying that the programming aspects need to be difficult—many times they do
not!
Probability and Linear Algebra are essential foundations for much of data
analysis, and hence also in this book. This text includes reviews of these topics.
This is partially to keep the book more self-contained, and partially to ensure that
there is a consistent notation for all critical concepts. This material should be
suitable for a review on these topics but is not meant to replace full courses. It is
recommended that students take courses on these topics before, or potentially
concurrently with, a more introductory course from this book.
If appropriately planned for, it is the hope that a first course taught from this
book could be taken as early as the undergraduate sophomore level, so that more
rigorous and advanced data analysis classes can be taken during the junior year.
Although we touch on Bayesian inference, we do not cover most of classical
statistics; neither frequentist hypothesis testing nor similar Bayesian perspectives.
Most universities have well-developed courses on these topics that are also very
useful, and provide a complementary view of data analysis.
Vital concepts introduced include the concentration of measure and PAC bounds,
cross-validation, gradient descent, a variety of distances, principal component
analysis, and graph-structured data. These ideas are essential for modern data
analysis, but are not often taught in other introductory mathematics classes in a
computer science or math department, or if these concepts are taught, they are
presented in a very different context.
Preface ix
labeled
data
(X, y) regression prediction
X
unlabeled dimensionality
clustering structure
reduction
data
scalar set
outcome outcome
On Data
While this text is mainly focused on mathematical preparation, what would data
analysis be without data? As such we provide a discussion on how to use these
tools and techniques on actual data, with simple examples given in Python. We
choose Python since it has increasingly many powerful libraries often with efficient
backends in low-level languages like C or Fortran. So for most data sets, this
provides the proper interface for working with these tools. Data sets can be found
here: https://mathfordata.github.io/data/.
But arguably, more important than writing the code itself is a discussion on
when and when not to use techniques from the immense toolbox available. This is
one of the main ongoing questions a data scientist must ask. And so, the text
attempts to introduce the readers to this ongoing discussion—but resists diving into
an explicit discussion of the actual tools available.
x Preface
Three themes that this text highlights to aid in a broader understanding of these
fundamentals are examples, geometry, and ethical connotations. These are each
offset in colored boxes.
This book provides numerous simple and sometimes fun examples to demonstrate
key concepts. It aims to be as simple as possible (but not simpler), and make data
examples small, so they can be fully digested. These are illustrated with figures
and plots, and often the supporting Python code is integrated when it is illustrative.
For brevity, the standard import commands from Python are only written once per
chapter, and state is assumed carried forward throughout the examples within a
chapter. Although most such Python parts are fully self-contained.
Many of the ideas in this text are inherently geometric, and hence we attempt to
provide many geometric illustrations and descriptions to use visual intuition to shed
light on otherwise abstract concepts. These boxes often go more in depth into what
is going on, and include the most technical proofs. Occasionally the proofs are not
really geometric, yet for consistency the book retains this format in those cases.
As data analysis glides into an abstract, nebulous, and ubiquitous place, with a
role in automatic decision making, the surrounding ethical questions are becoming
more important. As such, we highlight various ethical questions which may arise
in the course of using the analysis described in this text. We intentionally do not
offer solutions, since there may be no single good answer to some of the dilemmas
presented. Moreover, we believe the most important part of instilling positive ethics
is to make sure analysts at least think about the consequences, which we hope is
partially achieved via these highlighting boxes and ensuing discussions.
I would like to thank the gracious support from NSF in the form of grants
CCF-1350888, IIS-1251019, ACI-1443046, CNS-1514520, CNS-1564287, and
IIS-1816149, which have funded my cumulative research efforts during the writing
of this text. I would also like to thank the University of Utah, as well as the Simons
Institute for Theory of Computing, for providing excellent work environments
while this text was written. And thanks to Natalie Cottrill, Yang Gao, Koriann
South, and many other students for a careful reading and feedback.
xi
Contents
1 Probability Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Conditional Probability and Independence . . . . . . . . . . . . . . . . . 4
1.3 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Joint, Marginal, and Conditional Distributions . . . . . . . . . . . . . . 8
1.7 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7.1 Model Given Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
xiii
xiv Contents
3.5 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Square Matrices and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.7 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 Linear Regression with Multiple Explanatory Variables . . . . . . . 99
5.3 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.1 Other ways to Evaluate Linear Regression Models . . . . . 108
5.5 Regularized Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5.1 Tikhonov Regularization for Ridge Regression . . . . . . . . 110
5.5.2 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.5.3 Dual Constrained Formulation . . . . . . . . . . . . . . . . . . . . 113
5.5.4 Matching Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Contents xv
8 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.1 Voronoi Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.1.1 Delaunay Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.1.2 Connection to Assignment-Based Clustering . . . . . . . . . . 182
8.2 Gonzalez’s Algorithm for k-Center Clustering . . . . . . . . . . . . . . 183
8.3 Lloyd’s Algorithm for k-Means Clustering . . . . . . . . . . . . . . . . . 185
8.3.1 Lloyd’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.3.2 k-Means++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8.3.3 k-Mediod Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.3.4 Soft Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.4 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
8.4.1 Expectation-Maximization . . . . . . . . . . . . . . . . . . . . . . . 196
8.5 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
xvi Contents
9 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.1 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.1.1 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.1.2 Cross-Validation and Regularization . . . . . . . . . . . . . . . . 212
9.2 Perceptron Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.3 Support Vector Machines and Kernels . . . . . . . . . . . . . . . . . . . . 217
9.3.1 The Dual: Mistake Counter . . . . . . . . . . . . . . . . . . . . . . 218
9.3.2 Feature Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
9.3.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 221
9.4 Learnability and VC dimension . . . . . . . . . . . . . . . . . . . . . . . . . 222
9.5 kNN Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9.6 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9.7 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
9.7.1 Training with Back-propagation . . . . . . . . . . . . . . . . . . . 230
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283