0% found this document useful (0 votes)

23 views15 pages

Springer Series in The Data Sciences

Uploaded by

satwikalanka369

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views15 pages

Springer Series in The Data Sciences

Uploaded by

satwikalanka369

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Springer Series in the Data Sciences

Series Editors
David Banks, Duke University, Durham, NC, USA
Jianqing Fan, Department of Financial Engineering, Princeton University,
Princeton, NJ, USA
Michael Jordan, University of California, Berkeley, CA, USA
Ravi Kannan, Microsoft Research Labs, Bangalore, India
Yurii Nesterov, CORE, Universite Catholique de Louvain, Louvain-la-Neuve,
Belgium
Christopher Ré, Department of Computer Science, Stanford University, Stanford,
USA
Ryan J. Tibshirani, Department of Statistics, Carnegie Melon University,
Pittsburgh, PA, USA
Larry Wasserman, Department of Statistics, Carnegie Mellon University,
Pittsburgh, PA, USA
Springer Series in the Data Sciences focuses primarily on monographs and graduate
level textbooks. The target audience includes students and researchers working in
and across the ﬁelds of mathematics, theoretical computer science, and statistics.
Data Analysis and Interpretation is a broad field encompassing some of the
fastest-growing subjects in interdisciplinary statistics, mathematics and computer
science. It encompasses a process of inspecting, cleaning, transforming, and
modeling data with the goal of discovering useful information, suggesting
conclusions, and supporting decision making. Data analysis has multiple facets
and approaches, including diverse techniques under a variety of names, in different
business, science, and social science domains. Springer Series in the Data Sciences
addresses the needs of a broad spectrum of scientists and students who are utilizing
quantitative methods in their daily research. The series is broad but structured,
including topics within all core areas of the data sciences. The breadth of the series
reflects the variation of scholarly projects currently underway in the ﬁeld of
machine learning.

More information about this series at http://www.springer.com/series/13852

Jeff M. Phillips

Mathematical Foundations
for Data Analysis

123
Jeff M. Phillips
School of Computing
University of Utah
Salt Lake City, UT, USA

ISSN 2365-5674 ISSN 2365-5682 (electronic)

Springer Series in the Data Sciences
ISBN 978-3-030-62340-1 ISBN 978-3-030-62341-8 (eBook)
https://doi.org/10.1007/978-3-030-62341-8

Mathematics Subject Classiﬁcation: 62-07

© Springer Nature Switzerland AG 2021

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
to Bei, Stanley, and Max
Preface

This book is meant for data science students, and in particular preparing students
mathematically for advanced concepts in data analysis. It can be used for a
self-contained course that introduces many of the basic mathematical principles
and techniques needed for modern data analysis, and can go deeper in a variety of
topics; the shorthand math for data may be appropriate. In particular, it was
constructed from material taught mainly in two courses. The first is an early
undergraduate course which is designed to prepare students to succeed in rigorous
Machine Learning and Data Mining courses. The second course is the advanced
Data Mining course. It should be useful for any combination of such courses. The
book introduces key conceptual tools which are often absent or brief in the
undergraduate curriculum, and for most students, helpful to see multiple times. On
top of these, it introduces the generic versions of the most basic techniques that
comprise the backbone of modern data analysis. And then it delves deeper into a
few more advanced topics and techniques—still focusing on clear, intuitive, and
lasting ideas, instead of specific details in the ever-evolving state of the art.

Notation

Consistent, clear, and crisp mathematical notation is essential for intuitive learning.
The domains which comprise modern data analysis (e.g., statistics, machine
learning, and algorithms) until recently had matured somewhat separately with
their own conventions for ways to write the same or similar concepts. Moreover, it
is commonplace for researchers to adapt notation to best highlight ideas within
specific papers. As such, much of the existing literature on the topics covered in
this book has varied, sometimes inconsistent notation, and as a whole can be
confusing. This text attempts to establish a common, simple, and consistent
notation for these ideas, yet not veer too far from how concepts are consistently
represented in the research literature, and as they will be in more advanced courses.

vii
viii Preface

Indeed, the most prevalent sources of confusion in earlier uses of this text in class
have arisen around overloaded notation.

Interaction with Other Courses

This book is written for students who have already taken calculus, for several
topics it relies on integration (continuous probability) and differentiation (gradient
descent). However, this book does not lean heavily on calculus, and as data science
introductory courses are being developed which are not calculus-forward, students
following this curriculum may still find many parts of this book useful.
For some advanced material in Sampling, Nearest Neighbors, Regression,
Clustering, Classification, and especially Big Data, some basic familiarity with
programming and algorithms will be useful to fully understand these concepts.
These topics are deeply integrated with computational techniques beyond numer-
ical linear algebra. When the implementation is short and sweet, several imple-
mentations are provided in Python. This is not meant to provide a full introduction
to programming, but rather to help break down any barriers among students
worrying that the programming aspects need to be difficult—many times they do
not!
Probability and Linear Algebra are essential foundations for much of data
analysis, and hence also in this book. This text includes reviews of these topics.
This is partially to keep the book more self-contained, and partially to ensure that
there is a consistent notation for all critical concepts. This material should be
suitable for a review on these topics but is not meant to replace full courses. It is
recommended that students take courses on these topics before, or potentially
concurrently with, a more introductory course from this book.
If appropriately planned for, it is the hope that a first course taught from this
book could be taken as early as the undergraduate sophomore level, so that more
rigorous and advanced data analysis classes can be taken during the junior year.
Although we touch on Bayesian inference, we do not cover most of classical
statistics; neither frequentist hypothesis testing nor similar Bayesian perspectives.
Most universities have well-developed courses on these topics that are also very
useful, and provide a complementary view of data analysis.

Scope and Topics

Vital concepts introduced include the concentration of measure and PAC bounds,
cross-validation, gradient descent, a variety of distances, principal component
analysis, and graph-structured data. These ideas are essential for modern data
analysis, but are not often taught in other introductory mathematics classes in a
computer science or math department, or if these concepts are taught, they are
presented in a very different context.
Preface ix

labeled
data
(X, y) regression prediction

X
unlabeled dimensionality
clustering structure
reduction
data

scalar set
outcome outcome

We also survey basic techniques in supervised (regression and classification)

and unsupervised (dimensionality reduction and clustering) learning. We make an
effort to keep simple the presentation and concepts on these topics. The book
initially describes those methods which attempt to minimize the sum of squared
errors. It leads with classic but magical algorithms like Lloyd’s algorithm for k-
means, the power method for eigenvectors, and perceptron for linear classification.
For many students (even those in a computer science program), these are the first
iterative, non-discrete algorithms they will have encountered. After the basic
concepts, the book ventures into more advanced concepts like regularization and
lasso, locality-sensitive hashing, multidimensional scaling, spectral clustering,
kernel methods, and data sketching. These surveys of advanced techniques are
chosen mostly among the ones which have solid mathematical footing, are built on
initial concepts in this book, and are actually used. These can be sprinkled in, to
allow courses to go deeper and more advanced as is suitable for the level of
students.

On Data

While this text is mainly focused on mathematical preparation, what would data
analysis be without data? As such we provide a discussion on how to use these
tools and techniques on actual data, with simple examples given in Python. We
choose Python since it has increasingly many powerful libraries often with efficient
backends in low-level languages like C or Fortran. So for most data sets, this
provides the proper interface for working with these tools. Data sets can be found
here: https://mathfordata.github.io/data/.
But arguably, more important than writing the code itself is a discussion on
when and when not to use techniques from the immense toolbox available. This is
one of the main ongoing questions a data scientist must ask. And so, the text
attempts to introduce the readers to this ongoing discussion—but resists diving into
an explicit discussion of the actual tools available.
x Preface

Examples, Geometry, and Ethics

Three themes that this text highlights to aid in a broader understanding of these
fundamentals are examples, geometry, and ethical connotations. These are each
offset in colored boxes.

Example: with focus on Simplicity

This book provides numerous simple and sometimes fun examples to demonstrate
key concepts. It aims to be as simple as possible (but not simpler), and make data
examples small, so they can be fully digested. These are illustrated with figures
and plots, and often the supporting Python code is integrated when it is illustrative.
For brevity, the standard import commands from Python are only written once per
chapter, and state is assumed carried forward throughout the examples within a
chapter. Although most such Python parts are fully self-contained.

Geometry of Data and Proofs

Many of the ideas in this text are inherently geometric, and hence we attempt to
provide many geometric illustrations and descriptions to use visual intuition to shed
light on otherwise abstract concepts. These boxes often go more in depth into what
is going on, and include the most technical proofs. Occasionally the proofs are not
really geometric, yet for consistency the book retains this format in those cases.

Ethical Questions with Data Analysis

As data analysis glides into an abstract, nebulous, and ubiquitous place, with a
role in automatic decision making, the surrounding ethical questions are becoming
more important. As such, we highlight various ethical questions which may arise
in the course of using the analysis described in this text. We intentionally do not
offer solutions, since there may be no single good answer to some of the dilemmas
presented. Moreover, we believe the most important part of instilling positive ethics
is to make sure analysts at least think about the consequences, which we hope is
partially achieved via these highlighting boxes and ensuing discussions.

Salt Lake City, USA Jeff M. Phillips

August 2020
Acknowledgements

I would like to thank the gracious support from NSF in the form of grants
CCF-1350888, IIS-1251019, ACI-1443046, CNS-1514520, CNS-1564287, and
IIS-1816149, which have funded my cumulative research efforts during the writing
of this text. I would also like to thank the University of Utah, as well as the Simons
Institute for Theory of Computing, for providing excellent work environments
while this text was written. And thanks to Natalie Cottrill, Yang Gao, Koriann
South, and many other students for a careful reading and feedback.

xi
Contents

1 Probability Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Conditional Probability and Independence . . . . . . . . . . . . . . . . . 4
1.3 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Joint, Marginal, and Conditional Distributions . . . . . . . . . . . . . . 8
1.7 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7.1 Model Given Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Convergence and Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1 Sampling and Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Probably Approximately Correct (PAC) . . . . . . . . . . . . . . . . . . . 26
2.3 Concentration of Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Markov Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 Chebyshev Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.3 Chernoff-Hoeffding Inequality . . . . . . . . . . . . . . . . . . . . 29
2.3.4 Union Bound and Examples . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Sampling Without Replacement with Priority
Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . ......... 39
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......... 41

3 Linear Algebra Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Addition and Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

xiii
xiv Contents

3.5 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Square Matrices and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.7 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4 Distances and Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Lp Distances and their Relatives . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Lp Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Mahalanobis Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.3 Cosine and Angular Distance . . . . . . . . . . . . . . . . . . . . . 64
4.2.4 KL Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Distances for Sets and Strings . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.1 Jaccard Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.2 Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Modeling Text with Distances . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.1 Bag-of-Words Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.2 k-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 Similarities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5.1 Set Similarities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5.2 Normed Similarities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5.3 Normed Similarities between Sets . . . . . . . . . . . . . . . . . 78
4.6 Locality Sensitive Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.6.1 Properties of Locality Sensitive Hashing . . . . . . . . . . . . . 82
4.6.2 Prototypical Tasks for LSH . . . . . . . . . . . . . . . . . . . . . . 83
4.6.3 Banding to Amplify LSH . . . . . . . . . . . . . . . . . . . . . . . . 84
4.6.4 LSH for Angular Distance . . . . . . . . . . . . . . . . . . . . . . . 87
4.6.5 LSH for Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . 89
4.6.6 Min Hashing as LSH for Jaccard Distance . . . . . . . . . . . 90
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 Linear Regression with Multiple Explanatory Variables . . . . . . . 99
5.3 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.1 Other ways to Evaluate Linear Regression Models . . . . . 108
5.5 Regularized Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5.1 Tikhonov Regularization for Ridge Regression . . . . . . . . 110
5.5.2 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.5.3 Dual Constrained Formulation . . . . . . . . . . . . . . . . . . . . 113
5.5.4 Matching Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Contents xv

6 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2 Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3.1 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.4 Fitting a Model to Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.4.1 Least Mean Squares Updates for Regression . . . . . . . . . . 136
6.4.2 Decomposable Functions . . . . . . . . . . . . . . . . . . . . . . . . 137
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.1 Data Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.1 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.1.2 Sum of Squared Errors Goal . . . . . . . . . . . . . . . . . . . . . 146
7.2 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.1 Best Rank-k Approximation of a Matrix . . . . . . . . . . . . . 152
7.3 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.4 The Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.5 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.6 Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.6.1 Why does Classical MDS work? . . . . . . . . . . . . . . . . . . 163
7.7 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.8 Distance Metric Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.9 Matrix Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.10 Random Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

8 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.1 Voronoi Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.1.1 Delaunay Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.1.2 Connection to Assignment-Based Clustering . . . . . . . . . . 182
8.2 Gonzalez’s Algorithm for k-Center Clustering . . . . . . . . . . . . . . 183
8.3 Lloyd’s Algorithm for k-Means Clustering . . . . . . . . . . . . . . . . . 185
8.3.1 Lloyd’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.3.2 k-Means++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8.3.3 k-Mediod Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.3.4 Soft Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.4 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
8.4.1 Expectation-Maximization . . . . . . . . . . . . . . . . . . . . . . . 196
8.5 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
xvi Contents

8.6 Density-Based Clustering and Outliers . . . . . . . . . . . . . . . . . . . . 199

8.6.1 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.7 Mean Shift Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

9 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.1 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.1.1 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.1.2 Cross-Validation and Regularization . . . . . . . . . . . . . . . . 212
9.2 Perceptron Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.3 Support Vector Machines and Kernels . . . . . . . . . . . . . . . . . . . . 217
9.3.1 The Dual: Mistake Counter . . . . . . . . . . . . . . . . . . . . . . 218
9.3.2 Feature Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
9.3.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 221
9.4 Learnability and VC dimension . . . . . . . . . . . . . . . . . . . . . . . . . 222
9.5 kNN Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9.6 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9.7 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
9.7.1 Training with Back-propagation . . . . . . . . . . . . . . . . . . . 230
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

10 Graph Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

10.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
10.1.1 Ergodic Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.1.2 Metropolis Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
10.2 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
10.3 Spectral Clustering on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 249
10.3.1 Laplacians and their EigenStructures . . . . . . . . . . . . . . . 250
10.4 Communities in Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
10.4.1 Preferential Attachment . . . . . . . . . . . . . . . . . . . . . . . . . 256
10.4.2 Betweenness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
10.4.3 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

11 Big Data and Sketching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

11.1 The Streaming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
11.1.1 Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
11.1.2 Reservoir Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
11.2 Frequent Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
11.2.1 Warm-Up: Majority . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
11.2.2 Misra-Gries Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 269
11.2.3 Count-Min Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
11.2.4 Count Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Contents xvii

11.3 Matrix Sketching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

11.3.1 Covariance Matrix Summation . . . . . . . . . . . . . . . . . . . . 274
11.3.2 Frequent Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
11.3.3 Row Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
11.3.4 Random Projections and Count Sketch Hashing . . . . . . . 278
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283