Machine learning:
An Overview
Seth Flaxman
Department of Mathematics and
Data Science Institute
2 July 2019
About me1
Credit: Palsson on Flickr
2
1
[Link]
What is machine learning?
[Link]
3
What is machine learning?
For our purposes, follow Tom Mitchell:
A computer program is said to learn from experience
E with respect to some class of tasks T and
performance measure P if its performance at tasks in
T, as measured by P, improves with experience E.
4
What is statistical machine learning?
Both provide methods for learning from data.
Computer science takes an algorithmic perspective: propose an
algorithm for data, study the algorithm formally.
Statistics takes an inferential perspective: propose a model for
data, study the model formally.
Statistical machine learning (and computational statistics) is the
intersection: algorithmic perspective on statistical methods,
statistical perspective on algorithms.
5
Teaching staff
I Dr Seth Flaxman, Lecturer in Machine Learning and Big Data
Analytics, Department of Mathematics and Data Science
Institute
I Mariana Clare, PhD student in Mathematics of Planet Earth
CDT
I Adriaan Hilberts, PhD student in Mathematics of Planet
Earth CDT
I Jonathan Ish-Horowicz, PhD student in Theoretical Systems
Biology and Bioinformatics
I Lekha Patel, PhD student in Statistics in the Department of
Mathematics
I Tim Wolock, PhD student in Statistics in the Department of
Mathematics
6
Schedule
I Today: lecture 10am-1pm, 2pm-3pm; problem class 3pm-4pm
or 4pm-5pm
I Wednesday: problem class 10am-11am; lecture 11am-1pm,
2pm-4pm
I Thursday: lecture 10am-1pm, 2pm-3pm; problem class
3pm-4pm or 4pm-5pm
7
Assessment
I test on Friday at 2pm (20 questions, 1 hour)
8
Learning Objectives
By the end of this module, students should be able to:
I Understand what machine learning is and how it relates to statistics and
computer science
I Work with relevant calculus and linear algebra (gradients, vectors,
matrices, norms)
I Understand linear models, loss functions, and regularization
I Understand bias vs. variance and be able to assess how various models
and hyperparameters will increase/decrease each
I Be able to characterize algorithms as appropriate for supervised
vs. unsupervised learning
I Be familiar with the basic ideas of support vector machines, decision
trees, and random forests
I Be able to critically reflect on ethical issues in the application of machine
learning
I Demonstrate familiarity with neural networks and deep learning
9
Supervised vs. unsupervised learning: terminology
I Supervised learning, also known as: regression, classification,
pattern recognition, recovery, sensing, . . .
10
Supervised vs. unsupervised learning: terminology
I Supervised learning, also known as: regression, classification,
pattern recognition, recovery, sensing, . . .
I Unsupervised learning, also known as: clustering, data mining,
dimensionality reduction, . . .
11
Supervised vs. unsupervised learning: terminology
I Supervised learning, also known as: regression, classification,
pattern recognition, recovery, sensing, . . .
I Unsupervised learning, also known as: clustering, data mining,
dimensionality reduction, . . .
I Inputs, also known as: independent variables, predictors,
covariates, patterns, x, X , . . .
12
Supervised vs. unsupervised learning: terminology
I Supervised learning, also known as: regression, classification,
pattern recognition, recovery, sensing, . . .
I Unsupervised learning, also known as: clustering, data mining,
dimensionality reduction, . . .
I Inputs, also known as: independent variables, predictors,
covariates, patterns, x, X , . . .
I Outputs, also known as: dependent variables, responses,
labels, y , Y , . . .
13
Supervised vs. unsupervised learning: terminology
I Supervised learning, also known as: regression, classification,
pattern recognition, recovery, sensing, . . .
I Unsupervised learning, also known as: clustering, data mining,
dimensionality reduction, . . .
I Inputs, also known as: independent variables, predictors,
covariates, patterns, x, X , . . .
I Outputs, also known as: dependent variables, responses,
labels, y , Y , . . .
x f y
14
Supervised learning
15
Supervised learning, most basic setup
x f y
Given training inputs x ∈ X and outputs y ∈ Y
(xi , yi ), i = 1, . . . , n (1)
Learn a function (algorithm, black box, decision rule, classifier,
probability distribution)
f :X →Y (2)
i.e. on the training inputs, we would like our function f to
approximately recover the training outputs:
f (xi ) ≈ yi (3)
16
Unsupervised learning, clustering and dim. reduction
Given training inputs x ∈ X , learn:
I Clustering: a function f giving cluster assignments 1, . . . , K
f (x) ∈ {1, . . . , K } (4)
such that Ck = {xi |f (xi ) = k} is homogeneous for each k.
I Dimensionality reduction: if X ∈ Rp , for large p, learn a
latent representation Z ∈ Rd , d p, such that Z explains
most of the variance in X .
17
Supervised learning: k-nearest neighbors
18
Supervised learning: k-nearest neighbors
19
Supervised learning: k-nearest neighbors
20
Unsupervised learning: k-means clustering
21
Unsupervised learning: k-means clustering
22
Unsupervised learning: k-means clustering
23
Unsupervised learning: k-means clustering
24
Unsupervised learning: k-means clustering
25
Unsupervised learning: k-means clustering
26
Unsupervised learning: k-means clustering
27
Supervised learning: further considerations
I Loss function: standard choice in regression are squared error
(L2 ) loss:
L(x, y , f ) := (y − f (x))2 (5)
I Standard choice in classification is misclassification rate (1 -
accuracy):
L(x, y , f ) := 1 − I(y = f (x)) (6)
28
Supervised learning: further considerations
I Loss function: standard choice in regression are squared error
(L2 ) loss:
L(x, y , f ) := (y − f (x))2 (5)
I Standard choice in classification is misclassification rate (1 -
accuracy):
L(x, y , f ) := 1 − I(y = f (x)) (6)
Loss is bad, you want to avoid loss, so smaller loss is better! (Some
losses are always positive, others can be positive or negative.)
29
Supervised learning: further considerations
Quiz: what value of k for k-nearest neighbors gives training
loss = 0? Does this make sense?
I Risk: expected loss
R(f ) := EX ,Y [L(x, y , f )] (7)
I Empirical risk: average over data, e.g. “ordinary least
squares”:
n
X
R̂(f ) := (yi − f (xi ))2 (8)
i=1
30
An algorithmic vs. statistical perspective
I K-nearest neighbors and k-means clustering are algorithms for
handling data
I Algorithmic questions: what is their time complexity in terms
of p and n? storage complexity?
I Statistical perspective: can the performance of either
algorithm be analyzed with reference to an underlying
probabilistic model?
I Statistical questions: what kind of performance do we expect
on unseen data (generalization)? How does performance vary
with n and p? How robust is the model to outliers?
31
The curse of dimensionality (Bellman 1961)
As p increases, all points are about equally distant from one another:
32
The curse of dimensionality (Bellman 1961)
As p increases, all points are about equally distant from one another:
33
The curse of dimensionality (Bellman 1961)
As p increases, all points are about equally distant from one another:
fraction close
0.4
0.2
0.0
5 10 15 20
p
34
Linear regression as a statistical machine learning method
Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:
f (x) = α + βx (9)
35
Linear regression as a statistical machine learning method
Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:
f (x) = α + βx (9)
Finding f in this case means finding values for α and β.
36
Linear regression as a statistical machine learning method
Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:
f (x) = α + βx (9)
Finding f in this case means finding values for α and β.
I Algorithmic perspective: assuming squared error loss, find α
and β to minimize the empirical risk:
n
X
R̂(f ) := (yi − f (xi ))2 (10)
i=1
37
Linear regression as a statistical machine learning method
Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:
f (x) = α + βx (9)
Finding f in this case means finding values for α and β.
I Algorithmic perspective: assuming squared error loss, find α
and β to minimize the empirical risk:
n
X
R̂(f ) := (yi − f (xi ))2 (10)
i=1
Closed form solutions exist for α̂ and β̂ which minimize R̂(f ).
Exercise: find them! Hint: you will need to solve ∇β R̂(f ) = 0
and ∇α R̂(f ) = 0.
38
Linear regression as a statistical machine learning method
Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:
f (x) = α + βx (11)
Finding f in this case means finding values for α and β.
I Statistical perspective:
39
Linear regression as a statistical machine learning method
Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:
f (x) = α + βx (11)
Finding f in this case means finding values for α and β.
I Statistical perspective: assume that errors are iid N (0, σ 2 ), or
equivalently:
p(y |x) = N (f (x), σ 2 ) (12)
40
Linear regression as a statistical machine learning method
Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:
f (x) = α + βx (11)
Finding f in this case means finding values for α and β.
I Statistical perspective: assume that errors are iid N (0, σ 2 ), or
equivalently:
p(y |x) = N (f (x), σ 2 ) (12)
I Use maximum likelihood to estimate α̂ and β̂.
41
Linear regression as a statistical machine learning method
Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:
f (x) = α + βx (11)
Finding f in this case means finding values for α and β.
I Statistical perspective: assume that errors are iid N (0, σ 2 ), or
equivalently:
p(y |x) = N (f (x), σ 2 ) (12)
I Use maximum likelihood to estimate α̂ and β̂.
I The statistical and algorithmic perspectives coincide!
42
Linear regression as a statistical machine learning method
I Closed-form optima aren’t always available: need to use some
sort of optimization method (e.g. gradient descent) to learn
the parameters of a model.
I Many machine learning papers back in the day contained
pages of math deriving gradients
I More common these days to rely on autodifferentiation
methods (see the deep learning revolution)
I Distinction between parameters (usually fit with optimization)
and hyperparameters (usually learned by crossvalidation)
43
A quick tour of classic supervised learning methods
I k-nearest neighbors [Friedman, Tibshirani, Hastie 2009]
I Linear regression
I Naive Bayes [Mitchell 1997]
I Logistic regression
I Linear Discriminant Analysis
I Support Vector Machines (SVMs) [Scholkopf and Smola 2002]
I Gaussian process regression and classification [Rasmussen and
Williams 2006]
I Neural networks [Goodfellow, Bengio, Courville 2016]
I Random forests [Breiman 2001]
I Probabilistic Graphical Models [Murphy 2012]
44
A quick tour of classic unsupervised learning methods
I k-means clustering [Friedman, Tibshirani, Hastie 2009]
I Spectral clustering [von Luxburg 2007]
I Principal Components Analysis
I Latent Dirichlet Allocation [Blei, Ng, Jordan 2003]
I Gaussian Mixture Models
I Neural networks, especially VAEs and GANs [Goodfellow,
Bengio, Courville 2016]
45