0% found this document useful (0 votes)
3 views

QSRI-lecture1

The document provides an overview of machine learning, defining it as a process where a computer program improves its performance on tasks through experience. It discusses the distinction between statistical machine learning and traditional computer science approaches, emphasizing the intersection of algorithms and statistical methods. Additionally, it outlines learning objectives, assessment methods, and introduces various supervised and unsupervised learning techniques.

Uploaded by

Len McLemore
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

QSRI-lecture1

The document provides an overview of machine learning, defining it as a process where a computer program improves its performance on tasks through experience. It discusses the distinction between statistical machine learning and traditional computer science approaches, emphasizing the intersection of algorithms and statistical methods. Additionally, it outlines learning objectives, assessment methods, and introduces various supervised and unsupervised learning techniques.

Uploaded by

Len McLemore
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Machine learning:

An Overview

Seth Flaxman
Department of Mathematics and
Data Science Institute
2 July 2019
About me1

Credit: Palsson on Flickr

2
1
http://www.sethrf.com
What is machine learning?

https://www.youtube.com/watch?v=f_uwKZIAeM0

3
What is machine learning?

For our purposes, follow Tom Mitchell:

A computer program is said to learn from experience


E with respect to some class of tasks T and
performance measure P if its performance at tasks in
T, as measured by P, improves with experience E.

4
What is statistical machine learning?

Both provide methods for learning from data.

Computer science takes an algorithmic perspective: propose an


algorithm for data, study the algorithm formally.

Statistics takes an inferential perspective: propose a model for


data, study the model formally.

Statistical machine learning (and computational statistics) is the


intersection: algorithmic perspective on statistical methods,
statistical perspective on algorithms.

5
Teaching staff

I Dr Seth Flaxman, Lecturer in Machine Learning and Big Data


Analytics, Department of Mathematics and Data Science
Institute
I Mariana Clare, PhD student in Mathematics of Planet Earth
CDT
I Adriaan Hilberts, PhD student in Mathematics of Planet
Earth CDT
I Jonathan Ish-Horowicz, PhD student in Theoretical Systems
Biology and Bioinformatics
I Lekha Patel, PhD student in Statistics in the Department of
Mathematics
I Tim Wolock, PhD student in Statistics in the Department of
Mathematics

6
Schedule

I Today: lecture 10am-1pm, 2pm-3pm; problem class 3pm-4pm


or 4pm-5pm
I Wednesday: problem class 10am-11am; lecture 11am-1pm,
2pm-4pm
I Thursday: lecture 10am-1pm, 2pm-3pm; problem class
3pm-4pm or 4pm-5pm

7
Assessment

I test on Friday at 2pm (20 questions, 1 hour)

8
Learning Objectives
By the end of this module, students should be able to:
I Understand what machine learning is and how it relates to statistics and
computer science
I Work with relevant calculus and linear algebra (gradients, vectors,
matrices, norms)
I Understand linear models, loss functions, and regularization
I Understand bias vs. variance and be able to assess how various models
and hyperparameters will increase/decrease each
I Be able to characterize algorithms as appropriate for supervised
vs. unsupervised learning
I Be familiar with the basic ideas of support vector machines, decision
trees, and random forests
I Be able to critically reflect on ethical issues in the application of machine
learning
I Demonstrate familiarity with neural networks and deep learning

9
Supervised vs. unsupervised learning: terminology

I Supervised learning, also known as: regression, classification,


pattern recognition, recovery, sensing, . . .

10
Supervised vs. unsupervised learning: terminology

I Supervised learning, also known as: regression, classification,


pattern recognition, recovery, sensing, . . .
I Unsupervised learning, also known as: clustering, data mining,
dimensionality reduction, . . .

11
Supervised vs. unsupervised learning: terminology

I Supervised learning, also known as: regression, classification,


pattern recognition, recovery, sensing, . . .
I Unsupervised learning, also known as: clustering, data mining,
dimensionality reduction, . . .
I Inputs, also known as: independent variables, predictors,
covariates, patterns, x, X , . . .

12
Supervised vs. unsupervised learning: terminology

I Supervised learning, also known as: regression, classification,


pattern recognition, recovery, sensing, . . .
I Unsupervised learning, also known as: clustering, data mining,
dimensionality reduction, . . .
I Inputs, also known as: independent variables, predictors,
covariates, patterns, x, X , . . .
I Outputs, also known as: dependent variables, responses,
labels, y , Y , . . .

13
Supervised vs. unsupervised learning: terminology

I Supervised learning, also known as: regression, classification,


pattern recognition, recovery, sensing, . . .
I Unsupervised learning, also known as: clustering, data mining,
dimensionality reduction, . . .
I Inputs, also known as: independent variables, predictors,
covariates, patterns, x, X , . . .
I Outputs, also known as: dependent variables, responses,
labels, y , Y , . . .
x f y

14
Supervised learning

15
Supervised learning, most basic setup

x f y

Given training inputs x ∈ X and outputs y ∈ Y

(xi , yi ), i = 1, . . . , n (1)

Learn a function (algorithm, black box, decision rule, classifier,


probability distribution)

f :X →Y (2)

i.e. on the training inputs, we would like our function f to


approximately recover the training outputs:

f (xi ) ≈ yi (3)

16
Unsupervised learning, clustering and dim. reduction

Given training inputs x ∈ X , learn:


I Clustering: a function f giving cluster assignments 1, . . . , K

f (x) ∈ {1, . . . , K } (4)

such that Ck = {xi |f (xi ) = k} is homogeneous for each k.


I Dimensionality reduction: if X ∈ Rp , for large p, learn a
latent representation Z ∈ Rd , d  p, such that Z explains
most of the variance in X .

17
Supervised learning: k-nearest neighbors

18
Supervised learning: k-nearest neighbors

19
Supervised learning: k-nearest neighbors

20
Unsupervised learning: k-means clustering

21
Unsupervised learning: k-means clustering

22
Unsupervised learning: k-means clustering

23
Unsupervised learning: k-means clustering

24
Unsupervised learning: k-means clustering

25
Unsupervised learning: k-means clustering

26
Unsupervised learning: k-means clustering

27
Supervised learning: further considerations

I Loss function: standard choice in regression are squared error


(L2 ) loss:
L(x, y , f ) := (y − f (x))2 (5)
I Standard choice in classification is misclassification rate (1 -
accuracy):
L(x, y , f ) := 1 − I(y = f (x)) (6)

28
Supervised learning: further considerations

I Loss function: standard choice in regression are squared error


(L2 ) loss:
L(x, y , f ) := (y − f (x))2 (5)
I Standard choice in classification is misclassification rate (1 -
accuracy):
L(x, y , f ) := 1 − I(y = f (x)) (6)

Loss is bad, you want to avoid loss, so smaller loss is better! (Some
losses are always positive, others can be positive or negative.)

29
Supervised learning: further considerations

Quiz: what value of k for k-nearest neighbors gives training


loss = 0? Does this make sense?

I Risk: expected loss

R(f ) := EX ,Y [L(x, y , f )] (7)

I Empirical risk: average over data, e.g. “ordinary least


squares”:
n
X
R̂(f ) := (yi − f (xi ))2 (8)
i=1

30
An algorithmic vs. statistical perspective

I K-nearest neighbors and k-means clustering are algorithms for


handling data
I Algorithmic questions: what is their time complexity in terms
of p and n? storage complexity?
I Statistical perspective: can the performance of either
algorithm be analyzed with reference to an underlying
probabilistic model?
I Statistical questions: what kind of performance do we expect
on unseen data (generalization)? How does performance vary
with n and p? How robust is the model to outliers?

31
The curse of dimensionality (Bellman 1961)
As p increases, all points are about equally distant from one another:

32
The curse of dimensionality (Bellman 1961)
As p increases, all points are about equally distant from one another:

33
The curse of dimensionality (Bellman 1961)
As p increases, all points are about equally distant from one another:

fraction close

0.4
0.2
0.0

5 10 15 20

p
34
Linear regression as a statistical machine learning method

Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:

f (x) = α + βx (9)

35
Linear regression as a statistical machine learning method

Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:

f (x) = α + βx (9)

Finding f in this case means finding values for α and β.

36
Linear regression as a statistical machine learning method

Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:

f (x) = α + βx (9)

Finding f in this case means finding values for α and β.


I Algorithmic perspective: assuming squared error loss, find α
and β to minimize the empirical risk:
n
X
R̂(f ) := (yi − f (xi ))2 (10)
i=1

37
Linear regression as a statistical machine learning method

Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:

f (x) = α + βx (9)

Finding f in this case means finding values for α and β.


I Algorithmic perspective: assuming squared error loss, find α
and β to minimize the empirical risk:
n
X
R̂(f ) := (yi − f (xi ))2 (10)
i=1

Closed form solutions exist for α̂ and β̂ which minimize R̂(f ).


Exercise: find them! Hint: you will need to solve ∇β R̂(f ) = 0
and ∇α R̂(f ) = 0.

38
Linear regression as a statistical machine learning method

Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:

f (x) = α + βx (11)

Finding f in this case means finding values for α and β.


I Statistical perspective:

39
Linear regression as a statistical machine learning method

Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:

f (x) = α + βx (11)

Finding f in this case means finding values for α and β.


I Statistical perspective: assume that errors are iid N (0, σ 2 ), or
equivalently:
p(y |x) = N (f (x), σ 2 ) (12)

40
Linear regression as a statistical machine learning method

Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:

f (x) = α + βx (11)

Finding f in this case means finding values for α and β.


I Statistical perspective: assume that errors are iid N (0, σ 2 ), or
equivalently:
p(y |x) = N (f (x), σ 2 ) (12)

I Use maximum likelihood to estimate α̂ and β̂.

41
Linear regression as a statistical machine learning method

Given (xi , yi ), i = 1, . . . , n we consider fitting a linear model:

f (x) = α + βx (11)

Finding f in this case means finding values for α and β.


I Statistical perspective: assume that errors are iid N (0, σ 2 ), or
equivalently:
p(y |x) = N (f (x), σ 2 ) (12)

I Use maximum likelihood to estimate α̂ and β̂.


I The statistical and algorithmic perspectives coincide!

42
Linear regression as a statistical machine learning method
I Closed-form optima aren’t always available: need to use some
sort of optimization method (e.g. gradient descent) to learn
the parameters of a model.
I Many machine learning papers back in the day contained
pages of math deriving gradients
I More common these days to rely on autodifferentiation
methods (see the deep learning revolution)
I Distinction between parameters (usually fit with optimization)
and hyperparameters (usually learned by crossvalidation)

43
A quick tour of classic supervised learning methods

I k-nearest neighbors [Friedman, Tibshirani, Hastie 2009]


I Linear regression
I Naive Bayes [Mitchell 1997]
I Logistic regression
I Linear Discriminant Analysis
I Support Vector Machines (SVMs) [Scholkopf and Smola 2002]
I Gaussian process regression and classification [Rasmussen and
Williams 2006]
I Neural networks [Goodfellow, Bengio, Courville 2016]
I Random forests [Breiman 2001]
I Probabilistic Graphical Models [Murphy 2012]

44
A quick tour of classic unsupervised learning methods

I k-means clustering [Friedman, Tibshirani, Hastie 2009]


I Spectral clustering [von Luxburg 2007]
I Principal Components Analysis
I Latent Dirichlet Allocation [Blei, Ng, Jordan 2003]
I Gaussian Mixture Models
I Neural networks, especially VAEs and GANs [Goodfellow,
Bengio, Courville 2016]

45

You might also like