Theoretical Foundations of Machine Learning
Vianney Perchet
29th January 2024
Lecture 1/12
Structure of the course
12 lectures 1H30 + 5 TP + 3TD (1H30)
1. Introduction
2. Plug-in methods & over/under-fitting
3. Model selection & penalization
4. Empirical Risk Minimization
5. Decision Trees & Random Forest
6. Neural Nets & Deep Learning (2 sessions)
7. Transformers, implicit regularization, double descent
8. Reinforcement learning
9. Clustering & PCA
10. Ethics: Privacy and Fairness (2 sessions)
2
Machine Learning is everywhere
• Image Recognition
3
Machine Learning is everywhere
• Image Recognition
• Web search
3
Machine Learning is everywhere
• Image Recognition
• Web search
• Recommendation
3
Machine Learning is everywhere
• Image Recognition
• Web search
• Recommendation
• Advertisement
3
Machine Learning is everywhere
• Image Recognition
• Web search
• Recommendation
• Advertisement
• Scoring
3
Machine Learning is everywhere
• Image Recognition
• Web search
• Recommendation
• Advertisement
• Scoring
• Market
Segmentation
3
Machine Learning is everywhere
• Image Recognition
• Web search
• Recommendation
• Advertisement
• Scoring
• Market
Segmentation
• Translation
3
Machine Learning is everywhere
• Image Recognition
• Web search
• Recommendation
• Advertisement
• Scoring
• Market
Segmentation
• Translation
• Speech Recognition
3
Machine Learning is everywhere
• Image Recognition
• Web search
• Recommendation
• Advertisement
• Scoring
• Market
Segmentation
• Translation
• Speech Recognition
• Self-Driving Cars...
3
Machine Learning is everywhere
• Image Recognition
• Web search
• Recommendation
• Advertisement
• Scoring
• Market
Segmentation
• Translation
• Speech Recognition
• Self-Driving Cars...
• Healthcare
3
Machine Learning is everywhere
• Image Recognition
• Web search
• Recommendation
• Advertisement
• Scoring
• Market
Segmentation
• Translation
• Speech Recognition
• Self-Driving Cars...
• Healthcare
• Generative AI (text -
ChatGPT, image -
DallE...)
3
4 typical ML tasks
1. Supervised Learning
• Data Xi ∈ X (images) have labels Yi ∈ Y (cat/dog)
• Predict the label of future/new/unseen data X
• Examples: Digit classification, Advertisement, Speech Recognition,...
4
4 typical ML tasks
1. Supervised Learning
• Data Xi ∈ X (images) have labels Yi ∈ Y (cat/dog)
• Predict the label of future/new/unseen data X
• Examples: Digit classification, Advertisement, Speech Recognition,...
2. Un-supervised Learning
• Data Xi ∈ X (users) are just vectors without labels
• Find some “structure”
• Ex.: small groups (clustering) or ambiant space (dimension reduc.)
4
4 typical ML tasks
1. Supervised Learning
• Data Xi ∈ X (images) have labels Yi ∈ Y (cat/dog)
• Predict the label of future/new/unseen data X
• Examples: Digit classification, Advertisement, Speech Recognition,...
2. Un-supervised Learning
• Data Xi ∈ X (users) are just vectors without labels
• Find some “structure”
• Ex.: small groups (clustering) or ambiant space (dimension reduc.)
3. Reinforcement Learning
• Learner affects and interacts with the environment
• Examples: Robots, driving cars, drone
• What you see depends on what you do
4. Generative AI
• Dataset (Xi , Yi ) ∈ X × Y (images,labels)
• Create new data (X′j , Y′j ) that “looks like” an original one
4
The prediction task of supervised learning
• “Attributes/Features” space X ⊂ Rd & ”label” space Y ⊂ R
n o
• Training data-set: Dn = (X1 , Y1 ), . . . , (Xn , Yn )
• Future data ≃ Past data
• (Xi , Yi ) are i.i.d. of unknown joint law P on X × Y
From Dn , predict what is the “probable” labels Yn+1 of Xn+1
5
The prediction task of supervised learning
• “Attributes/Features” space X ⊂ Rd & ”label” space Y ⊂ R
n o
• Training data-set: Dn = (X1 , Y1 ), . . . , (Xn , Yn )
• Future data ≃ Past data
• (Xi , Yi ) are i.i.d. of unknown joint law P on X × Y
From Dn , predict what is the “probable” labels Yn+1 of Xn+1
• Predictor f : X → Y
• If Y = {0, 1} : classifier
• If Y = R : regressor & scoring rule
5
Performance of a predictor
• Risk based on some “local” loss ℓ : Y × Y → R+
• Cost of predicting Y′ instead of Y
• 0 − 1 loss: ℓ(Y, Y′ ) = 1{Y ̸= Y′ } (classification)
• quad-loss: ℓ(Y, Y′ ) = ∥Y − Y′ ∥2 (linear regression)
• logistic-loss: ℓ(Y, Y′ ) = log(1 + exp(−YY′ )) (logistic reg.)
6
Performance of a predictor
• Risk based on some “local” loss ℓ : Y × Y → R+
• Cost of predicting Y′ instead of Y
• 0 − 1 loss: ℓ(Y, Y′ ) = 1{Y ̸= Y′ } (classification)
• quad-loss: ℓ(Y, Y′ ) = ∥Y − Y′ ∥2 (linear regression)
• logistic-loss: ℓ(Y, Y′ ) = log(1 + exp(−YY′ )) (logistic reg.)
h i
Risk: R(f) = E(X,Y)∼P ℓ f(X), Y
• Optimal risk and Bayes predictor
f∗ = arg minf R(f) and R∗ = R(f∗ )
6
Performance of a predictor
• Risk based on some “local” loss ℓ : Y × Y → R+
• Cost of predicting Y′ instead of Y
• 0 − 1 loss: ℓ(Y, Y′ ) = 1{Y ̸= Y′ } (classification)
• quad-loss: ℓ(Y, Y′ ) = ∥Y − Y′ ∥2 (linear regression)
• logistic-loss: ℓ(Y, Y′ ) = log(1 + exp(−YY′ )) (logistic reg.)
h i
Risk: R(f) = E(X,Y)∼P ℓ f(X), Y
• Optimal risk and Bayes predictor
f∗ = arg minf R(f) and R∗ = R(f∗ )
• Remark: R(f) cannot
n be evaluated.
o
′
• Test set Dm = (X′i , Y′i ) INDEPENDENT from Training set
1 X
m
R(f) ≃ ℓ(f(X′i ), Y′i ) thanks to the CLT
m i=1
• Recommendation: 80% data in training set, 20% in test set 6
Optimal/Bayes predictor
η(x) = E[Y|X = x]
• Linear Regression
• ℓ(y, y′ ) = (y − y′ )2 with y ∈ R
• “closed” form Bayes regressor
f∗ (x) = η(x)
• Binary Classification
• ℓ(y, y′ ) = 1{y ̸= y′ }
• “closed” form Bayes classifier
f∗ (x) = 1{η(x) ≥ 1
2
}
7
Refined losses. Type I/II, Precision / Recall
• “Unbalanced” data (almost only 0’s) or effect (0 = credit fraud)
Predict 0 instead of 1, way worse than 1 instead of 0
8
Refined losses. Type I/II, Precision / Recall
• “Unbalanced” data (almost only 0’s) or effect (0 = credit fraud)
Predict 0 instead of 1, way worse than 1 instead of 0
♯{j:Yj =1 and f(Xj )=1}
• Precision: ♯{i:f(Xi )=1} ≃ P Y = 1 f(X) = 1
♯{j:Yj =1 and f(Xj )=1}
• Recall: ♯{i:Yi =1} ≃ P f(X) = 1 Y = 1
• Not “local” losses but global
• more difficult to control
• (in theory as in practice)
• Many variants
• False Discovery Rate P{Y = 0|f(X) = 1}
Precision.Recall
• F1 score = 2 Precision+Recall
8
Scoring Rules - Area Under the Curve
n o
• Training data-set: Dn = (X1 , Y1 ), . . . , (Xn , Yn )
• Score f : X → R
• Threshold θ∗ ∈ R. Above θ, user “accepted” (below, rejected)
• Tuning θ balances True Positive Rate vs False Positive Rate
• TRP : P{f(X) = 1|Y = 1}; FPR: P{f(X) = 1|Y = 0}
• High θ: few users, pretty confident
• Low θ: many users, low confidence
ROC: True positives function of False positives
• Parameterized by θ: the higher the ROC the better
AUC: Area Under the roc Curve
9
Scoring Rules - Area Under the Curve
9
Unsupervised learning
• “Attributes/Features” space X ⊂ Rd but no labels
n o
• data-set: Dn = X1 , . . . , Xn Xi might not be i.i.d. !
Find a “good small dimension” representation of Dn
10
Unsupervised learning
• “Attributes/Features” space X ⊂ Rd but no labels
n o
• data-set: Dn = X1 , . . . , Xn Xi might not be i.i.d. !
Find a “good small dimension” representation of Dn
• Clustering: Regroup Dn into k “groups” of points
• How to choose k ?
• Possible metrics
• Low intracluster similarity. “average distance within a cluster)
• High intercluster distance “average distance between 2 different
clusters
10
Unsupervised learning
10
Unsupervised learning
10
Unsupervised learning
• “Attributes/Features” space X ⊂ Rd but no labels
n o
• data-set: Dn = X1 , . . . , Xn Xi might not be i.i.d. !
Find a “good small dimension” representation of Dn
• Clustering: Regroup Dn into k “groups” of points
• How to choose k ?
• Possible metrics
• Low intracluster similarity. “average distance within a cluster)
• High intercluster distance “average distance between 2 different
clusters
• Dimension reduction: Project Xi on a d-dimension linear space
• How to choose d ?
• Metric ? Average distance point/projection
10
Unsupervised learning
10
Unsupervised learning
10
Statistics vs. ML: Linear Regression
• Dn = {(Xi , Yi ), ; i = 1, . . . , n}, with Xi ∈ Rd and Yi ∈ R
with quadratic loss: ℓ(y, y′ ) = |y − y′ |2
• Local methods too slow. What about global methods ?
11
Statistics vs. ML: Linear Regression
• Dn = {(Xi , Yi ), ; i = 1, . . . , n}, with Xi ∈ Rd and Yi ∈ R
with quadratic loss: ℓ(y, y′ ) = |y − y′ |2
• Local methods too slow. What about global methods ?
• Linear predictor: fβ (x) = β ⊤ x with β ∈ Rd
h i
• Best linear pred. β ∗ = arg minβ E ℓ Y, β ⊤ X
R(fβ ) − R(f∗ ) = R(fβ ) − R(fβ ∗ ) + R(fβ ∗ ) − R(f∗ )
| {z } | {z }
Estimation Error Approximation Error
Pn 2 2
• Empirical error: R̂(fβ ) = 1
n i=1 Yi − β ⊤ Xi = Y − Xβ
−1
Closed form: β̂ = X⊤ X X⊤ Y
11
Statistics vs. ML. Pros/Cons
• Statistics. The model is correct (Approx. Error = 0)
• Can compute law of residuals Yi − β̂ ⊤ Xi and ∥β̂ − β ∗ ∥2
• Machine Learning The model is incorrect (Approx. Error > 0)
• Can add/create features (X2i , 2Xi + 3Xj ....)
• ✓ Pros
• Simple: closed form solution & easily generalizable
• Good first approximation
• Rather intuitive
• 7 Cons
• Potential huge approximation error
• Non-robust to outliers (high generalization error)
• Makes sense only for Y = R, not for Y = {0, 1}
12
Logistic Regression
• Most datasets: Y = {0, 1} & η(x) = P(Y = 1|X = x)
Linear regression outputs “probability” in R...
exp(β ⊤ x)
Logistic Reg. ηβ (x) =
1 + exp(β ⊤ x)
13
Logistic Regression
• Most datasets: Y = {0, 1} & η(x) = P(Y = 1|X = x)
Linear regression outputs “probability” in R...
exp(β ⊤ x)
Logistic Reg. ηβ (x) =
1 + exp(β ⊤ x)
• Maximize the log-likelihood or the empirical “log-loss”
1X
n
log-loss(β) = e i β ⊤ Xi ) with Y
log 1 + exp(−Y e i = 2Yi − 1
n
i=1
13
Logistic Regression. The upsides
1X
n
log-loss(β) = e i β ⊤ Xi ) with Y
log 1 + exp(−Y e i = 2Yi − 1
n
i=1
• ✓ log2 (1 + exp(u)) smooth & convex surrogate of 0-1 loss
• ✓ log-loss(·) is convex and differentiable.
• Can be optimized
1X ei
Y
• ∇log-loss(β) = Xi
n i 1 + exp(Y e i β ⊤ Xi )
e i∗
Y
b
• Unbiased grad. ∇l.-l.(β) = Xi∗ with i∗ random
1 + exp(Y e i∗ β ⊤ Xi∗ )
• ✓ ✓ Works very-well in practice (most of “AI” is log. regression)
• 7 Cons: no closed form of β; need computational power
14
Logistic Regression. The upsides
14
Take home message - most important settings
• “Attributes/Features” space X ⊂ Rd & ”label” space Y ⊂ R
n o
• Training data-set: Dn = (X1 , Y1 ), . . . , (Xn , Yn )
• Risk w.r.t. loss ℓ : Y × Y → R+
h i
Risk: R(f) = E(X,Y)∼P ℓ f(X), Y
• Optimal risk and Bayes predictor
f∗ = arg minf R(f) and R∗ = R(f∗ )
• Restricted class of predictor/classifiers: {fβ ; β ∈ B}
R(fβ ) − R(f∗ ) = R(fβ ) − R(fβ ∗ ) + R(fβ ∗ ) − R(f∗ )
| {z } | {z }
Estimation Error Approximation Error
15