0% found this document useful (0 votes)
28 views41 pages

Cours1 ML

Uploaded by

laribiamal24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views41 pages

Cours1 ML

Uploaded by

laribiamal24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Theoretical Foundations of Machine Learning

Vianney Perchet
29th January 2024

Lecture 1/12
Structure of the course

12 lectures 1H30 + 5 TP + 3TD (1H30)

1. Introduction
2. Plug-in methods & over/under-fitting
3. Model selection & penalization
4. Empirical Risk Minimization
5. Decision Trees & Random Forest
6. Neural Nets & Deep Learning (2 sessions)
7. Transformers, implicit regularization, double descent
8. Reinforcement learning
9. Clustering & PCA
10. Ethics: Privacy and Fairness (2 sessions)

2
Machine Learning is everywhere

• Image Recognition

3
Machine Learning is everywhere

• Image Recognition
• Web search

3
Machine Learning is everywhere

• Image Recognition
• Web search
• Recommendation

3
Machine Learning is everywhere

• Image Recognition
• Web search
• Recommendation
• Advertisement

3
Machine Learning is everywhere

• Image Recognition
• Web search
• Recommendation
• Advertisement
• Scoring

3
Machine Learning is everywhere

• Image Recognition
• Web search
• Recommendation
• Advertisement
• Scoring
• Market
Segmentation

3
Machine Learning is everywhere

• Image Recognition
• Web search
• Recommendation
• Advertisement
• Scoring
• Market
Segmentation
• Translation

3
Machine Learning is everywhere

• Image Recognition
• Web search
• Recommendation
• Advertisement
• Scoring
• Market
Segmentation
• Translation
• Speech Recognition

3
Machine Learning is everywhere

• Image Recognition
• Web search
• Recommendation
• Advertisement
• Scoring
• Market
Segmentation
• Translation
• Speech Recognition
• Self-Driving Cars...

3
Machine Learning is everywhere

• Image Recognition
• Web search
• Recommendation
• Advertisement
• Scoring
• Market
Segmentation
• Translation
• Speech Recognition
• Self-Driving Cars...
• Healthcare

3
Machine Learning is everywhere

• Image Recognition
• Web search
• Recommendation
• Advertisement
• Scoring
• Market
Segmentation
• Translation
• Speech Recognition
• Self-Driving Cars...
• Healthcare
• Generative AI (text -
ChatGPT, image -
DallE...)
3
4 typical ML tasks

1. Supervised Learning
• Data Xi ∈ X (images) have labels Yi ∈ Y (cat/dog)
• Predict the label of future/new/unseen data X
• Examples: Digit classification, Advertisement, Speech Recognition,...

4
4 typical ML tasks

1. Supervised Learning
• Data Xi ∈ X (images) have labels Yi ∈ Y (cat/dog)
• Predict the label of future/new/unseen data X
• Examples: Digit classification, Advertisement, Speech Recognition,...
2. Un-supervised Learning
• Data Xi ∈ X (users) are just vectors without labels
• Find some “structure”
• Ex.: small groups (clustering) or ambiant space (dimension reduc.)

4
4 typical ML tasks

1. Supervised Learning
• Data Xi ∈ X (images) have labels Yi ∈ Y (cat/dog)
• Predict the label of future/new/unseen data X
• Examples: Digit classification, Advertisement, Speech Recognition,...
2. Un-supervised Learning
• Data Xi ∈ X (users) are just vectors without labels
• Find some “structure”
• Ex.: small groups (clustering) or ambiant space (dimension reduc.)
3. Reinforcement Learning
• Learner affects and interacts with the environment
• Examples: Robots, driving cars, drone
• What you see depends on what you do
4. Generative AI
• Dataset (Xi , Yi ) ∈ X × Y (images,labels)
• Create new data (X′j , Y′j ) that “looks like” an original one

4
The prediction task of supervised learning

• “Attributes/Features” space X ⊂ Rd & ”label” space Y ⊂ R


n o
• Training data-set: Dn = (X1 , Y1 ), . . . , (Xn , Yn )
• Future data ≃ Past data
• (Xi , Yi ) are i.i.d. of unknown joint law P on X × Y

From Dn , predict what is the “probable” labels Yn+1 of Xn+1

5
The prediction task of supervised learning

• “Attributes/Features” space X ⊂ Rd & ”label” space Y ⊂ R


n o
• Training data-set: Dn = (X1 , Y1 ), . . . , (Xn , Yn )
• Future data ≃ Past data
• (Xi , Yi ) are i.i.d. of unknown joint law P on X × Y

From Dn , predict what is the “probable” labels Yn+1 of Xn+1

• Predictor f : X → Y
• If Y = {0, 1} : classifier
• If Y = R : regressor & scoring rule

5
Performance of a predictor

• Risk based on some “local” loss ℓ : Y × Y → R+


• Cost of predicting Y′ instead of Y
• 0 − 1 loss: ℓ(Y, Y′ ) = 1{Y ̸= Y′ } (classification)
• quad-loss: ℓ(Y, Y′ ) = ∥Y − Y′ ∥2 (linear regression)
• logistic-loss: ℓ(Y, Y′ ) = log(1 + exp(−YY′ )) (logistic reg.)

6
Performance of a predictor

• Risk based on some “local” loss ℓ : Y × Y → R+


• Cost of predicting Y′ instead of Y
• 0 − 1 loss: ℓ(Y, Y′ ) = 1{Y ̸= Y′ } (classification)
• quad-loss: ℓ(Y, Y′ ) = ∥Y − Y′ ∥2 (linear regression)
• logistic-loss: ℓ(Y, Y′ ) = log(1 + exp(−YY′ )) (logistic reg.)

h i
Risk: R(f) = E(X,Y)∼P ℓ f(X), Y

• Optimal risk and Bayes predictor


f∗ = arg minf R(f) and R∗ = R(f∗ )

6
Performance of a predictor

• Risk based on some “local” loss ℓ : Y × Y → R+


• Cost of predicting Y′ instead of Y
• 0 − 1 loss: ℓ(Y, Y′ ) = 1{Y ̸= Y′ } (classification)
• quad-loss: ℓ(Y, Y′ ) = ∥Y − Y′ ∥2 (linear regression)
• logistic-loss: ℓ(Y, Y′ ) = log(1 + exp(−YY′ )) (logistic reg.)

h i
Risk: R(f) = E(X,Y)∼P ℓ f(X), Y

• Optimal risk and Bayes predictor


f∗ = arg minf R(f) and R∗ = R(f∗ )

• Remark: R(f) cannot


n be evaluated.
o

• Test set Dm = (X′i , Y′i ) INDEPENDENT from Training set

1 X
m
R(f) ≃ ℓ(f(X′i ), Y′i ) thanks to the CLT
m i=1

• Recommendation: 80% data in training set, 20% in test set 6


Optimal/Bayes predictor

η(x) = E[Y|X = x]

• Linear Regression
• ℓ(y, y′ ) = (y − y′ )2 with y ∈ R
• “closed” form Bayes regressor

f∗ (x) = η(x)

• Binary Classification
• ℓ(y, y′ ) = 1{y ̸= y′ }
• “closed” form Bayes classifier

f∗ (x) = 1{η(x) ≥ 1
2
}

7
Refined losses. Type I/II, Precision / Recall

• “Unbalanced” data (almost only 0’s) or effect (0 = credit fraud)

Predict 0 instead of 1, way worse than 1 instead of 0

8
Refined losses. Type I/II, Precision / Recall

• “Unbalanced” data (almost only 0’s) or effect (0 = credit fraud)

Predict 0 instead of 1, way worse than 1 instead of 0

♯{j:Yj =1 and f(Xj )=1} 


• Precision: ♯{i:f(Xi )=1} ≃ P Y = 1 f(X) = 1
♯{j:Yj =1 and f(Xj )=1} 
• Recall: ♯{i:Yi =1} ≃ P f(X) = 1 Y = 1
• Not “local” losses but global
• more difficult to control
• (in theory as in practice)
• Many variants
• False Discovery Rate P{Y = 0|f(X) = 1}
Precision.Recall
• F1 score = 2 Precision+Recall

8
Scoring Rules - Area Under the Curve
n o
• Training data-set: Dn = (X1 , Y1 ), . . . , (Xn , Yn )
• Score f : X → R
• Threshold θ∗ ∈ R. Above θ, user “accepted” (below, rejected)
• Tuning θ balances True Positive Rate vs False Positive Rate
• TRP : P{f(X) = 1|Y = 1}; FPR: P{f(X) = 1|Y = 0}
• High θ: few users, pretty confident
• Low θ: many users, low confidence

ROC: True positives function of False positives

• Parameterized by θ: the higher the ROC the better

AUC: Area Under the roc Curve

9
Scoring Rules - Area Under the Curve

9
Unsupervised learning

• “Attributes/Features” space X ⊂ Rd but no labels


n o
• data-set: Dn = X1 , . . . , Xn Xi might not be i.i.d. !

Find a “good small dimension” representation of Dn

10
Unsupervised learning

• “Attributes/Features” space X ⊂ Rd but no labels


n o
• data-set: Dn = X1 , . . . , Xn Xi might not be i.i.d. !

Find a “good small dimension” representation of Dn

• Clustering: Regroup Dn into k “groups” of points


• How to choose k ?
• Possible metrics
• Low intracluster similarity. “average distance within a cluster)
• High intercluster distance “average distance between 2 different
clusters

10
Unsupervised learning

10
Unsupervised learning

10
Unsupervised learning

• “Attributes/Features” space X ⊂ Rd but no labels


n o
• data-set: Dn = X1 , . . . , Xn Xi might not be i.i.d. !

Find a “good small dimension” representation of Dn

• Clustering: Regroup Dn into k “groups” of points


• How to choose k ?
• Possible metrics
• Low intracluster similarity. “average distance within a cluster)
• High intercluster distance “average distance between 2 different
clusters
• Dimension reduction: Project Xi on a d-dimension linear space
• How to choose d ?
• Metric ? Average distance point/projection

10
Unsupervised learning

10
Unsupervised learning

10
Statistics vs. ML: Linear Regression

• Dn = {(Xi , Yi ), ; i = 1, . . . , n}, with Xi ∈ Rd and Yi ∈ R


with quadratic loss: ℓ(y, y′ ) = |y − y′ |2
• Local methods too slow. What about global methods ?

11
Statistics vs. ML: Linear Regression

• Dn = {(Xi , Yi ), ; i = 1, . . . , n}, with Xi ∈ Rd and Yi ∈ R


with quadratic loss: ℓ(y, y′ ) = |y − y′ |2
• Local methods too slow. What about global methods ?
• Linear predictor: fβ (x) = β ⊤ x with β ∈ Rd
h i
• Best linear pred. β ∗ = arg minβ E ℓ Y, β ⊤ X

R(fβ ) − R(f∗ ) = R(fβ ) − R(fβ ∗ ) + R(fβ ∗ ) − R(f∗ )


| {z } | {z }
Estimation Error Approximation Error

Pn 2 2
• Empirical error: R̂(fβ ) = 1
n i=1 Yi − β ⊤ Xi = Y − Xβ

−1
Closed form: β̂ = X⊤ X X⊤ Y

11
Statistics vs. ML. Pros/Cons

• Statistics. The model is correct (Approx. Error = 0)


• Can compute law of residuals Yi − β̂ ⊤ Xi and ∥β̂ − β ∗ ∥2
• Machine Learning The model is incorrect (Approx. Error > 0)
• Can add/create features (X2i , 2Xi + 3Xj ....)
• ✓ Pros
• Simple: closed form solution & easily generalizable
• Good first approximation
• Rather intuitive
• 7 Cons
• Potential huge approximation error
• Non-robust to outliers (high generalization error)
• Makes sense only for Y = R, not for Y = {0, 1}

12
Logistic Regression

• Most datasets: Y = {0, 1} & η(x) = P(Y = 1|X = x)


Linear regression outputs “probability” in R...

exp(β ⊤ x)
Logistic Reg. ηβ (x) =
1 + exp(β ⊤ x)

13
Logistic Regression

• Most datasets: Y = {0, 1} & η(x) = P(Y = 1|X = x)


Linear regression outputs “probability” in R...

exp(β ⊤ x)
Logistic Reg. ηβ (x) =
1 + exp(β ⊤ x)

• Maximize the log-likelihood or the empirical “log-loss”

1X
n  
log-loss(β) = e i β ⊤ Xi ) with Y
log 1 + exp(−Y e i = 2Yi − 1
n
i=1

13
Logistic Regression. The upsides

1X
n  
log-loss(β) = e i β ⊤ Xi ) with Y
log 1 + exp(−Y e i = 2Yi − 1
n
i=1

• ✓ log2 (1 + exp(u)) smooth & convex surrogate of 0-1 loss


• ✓ log-loss(·) is convex and differentiable.
• Can be optimized
1X ei
Y
• ∇log-loss(β) = Xi
n i 1 + exp(Y e i β ⊤ Xi )
e i∗
Y
b
• Unbiased grad. ∇l.-l.(β) = Xi∗ with i∗ random
1 + exp(Y e i∗ β ⊤ Xi∗ )
• ✓ ✓ Works very-well in practice (most of “AI” is log. regression)
• 7 Cons: no closed form of β; need computational power

14
Logistic Regression. The upsides

14
Take home message - most important settings

• “Attributes/Features” space X ⊂ Rd & ”label” space Y ⊂ R


n o
• Training data-set: Dn = (X1 , Y1 ), . . . , (Xn , Yn )
• Risk w.r.t. loss ℓ : Y × Y → R+
h i
Risk: R(f) = E(X,Y)∼P ℓ f(X), Y

• Optimal risk and Bayes predictor


f∗ = arg minf R(f) and R∗ = R(f∗ )
• Restricted class of predictor/classifiers: {fβ ; β ∈ B}

R(fβ ) − R(f∗ ) = R(fβ ) − R(fβ ∗ ) + R(fβ ∗ ) − R(f∗ )


| {z } | {z }
Estimation Error Approximation Error

15

You might also like