0% found this document useful (0 votes)
4 views

Week11_regularization and optimization

NTU EE6483 Week11_regularization and optimization

Uploaded by

yimingxiao2000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Week11_regularization and optimization

NTU EE6483 Week11_regularization and optimization

Uploaded by

yimingxiao2000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Artificial Intelligence & Data Mining

Week 11

WEN Bihan (Asst Prof)


Homepage: https://personal.ntu.edu.sg/bihan.wen/

1
Recap: K-Means

• Inputs:

1. A collection of data without labels - Unsupervised!

2. K - number of clusters - Need to provide as the algorithm parameter

3. Distance metrics - Euclidean distance, used in this course

• Initialization:

• Centers / Centroids of each cluster - Initialization affects the final result

• Outputs:

• Clusters / membership of each data point + centroids

• Stopping criterion

2
Recap: HAC

• Inputs:

1. A collection of data without labels

2. Distance metric - Centroid/Max/Min/Average Distance

• No Initialization - Deterministic Algorithm

• Merge clusters: Distance Table + Visualization

• Dendrogram

• Stop when #clusters = K, or until only 1 cluster left.

3
Example: HAC
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1

What are the distance metrics between clusters?

1. Single Linkage (MIN distance)


2. Complete Linkage (MAX distance)
3. Centroid Distance (distance between the centers)
4. Average Linkage (average over all pairs of points from two clusters)

4
Example: HAC
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1

What are the distance metrics between clusters?

1. Single Linkage (MIN distance)


2. Complete Linkage (MAX distance)
3. Centroid Distance (distance between the centers)
4. Average Linkage (average over all pairs of points from two clusters)

5
Example: HAC
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1

Step 1: Same for all linkages / distances


#1 #2 #3 #4

#1 0
#2 0.14 0
#3 0.72 0.86 0
#4 1.17 1.3 0.5 0

Decide to merge #1 and #2


6
#1 #2 #3 #4
Example: HAC #1 0
#2 0.14 0
#3 0.72 0.86 0
#4 1.17 1.3 0.5 0

Step 2 for Average / MIN / Max distance, no need to compute centroids

Average Distance:
• D(1+2, 3) = 0.5 * D(1,3) + 0.5 * D(2,3) = 0.36 + 0.43 = 0.79
• D(1+2, 4) = 0.5 * D(1,4) + 0.5 * D(2,4) = 0.59 + 0.65 = 1.24

#1 + #2 #3 #4

#1 + #2 0
#3 0.79 0
#4 1.24 0.5 0
7
#1 #2 #3 #4
Example: HAC #1 0
#2 0.14 0
#3 0.72 0.86 0
#4 1.17 1.3 0.5 0

Step 2 for Average / MIN / Max distance, no need to compute centroids

Average Distance:
• D(1+2, 3) = Average {D(1,3), D(2,3)} = 0.5 * D(1,3) + 0.5 * D(2,3) = 0.36 + 0.43 = 0.79
• D(1+2, 4) = Average {D(1,4), D(2,4)} = 0.5 * D(1,4) + 0.5 * D(2,4) = 0.59 + 0.65 = 1.24

#1 + #2 #3 #4

#1 + #2 0
#3 0.79 0
#4 1.24 0.5 0
8
#1 #2 #3 #4
Example: HAC #1 0
#2 0.14 0
#3 0.72 0.86 0
#4 1.17 1.3 0.5 0

Step 2 for Average / MIN / Max distance, no need to compute centroids

MIN Distance:
• D(1+2, 3) = MIN {D(1,3), D(2,3)} = MIN {0.72, 0.86} = 0.72
• D(1+2, 4) = MIN {D(1,4), D(2,4)} = MIN {1.17, 1.3}= 1.17

#1 + #2 #3 #4

#1 + #2 0
#3 0.72 0
#4 1.17 0.5 0
9
#1 #2 #3 #4
Example: HAC #1 0
#2 0.14 0
#3 0.72 0.86 0
#4 1.17 1.3 0.5 0

Step 2 for Average / MIN / Max distance, no need to compute centroids

MAX Distance:
• D(1+2, 3) = MAX {D(1,3), D(2,3)} = MAX {0.72, 0.86} = 0.86
• D(1+2, 4) = MAX {D(1,4), D(2,4)} = MAX {1.17, 1.3}= 1.3

#1 + #2 #3 #4

#1 + #2 0
#3 0.86 0
#4 1.3 0.5 0
10
Recap: Linear Regression

• Linear regression: 𝑦 can be determined by 𝒘 𝑇 𝒙 + 𝑏

• Training: to find the best 𝒘 , based on training data.

• Given a set of N data points 𝒙1 , … , 𝒙𝑖 , … 𝒙𝑁 , and their 𝑦1 , … , 𝑦𝑖 , … 𝑦𝑁 .

• Loss Function = mean squared error between 𝒘𝑇 𝒙𝒊 + 𝑏 and 𝑦𝑖 .


𝑁
1
min 𝐿෠ 𝑓𝒘 = min ෍(𝒘𝑇 𝒙𝒊 + 𝑏 − 𝑦𝑖 )2
𝒘 𝒘 N
𝑖=1

• Minimizing the Vertical Offset

11
Recap: Linear Regression

• Linear regression: 𝑦 can be determined by 𝒘 𝑇 𝒙 + 𝑏

• 𝒘 ∈ 𝑅𝑑 and 𝒘 𝑇 denotes its transpose (row vector).

𝒙
𝒘 𝑇𝒙 + 𝑏 = 𝒘 𝑇 | 𝑏
1

• To simplify the notation,

𝒙
𝒘𝑇 ← 𝒘𝑇 | 𝑏 𝒙 ←
1

• Thus, 𝑦 can be determined by 𝒘 𝑇 𝒙.

12
Recap: Linear Regression

• Write the problem in matrix form:

෠𝐿 𝑓𝒘 = 1 σ𝑁
𝑖=1 (𝒘 𝑇 𝒙 − 𝑦 )2 =
𝒊 𝑖
1
𝑿𝒘 − 𝒚 2
2
N N

• Concatenate the rows:


𝒙𝒊 𝑦𝒊

• Matrix 𝑿 ∈ 𝑅𝑁×𝑑
𝒚
• Vector 𝒚 ∈ 𝑅𝑁

• Vector 𝒘 ∈ 𝑅𝑑

13
Recap: Linear Regression

• Write the problem in matrix form:


1 1
𝐿෠ 𝑓𝒘 = σ𝑁 𝑇 2
𝑖=1(𝒘 𝒙𝒊 − 𝑦𝑖 ) = 𝑿𝒘 − 𝒚 2
2
N N

• Find the gradient w.r.t. 𝒘:


2
∇𝒘 𝑿𝒘 − 𝒚 2 = ∇𝒘 𝑿𝒘 − 𝒚 𝑇 𝑿𝒘 − 𝒚
= ∇𝑤 𝒘𝑇 𝑿𝑇 𝑿𝒘 − 2𝒘𝑇 𝑿𝑇 𝒚
= 2𝑿𝑇 𝑿𝒘 − 2𝑿𝑇 𝒚

• Set gradient to zero to get the minimizer:


𝑿𝑇 𝑿𝒘 = 𝑿𝑇 𝒚

𝒘 = (𝑿𝑇 𝑿)−1 𝑿𝑇 𝒚

• Note: here we assume (𝑿𝑇 𝑿)−1 exists unless otherwise specified.


14
Example: Linear Regression

• Least Square (LS) solution:


𝒘 = (𝑿𝑇 𝑿)−1 𝑿𝑇 𝒚

Example: 𝒙 𝒚
0 1

1 2

2 3

𝒙 𝒚
Step 1: Use the modified 𝒙 for all data samples
(0, 1) 1

𝒙 (1, 1) 2
𝒙 ←
1 (2, 1) 3

15
Example: Linear Regression

• Step 2: Construct the data matrix 𝑿 and label vector 𝒚

0 1 1
𝑿 = 1 1 𝒚 = 2
𝒙𝒊 𝑦𝒊
2 1 3

𝒙 𝒚 𝒚
(0, 1) 1

(1, 1) 2

(2, 1) 3

16
Example: Linear Regression

• Step 3: Apply the Least Square (LS) solution:

𝒘 = (𝑿𝑇 𝑿)−1 𝑿𝑇 𝒚

0 1 1
0 1 2 5 3 0 1 2 8
𝑿𝑇 𝑿 = 1 1 = 𝑿𝑇 𝒚 = 2 =
1 1 1 3 3 1 1 1 6
2 1 3

𝒘 = (𝑿𝑇 𝑿)−1 𝑿𝑇 𝒚
−1 3/6 −3/6
5 3 8 8 1
= = =
3 3 6 −3/6 5/6 6 1

17
Regularization and Optimization

18
Outline

• Learning and Supervision

• Bias and Variance

• Overfitting and Underfitting

• Theoretical Analysis of Statistical Learning Theory (optional)

• Model Diagnosis and Optimization

19
Carry-on Questions

• What are the types of supervision?

• How to measure the degree of overfitting?

• What are methods to prevent overfitting?

20
Supervised vs Unsupervised Learning

From what we learn so far:

• Classification is supervised:

• A training dataset with pre-defined class labels are provided.

• Clustering is unsupervised:

• No training dataset with pre-existing label is given.

21
Types of Supervisions

Semi-supervised
(labels for a small portion of
training data)

Unsupervised Weakly supervised Supervised


(no labels) (noisy labels, labels not exactly for (clean, complete
the task of interest) training labels for the
task of interest)

22
Revisit Supervised Image Classification
input desired output

apple

pear

tomato

cow

dog

horse

23
Framework of Supervised Learning
Training time Training
Labels
Training
Samples
Learned
Features Training
model

Testing time

Learned
Features Prediction
model
24
The Basic Supervised Learning Framework

𝑦 = 𝑓𝜃 (𝒙)

Output Model Input

• Learning / Training: given a training set of labeled examples


{(𝒙1 , 𝑦1 ), … , (𝒙𝑁 , 𝑦𝑁 )}, estimate the parameters 𝜽 of the prediction
function / model 𝑓𝜃 .

• Inference / Testing : apply the learned 𝒇𝜽 to an unseen test example


𝒙 and output the predicted value 𝑦 = 𝑓𝜃 (𝒙)

25
Learning Effectiveness

• Potential Problems

1. Do you have sufficient data for supervision? - Overfitting

26
Learning Effectiveness

• Potential Problems

1. Do you have sufficient data for supervision? - Overfitting

2. Is your model complex / rich enough for the problem? - Underfitting

27
Learning Effectiveness

• Potential Problems

1. Do you have sufficient data for supervision? - Overfitting

2. Is your model complex / rich enough for the problem? - Underfitting

• We wish to understand what happened

• Where does the error come from?

28
Where does the learning error come from?

• Challenges:

1. We never know the exact model mapping from the inputs to outputs.

2. We never know the distribution of data.

29
Where does the learning error come from?

• Challenges:

1. We never know the exact model mapping from the inputs to outputs.

2. We never know the distribution of data.

• What we do in practice:

1. We make certain assumptions about the model

• E.g., We assume that classifier is linear, when using linear classifier.

30
Where does the learning error come from?

• Challenges:

1. We never know the exact model mapping from the inputs to outputs.

2. We never know the distribution of data.

• What we do in practice:

1. We make certain assumptions about the model

• E.g., We assume that classifier is linear, when using linear classifier.

2. We use empirical risk minimization

• E.g., We minimize the prediction error averaged over the training dataset.
31
Where does the learning error come from?

• Challenges:

1. We never know the exact model mapping from the inputs to outputs.

2. We never know the distribution of data.

32
What are the learning errors?

• Both approximations introduce potential errors in learning

1. Bias: Error caused by the wrong assumptions made in the learning


algorithms or models.

• E.g., Approximating a 2nd order function using linear regression.

2. Variance: Error due to the learning sensitivity to small fluctuations in


the training set.

• E.g., fitting to the noise of the limited training data.

33
What are the learning errors?

• Both approximations introduce potential errors in learning

• What is the total “error” we care about?

• Expected Error

For a future testing sample, randomly drawn from the underlying distribution,
Expected Error = the likelihood that we expect it to be misclassified by 𝑓𝜃 (. ).

• In practice, how to measure the expected error?

• Testing Error / Validation Error

34
Bias and Variance

• Training a classifier 𝑓𝜃 (𝑥)

• Expected Error:

• For an item that is randomly drawn from the underlying distribution, the likelihood that we
expect it to be misclassified by 𝑓𝜃 (. ).

35
Bias and Variance

• Training a classifier 𝑓𝜃 (𝑥)

• Model Complexity (informally):


• How many free parameters in 𝑓𝜃 (. ) do we have to learn?

• For example, complexity of a neural network depends on #hidden neurons.

36
Bias and Variance

• Training a classifier 𝑓𝜃 (𝑥)

• Bias:
• Type of error that occurs due to wrong / inaccurate assumptions made in the
learning algorithm.

• Bias is high when the model is (too) simple


37
Bias and Variance

• Training a classifier 𝑓𝜃 (𝑥)

• Variance:
• Type of error that occurs due to a model's sensitivity to small fluctuations in the
training set.

• Variance increases with model complexity.


38
Bias and Variance

• Training a classifier 𝑓𝜃 (𝑥)

• Expected error of a classifier ≈ bias + variance (+noise)

39
Bias and Variance

• Training a classifier 𝑓𝜃 (𝑥)

Simple Model: Complex Model:


High bias and low variance Low bias and high variance
40
Bias and Variance

• The trade-off between bias and variance (of a model)

• Bullseye (center) = target model; Darts (crosses) = learned models

41
Basics on statistical learning theory (optional)

• Expected error of a classifier ≈ bias + variance

• How to derive it mathematically?

• We need to use the tool of statistical machine learning.

• How Empirical Risk Minimization approximates Expected/True Risk.


• “Empirical”: calculated from finite amount of data – We can obtain
• “Expected”: calculated from true data distribution – cannot obtain in practice

• Optional for those who are interested

42
Basics on statistical learning theory

• Why do we need to study statistical learning?

• We cannot know exactly how well an algorithm will work in practice (the
true "risk“ – measure of effectiveness).

• Because we do not know the true distribution of data that the


algorithm will work on.

• But, we can instead measure its performance on a known set of data


(the "empirical" risk).

• Empirical Risk Minimization is the core idea.

43
Basics on statistical learning theory

• Expected (true) Risk:

• ℎ(𝑥) is the function predicting 𝑦.

• 𝑙(ℎ 𝑥 , 𝑦) measures the distance between 𝑦 and the predicted ℎ 𝑥 .

• (𝑥, 𝑦) follows some underlying distribution 𝑝 𝑥, 𝑦 : some (𝑥, 𝑦) appear more


often in practice, thus need higher weight.

• The expected (true) risk measures how well the ℎ(𝑥) approximates the 𝑦.

• In practice, we do NOT have full access to such distribution 𝒑 𝒙, 𝒚 .

• We do NOT have access to expected risk explicitly.

44
Basics on statistical learning theory

• Expected Risk:

• Empirical Risk:

• Though we do not have full access to the distribution 𝑝 𝑥, 𝑦 , we can collect


a labeled dataset: limited number of samples 𝑥 (𝑖) , 𝑦 (𝑖) from 𝑝 𝑥, 𝑦 .

• Instead of integration, we take the average distance between 𝑦 (𝑖) and the
predicted ℎ 𝑥 (𝑖) : all samples have equal weights.

• In practice, we can calculate the empirical risk given a labeled dataset.

• Empirical Risk approximates the Expected Risk.

45
Basics on statistical learning theory

• Expected Risk:

• Empirical Risk:

• Assumptions of learning the function ℎ(𝑥) in practice:

1. We need assumptions on ℎ 𝑥 to be learned, ℎ ∈ ℋ


(the specific model you use, e.g., linear regressor, neural networks).

2. We can only minimize the empirical risk instead of expected risk.

46
Basics on statistical learning theory

• Expected Risk:

• Empirical Risk:

The best
• Limitations of learning the function ℎ(𝑥): possible ℎ(. )

With limitation 1

With limitations 1 + 2
47
Basics on statistical learning theory

• Expected Risk:

• Empirical Risk:
Limitation 1

Error by Error by Limitation 2


limitation 1 limitation 2
• Total Learning Error:

48
Basics on statistical learning theory

• Expected Risk:

• Empirical Risk:

How much data What model you are


you have? using?
• Total Learning Error:

49
Basics on statistical learning theory

All possible algorithms


you can learn using a
specific model

• ℎ ∈ ℋ is the learnable function space based on our assumptions.

• 𝐼 is the size / complexity of the training dataset.


50
Basics on statistical learning theory

All possible algorithms


you can learn using a
specific model

• 𝜀𝑎𝑝𝑝 (ℋ) is formally defined as the bias.

• 𝜀𝑒𝑠𝑡 (ℋ) is formally defined as the variance.

• Total Learning Error = 𝜀𝑎𝑝𝑝 ℋ + 𝜀𝑒𝑠𝑡 (ℋ) = bias + variance.


51
Overfitting vs Underfitting

• What is a good model?

Simple Model Complex Model

Good Model!
52
Overfitting vs Underfitting

• Simple Model

• High Bias

• Cause an algorithm to miss relevant


relations between the input features and
the target outputs.

• Complex Model

• High Variance

• Cause an algorithm to model the noise in


the training set.

53
Overfitting vs Underfitting

• Simple Model

• High Bias - Underfitting

• Complex Model

• High Variance - Overfitting

54
Overfitting vs Underfitting

• Training a classifier 𝑓𝜃 (𝑥)

Simple Model: Complex Model:


High bias and low variance Low bias and high variance
55
Overfitting vs Underfitting

• Training a classifier 𝑓𝜃 (𝑥)

Underfitting Overfitting
High bias and low variance Low bias and high variance
56
Overfitting vs Underfitting

• Measure overfitting by training and testing / validation errors

57
Overfitting vs Underfitting

• Overfitting ≈ Testing / Validation Error – Training Error

The gap measures the


degree of overfitting

58
Overfitting vs Underfitting

• Overfitting ≈ Testing / Validation Error – Training Error

Overfitting
Large gap between
training and test errors

Underfitting
Small gap between
training and test errors 59
Overfitting vs Underfitting

• Bias-Variance Tradeoff:

• fundamental dilemma of minimizing between two sources of errors that


prevent ML algorithms from generalizing beyond their training set.

• The bias is error from erroneous assumptions in the learning algorithm. High
bias can cause an algorithm to miss the relevant relations between features
and target outputs (e.g., model is too simple -> underfitting).

• The variance is error from sensitivity to small fluctuations in the training set.
High variance can cause an algorithm to model the random noise in the
training data, rather than the intended outputs (e.g., model is too
complicated -> overfitting).

60
Optimize and Regularize Learning

• How to monitor the expected error?

• Separate a validation dataset from the


data with known labels.

• Learn parameters on the training data. With the


Validation known
Dataset labels
• Measure accuracy on the validation data,
which is a “simulated” testing set.

• Peek at the validation set to prevent


overfitting and underfitting.

• Why this is important?


61
Model training diagnosis

• Important statistics:

• Training / Validation / Testing Error Curves

• Training parameters:

1. Learning Rate

2. Model Regularization

3. Number of Iterations / Epochs

62
Diagnosing learning rates

Image source: Stanford CS231n


Regularization to prevent overfitting

• Solutions, in the context of learning neural networks:

1. Limit the model complexity by reducing the model expressiveness.

• Dropout: During training, some number of layer outputs are randomly ignored or
“dropped out”.

64
Regularization to prevent overfitting

• Solutions, in the context of learning neural networks:

1. Limit the model complexity by reducing the model expressiveness.

• Early Stopping: Sample the model every few iterations of training, check how well it
works with the validation set, and stop when the validation error reaches the minimum.

65
Regularization to prevent overfitting

• Solutions, in the context of learning neural networks:

1. Limit the model complexity by reducing the model expressiveness.

• Early Stopping: Do not train a network to achieve too low training error, but the minimal
validation loss

66
Regularization to prevent overfitting

• Solutions, in the context of learning neural networks:

1. Limit the model complexity by reducing the model expressiveness.

• Dropout: During training, some number of layer outputs are randomly ignored or
“dropped out”.

• Early Stopping: Sample the model every few iterations of training, check how well it
works with the validation set, and stop when the validation error reaches the minimum.

• Weight Sharing: Instead of training each neuron independently, we can force their
parameters to be the same. Examples: Recurrent Neural Networks (RNN).

67
Regularization to prevent overfitting

• Solutions, in the context of learning neural networks:

1. Limit the model complexity by reducing the model expressiveness.

2. Increase the training data complexity / size, to reduce the variance.

• Add more “real” training data

• Data Augmentation: modify the data available in a realistic but randomized way, to
increase the variety of data seen during training

68
Data augmentation
• Introduce transformations not adequately sampled in the training data

• Geometric: flipping, rotation, shearing, multiple crops

Flipping & Rotation


Cropping
69
Data augmentation
• Introduce transformations not adequately sampled in the training data

• Geometric: flipping, rotation, shearing, multiple crops

• Photometric: color transformations

70
Data augmentation
• Introduce transformations not adequately sampled in the training data

• Geometric: flipping, rotation, shearing, multiple crops

• Photometric: color transformations

• Other: scaling, add noise, compression artifacts, lens distortions, etc.

71
Data augmentation
• Introduce transformations not adequately sampled in the training data

• Geometric: flipping, rotation, shearing, multiple crops

• Photometric: color transformations

• Other: scaling, add noise, compression artifacts, lens distortions, etc.

• Limited only by data assumptions + time/memory constraints!

• Avoid introducing obvious artifacts

72
Regularization to prevent overfitting

• Solutions, in the context of learning neural networks:

1. Limit the model complexity by reducing the model expressiveness.

2. Increase the training data complexity / size, to reduce the variance.

3. Simplify data distribution and dimensionality.

• Dimensionality Reduction

k = 200 k = 50 k=2
73
What we have learned

• Learning and Supervision

• Types of learnings
• Examples of each learning type

• Bias and Variance

• Basics of statistical learning theory

• Overfitting and Underfitting


• How to measure degree of overfitting
• How to prevent overfitting

• Model training diagnosis and optimization

74
Carry-on Questions

• What are the types of supervision?

• Unsupervised / Weakly Supervised / Semi-Supervised / Supervised Learning

• How to measure the degree of overfitting?

• The gap between the testing error and training error

• What are methods to prevent overfitting?

• Reduce model expressiveness: dropout, early stop, weight sharing, etc.


• Increase data richness: add more training data, data augmentation, etc.

75

You might also like