Model Generalization
Model Generalization
Generalization
Learning
•Objectives
Explain the difference between over-fitting and
under- fitting a model
• Describe Bias-variance tradeoffs
• Find the optimal training and test data set splits,
cross- validation, and model complexity versus error
• Apply a linear regression model for supervised
learning
• Apply Intel® Extension for Scikit-learn* to leverage
underlying compute capabilities of hardware
K Value Affects Decision
Boundary
60 K= 60 K = 34
1
40 40
Ag
e
20 20
0 10 20 0 10
Number of Malignant
20
Nodes
Number of Malignant Nodes
Choosing Between Different
Complexities
Polynomial Degree = Polynomial Degree = Polynomial Degree =
1 4 15
Model
True
Function
Samples
Y Y Y
X X X
How Well Does the Model
Generalize?
Polynomial Degree = Polynomial Degree = Polynomial Degree =
1 4 15
Model
True
Function
Samples
Y Y Y
X X X
Y Y Y
X X X
Underfitting Just Overfitting
Right
Bias – Variance
Tradeoff
• Technically, we can define bias as the error between average model
prediction and the ground truth.
• Moreover, it describes how well the model matches the training
data set:
• A model with a higher bias would not match the data set closely.
• A low bias model will closely match the training data set.
Y Y Y
X X X
Predicted Class
Class = YES Class = No
Actual Class = Yes True Positive False Negative
Class
Class = No False Positive True Negative
Evaluation Metrics
Type I and II error
Evaluation Metrics
Sensitivity or True Positive Rate (TPR)
🞑 TP/(TP+FN)
Specificity or True Negative Rate (TNR)
🞑 TN/(FP+TN)
Precision or Positive Predictive Value (PPV)
🞑 TP/(TP+FP)
Negative Predictive Value (NPV)
🞑 TN/(TN+FN)
Accuracy
🞑 (TP+TN)/(TP+FP+TN+FN)
Limitation of Accuracy
Consider a binary classification problem
🞑 Number of Class 0 examples = 9990
🞑 Number of Class 1 examples = 10
Precision
Recall
𝑤 𝑇 𝑃 𝑇 𝑃 +𝑤 𝑇 𝑁 𝑇
Weighted Accuracy
𝑁 𝐹 𝑃 𝐹 𝑃 +𝑤 𝑇 𝑁 𝑇 𝑁 +𝑤 𝐹 𝑁 𝐹
𝑤 𝑇 𝑃 𝑇 𝑃 +𝑤
= 𝑁
Evaluation
Model Selection
🞑 How to evaluate the performance of a model?
🞑 How to obtain reliable estimates?
Performance estimation
🞑 How to compare the relative performance with competing models?
Motivation
• Ref:
https://towardsdatascience.com/understanding-8-types-of-cross-validation-80c935a497
6d
Holdout
Split dataset into two groups for training and test
🞑 Training dataset: used to train the model
🞑 Test dataset: use to estimate the error rate of the model
Entire dataset
Drawback
🞑 When “unfortunate split” happens, the holdout estimate of
error rate will be misleading
Random Subsampling CV
Split the data set into two groups
🞑 Randomlyselects a number of samples without
replacement
Usually, one third for testing, the rest for training
K-Fold Cross-validation
K-fold Partition
🞑 Partition K equal sized sub groups
🞑 Use K-1 groups for training and the remaining one for
testin Test
g set
Experiment
1
Experiment
2
Experiment
3
Experiment
4
K-fold cross-validation
• This way, when training, the 10 fold cross validation will have a 90-10
train-test split,
• where as the 2 fold cross validation will have a 50-50 train test split.
Making use of more folds, will present the model with more data to train on,
but will required way more time as it has to train and validate K separate
times.
Stratified k-folds cross-validation
Ref: http://pages.cs.wisc.edu/~dpage/cs760/evaluating
Bootstrapping
Oversampling
• Amplifying the minor class samples so that the
classes are equally distributed
• Sampling technique for imbalanced data
Bootstrapping
Undersampling
• Consider less numbers of samples in the major class
so that the classes are equally distributed
• Sampling technique for imbalanced data
Cross-validation with normalization
x−𝜇
x𝑛 𝑜 𝑟 𝑚 = 𝜎
𝜇 is mean, 𝜎 is a standard deviation from the
mean (standard score)
Training and Test
Splits
Training and Test
Splits
Training
Data
Test
Dat
a
Using Training and Test
Data
4.0 4.0
3.0 3.0
2.0 2.0
1.0 1.0
4.0 4.0
3.0 3.0
2.0 2.0
1.0 1.0
Fit the
Using Training and Test
Data Training Data Test
x108 x108 Data
4.0 4.0
3.0 3.0
2.0 2.0
1.0 1.0
Make predictions
Using Training and Test
Data Training Data Test
x108 x108 Data
4.0 4.0
3.0 3.0
2.0 2.0
1.0 1.0
Training X_train
Data KNN( X_train, model
Y_train ).fit()
Y_train
X_test
Test model .predict( X_test Y_predic
) t
Dat
a
Fitting Training and Test
Data
Training X_train
Data KNN( X_train, model
Y_train ).fit()
Y_train
X_test
Test model .predict( X_test Y_predic
) t
Dat
a
error_metric( Y_test, test error
Y_tes Y_predict)
t
Train and Test Splitting: The
Syntax
Import the train and test split function
from sklearn.model_selection import train_test_split
Split the data and put 30% into the test set
train, test = train_test_split(data, test_size=0.3)
Train and Test Splitting: The
Syntax
Import the train and test split function
from sklearn.model_selection import train_test_split
Split the data and put 30% into the test set
train, test = train_test_split(data, test_size=0.3)
Training
Data
Validation
Data
Beyond a Single Test Set: Cross
Validation Training Data
Test Data
x10
4.08 x108
4.0
3.0 3.0
2.0 2.0
1.0 1.0
Training
Data 1
Validation
Data 1
Beyond a Single Test Set: Cross
Validation
Trainin
g Data
2
Validatio
n
Data 2
Beyond a Single Test Set: Cross
Validation
Validation
Data 3
Trainin
g Data
3
Beyond a Single Test Set: Cross
Validation
Validation
Data 4
Training
Data 4
Beyond a Single Test Set: Cross
Validation
Training Split Training Split Training Split Test Split
+
Training Split Training Split Test Split Training Split
+
Training Split Test Split Training Split Training Split
+
Test Split Training Split Training Split Training Split
+
Training Split Training Split Test Split Training Split
+
Training Split Test Split Training Split Training Split
+
Test Split Training Split Training Split Training Split
𝜃
𝐽𝑐𝑣
cross validation
error
erro
r
𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error
Model Complexity vs
Error
𝜃
𝐽𝑐𝑣
cross validation
error
erro
r
𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error
Model Complexity vs
Error
𝜃
𝐽𝑐𝑣
cross validation
error
erro
r
𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error
Model Complexity vs
Error
Polynomial Degree =
1
Model
𝜃 True
𝐽𝑐𝑣 Function
cross validation Samples
error
erro
Y
r
𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error
Y
r
𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error
X
model
complexity
Overfitting: training error is low, cross validation is
high
Model Complexity vs
Error
Polynomial Degree =
4
Model
𝜃 True
𝐽𝑐𝑣 Function
cross validation Samples
error
erro
Y
r
𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error
Linear
Regression
Correlation
(r)
Linear association between two variables
Show how to determine both the nature and strength of
relationship between two variables
Correlation lies between +1 to -1
Zero correlation indicates that there is no relationship
between the variables
Pearson correlation coefficient
🞑 most familiar measure of dependence between two quantities
Correlation
(r)
Correlation
(r)
Reference:
https://en.wikipedia.org/wiki/Correlation_and_dependence
Linear
Regression
Samples with ONE independent variable Samples with TWO independent
variables
Linear
Regression
Samples with ONE independent Samples with TWO independent
variable variables
Linear Regression
x108
2.0
Office
Box
1.0
Budge
t
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥
box movie
office
coefficient 0 coefficient
revenue budge
1
Linear Regression
x108
2.0
Office
Box
1.0
Budge
t
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥
𝛽 0 = 80 million, 𝛽 1= 0.6
Predicting from Linear
Regression x108
2.0
Office
Box
1.0
Budge
t
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥
𝛽 0 = 80 million, 𝛽 1= 0.6
Predict 175 Million Gross for 160 Million Budget
Linear
Regression
Simple linear regression
🞑A single independent variable is used to predict
Multiple linear regression
🞑 Two or more independent variables are used to
predict
Linear
Regression
How to represent the data as a vector/matrix
🞑 We assume a model:
𝐲 = b0 + 𝐛𝐗 + ϵ,
where b0 and 𝐛 are intercept and slope, known as coefficients
or parameters. ϵ is the error term (typically assumes that
ϵ~𝑁(𝜇, 𝜎 2 )
Linear
Regression How to represent the data as a
vector/matrix
🞑 Include bias constant (intercept) in the input vector
𝐗 ∈ ℝ𝒏×(𝒑+𝟏), 𝐲 ∈ ℝ𝒏, 𝐛 ∈ ℝ𝒑+𝟏, and 𝐞 ∈ ℝ𝒏
𝐲=𝐗∙𝐛+𝐞
𝐗= 𝟏, 𝐱 𝟏 , 𝐱 𝟐 , … , 𝐱 𝐩 , 𝐛 = {𝑏0 , 𝑏1 , 𝑏2 , …
, 𝑏𝑝 }T
𝐲 = {𝑦1 , 𝑦2 , … , 𝑦𝑛 }T , 𝐞 = {𝑒1 , 𝑒2 , … , 𝑒𝑛 }T
∙ is a dot product
equivalent to
y𝑖 = 1 ∗ b0 + 𝑥𝑖 1 b1 + 𝑥𝑖 2 b2 + ⋯ + 𝑥𝑖 𝑝 bp (1
Linear
Regression
Find the optimal coefficient vector b that
makes the most similar observation
𝑦1 𝑥11 ⋯ 𝑥 1𝑝 𝑏0 𝑒1
1
⋮ = 1 ⋮ ⋱ ⋮ ⋮ + ⋮
𝑦𝑛 𝑥 𝑛1 ⋯ 𝑒𝑛
𝑥 𝑛𝑝 𝑏𝑝
1
equivalent to
y𝑖 = 1 ∗ b0 + 𝑥𝑖 1 b1 + 𝑥𝑖 2 b2 + ⋯ + 𝑥𝑖 𝑝 bp (1 ≤ 𝑖 ≤ 𝑛)
Which Model Fits the
Best?
x108
2.0
Office
Box
1.0
Budge
t
Ordinary Least Squares
(OLS) 𝐲 = 𝐗𝐛 + 𝐞
Used to estimate the unknown parameters (b) in linear
regression model
Minimizing the sum of the squares of the differences between the
observed responses and the predicted by a linear function
Σ(𝑦 𝑖 − 𝐱 𝑖 ∗𝐛)2
𝑖=1
Ordinary Least Squares
(OLS)
Σ(𝑦 𝑖 − 𝐱 𝑖 ∗ 𝐛)2
𝑖=1
Calculating the
Residuals x108
2.0
Office
Box
1.0
Budge
t
predicted
𝑦 𝑥 𝑜(𝑖)𝑏 − 𝑦 (𝑖)
𝑜𝑏 observe
value d
𝑠 𝑠 value
Calculating the
Residuals x108
2.0
Office
Box
1.0
Budge
t
𝛽0 + 𝛽 1 𝑥𝑜(𝑖)𝑏 − 𝑦 (𝑖)
𝑜𝑏
𝑠 𝑠
Mean Squared
Error
x108
2.0
Office
Box
1.0
𝑚
1 Budget 2
Σ 𝛽0 + 𝛽 1 𝑥𝑜(𝑖)𝑏 − 𝑦 (𝑖)
𝑜𝑏
𝑚
𝑖=1 𝑠 𝑠
Minimum Mean Squared
Error
x108
2.0
Office
Box
1.0
𝑚
1
Budge
t
2
min Σ 𝛽0 + 𝛽 1 𝑥𝑜(𝑖)𝑏 − 𝑦 (𝑖)
𝑜𝑏
𝛽 0 ,𝛽 1 𝑚 𝑖=1 𝑠 𝑠
Optimizatio
n Need to minimize the error
𝑛
b
𝐽𝑐𝑣
cross validation
error
erro
r
b
𝐽 𝑡𝑟𝑎𝑖𝑛
training error
Optimizatio
n
𝐽 = 𝐞 T 𝐞 = 𝐲 − 𝐗𝐛 ′ 𝐲 − 𝐗𝐛
= 𝐲 ′ − 𝐛 ′ 𝐗 ′ 𝐲 − 𝐗𝐛
= 𝐲 ′ 𝐲 − 𝐲 ′ 𝐗𝐛 − 𝐛 ′ 𝐗 ′ 𝐲 + 𝐛 ′ 𝐗
′ 𝐗𝐛
= 𝐲 ′ 𝐲 − 𝟐𝐛 ′ 𝐗 ′ 𝐲 + 𝐛 ′ 𝐗 ′ 𝐗𝐛
𝜕𝐞 ′ 𝐞
= −2𝐗 ′ 𝐲 + 2𝐗 ′ 𝐗𝐛 = 0
𝜕𝐛
𝐗 ′𝐗 𝐛 = 𝐗 ′𝐲
𝐛 = (𝐗 ′ 𝐗) −1 𝐗 ′ 𝐲
Matrix Cookbook: https:/ / www.math.uwaterloo.ca/~hwolkowi/matrixcookb
ook.pdf
Comparing Linear Regression and
KNN
Linear Regression K Nearest Neighbors
Linear Regression
Scaling is a Type of Feature
Transformation
60 60
24
40 22 40
Age
20
20 20
18
1234 1 2 3 4 5
5 Number of Surgeries Number of
Surgeries
Scaling is a Type of Feature
Transformation
Why need data transformation?
• the algorithm is more likely to be biased when the data distribution is
skewed
• transforming data into the same scale allows the algorithm to compare
the relative relationship between data points better
Transformation of Data
Distributions
• Predictions from linear regression
models assume residuals are
normally distributed
• Features and predicted data
are
often skewed
• Data transformations can
solve this issue
statistics-for/9781491952955/ch04.html
https://www.oreilly.com/library/view/practical-
Reference:
Transformation of Data
Distributions
• Predictions from linear regression
models assume residuals are
normally distributed
• Features and predicted data
are often skewed
• Data transformations can
solve this
issue
Transformation of Data
Distributions
Reference: https://towardsdatascience.com/data-transformation-and-feature-
engineering-e3c7dfbb4899
Feature
Types
Feature Transformatio
Type n
• Continuous: numerical • Standard Scaling, Min-Max
values Scaling
• Nominal: categorical,
unordered • One-hot encoding (0, 1)
features (True or False)
Office
Box
• "Linear regression" means
linear
combinations of features
Budget
Addition of Polynomial
Features
𝑦𝛽
Office
Box
• Capture higher order
features of data by adding
polynomial features
Budget
Office
Box
• "Linear regression" means
linear combinations of
features
Budget
Addition of Polynomial
Features
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1 log(𝑥)
• Capture higher order
features of data by adding
polynomial features
Office
Box
• "Linear regression" means
linear combinations of
features
Budget
Addition of Polynomial
Features
Equation of the Polynomial Regression Model:
Simple Linear Regression equation: y = b0+b1x .........(a)
Multiple Linear Regression equation: y= b0+b1x+ b2x2+ b3x3+....+
bnxn .....(b)
Polynomial Regression equation: y= b0+b1x + b2x2+ b3x3+....+ bnxn
…...(c)
Reference: https://www.javatpoint.com/machine-learning-polynomial-regression
Addition of Polynomial
Features
• Can also include 𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3𝑥1𝑥2
variable interactions
𝑓 𝑥 ≥ 𝜆
<