0% found this document useful (0 votes)

125 views117 pages

Model Generalization

Uploaded by

Ehab Emam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views117 pages

Model Generalization

Uploaded by

Ehab Emam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 117

Model

Generalization
Learning
•Objectives
Explain the difference between over-fitting and
underfitting a model
• Describe Bias-variance tradeoffs
• Find the optimal training and test data set splits,
cross- validation, and model complexity versus error
• Apply a linear regression model for supervised
learning
• Apply Intel® Extension for Scikit-learn* to leverage
underlying compute capabilities of hardware
K Value Affects Decision
Boundary
60 K= 60 K = 34
1
40 40
Ag
e

20 20

0 10 20 0 10
Number of Malignant
20
Nodes
Number of Malignant Nodes
Choosing Between Different
Complexities
Polynomial Degree = Polynomial Degree = Polynomial Degree =
1 4 15
Model
True
Function
Samples

Y Y Y

X X X
How Well Does the Model
Generalize?
Polynomial Degree = Polynomial Degree = Polynomial Degree =
1 4 15
Model
True
Function
Samples

Y Y Y

X X X

Poor at Training Good at

Just
Poor at Training
Right
Predicting Poor at
Predicting
Underfitting vs
• Overfitting
The terms underfitting and overfitting refer to how the model fails to match the
data.
• The fitting of a model directly correlates to whether it will return accurate
predictions from a given data set.
•Underfitting:
• Occurs when the model is unable to match the input data to the target data.
• This happens when the model is not complex enough to match all the
available data and performs poorly with the training dataset.
•Overfitting:
• Relates to instances where the model tries to match non-existent data.
• This occurs when dealing with highly complex models where the model will
match almost all the given data points and perform well in training datasets.
• However, the model would not be able to generalize the data point in the test
data set to predict the outcome accurately.
Underfitting vs
Overfitting
Polynomial Degree = Polynomial Degree = Polynomial Degree =
1 4 15
Model
True
Function
Samples

Y Y Y

X X X
Underfitting Just Overfitting
Right
Bias – Variance
Tradeoff
• Technically, we can define bias as the error between average model
prediction and the ground truth.
• Moreover, it describes how well the model matches the training
data set:
• A model with a higher bias would not match the data set closely.
• A low bias model will closely match the training data set.

• Characteristics of a high bias model include:

• Failure to capture proper data trends
• Potential towards underfitting
• More generalized/overly simplified
• High error rate
Bias – Variance
Tradeoff
• Variance refers to the changes in the model when using different portions of the
training data set.
• Simply stated, variance is the variability in the model prediction—how much the ML
function can adjust depending on the given data set.
• Variance comes from highly complex models with a large number of features.
• Models with high bias will have low variance.
• Models with high variance will have a low bias.
• All these contribute to the flexibility of the model.
• For instance, a model that does not match a data set with a high bias will create
an inflexible model with a low variance that results in a suboptimal machine
learning model.
• Characteristics of a high variance model include:
• Noise in the data set
• Potential towards overfitting
• Complex models
• Trying to put all data points as close as possible
Bias – Variance
Tradeoff
Polynomial Degree = Polynomial Degree = Polynomial Degree =
1 4 15
Model
True
Function
Samples

Y Y Y

X X X

High Bias Low Bias

Just
Low High
Right
Variance Variance
EVALUATION
Evaluation for Classification
Evaluation Metrics

 Confusion Matrix: shows performance of an algorithm,

especially predictive capability.
🞑 rather than how fast it takes to classify, build models, or scalability.

Predicted Class
Class = YES Class = No
Actual Class = Yes True Positive False Negative
Class
Class = No False Positive True Negative
Evaluation Metrics
Type I and II error
Evaluation Metrics
 Sensitivity or True Positive Rate (TPR)
🞑 TP/(TP+FN)
 Specificity or True Negative Rate (TNR)
🞑 TN/(FP+TN)
 Precision or Positive Predictive Value (PPV)
🞑 TP/(TP+FP)
 Negative Predictive Value (NPV)
🞑 TN/(TN+FN)
 Accuracy
🞑 (TP+TN)/(TP+FP+TN+FN)
Limitation of Accuracy
 Consider a binary classification problem
🞑 Number of Class 0 examples = 9990
🞑 Number of Class 1 examples = 10

🞑 If predict all as 0, accuracy is 9990/10000=99.9%

 Precision
 Recall
𝑤 𝑇 𝑃 𝑇 𝑃 +𝑤 𝑇 𝑁 𝑇
 Weighted Accuracy
𝑁 𝐹 𝑃 𝐹 𝑃 +𝑤 𝑇 𝑁 𝑇 𝑁 +𝑤 𝐹 𝑁 𝐹
𝑤 𝑇 𝑃 𝑇 𝑃 +𝑤
= 𝑁
Evaluation

 Model Selection
🞑 How to evaluate the performance of a model?
🞑 How to obtain reliable estimates?
 Performance estimation
🞑 How to compare the relative performance with competing models?
Motivation

 We often have a finite set of data

🞑 If using the entire training data for the best model,
 The model normally overfits the training data, where it often gives almost
100% correct classification results on training data
 Better to split the training data into disjoint subsets
 Note that test data is not used in any way to create the
classifier  Cheating!
Methods of Validation
 Holdout
🞑 Use 2/3 for training and 1/3 for testing
 Cross-validation
🞑 Random subsampling
🞑 K-Fold Cross-validation
🞑 Leave-one-out
 Stratified cross-validation
🞑 Stratified 10-fold cross-validation is often the best
 Bootstrapping
🞑 Sampling with replacement
🞑 Oversampling vs undersampling

• Ref:
https://towardsdatascience.com/understanding-8-types-of-cross-validation-80c935a497
6d
Holdout
 Split dataset into two groups for training and test
🞑 Training dataset: used to train the model
🞑 Test dataset: use to estimate the error rate of the model

Entire dataset

Split the data into two

Training set Test set

 Drawback
🞑 When “unfortunate split” happens, the holdout estimate of
error rate will be misleading
Random Subsampling CV
 Split the data set into two groups
🞑 Randomlyselects a number of samples without
replacement
 Usually, one third for testing, the rest for training
K-Fold Cross-validation
 K-fold Partition
🞑 Partition K equal sized sub groups
🞑 Use K-1 groups for training and the remaining one for

testin Test
g set
Experiment
1

Experiment
2
Experiment
3

Experiment
4
K-fold cross-validation

 Suppose that 𝐸𝑖 is the performance in the i-

th experiment
 The average error rate is
𝑘
1
𝐸= Σ 𝐸𝑖
𝐾
𝑖=1
How many folds?

 If a large number of folds

🞑 Theestimator will be accurate (as training folds will be closer to the total dataset)
🞑 Computationally expensive
 If a small number of folds
🞑 Cheap computational time for experiments
🞑 Variance of the estimator will be small
 5 or 10-Fold CV is a common choice for K-fold CV
Leave-one-out cross-validation

 Use N-1 samples for training and the remaining

sample for testing (i.e., there is only one sample
for testing)
 The average error rate is
𝑁
1
𝐸= Σ 𝐸𝑖
𝑁
𝑖=1
where N is the total sample number.
How many folds?
Smaller values of K means that the dataset is split into fewer parts, but each
part contains a larger percentage of the dataset.
Taking a dataset with 100 rows.
•2 fold cross validation - Each fold will contain 50 rows.
•10 fold cross validation - Each fold will contain 10 rows.

• This way, when training, the 10 fold cross validation will have a 90-10
train-test split,
• where as the 2 fold cross validation will have a 50-50 train test split.
Making use of more folds, will present the model with more data to train on,
but will required way more time as it has to train and validate K separate
times.
Stratified k-folds cross-validation

 When randomly selecting training or test sets, ensure that

class proportions are maintained in each selected set.
1. Stratify instances by class
2. Randomly select
instances from each class
proportionally

Ref: http://pages.cs.wisc.edu/~dpage/cs760/evaluating
Bootstrapping

 Oversampling
• Amplifying the minor class samples so that the
classes are equally distributed
• Sampling technique for imbalanced data
Bootstrapping

 Undersampling
• Consider less numbers of samples in the major class
so that the classes are equally distributed
• Sampling technique for imbalanced data
Cross-validation with normalization

 Cross-validation with normalization

🞑 The model is optimized to the normalized data rather than the original
data
🞑 How to evaluate via CV with normalization (e.g., z- score
normalization)?
 Normalize the training data (obtain mean and std)
 Normalize the validation or test data with the mean and std obtained from the
training data
 Otherwise (normalize before splitting), the test data are not independent from the
training data. Weak cheating.
Standardization

 Standardization or Z-score normalization

🞑 Rescale the data so that the mean is zero and the standard
deviation from the mean (standard scores) is one

x−𝜇
x𝑛 𝑜 𝑟 𝑚 = 𝜎
𝜇 is mean, 𝜎 is a standard deviation from the
mean (standard score)
Training and Test
Splits
Training and Test
Splits

Training
Data

Test
Dat
a
Using Training and Test
Data

Trainin fit the

g model
Data
measure performance
Test - predict label with model
- compare with actual
Dat
a value
- measure error
Using Training and Test
Data Training Data Test
x108 x108 Data

4.0 4.0

3.0 3.0

2.0 2.0

1.0 1.0

0.0 1.0 2.0 x108 0.0 1.0 2.0 x108

Using Training and Test
Data Training Data Test
x108 x108 Data

4.0 4.0

3.0 3.0

2.0 2.0

1.0 1.0

0.0 1.0 x108 0.0 1.0 2.0 x108

2.0

Fit the
Using Training and Test
Data Training Data Test
x108 x108 Data

4.0 4.0

3.0 3.0

2.0 2.0

1.0 1.0

0.0 1.0 2.0 x108 0.0 1.0 2.0 x108

Make predictions
Using Training and Test
Data Training Data Test
x108 x108 Data

4.0 4.0

3.0 3.0

2.0 2.0

1.0 1.0

0.0 1.0 2.0 x108 0.0 1.0 2.0 x108

Measure
error
Fitting Training and Test
Data

Training X_train
Data KNN( X_train, model
Y_train ).fit()
Y_train
X_test
Test model .predict( X_test Y_predic
) t
Dat
a
Fitting Training and Test
Data

Training X_train
Data KNN( X_train, model
Y_train ).fit()
Y_train
X_test
Test model .predict( X_test Y_predic
) t
Dat
a
error_metric( Y_test, test error
Y_tes Y_predict)
t
Train and Test Splitting: The
Syntax
Import the train and test split function
from sklearn.model_selection import train_test_split

To use the Intel® Extension for Scikit-learn* variant of this

algorithm:
• Install Intel® oneAPI AI Analytics Toolkit (AI Kit)
• Add the following two lines of code after the above code:
import patch_sklearn
patch_sklearn()
Train and Test Splitting: The
Syntax
Import the train and test split function
from sklearn.model_selection import train_test_split
Train and Test Splitting: The
Syntax
Import the train and test split function
from sklearn.model_selection import train_test_split

Split the data and put 30% into the test set
train, test = train_test_split(data, test_size=0.3)
Train and Test Splitting: The
Syntax
Import the train and test split function
from sklearn.model_selection import train_test_split

Split the data and put 30% into the test set
train, test = train_test_split(data, test_size=0.3)

Other method for splitting data:

from sklearn.model_selection import ShuffleSplit
Beyond a Single Test Set: Cross
Validation

Training
Data

Validation
Data
Beyond a Single Test Set: Cross
Validation Training Data
Test Data
x10
4.08 x108
4.0

3.0 3.0

2.0 2.0

1.0 1.0

0.0 1.0 2.0 x108 0.0 1.0 x108

2.0
Best model for this test
set
Beyond a Single Test Set: Cross
Validation

Training
Data 1

Validation
Data 1
Beyond a Single Test Set: Cross
Validation
Trainin
g Data
2

Validatio
n
Data 2
Beyond a Single Test Set: Cross
Validation

Validation
Data 3

Trainin
g Data
3
Beyond a Single Test Set: Cross
Validation
Validation
Data 4

Training
Data 4
Beyond a Single Test Set: Cross
Validation
Training Split Training Split Training Split Test Split

+
Training Split Training Split Test Split Training Split

+
Training Split Test Split Training Split Training Split

+
Test Split Training Split Training Split Training Split

Average cross validation

results.
Beyond a Single Test Set: Cross
Validation
Training Split Training Split Training Split Test Split

+
Training Split Training Split Test Split Training Split

+
Training Split Test Split Training Split Training Split

+
Test Split Training Split Training Split Training Split

Average cross validation

results.
results.
Model Complexity vs
Error

𝜃
𝐽𝑐𝑣
cross validation
error
erro
r

𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error
Model Complexity vs
Error

𝜃
𝐽𝑐𝑣
cross validation
error
erro
r

𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error
Model Complexity vs
Error

𝜃
𝐽𝑐𝑣
cross validation
error
erro
r

𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error
Model Complexity vs
Error
Polynomial Degree =
1
Model
𝜃 True
𝐽𝑐𝑣 Function
cross validation Samples
error
erro

Y
r

𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error

Underfitting: training and cross validation error are

high
Model Complexity vs
Error
Polynomial Degree =
15
Model
𝜃 True
𝐽𝑐𝑣 Function
cross validation Samples
error
erro

Y
r

𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error

X
model
complexity
Overfitting: training error is low, cross validation is
high
Model Complexity vs
Error
Polynomial Degree =
4
Model
𝜃 True
𝐽𝑐𝑣 Function
cross validation Samples
error
erro

Y
r

𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error

Just right: training and cross validation errors are

low
Cross Validation: The
Syntax
Import the train and test split function
from sklearn.model_selection import cross_val_score

Perform cross-validation with a given model

cross_val = cross_val_score(KNN, X_data, y_data, cv=4,
scoring='neg_mean_squared_error')

Other methods for cross validation:

from sklearn.model_selection import KFold, StratifiedKFold
Cross Validation: The
Syntax
Import the train and test split function
from sklearn.model_selection import cross_val_score

Perform cross-validation with a given model

cross_val = cross_val_score(KNN, X_data, y_data, cv=4,
scoring='neg_mean_squared_error')

Other methods for cross validation:

from sklearn.model_selection import KFold, StratifiedKFold
Cross Validation: The
Syntax
Import the train and test split function
from sklearn.model_selection import cross_val_score

Perform cross-validation with a given model

cross_val = cross_val_score(KNN, X_data, y_data, cv=4,
scoring='neg_mean_squared_error')

Other methods for cross validation:

from sklearn.model_selection import KFold, StratifiedKFold
Introduction to

Linear
Regression
Correlation
(r)
 Linear association between two variables
 Show how to determine both the nature and strength of
relationship between two variables
 Correlation lies between +1 to -1
 Zero correlation indicates that there is no relationship
between the variables
 Pearson correlation coefficient
🞑 most familiar measure of dependence between two quantities
Correlation
(r)
Correlation
(r)

where E is the expected value operator, cov(,)

means covariance, and corr(,) is a widely used
alternative notation for the correlation coefficient

Reference:
https://en.wikipedia.org/wiki/Correlation_and_dependence
Linear
Regression
Samples with ONE independent variable Samples with TWO independent
variables
Linear
Regression
Samples with ONE independent Samples with TWO independent
variable variables
Linear Regression
x108

2.0

Office
Box
1.0

0.0 1.0 2.0 x10

Budge
t

𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥
box movie
office
coefficient 0 coefficient
revenue budge
1
Linear Regression
x108

2.0

Office
Box
1.0

0.0 1.0 2.0 x10

Budge
t
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥
𝛽 0 = 80 million, 𝛽 1= 0.6
Predicting from Linear
Regression x108

2.0

Office
Box
1.0

0.0 1.0 2.0 x10

Budge
t
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥
𝛽 0 = 80 million, 𝛽 1= 0.6
Predict 175 Million Gross for 160 Million Budget
Linear
Regression
Simple linear regression
🞑A single independent variable is used to predict
 Multiple linear regression
🞑 Two or more independent variables are used to
predict
Linear
Regression
 How to represent the data as a vector/matrix
🞑 We assume a model:
𝐲 = b0 + 𝐛𝐗 + ϵ,
where b0 and 𝐛 are intercept and slope, known as coefficients
or parameters. ϵ is the error term (typically assumes that
ϵ~𝑁(𝜇, 𝜎 2 )
Linear
Regression  How to represent the data as a
vector/matrix
🞑 Include bias constant (intercept) in the input vector
 𝐗 ∈ ℝ𝒏×(𝒑+𝟏), 𝐲 ∈ ℝ𝒏, 𝐛 ∈ ℝ𝒑+𝟏, and 𝐞 ∈ ℝ𝒏

𝐲=𝐗∙𝐛+𝐞

𝐗= 𝟏, 𝐱 𝟏 , 𝐱 𝟐 , … , 𝐱 𝐩 , 𝐛 = {𝑏0 , 𝑏1 , 𝑏2 , …
, 𝑏𝑝 }T
𝐲 = {𝑦1 , 𝑦2 , … , 𝑦𝑛 }T , 𝐞 = {𝑒1 , 𝑒2 , … , 𝑒𝑛 }T
∙ is a dot product

equivalent to
y𝑖 = 1 ∗ b0 + 𝑥𝑖 1 b1 + 𝑥𝑖 2 b2 + ⋯ + 𝑥𝑖 𝑝 bp (1
Linear
Regression
 Find the optimal coefficient vector b that
makes the most similar observation

𝑦1 𝑥11 ⋯ 𝑥 1𝑝 𝑏0 𝑒1
1
⋮ = 1 ⋮ ⋱ ⋮ ⋮ + ⋮
𝑦𝑛 𝑥 𝑛1 ⋯ 𝑒𝑛
𝑥 𝑛𝑝 𝑏𝑝
1
equivalent to
y𝑖 = 1 ∗ b0 + 𝑥𝑖 1 b1 + 𝑥𝑖 2 b2 + ⋯ + 𝑥𝑖 𝑝 bp (1 ≤ 𝑖 ≤ 𝑛)
Which Model Fits the
Best?
x108

2.0

Office
Box
1.0

0.0 1.0 2.0 x10

Budge
t
Ordinary Least Squares
(OLS) 𝐲 = 𝐗𝐛 + 𝐞
 Used to estimate the unknown parameters (b) in linear
regression model
 Minimizing the sum of the squares of the differences between the
observed responses and the predicted by a linear function

Sum squared error =

𝑛

Σ(𝑦 𝑖 − 𝐱 𝑖 ∗𝐛)2
𝑖=1
Ordinary Least Squares
(OLS)

Sum squared error =

𝑛

Σ(𝑦 𝑖 − 𝐱 𝑖 ∗ 𝐛)2
𝑖=1
Calculating the
Residuals x108

2.0

Office
Box
1.0

0.0 1.0 2.0 x10

Budge
t

predicted
𝑦 𝑥 𝑜(𝑖)𝑏 − 𝑦 (𝑖)
𝑜𝑏 observe
value d
𝑠 𝑠 value
Calculating the
Residuals x108

2.0

Office
Box
1.0

0.0 1.0 2.0 x10

Budge
t

𝛽0 + 𝛽 1 𝑥𝑜(𝑖)𝑏 − 𝑦 (𝑖)
𝑜𝑏
𝑠 𝑠
Mean Squared
Error
x108
2.0

Office
Box
1.0

0.0 1.0 2.0 x10

𝑚
1 Budget 2
Σ 𝛽0 + 𝛽 1 𝑥𝑜(𝑖)𝑏 − 𝑦 (𝑖)
𝑜𝑏
𝑚
𝑖=1 𝑠 𝑠
Minimum Mean Squared
Error
x108
2.0

Office
Box
1.0

0.0 1.0 2.0 x10

𝑚
1
Budge
t
2
min Σ 𝛽0 + 𝛽 1 𝑥𝑜(𝑖)𝑏 − 𝑦 (𝑖)
𝑜𝑏
𝛽 0 ,𝛽 1 𝑚 𝑖=1 𝑠 𝑠
Optimizatio
n Need to minimize the error

𝑛

min 𝐽(𝐛) = Σ (𝑦𝑖 − 𝐱 𝑖 ,∗ 𝐛)2

𝑖=1
 To obtain the optimal set of parameters (b), derivatives of
the error w.r.t. each parameters must be zero.
Optimization

b
𝐽𝑐𝑣
cross validation
error
erro
r

b
𝐽 𝑡𝑟𝑎𝑖𝑛
training error
Optimizatio
n
𝐽 = 𝐞 T 𝐞 = 𝐲 − 𝐗𝐛 ′ 𝐲 − 𝐗𝐛
= 𝐲 ′ − 𝐛 ′ 𝐗 ′ 𝐲 − 𝐗𝐛
= 𝐲 ′ 𝐲 − 𝐲 ′ 𝐗𝐛 − 𝐛 ′ 𝐗 ′ 𝐲 + 𝐛 ′ 𝐗
′ 𝐗𝐛

= 𝐲 ′ 𝐲 − 𝟐𝐛 ′ 𝐗 ′ 𝐲 + 𝐛 ′ 𝐗 ′ 𝐗𝐛
𝜕𝐞 ′ 𝐞
= −2𝐗 ′ 𝐲 + 2𝐗 ′ 𝐗𝐛 = 0
𝜕𝐛
𝐗 ′𝐗 𝐛 = 𝐗 ′𝐲
𝐛 = (𝐗 ′ 𝐗) −1 𝐗 ′ 𝐲
Matrix Cookbook: https:/ / www.math.uwaterloo.ca/~hwolkowi/matrixcookb
ook.pdf
Comparing Linear Regression and
KNN
Linear Regression K Nearest Neighbors

• Fitting involves minimizing • Fitting involves storing training

cost function (slow) data (fast)
• Model has few • Model has many
parameters (memory parameters (memory
efficient) intensive)
• Prediction involves • Prediction involves
calculation (fast) finding closest
neighbors (slow)
Linear Regression: The
Syntax
Import the class containing the regression method
from sklearn.linear_model import LinearRegression

Create an instance of the class

LR = LinearRegression()

Fit the instance on the data and then predict the

expected value
LR = LR.fit(X_train, y_train)
y_predict = LR.predict(X_test)
Advanced

Linear Regression
Scaling is a Type of Feature
Transformation
60 60
24

40 22 40
Age
20
20 20
18

1234 1 2 3 4 5
5 Number of Surgeries Number of
Surgeries
Scaling is a Type of Feature
Transformation
Why need data transformation?
• the algorithm is more likely to be biased when the data distribution is
skewed
• transforming data into the same scale allows the algorithm to compare
the relative relationship between data points better
Transformation of Data
Distributions
• Predictions from linear regression
models assume residuals are
normally distributed
• Features and predicted data
are
often skewed
• Data transformations can
solve this issue
statistics-for/9781491952955/ch04.html
https://www.oreilly.com/library/view/practical-
Reference:
Transformation of Data
Distributions
• Predictions from linear regression
models assume residuals are
normally distributed
• Features and predicted data
are often skewed
• Data transformations can
solve this
issue
Transformation of Data
Distributions

from numpy import log, log1p

from scipy.stats import boxcox

Transformation of Data
Distributions
• Predictions from linear regression
models assume residuals are
normally distributed
• Features and predicted data
are often skewed
• Data transformations can
solve this issue

Reference: https://towardsdatascience.com/data-transformation-and-feature-
engineering-e3c7dfbb4899
Feature
Types
Feature Transformatio
Type n
• Continuous: numerical • Standard Scaling, Min-Max
values Scaling
• Nominal: categorical,
unordered • One-hot encoding (0, 1)
features (True or False)

• Ordinal: categorical, ordered • Ordinal encoding (0, 1, 2, 3)

features (movie ratings)
Feature
Types
Feature Transformatio
Type n
• Continuous: numerical • Standard Scaling, Min-Max
values Scaling
• Nominal: categorical, • One-hot encoding (0, 1)
unordered
features (True or False)
• Ordinal encoding (0, 1, 2,
• Ordinal: categorical, ordered 3)
features (movie ratings)
Feature
Types
Feature Transformatio
Type n
• Continuous: numerical • Standard Scaling, Min-Max
values Scaling
• Nominal: categorical, • One-hot encoding (0,
unordered 1)
features (True or False)
Computer Biology Physics
Computer 1 0 0
Biology 0 1 0
Physics 0 0 1

from sklearn.preprocessing import LabelEncoder, LabelBinarizer, OneHotEncoder

Feature
Types
Feature Transformatio
Type n
• Continuous: numerical • Standard Scaling, Min-Max
values Scaling
• Nominal: categorical, • One-hot encoding (0,
unordered 1)
features (True or False)
• Ordinal: categorical, • Ordinal encoding (0, 1, 2,
ordered 3)
features (movie ratings)
from sklearn.feature_extraction import DictVectorizer
from pandas import get_dummies
Addition of Polynomial
Features
Addition of Polynomial
Features
The need of Polynomial Regression in ML can be understood in the below points:
• If we apply a linear model on a linear dataset, then it provides us a good result as
we have seen in Simple Linear Regression, but if we apply the same model without
any modification on a non-linear dataset, then it will produce a drastic output.
Due to which loss function will increase, the error rate will be high, and accuracy
will be decreased.
• So for such cases, where data points are arranged in a non-linear fashion, we
need the Polynomial Regression model.
• Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a linear
model."
Addition of Polynomial
Features
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥 + 𝛽2𝑥2
• Capture higher order
features of data by adding
polynomial features

Office
Box
• "Linear regression" means
linear
combinations of features

Budget
Addition of Polynomial
Features
𝑦𝛽

= 𝛽0 + 𝛽1𝑥 + 𝛽2𝑥2 + 𝛽3𝑥3

Office
Box
• Capture higher order
features of data by adding
polynomial features
Budget

• "Linear regression" means

Addition of Polynomial
Features
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥 + 𝛽2𝑥2
• Capture higher order
features of data by adding
polynomial features

Office
Box
• "Linear regression" means
linear combinations of
features

Budget
Addition of Polynomial
Features
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1 log(𝑥)
• Capture higher order
features of data by adding
polynomial features

Office
Box
• "Linear regression" means
linear combinations of
features

Budget
Addition of Polynomial
Features
Equation of the Polynomial Regression Model:
Simple Linear Regression equation: y = b0+b1x .........(a)
Multiple Linear Regression equation: y= b0+b1x+ b2x2+ b3x3+....+
bnxn .....(b)
Polynomial Regression equation: y= b0+b1x + b2x2+ b3x3+....+ bnxn
…...(c)

Reference: https://www.javatpoint.com/machine-learning-polynomial-regression
Addition of Polynomial
Features
• Can also include 𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3𝑥1𝑥2
variable interactions

• How is the correct functional Check relationship of each e

variabl
form chosen? or with outcome
Polynomial Features: The
Syntax
Import the class containing the transformation method
from sklearn.preprocessing import PolynomialFeatures

Create an instance of the class

polyFeat = PolynomialFeatures(degree=2)

Create the polynomial features and then transform

the data
polyFeat = polyFeat.fit(X_data)
X_poly = polyFeat.transform(X_data)
Linear regression for
classification
 For binary classification
🞑 Encode class labels as y 0, 1 𝑜𝑟 {−1,
= 1}
🞑 Apply
Check OLSwhich class the prediction is closer
to
 If class 1 is encoded to 1 and class 2 is -1.
𝑐𝑙𝑎𝑠𝑠 1 𝑖𝑓 𝑓 𝑥 ≥ 0
𝑐𝑙𝑎𝑠𝑠 2 𝑖𝑓 𝑓 𝑥 < 0

🞑 Linear models are NOT optimized for classification

 Logistic regression (NEXT)
Linear regression for
classification
 ROC for classification  Later with logistic
regression.

𝑓 𝑥 ≥ 𝜆
<

If 𝑓 𝑥 is less than 𝜆, class 1. Otherwise

class 2. How can we know the optimal 𝜆 ?
 Let’s revisit EVALUATION.
Linear regression for
classification
 Multi-label classification
🞑 Encode classes label as:

Computer Biology Physics

Computer 1 0 0
Biology 0 1 0
Physics 0 0 1

🞑 Perform linear (binary) regression for each class

Assumptions in Linear regression
 Linearity of independent variable in the predictor
🞑 normally good approximation, especially for high- dimensional
data
 Error has normal distribution, with mean zero and constant
variance
🞑 important for tests
 Independent variables are independent from each other
🞑 Otherwise, it causes a multicollinearity problem; two or more
predictor variables are highly correlated.
🞑 Should remove one of them
Assumptions in Linear
regression
 Independent variables are independent from
each other
🞑 Otherwise,
it causes a multicollinearity problem; two or
more predictor variables are highly correlated.
🞑 Should remove one of them
Think
more!
Feature 1 Feature 2 Feature 3 Feature 4
Coefficient 5.2 0.1 -6.6 0

 How can we interpret this model?

 What is the most useless feature?
🞑 Is
it always useless to explain the dependent
variable?
 What do negative coefficients represent?
 What is the most informative feature?
Different views between Statistics
and CS
 In Statistics, description of the model is often more important.
🞑 Which
variables are more informative and reliable to describe the
responses?  p-values
🞑 How much information do the variables have?
 In Computer Science, the accuracy of prediction and
classification is more important.
🞑 How well can we predict/classify?
Discussio
n
 What if data is imbalanced data?
 Weighted MSE
Links
 Handwritten example:
 https://www.youtube.com/watch?v=orQ-QGaOPIg
 LR Calculator:
 https://www.socscistatistics.com/tests/regression/default.aspx
 Scikit learn tutorial
 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegres
sion.html

Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
100% (2)
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
26 pages
02 ML Supervised Learning
No ratings yet
02 ML Supervised Learning
32 pages
Decision Trees
No ratings yet
Decision Trees
25 pages
Decision Trees
No ratings yet
Decision Trees
32 pages
Lecture Notes - Random Forests PDF
100% (1)
Lecture Notes - Random Forests PDF
4 pages
Ue22cs342aa2 20241114095341
No ratings yet
Ue22cs342aa2 20241114095341
23 pages
Cross-Validation and Model Selection
No ratings yet
Cross-Validation and Model Selection
46 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Bioinformatics F&amp M 20100722 Bujak
100% (1)
Bioinformatics F&amp M 20100722 Bujak
27 pages
R PPT 30
No ratings yet
R PPT 30
45 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Gradient Descent
No ratings yet
Gradient Descent
18 pages
AI Statistical Methods Course
No ratings yet
AI Statistical Methods Course
23 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Expectation Maximization
No ratings yet
Expectation Maximization
23 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Cluster Validation
No ratings yet
Cluster Validation
47 pages
Stats 1 Formulae
No ratings yet
Stats 1 Formulae
26 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
Intro to k-Nearest Neighbor Algorithm
No ratings yet
Intro to k-Nearest Neighbor Algorithm
3 pages
01-Introduction Machine Learning
100% (1)
01-Introduction Machine Learning
48 pages
Nearest Neighbor Classifier Guide
No ratings yet
Nearest Neighbor Classifier Guide
16 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Probability Statistics
No ratings yet
Probability Statistics
125 pages
Inferential Statistics
No ratings yet
Inferential Statistics
111 pages
Clustering Techniques Explained
No ratings yet
Clustering Techniques Explained
80 pages
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
No ratings yet
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
34 pages
Machine Learning Strategies
No ratings yet
Machine Learning Strategies
59 pages
Amt305 Introduction To Machine Learning, Pyq
No ratings yet
Amt305 Introduction To Machine Learning, Pyq
5 pages
Intro to Machine Learning Basics
100% (1)
Intro to Machine Learning Basics
52 pages
EM Presentation 2013
No ratings yet
EM Presentation 2013
18 pages
SVM Guide for Data Science Enthusiasts
100% (1)
SVM Guide for Data Science Enthusiasts
28 pages
ML UNIT-2 Notes
No ratings yet
ML UNIT-2 Notes
15 pages
Artificial Intelligence Artificial Neural Networks - : Introduction
No ratings yet
Artificial Intelligence Artificial Neural Networks - : Introduction
43 pages
Computational Learning Theory Guide
No ratings yet
Computational Learning Theory Guide
24 pages
CS7641 Machine Learning Midterm Notes PDF
0% (1)
CS7641 Machine Learning Midterm Notes PDF
239 pages
Bayes Classification for Fish Sorting
No ratings yet
Bayes Classification for Fish Sorting
86 pages
Nueral Network Mcqs
No ratings yet
Nueral Network Mcqs
6 pages
Deep Learning with RBMs and DBNs
No ratings yet
Deep Learning with RBMs and DBNs
79 pages
21CSC305P ML - Unit 1-E
No ratings yet
21CSC305P ML - Unit 1-E
137 pages
Math4ml PDF
No ratings yet
Math4ml PDF
21 pages
ML Lecture 15 Ensemble
No ratings yet
ML Lecture 15 Ensemble
27 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
9 pages
Finance-Focused Big Data Techniques
100% (1)
Finance-Focused Big Data Techniques
23 pages
Sat - 13.Pdf - Child Mortality Prediction Using Machine Learning
No ratings yet
Sat - 13.Pdf - Child Mortality Prediction Using Machine Learning
11 pages
Naive Bayes
No ratings yet
Naive Bayes
38 pages
Feature Selection Techniques For ML - A Survey of More Than Two Decades of Research - Dipti Theng
No ratings yet
Feature Selection Techniques For ML - A Survey of More Than Two Decades of Research - Dipti Theng
63 pages
ML-5TH Unit
No ratings yet
ML-5TH Unit
28 pages
ML Assignment 6
No ratings yet
ML Assignment 6
5 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
55 pages
Gaussian Mixture Models Unit-III
No ratings yet
Gaussian Mixture Models Unit-III
13 pages
KNN Algorithm for Students
100% (1)
KNN Algorithm for Students
18 pages
Self Organizing Maps
No ratings yet
Self Organizing Maps
27 pages
Chapter
100% (1)
Chapter
101 pages
Overfitting vs Underfitting in ML
No ratings yet
Overfitting vs Underfitting in ML
20 pages
Unit 3 ML
No ratings yet
Unit 3 ML
40 pages
Unit IV
No ratings yet
Unit IV
51 pages
ML Unit 4 Trupesh Patel
No ratings yet
ML Unit 4 Trupesh Patel
56 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Regularization and Feature Selectio N
No ratings yet
Regularization and Feature Selectio N
102 pages
2.introduction To Supervised Learning and K Nearest Neighbors
No ratings yet
2.introduction To Supervised Learning and K Nearest Neighbors
74 pages
1.introduction To Machine Learning and Toolkit
No ratings yet
1.introduction To Machine Learning and Toolkit
102 pages
Lecture 4
No ratings yet
Lecture 4
42 pages
CS251 Intro. To SE (0) Module Outline - An Intro. To SE
No ratings yet
CS251 Intro. To SE (0) Module Outline - An Intro. To SE
22 pages
Developing A Measure of Competitive Advantage - From Firm Competitiveness
No ratings yet
Developing A Measure of Competitive Advantage - From Firm Competitiveness
38 pages
SAPM - Sem1 - MidSem
No ratings yet
SAPM - Sem1 - MidSem
9 pages
Competency Mapping
No ratings yet
Competency Mapping
55 pages
Factor Analysis
No ratings yet
Factor Analysis
26 pages
Coursera Basic Statistics Final Exam Answers
80% (5)
Coursera Basic Statistics Final Exam Answers
9 pages
CHAPTER 4 Biostatistics
No ratings yet
CHAPTER 4 Biostatistics
10 pages
Peer Teaching and Mathematics Anxiety: Combating Anxiety Through Gamified Activities
No ratings yet
Peer Teaching and Mathematics Anxiety: Combating Anxiety Through Gamified Activities
7 pages
Modular Learning Impact on Pupils
No ratings yet
Modular Learning Impact on Pupils
9 pages
Karl Pearson's Coefficient of Correlation, Multiple Correlations
No ratings yet
Karl Pearson's Coefficient of Correlation, Multiple Correlations
3 pages
Biserial Correlation
No ratings yet
Biserial Correlation
27 pages
Tabel Shapiro
No ratings yet
Tabel Shapiro
29 pages
Dependent-Samples t Test Guide
No ratings yet
Dependent-Samples t Test Guide
16 pages
Correlation & Regression Guide
No ratings yet
Correlation & Regression Guide
110 pages
Bureaucratic Leadership
No ratings yet
Bureaucratic Leadership
5 pages
Correlation Regression
100% (1)
Correlation Regression
25 pages
C18 Biodiversity, Classification & Conservation PDF
No ratings yet
C18 Biodiversity, Classification & Conservation PDF
95 pages
Methods of Determining Reliability
No ratings yet
Methods of Determining Reliability
22 pages
The Relationship Between Parents Child Rearing ST
No ratings yet
The Relationship Between Parents Child Rearing ST
5 pages
pr3 Reviewer With Answers
No ratings yet
pr3 Reviewer With Answers
5 pages
(Ebook PDF) Fundamentals of Biostatistics 8th Edition Instant Download
100% (5)
(Ebook PDF) Fundamentals of Biostatistics 8th Edition Instant Download
56 pages
Statistics & Probability For Engineers (Gs-201)
No ratings yet
Statistics & Probability For Engineers (Gs-201)
2 pages
CLS565 - Sprin 2025
No ratings yet
CLS565 - Sprin 2025
4 pages
Reward Systems Boost Hotel Staff Motivation
No ratings yet
Reward Systems Boost Hotel Staff Motivation
13 pages
CIVI6731 Week4
No ratings yet
CIVI6731 Week4
16 pages
SSRN 4879150
No ratings yet
SSRN 4879150
22 pages
25-Diversity Elements in The Workplace
No ratings yet
25-Diversity Elements in The Workplace
26 pages
Validity and Reliability of The Student Work Readiness Scale
No ratings yet
Validity and Reliability of The Student Work Readiness Scale
10 pages
Open Stat Reference
No ratings yet
Open Stat Reference
403 pages
B. Com. Semester-II Business Mathematics and Statistics (Code: 52411202)
No ratings yet
B. Com. Semester-II Business Mathematics and Statistics (Code: 52411202)
3 pages
Sprint Mechanics in World-Class Athletes: A New Insight Into The Limits of Human Locomotion
No ratings yet
Sprint Mechanics in World-Class Athletes: A New Insight Into The Limits of Human Locomotion
12 pages

Model Generalization

Uploaded by

Model Generalization

Uploaded by

Model

Poor at Training Good at

• Characteristics of a high bias model include:

High Bias Low Bias

 Confusion Matrix: shows performance of an algorithm,

🞑 If predict all as 0, accuracy is 9990/10000=99.9%

 We often have a finite set of data

Split the data into two

Training set Test set

 Suppose that 𝐸𝑖 is the performance in the i-

 If a large number of folds

 Use N-1 samples for training and the remaining

 When randomly selecting training or test sets, ensure that

 Cross-validation with normalization

 Standardization or Z-score normalization

Trainin fit the

0.0 1.0 2.0 x108 0.0 1.0 2.0 x108

0.0 1.0 x108 0.0 1.0 2.0 x108

0.0 1.0 2.0 x108 0.0 1.0 2.0 x108

0.0 1.0 2.0 x108 0.0 1.0 2.0 x108

To use the Intel® Extension for Scikit-learn* variant of this

Other method for splitting data:

0.0 1.0 2.0 x108 0.0 1.0 x108

Average cross validation

Average cross validation

Underfitting: training and cross validation error are

Just right: training and cross validation errors are

Perform cross-validation with a given model

Other methods for cross validation:

Perform cross-validation with a given model

Other methods for cross validation:

Perform cross-validation with a given model

Other methods for cross validation:

where E is the expected value operator, cov(,)

0.0 1.0 2.0 x10

0.0 1.0 2.0 x10

0.0 1.0 2.0 x10

0.0 1.0 2.0 x10

Sum squared error =

Sum squared error =

0.0 1.0 2.0 x10

0.0 1.0 2.0 x10

0.0 1.0 2.0 x10

0.0 1.0 2.0 x10

min 𝐽(𝐛) = Σ (𝑦𝑖 − 𝐱 𝑖 ,∗ 𝐛)2

• Fitting involves minimizing • Fitting involves storing training

Create an instance of the class

Fit the instance on the data and then predict the

from numpy import log, log1p

from scipy.stats import boxcox

• Ordinal: categorical, ordered • Ordinal encoding (0, 1, 2, 3)

from sklearn.preprocessing import LabelEncoder, LabelBinarizer, OneHotEncoder

= 𝛽0 + 𝛽1𝑥 + 𝛽2𝑥2 + 𝛽3𝑥3

• "Linear regression" means

• How is the correct functional Check relationship of each e

Create an instance of the class

Create the polynomial features and then transform

🞑 Linear models are NOT optimized for classification

If 𝑓 𝑥 is less than 𝜆, class 1. Otherwise

Computer Biology Physics

🞑 Perform linear (binary) regression for each class

 How can we interpret this model?

You might also like