0% found this document useful (0 votes)
125 views117 pages

Model Generalization

Uploaded by

Ehab Emam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views117 pages

Model Generalization

Uploaded by

Ehab Emam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 117

Model

Generalization
Learning
•Objectives
Explain the difference between over-fitting and
under- fitting a model
• Describe Bias-variance tradeoffs
• Find the optimal training and test data set splits,
cross- validation, and model complexity versus error
• Apply a linear regression model for supervised
learning
• Apply Intel® Extension for Scikit-learn* to leverage
underlying compute capabilities of hardware
K Value Affects Decision
Boundary
60 K= 60 K = 34
1
40 40
Ag
e

20 20

0 10 20 0 10
Number of Malignant
20
Nodes
Number of Malignant Nodes
Choosing Between Different
Complexities
Polynomial Degree = Polynomial Degree = Polynomial Degree =
1 4 15
Model
True
Function
Samples

Y Y Y

X X X
How Well Does the Model
Generalize?
Polynomial Degree = Polynomial Degree = Polynomial Degree =
1 4 15
Model
True
Function
Samples

Y Y Y

X X X

Poor at Training Good at


Just
Poor at Training
Right
Predicting Poor at
Predicting
Underfitting vs
• Overfitting
The terms underfitting and overfitting refer to how the model fails to match the
data.
• The fitting of a model directly correlates to whether it will return accurate
predictions from a given data set.
•Underfitting: 
• Occurs when the model is unable to match the input data to the target data.
• This happens when the model is not complex enough to match all the
available data and performs poorly with the training dataset.
•Overfitting:
•  Relates to instances where the model tries to match non-existent data.
• This occurs when dealing with highly complex models where the model will
match almost all the given data points and perform well in training datasets.
• However, the model would not be able to generalize the data point in the test
data set to predict the outcome accurately.
Underfitting vs
Overfitting
Polynomial Degree = Polynomial Degree = Polynomial Degree =
1 4 15
Model
True
Function
Samples

Y Y Y

X X X
Underfitting Just Overfitting
Right
Bias – Variance
Tradeoff
• Technically, we can define bias as the error between average model
prediction and the ground truth.
• Moreover, it describes how well the model matches the training
data set:
• A model with a higher bias would not match the data set closely.
• A low bias model will closely match the training data set.

• Characteristics of a high bias model include:


• Failure to capture proper data trends
• Potential towards underfitting
• More generalized/overly simplified
• High error rate
Bias – Variance
Tradeoff
• Variance refers to the changes in the model when using different portions of the
training data set.
• Simply stated, variance is the variability in the model prediction—how much the ML
function can adjust depending on the given data set.
• Variance comes from highly complex models with a large number of features.
• Models with high bias will have low variance.
• Models with high variance will have a low bias.
• All these contribute to the flexibility of the model.
• For instance, a model that does not match a data set with a high bias will create
an inflexible model with a low variance that results in a suboptimal machine
learning model.
• Characteristics of a high variance model include:
• Noise in the data set
• Potential towards overfitting
• Complex models
• Trying to put all data points as close as possible
Bias – Variance
Tradeoff
Polynomial Degree = Polynomial Degree = Polynomial Degree =
1 4 15
Model
True
Function
Samples

Y Y Y

X X X

High Bias Low Bias


Just
Low High
Right
Variance Variance
EVALUATION
Evaluation for Classification
Evaluation Metrics

 Confusion Matrix: shows performance of an algorithm,


especially predictive capability.
🞑 rather than how fast it takes to classify, build models, or scalability.

Predicted Class
Class = YES Class = No
Actual Class = Yes True Positive False Negative
Class
Class = No False Positive True Negative
Evaluation Metrics
Type I and II error
Evaluation Metrics
 Sensitivity or True Positive Rate (TPR)
🞑 TP/(TP+FN)
 Specificity or True Negative Rate (TNR)
🞑 TN/(FP+TN)
 Precision or Positive Predictive Value (PPV)
🞑 TP/(TP+FP)
 Negative Predictive Value (NPV)
🞑 TN/(TN+FN)
 Accuracy
🞑 (TP+TN)/(TP+FP+TN+FN)
Limitation of Accuracy
 Consider a binary classification problem
🞑 Number of Class 0 examples = 9990
🞑 Number of Class 1 examples = 10

🞑 If predict all as 0, accuracy is 9990/10000=99.9%

 Precision
 Recall
𝑤 𝑇 𝑃 𝑇 𝑃 +𝑤 𝑇 𝑁 𝑇
 Weighted Accuracy
𝑁 𝐹 𝑃 𝐹 𝑃 +𝑤 𝑇 𝑁 𝑇 𝑁 +𝑤 𝐹 𝑁 𝐹
𝑤 𝑇 𝑃 𝑇 𝑃 +𝑤
= 𝑁
Evaluation

 Model Selection
🞑 How to evaluate the performance of a model?
🞑 How to obtain reliable estimates?
 Performance estimation
🞑 How to compare the relative performance with competing models?
Motivation

 We often have a finite set of data


🞑 If using the entire training data for the best model,
 The model normally overfits the training data, where it often gives almost
100% correct classification results on training data
 Better to split the training data into disjoint subsets
 Note that test data is not used in any way to create the
classifier  Cheating!
Methods of Validation
 Holdout
🞑 Use 2/3 for training and 1/3 for testing
 Cross-validation
🞑 Random subsampling
🞑 K-Fold Cross-validation
🞑 Leave-one-out
 Stratified cross-validation
🞑 Stratified 10-fold cross-validation is often the best
 Bootstrapping
🞑 Sampling with replacement
🞑 Oversampling vs undersampling

• Ref:
https://towardsdatascience.com/understanding-8-types-of-cross-validation-80c935a497
6d
Holdout
 Split dataset into two groups for training and test
🞑 Training dataset: used to train the model
🞑 Test dataset: use to estimate the error rate of the model

Entire dataset

Split the data into two

Training set Test set

 Drawback
🞑 When “unfortunate split” happens, the holdout estimate of
error rate will be misleading
Random Subsampling CV
 Split the data set into two groups
🞑 Randomlyselects a number of samples without
replacement
 Usually, one third for testing, the rest for training
K-Fold Cross-validation
 K-fold Partition
🞑 Partition K equal sized sub groups
🞑 Use K-1 groups for training and the remaining one for

testin Test
g set
Experiment
1

Experiment
2
Experiment
3

Experiment
4
K-fold cross-validation

 Suppose that 𝐸𝑖 is the performance in the i-


th experiment
 The average error rate is
𝑘
1
𝐸= Σ 𝐸𝑖
𝐾
𝑖=1
How many folds?

 If a large number of folds


🞑 Theestimator will be accurate (as training folds will be closer to the total dataset)
🞑 Computationally expensive
 If a small number of folds
🞑 Cheap computational time for experiments
🞑 Variance of the estimator will be small
 5 or 10-Fold CV is a common choice for K-fold CV
Leave-one-out cross-validation

 Use N-1 samples for training and the remaining


sample for testing (i.e., there is only one sample
for testing)
 The average error rate is
𝑁
1
𝐸= Σ 𝐸𝑖
𝑁
𝑖=1
where N is the total sample number.
How many folds?
 Smaller values of K means that the dataset is split into fewer parts, but each
part contains a larger percentage of the dataset.
Taking a dataset with 100 rows.
•2 fold cross validation - Each fold will contain 50 rows.
•10 fold cross validation - Each fold will contain 10 rows.

• This way, when training, the 10 fold cross validation will have a 90-10
train-test split,
• where as the 2 fold cross validation will have a 50-50 train test split.
Making use of more folds, will present the model with more data to train on,
but will required way more time as it has to train and validate K separate
times.
Stratified k-folds cross-validation

 When randomly selecting training or test sets, ensure that


class proportions are maintained in each selected set.
1. Stratify instances by class
2. Randomly select
instances from each class
proportionally

Ref: http://pages.cs.wisc.edu/~dpage/cs760/evaluating
Bootstrapping

 Oversampling
• Amplifying the minor class samples so that the
classes are equally distributed
• Sampling technique for imbalanced data
Bootstrapping

 Undersampling
• Consider less numbers of samples in the major class
so that the classes are equally distributed
• Sampling technique for imbalanced data
Cross-validation with normalization

 Cross-validation with normalization


🞑 The model is optimized to the normalized data rather than the original
data
🞑 How to evaluate via CV with normalization (e.g., z- score
normalization)?
 Normalize the training data (obtain mean and std)
 Normalize the validation or test data with the mean and std obtained from the
training data
 Otherwise (normalize before splitting), the test data are not independent from the
training data. Weak cheating.
Standardization

 Standardization or Z-score normalization


🞑 Rescale the data so that the mean is zero and the standard
deviation from the mean (standard scores) is one

x−𝜇
x𝑛 𝑜 𝑟 𝑚 = 𝜎
𝜇 is mean, 𝜎 is a standard deviation from the
mean (standard score)
Training and Test
Splits
Training and Test
Splits

Training
Data

Test
Dat
a
Using Training and Test
Data

Trainin fit the


g model
Data
measure performance
Test - predict label with model
- compare with actual
Dat
a value
- measure error
Using Training and Test
Data Training Data Test
x108 x108 Data

4.0 4.0

3.0 3.0

2.0 2.0

1.0 1.0

0.0 1.0 2.0 x108 0.0 1.0 2.0 x108


Using Training and Test
Data Training Data Test
x108 x108 Data

4.0 4.0

3.0 3.0

2.0 2.0

1.0 1.0

0.0 1.0 x108 0.0 1.0 2.0 x108


2.0

Fit the
Using Training and Test
Data Training Data Test
x108 x108 Data

4.0 4.0

3.0 3.0

2.0 2.0

1.0 1.0

0.0 1.0 2.0 x108 0.0 1.0 2.0 x108

Make predictions
Using Training and Test
Data Training Data Test
x108 x108 Data

4.0 4.0

3.0 3.0

2.0 2.0

1.0 1.0

0.0 1.0 2.0 x108 0.0 1.0 2.0 x108


Measure
error
Fitting Training and Test
Data

Training X_train
Data KNN( X_train, model
Y_train ).fit()
Y_train
X_test
Test model .predict( X_test Y_predic
) t
Dat
a
Fitting Training and Test
Data

Training X_train
Data KNN( X_train, model
Y_train ).fit()
Y_train
X_test
Test model .predict( X_test Y_predic
) t
Dat
a
error_metric( Y_test, test error
Y_tes Y_predict)
t
Train and Test Splitting: The
Syntax
Import the train and test split function
from sklearn.model_selection import train_test_split

To use the Intel® Extension for Scikit-learn* variant of this


algorithm:
• Install Intel® oneAPI AI Analytics Toolkit (AI Kit)
• Add the following two lines of code after the above code:
import patch_sklearn
patch_sklearn()
Train and Test Splitting: The
Syntax
Import the train and test split function
from sklearn.model_selection import train_test_split
Train and Test Splitting: The
Syntax
Import the train and test split function
from sklearn.model_selection import train_test_split

Split the data and put 30% into the test set
train, test = train_test_split(data, test_size=0.3)
Train and Test Splitting: The
Syntax
Import the train and test split function
from sklearn.model_selection import train_test_split

Split the data and put 30% into the test set
train, test = train_test_split(data, test_size=0.3)

Other method for splitting data:


from sklearn.model_selection import ShuffleSplit
Beyond a Single Test Set: Cross
Validation

Training
Data

Validation
Data
Beyond a Single Test Set: Cross
Validation Training Data
Test Data
x10
4.08 x108
4.0

3.0 3.0

2.0 2.0

1.0 1.0

0.0 1.0 2.0 x108 0.0 1.0 x108


2.0
Best model for this test
set
Beyond a Single Test Set: Cross
Validation

Training
Data 1

Validation
Data 1
Beyond a Single Test Set: Cross
Validation
Trainin
g Data
2

Validatio
n
Data 2
Beyond a Single Test Set: Cross
Validation

Validation
Data 3

Trainin
g Data
3
Beyond a Single Test Set: Cross
Validation
Validation
Data 4

Training
Data 4
Beyond a Single Test Set: Cross
Validation
Training Split Training Split Training Split Test Split

+
Training Split Training Split Test Split Training Split

+
Training Split Test Split Training Split Training Split

+
Test Split Training Split Training Split Training Split

Average cross validation


results.
Beyond a Single Test Set: Cross
Validation
Training Split Training Split Training Split Test Split

+
Training Split Training Split Test Split Training Split

+
Training Split Test Split Training Split Training Split

+
Test Split Training Split Training Split Training Split

Average cross validation


results.
results.
Model Complexity vs
Error

𝜃
𝐽𝑐𝑣
cross validation
error
erro
r

𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error
Model Complexity vs
Error

𝜃
𝐽𝑐𝑣
cross validation
error
erro
r

𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error
Model Complexity vs
Error

𝜃
𝐽𝑐𝑣
cross validation
error
erro
r

𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error
Model Complexity vs
Error
Polynomial Degree =
1
Model
𝜃 True
𝐽𝑐𝑣 Function
cross validation Samples
error
erro

Y
r

𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error

Underfitting: training and cross validation error are


high
Model Complexity vs
Error
Polynomial Degree =
15
Model
𝜃 True
𝐽𝑐𝑣 Function
cross validation Samples
error
erro

Y
r

𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error

X
model
complexity
Overfitting: training error is low, cross validation is
high
Model Complexity vs
Error
Polynomial Degree =
4
Model
𝜃 True
𝐽𝑐𝑣 Function
cross validation Samples
error
erro

Y
r

𝜃
𝐽 𝑡𝑟𝑎𝑖𝑛
training error

Just right: training and cross validation errors are


low
Cross Validation: The
Syntax
Import the train and test split function
from sklearn.model_selection import cross_val_score

Perform cross-validation with a given model


cross_val = cross_val_score(KNN, X_data, y_data, cv=4,
scoring='neg_mean_squared_error')

Other methods for cross validation:


from sklearn.model_selection import KFold, StratifiedKFold
Cross Validation: The
Syntax
Import the train and test split function
from sklearn.model_selection import cross_val_score

Perform cross-validation with a given model


cross_val = cross_val_score(KNN, X_data, y_data, cv=4,
scoring='neg_mean_squared_error')

Other methods for cross validation:


from sklearn.model_selection import KFold, StratifiedKFold
Cross Validation: The
Syntax
Import the train and test split function
from sklearn.model_selection import cross_val_score

Perform cross-validation with a given model


cross_val = cross_val_score(KNN, X_data, y_data, cv=4,
scoring='neg_mean_squared_error')

Other methods for cross validation:


from sklearn.model_selection import KFold, StratifiedKFold
Introduction to

Linear
Regression
Correlation
(r)
 Linear association between two variables
 Show how to determine both the nature and strength of
relationship between two variables
 Correlation lies between +1 to -1
 Zero correlation indicates that there is no relationship
between the variables
 Pearson correlation coefficient
🞑 most familiar measure of dependence between two quantities
Correlation
(r)
Correlation
(r)

where E is the expected value operator, cov(,)


means covariance, and corr(,) is a widely used
alternative notation for the correlation coefficient

Reference:
https://en.wikipedia.org/wiki/Correlation_and_dependence
Linear
Regression
Samples with ONE independent variable Samples with TWO independent
variables
Linear
Regression
Samples with ONE independent Samples with TWO independent
variable variables
Linear Regression
x108

2.0

Office
Box
1.0

0.0 1.0 2.0 x10


8

Budge
t

𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥
box movie
office
coefficient 0 coefficient
revenue budge
1
Linear Regression
x108

2.0

Office
Box
1.0

0.0 1.0 2.0 x10


8

Budge
t
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥
𝛽 0 = 80 million, 𝛽 1= 0.6
Predicting from Linear
Regression x108

2.0

Office
Box
1.0

0.0 1.0 2.0 x10


8

Budge
t
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥
𝛽 0 = 80 million, 𝛽 1= 0.6
Predict 175 Million Gross for 160 Million Budget
Linear
Regression
Simple linear regression
🞑A single independent variable is used to predict
 Multiple linear regression
🞑 Two or more independent variables are used to
predict
Linear
Regression
 How to represent the data as a vector/matrix
🞑 We assume a model:
𝐲 = b0 + 𝐛𝐗 + ϵ,
where b0 and 𝐛 are intercept and slope, known as coefficients
or parameters. ϵ is the error term (typically assumes that
ϵ~𝑁(𝜇, 𝜎 2 )
Linear
Regression  How to represent the data as a
vector/matrix
🞑 Include bias constant (intercept) in the input vector
 𝐗 ∈ ℝ𝒏×(𝒑+𝟏), 𝐲 ∈ ℝ𝒏, 𝐛 ∈ ℝ𝒑+𝟏, and 𝐞 ∈ ℝ𝒏

𝐲=𝐗∙𝐛+𝐞

𝐗= 𝟏, 𝐱 𝟏 , 𝐱 𝟐 , … , 𝐱 𝐩 , 𝐛 = {𝑏0 , 𝑏1 , 𝑏2 , …
, 𝑏𝑝 }T
𝐲 = {𝑦1 , 𝑦2 , … , 𝑦𝑛 }T , 𝐞 = {𝑒1 , 𝑒2 , … , 𝑒𝑛 }T
∙ is a dot product

equivalent to
y𝑖 = 1 ∗ b0 + 𝑥𝑖 1 b1 + 𝑥𝑖 2 b2 + ⋯ + 𝑥𝑖 𝑝 bp (1
Linear
Regression
 Find the optimal coefficient vector b that
makes the most similar observation

𝑦1 𝑥11 ⋯ 𝑥 1𝑝 𝑏0 𝑒1
1
⋮ = 1 ⋮ ⋱ ⋮ ⋮ + ⋮
𝑦𝑛 𝑥 𝑛1 ⋯ 𝑒𝑛
𝑥 𝑛𝑝 𝑏𝑝
1
equivalent to
y𝑖 = 1 ∗ b0 + 𝑥𝑖 1 b1 + 𝑥𝑖 2 b2 + ⋯ + 𝑥𝑖 𝑝 bp (1 ≤ 𝑖 ≤ 𝑛)
Which Model Fits the
Best?
x108

2.0

Office
Box
1.0

0.0 1.0 2.0 x10


8

Budge
t
Ordinary Least Squares
(OLS) 𝐲 = 𝐗𝐛 + 𝐞
 Used to estimate the unknown parameters (b) in linear
regression model
 Minimizing the sum of the squares of the differences between the
observed responses and the predicted by a linear function

Sum squared error =


𝑛

Σ(𝑦 𝑖 − 𝐱 𝑖 ∗𝐛)2
𝑖=1
Ordinary Least Squares
(OLS)

Sum squared error =


𝑛

Σ(𝑦 𝑖 − 𝐱 𝑖 ∗ 𝐛)2
𝑖=1
Calculating the
Residuals x108

2.0

Office
Box
1.0

0.0 1.0 2.0 x10


8

Budge
t

predicted
𝑦 𝑥 𝑜(𝑖)𝑏 − 𝑦 (𝑖)
𝑜𝑏 observe
value d
𝑠 𝑠 value
Calculating the
Residuals x108

2.0

Office
Box
1.0

0.0 1.0 2.0 x10


8

Budge
t

𝛽0 + 𝛽 1 𝑥𝑜(𝑖)𝑏 − 𝑦 (𝑖)
𝑜𝑏
𝑠 𝑠
Mean Squared
Error
x108
2.0

Office
Box
1.0

0.0 1.0 2.0 x10


8

𝑚
1 Budget 2
Σ 𝛽0 + 𝛽 1 𝑥𝑜(𝑖)𝑏 − 𝑦 (𝑖)
𝑜𝑏
𝑚
𝑖=1 𝑠 𝑠
Minimum Mean Squared
Error
x108
2.0

Office
Box
1.0

0.0 1.0 2.0 x10


8

𝑚
1
Budge
t
2
min Σ 𝛽0 + 𝛽 1 𝑥𝑜(𝑖)𝑏 − 𝑦 (𝑖)
𝑜𝑏
𝛽 0 ,𝛽 1 𝑚 𝑖=1 𝑠 𝑠
Optimizatio
n Need to minimize the error

𝑛

min 𝐽(𝐛) = Σ (𝑦𝑖 − 𝐱 𝑖 ,∗ 𝐛)2


𝑖=1
 To obtain the optimal set of parameters (b), derivatives of
the error w.r.t. each parameters must be zero.
Optimization

b
𝐽𝑐𝑣
cross validation
error
erro
r

b
𝐽 𝑡𝑟𝑎𝑖𝑛
training error
Optimizatio
n
𝐽 = 𝐞 T 𝐞 = 𝐲 − 𝐗𝐛 ′ 𝐲 − 𝐗𝐛
= 𝐲 ′ − 𝐛 ′ 𝐗 ′ 𝐲 − 𝐗𝐛
= 𝐲 ′ 𝐲 − 𝐲 ′ 𝐗𝐛 − 𝐛 ′ 𝐗 ′ 𝐲 + 𝐛 ′ 𝐗
′ 𝐗𝐛

= 𝐲 ′ 𝐲 − 𝟐𝐛 ′ 𝐗 ′ 𝐲 + 𝐛 ′ 𝐗 ′ 𝐗𝐛
𝜕𝐞 ′ 𝐞
= −2𝐗 ′ 𝐲 + 2𝐗 ′ 𝐗𝐛 = 0
𝜕𝐛
𝐗 ′𝐗 𝐛 = 𝐗 ′𝐲
𝐛 = (𝐗 ′ 𝐗) −1 𝐗 ′ 𝐲
Matrix Cookbook: https:/ / www.math.uwaterloo.ca/~hwolkowi/matrixcookb
ook.pdf
Comparing Linear Regression and
KNN
Linear Regression K Nearest Neighbors

• Fitting involves minimizing • Fitting involves storing training


cost function (slow) data (fast)
• Model has few • Model has many
parameters (memory parameters (memory
efficient) intensive)
• Prediction involves • Prediction involves
calculation (fast) finding closest
neighbors (slow)
Linear Regression: The
Syntax
Import the class containing the regression method
from sklearn.linear_model import LinearRegression

Create an instance of the class


LR = LinearRegression()

Fit the instance on the data and then predict the


expected value
LR = LR.fit(X_train, y_train)
y_predict = LR.predict(X_test)
Advanced

Linear Regression
Scaling is a Type of Feature
Transformation
60 60
24

40 22 40
Age
20
20 20
18

1234 1 2 3 4 5
5 Number of Surgeries Number of
Surgeries
Scaling is a Type of Feature
Transformation
Why need data transformation?
• the algorithm is more likely to be biased when the data distribution is
skewed
• transforming data into the same scale allows the algorithm to compare
the relative relationship between data points better
Transformation of Data
Distributions
• Predictions from linear regression
models assume residuals are
normally distributed
• Features and predicted data
are
often skewed
• Data transformations can
solve this issue
statistics-for/9781491952955/ch04.html
https://www.oreilly.com/library/view/practical-
Reference:
Transformation of Data
Distributions
• Predictions from linear regression
models assume residuals are
normally distributed
• Features and predicted data
are often skewed
• Data transformations can
solve this
issue
Transformation of Data
Distributions

from numpy import log, log1p

from scipy.stats import boxcox


Transformation of Data
Distributions
• Predictions from linear regression
models assume residuals are
normally distributed
• Features and predicted data
are often skewed
• Data transformations can
solve this issue

Reference: https://towardsdatascience.com/data-transformation-and-feature-
engineering-e3c7dfbb4899
Feature
Types
Feature Transformatio
Type n
• Continuous: numerical • Standard Scaling, Min-Max
values Scaling
• Nominal: categorical,
unordered • One-hot encoding (0, 1)
features (True or False)

• Ordinal: categorical, ordered • Ordinal encoding (0, 1, 2, 3)


features (movie ratings)
Feature
Types
Feature Transformatio
Type n
• Continuous: numerical • Standard Scaling, Min-Max
values Scaling
• Nominal: categorical, • One-hot encoding (0, 1)
unordered
features (True or False)
• Ordinal encoding (0, 1, 2,
• Ordinal: categorical, ordered 3)
features (movie ratings)
Feature
Types
Feature Transformatio
Type n
• Continuous: numerical • Standard Scaling, Min-Max
values Scaling
• Nominal: categorical, • One-hot encoding (0,
unordered 1)
features (True or False)
Computer Biology Physics
Computer 1 0 0
Biology 0 1 0
Physics 0 0 1

from sklearn.preprocessing import LabelEncoder, LabelBinarizer, OneHotEncoder


Feature
Types
Feature Transformatio
Type n
• Continuous: numerical • Standard Scaling, Min-Max
values Scaling
• Nominal: categorical, • One-hot encoding (0,
unordered 1)
features (True or False)
• Ordinal: categorical, • Ordinal encoding (0, 1, 2,
ordered 3)
features (movie ratings)
from sklearn.feature_extraction import DictVectorizer
from pandas import get_dummies
Addition of Polynomial
Features
Addition of Polynomial
Features
The need of Polynomial Regression in ML can be understood in the below points:
• If we apply a linear model on a linear dataset, then it provides us a good result as
we have seen in Simple Linear Regression, but if we apply the same model without
any modification on a non-linear dataset, then it will produce a drastic output.
Due to which loss function will increase, the error rate will be high, and accuracy
will be decreased.
• So for such cases, where data points are arranged in a non-linear fashion, we
need the Polynomial Regression model.
• Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a linear
model."
Addition of Polynomial
Features
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥 + 𝛽2𝑥2
• Capture higher order
features of data by adding
polynomial features

Office
Box
• "Linear regression" means
linear
combinations of features

Budget
Addition of Polynomial
Features
𝑦𝛽

= 𝛽0 + 𝛽1𝑥 + 𝛽2𝑥2 + 𝛽3𝑥3

Office
Box
• Capture higher order
features of data by adding
polynomial features
Budget

• "Linear regression" means


Addition of Polynomial
Features
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥 + 𝛽2𝑥2
• Capture higher order
features of data by adding
polynomial features

Office
Box
• "Linear regression" means
linear combinations of
features

Budget
Addition of Polynomial
Features
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1 log(𝑥)
• Capture higher order
features of data by adding
polynomial features

Office
Box
• "Linear regression" means
linear combinations of
features

Budget
Addition of Polynomial
Features
Equation of the Polynomial Regression Model:
Simple Linear Regression equation:     y = b0+b1x         .........(a)
Multiple Linear Regression equation: y= b0+b1x+ b2x2+ b3x3+....+
bnxn   .....(b)
Polynomial Regression equation:         y= b0+b1x + b2x2+ b3x3+....+ bnxn    
 …...(c)

Reference: https://www.javatpoint.com/machine-learning-polynomial-regression
Addition of Polynomial
Features
• Can also include 𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3𝑥1𝑥2
variable interactions

• How is the correct functional Check relationship of each e


variabl
form chosen? or with outcome
Polynomial Features: The
Syntax
Import the class containing the transformation method
from sklearn.preprocessing import PolynomialFeatures

Create an instance of the class


polyFeat = PolynomialFeatures(degree=2)

Create the polynomial features and then transform


the data
polyFeat = polyFeat.fit(X_data)
X_poly = polyFeat.transform(X_data)
Linear regression for
classification
 For binary classification
🞑 Encode class labels as y 0, 1 𝑜𝑟 {−1,
= 1}
🞑 Apply
Check OLSwhich class the prediction is closer
to
 If class 1 is encoded to 1 and class 2 is -1.
𝑐𝑙𝑎𝑠𝑠 1 𝑖𝑓 𝑓 𝑥 ≥ 0
𝑐𝑙𝑎𝑠𝑠 2 𝑖𝑓 𝑓 𝑥 < 0

🞑 Linear models are NOT optimized for classification


 Logistic regression (NEXT)
Linear regression for
classification
 ROC for classification  Later with logistic
regression.

𝑓 𝑥 ≥ 𝜆
<

If 𝑓 𝑥 is less than 𝜆, class 1. Otherwise


class 2. How can we know the optimal 𝜆 ?
 Let’s revisit EVALUATION.
Linear regression for
classification
 Multi-label classification
🞑 Encode classes label as:

Computer Biology Physics


Computer 1 0 0
Biology 0 1 0
Physics 0 0 1

🞑 Perform linear (binary) regression for each class


Assumptions in Linear regression
 Linearity of independent variable in the predictor
🞑 normally good approximation, especially for high- dimensional
data
 Error has normal distribution, with mean zero and constant
variance
🞑 important for tests
 Independent variables are independent from each other
🞑 Otherwise, it causes a multicollinearity problem; two or more
predictor variables are highly correlated.
🞑 Should remove one of them
Assumptions in Linear
regression
 Independent variables are independent from
each other
🞑 Otherwise,
it causes a multicollinearity problem; two or
more predictor variables are highly correlated.
🞑 Should remove one of them
Think
more!
Feature 1 Feature 2 Feature 3 Feature 4
Coefficient 5.2 0.1 -6.6 0

 How can we interpret this model?


 What is the most useless feature?
🞑 Is
it always useless to explain the dependent
variable?
 What do negative coefficients represent?
 What is the most informative feature?
Different views between Statistics
and CS
 In Statistics, description of the model is often more important.
🞑 Which
variables are more informative and reliable to describe the
responses?  p-values
🞑 How much information do the variables have?
 In Computer Science, the accuracy of prediction and
classification is more important.
🞑 How well can we predict/classify?
Discussio
n
 What if data is imbalanced data?
 Weighted MSE
Links
 Handwritten example:
 https://www.youtube.com/watch?v=orQ-QGaOPIg
 LR Calculator:
 https://www.socscistatistics.com/tests/regression/default.aspx
 Scikit learn tutorial
 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegres
sion.html

You might also like