Ml2 train test-splits_validation_linear_regression

Legal Notices and Disclaimers
This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES,
EXPRESS OR IMPLIED, IN THIS SUMMARY.
Intel technologies’ features and benefits depend on system configuration and may require
enabled hardware, software or service activation. Performance varies depending on system
configuration. Check with your system manufacturer or retailer or learn more at intel.com.
This sample source code is released under the Intel Sample Source Code License Agreement.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2017, Intel Corporation. All rights reserved.

K Value Affects Decision Boundary
Number of Malignant Nodes
0
Age
60
40
20
10 20
Number of Malignant Nodes
0
60
40
20
10 20
K = 34K = 1

Choosing Between Different Complexities
X
Y
Model
True Function
Samples
X
Y
X
Y
Polynomial Degree = 1 Polynomial Degree = 4 Polynomial Degree = 15

How Well Does the Model Generalize?
Poor at Training
Poor at Predicting
Just Right
Good at Training
Poor at Predicting
X
Y
Model
True Function
Samples
X
Y
X
Y

Underfitting vs Overfitting
Underfitting Just Right Overfitting
X
Y
Model
True Function
Samples
X
Y
X
Y

Bias – Variance Tradeoff
High Bias
Low Variance
Just Right
Low Bias
High Variance
X
Y
Model
True Function
Samples
X
Y
X
Y

Training and Test Splits
Training
Data
Test
Data

fit the model
measure performance
- predict label with model
- compare with actual value
- measure error
Test
Data
Using Training and Test Data
Training
Data

Test DataTraining Data
0.0 1.0 2.00.0 1.0 2.0
1.0
2.0
3.0
4.0
x108 x108
1.0
2.0
3.0
4.0
x108x108

0.0 1.0 2.00.0 1.0 2.0
1.0
2.0
3.0
4.0
x108 x108
1.0
2.0
3.0
4.0
x108x108
Fit the model

0.0 1.0 2.00.0 1.0 2.0
1.0
2.0
3.0
4.0
x108 x108
1.0
2.0
3.0
4.0
x108x108
Make predictions

0.0 1.0 2.00.0 1.0 2.0
1.0
2.0
3.0
4.0
x108 x108
1.0
2.0
3.0
4.0
x108x108
Measure error

X_train
X_test
Y_train
model
KNN( X_train, Y_train ).fit()
.predict( X_test )
model
Fitting Training and Test Data
Y_predictTest
Data
Training
Data

X_train
X_test
Y_train
model
KNN( X_train, Y_train ).fit()
.predict( X_test )
model
Y_predict
Fitting Training and Test Data
error_metric( Y_test, Y_predict) test error
Y_test
Test
Data
Training
Data

Import the train and test split function
from sklearn.model_selection import train_test_split
Train and Test Splitting: The Syntax

Split the data and put 30% into the test set
train, test = train_test_split(data, test_size=0.3)

Split the data and put 30% into the test set
train, test = train_test_split(data, test_size=0.3)
Other method for splitting data:
from sklearn.model_selection import ShuffleSplit

Beyond a Single Test Set: Cross Validation
Training
Data
Validatio
n
Data

0.0 1.0 2.0
1.0
2.0
3.0
4.0
x108 x108
x108
Best model for this test set
0.0 1.0 2.0
1.0
2.0
3.0
4.0
x108

Training
Data 1
Validatio
n
Data 1

Training
Data 2
Validatio
n
Data 2

Training
Data 3
Validatio
n
Data 3

Training
Data 4
Validatio
n
Data 4

Test SplitTraining Split Training Split Training Split
Test Split Training Split Training Split Training Split
+
+
+
Average cross validation results.

Test Split Training Split Training Split Training Split
+
+
+
Average cross validation results.Average cross validation results.

Model Complexity vs Error
error
𝐽𝑐𝑣 𝜃
cross validation error
𝐽𝑡𝑟𝑎𝑖𝑛 𝜃
training error

Underfitting: training and cross validation error are high
error
𝐽𝑐𝑣 𝜃
training error
X
Y
Model
True Function
Samples
Polynomial Degree = 1

Overfitting: training error is low, cross validation is high
model complexity
error
𝐽𝑐𝑣 𝜃
training error
X
Y
Model
True Function
Samples

Just right: training and cross validation errors are low
error
𝐽𝑐𝑣 𝜃
training error
X
Y
Model
True Function
Samples

from sklearn.model_selection import cross_val_score
Perform cross-validation with a given model
cross_val = cross_val_score(KNN, X_data, y_data, cv=4,
scoring='neg_mean_squared_error')
Other methods for cross validation:
from sklearn.model_selection import KFold, StratifiedKFold
Cross Validation: The Syntax

Cross Validation: The Syntax
from sklearn.model_selection import cross_val_score
Perform cross-validation with a given model
cross_val = cross_val_score(KNN, X_data, y_data, cv=4,
scoring='neg_mean_squared_error')
Other methods for cross validation:
from sklearn.model_selection import KFold, StratifiedKFold

Introduction to
Linear Regression

Introduction to Linear Regression
𝑦 𝛽 𝑥 = 𝛽0 + 𝛽1 𝑥
0.0
1.0
2.0
x108
1.0 x108
2.0
Budget
BoxOffice

coefficient
0
box office
revenue
movie
budget
coefficient
1
𝑦 𝛽 𝑥 = 𝛽0 + 𝛽1 𝑥
0.0
1.0
2.0
x108
1.0 x108
2.0
Budget
BoxOffice

𝑦 𝛽 𝑥 = 𝛽0 + 𝛽1 𝑥
𝛽0= 80 million, 𝛽1= 0.6
0.0
1.0
2.0
x108
1.0 x108
2.0
Budget
BoxOffice

Predicting from Linear Regression
𝑦 𝛽 𝑥 = 𝛽0 + 𝛽1 𝑥
𝛽0= 80 million, 𝛽1= 0.6
Predict 175 Million Gross for 160 Million Budget
0.0
1.0
2.0
x108
1.0 x108
2.0
Budget
BoxOffice

Which Model Fits the Best?
0.0
1.0
2.0
x108
1.0 x108
2.0
Budget
BoxOffice

Calculating the Residuals
𝑦 𝛽 𝑥 𝑜𝑏𝑠
(𝑖)
− 𝑦𝑜𝑏𝑠
(𝑖)
predicted
value
observe
d value
0.0
1.0
2.0
x108
1.0 x108
2.0
Budget
BoxOffice

Calculating the Residuals
𝛽0 + 𝛽1 𝑥 𝑜𝑏𝑠
(𝑖)
(𝑖)
0.0
1.0
2.0
x108
1.0 x108
2.0
Budget
BoxOffice

Mean Squared Error
1
𝑚
𝑖=1
𝑚
(𝑖)
(𝑖)
2
0.0
1.0
2.0
x108
1.0 x108
2.0
Budget
BoxOffice

Minimum Mean Squared Error
min
𝛽0,𝛽1
1
𝑚
𝑖=1
𝑚
(𝑖)
(𝑖)
2
0.0
1.0
2.0
x108
1.0 x108
2.0
Budget
BoxOffice

Cost Function
𝐽 𝛽0, 𝛽1 =
1
2𝑚
𝑖=1
𝑚
(𝑖)
(𝑖)
2
0.0
1.0
2.0
x108
1.0 x108
2.0
Budget
BoxOffice

Modelling Best Practice
• Use cost function to fit model
• Develop multiple models
• Compare results and choose best one

𝑖=1
𝑚
𝑦 𝑜𝑏𝑠 − 𝑦𝑜𝑏𝑠
(𝑖) 2
Total Sum of Squares (TSS):
1 −
𝑆𝑆𝐸
𝑇𝑆𝑆
Correlation Coefficient (R2):
Other Model Metrics
Sum of Squared Error
(SSE): 𝑖=1
𝑚
𝑦 𝛽(𝑥(𝑖)) − 𝑦𝑜𝑏𝑠
(𝑖) 2

𝑖=1
𝑚
(𝑖) 2
1 −
𝑆𝑆𝐸
𝑇𝑆𝑆
Other Measures of Error
(SSE): 𝑖=1
𝑚
(𝑖) 2

Other Measures of Error
(SSE): 𝑖=1
𝑚
(𝑖) 2
𝑖=1
𝑚
(𝑖) 2
1 −
𝑆𝑆𝐸
𝑇𝑆𝑆

• Fitting involves minimizing cost
function (slow)
• Model has few parameters
(memory efficient)
• Prediction involves calculation
(fast)
• Fitting involves storing training
data (fast)
• Model has many parameters
(memory intensive)
• Prediction involves finding
closest neighbors (slow)
Comparing Linear Regression and KNN
Linear Regression K Nearest Neighbors

Import the class containing the regression method
from sklearn.linear_model import LinearRegression
Create an instance of the class
LR = LinearRegression()
Fit the instance on the data and then predict the expected value
LR = LR.fit(X_train, y_train)
y_predict = LR.predict(X_test)
Linear Regression: The Syntax

Linear Regression: The Syntax
Import the class containing the regression method
from sklearn.linear_model import LinearRegression
LR = LinearRegression()
Fit the instance on the data and then predict the expected value
LR = LR.fit(X_train, y_train)
y_predict = LR.predict(X_test)

Scaling is a Type of Feature Transformation
Number of Surgeries
Age
60
40
20
12345
24
22
20
18
Number of Surgeries
60
40
20
1 2 4 53

Transformation of Data Distributions
• Predictions from linear regression
models assume residuals are
normally distributed
• Features and predicted data are
often skewed
• Data transformations can solve
this issue

Transformation of Data Distributions
from numpy import log, log1p
from scipy.stats import boxcox

Feature Types
• Continuous: numerical values
• Nominal: categorical, unordered
features (True or False)
• Ordinal: categorical, ordered
features (movie ratings)
• Standard Scaling, Min-Max Scaling
• One-hot encoding (0, 1)
• Ordinal encoding (0, 1, 2, 3)
Feature Type Transformation

Feature Types
from sklearn.preprocessing import LabelEncoder, LabelBinarizer, OneHotEncoder

Feature Types
from sklearn.feature_extraction import DictVectorizer
from pandas import get_dummies

Addition of Polynomial Features
• Capture higher order features
of data by adding polynomial
features
• "Linear regression" means
linear combinations of
features
𝑦 𝛽 𝑥 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2
BudgetBoxOffice

features
features
𝑦 𝛽 𝑥 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2
+ 𝛽3 𝑥3
BudgetBoxOffice

features
features
𝑦 𝛽 𝑥 = 𝛽0 + 𝛽1 log(𝑥)
BudgetBoxOffice

• Can also include variable
interactions
• How is the correct functional
form chosen?
𝑦 𝛽 𝑥 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥1 𝑥2
Check relationship of each
variable or with outcome

Polynomial Features: The Syntax
Import the class containing the transformation method
from sklearn.preprocessing import PolynomialFeatures
polyFeat = PolynomialFeatures(degree=2)
Create the polynomial features and then transform the data
polyFeat = polyFeat.fit(X_data)
X_poly = polyFeat.transform(X_data)

Ml2 train test-splits_validation_linear_regression

Ml2 train test-splits_validation_linear_regression

More Related Content

What's hot

Similar to Ml2 train test-splits_validation_linear_regression

More from ankit_ppt

Recently uploaded

Ml2 train test-splits_validation_linear_regression