Session 3 - Chapter 06 Linear Reg

Chapter 6 discusses multiple linear regression, focusing on the distinction between explanatory and predictive modeling. It emphasizes the importance of selecting a subset of predictors to enhance model accuracy and robustness, using the example of predicting used Toyota Corolla prices. The chapter also outlines various methods for variable selection and the significance of metrics like AIC and BIC in model evaluation.

Uploaded by

tejaasbaid25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views20 pages

Session 3 - Chapter 06 Linear Reg

Uploaded by

tejaasbaid25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Chapter 6: Multiple Linear

Regression
We assume a linear relationship
between predictors and outcome:

outcome coefficients

constant error (noise)

predictors
Topics
• Explanatory vs. predictive modeling with regression
• Example: prices of Toyota Corollas
• Fitting a predictive model
• Assessing predictive accuracy
• Selecting a subset of predictors
Explanatory Modeling
Goal: Explain relationship between predictors
(explanatory variables) and target

● Familiar use of regression in data analysis

● Model Goal: Fit the data well and understand the

contribution of explanatory variables to the model

● Metrics: “goodness-of-fit” - R2, residual analysis, p-

values
Predictive Modeling
Goal: predict target values in other data where we have
predictor values, but not target values

● Classic data mining context

● Model Goal: Optimize predictive accuracy
● Train model on training data
● Assess performance on validation (hold-out) data
● Explaining role of predictors is not primary purpose
(but useful)
Explanatory vs. Predictive Modeling
1. A good explanatory model: fits the data closely
2. A good predictive model: predicts new records accurately
3. In explanatory models, the entire dataset is used for estimating the
best- fit model.
4. For predictive models, the data are typically split into a training set
and a validation set.
5. Performance measures for explanatory models measure how close
the data fit the model, whereas in predictive models, performance
is measured by predictive accuracy.
6. In explanatory models the focus is on the coefficients (β), whereas
in predictive models the focus is on the predictions (yˆ).
It is extremely important to know the goal of the analysis before
beginning the modeling process.
Estimating Regression Equation and
Prediction
• Predictions are best if they will be unbiased (if its expected
value is equal to the true value of the parameter)
• It will have the smallest mean squared error compared to any
unbiased estimates if we make the following assumptions:
1. Linearity: The relationship between X and Y is linear.
2. Independence of errors: There is not a relationship between
the residuals and the variable.
3. Normality of errors: The residuals are approximately normally
distributed.
4. Equal variances: The variance of the residuals is the consistent
for all values of X.
Example: Prices of Toyota Corolla
Problem Statement: A large Toyota car dealership offers purchasers
of new Toyota cars the option to buy their used car as part of a trade-
in. In particular, a new promotion promises to pay high prices for
used Toyota Corolla cars for purchasers of a new car. The dealer then
sells the used cars for a small profit. To ensure a reasonable profit,
the dealer needs to be able to predict the price that the dealership
will get for the used cars.

Goal: predict prices of used Toyota Corollas based on their

specification
Data: Prices of used Toyota Corollas, with their specification
information
Variables Used
Price in Euros
Age in months as of 8/04
KM (kilometers)
Fuel Type (diesel, petrol, CNG)
HP (horsepower)
Metallic color (1=yes, 0=no)
Automatic transmission (1=yes, 0=no)
CC (cylinder volume)
Doors
Quarterly_Tax (road tax)
Weight (in kg)
Distribution of Residuals (Holdout
Set)

Symmetric distribution
A few outliers
Observations
• Note that the mean error (ME) is $ 19.6 and RMSE = $1325.
• A histogram of the residuals shows that most of the errors
are between ±$2000.
• This error magnitude might be small relative to the car
price but should be taken into account when considering
the profit.
• Measures such as the mean error, and error percentiles are
used to assess the predictive performance of a model and
to compare models.
Variable Selection: Reducing No. of
Predictors
Why bother to select a subset? Can we use use all the variables in the
model?
• A previously hidden relationship might emerge.
• Ex: a company found that customers who had purchased anti-scuff
protectors for chair and table legs had lower credit risks.
Subset selection is needed.
• It may be expensive or not feasible to collect a full complement of
predictors.
• We may be able to measure fewer predictors more accurately.
• The more predictors, the higher the chance of missing values in the data.
• Parsimony is an important property of good models. We obtain more
insight into the influence of predictors in models with few parameters.
Variable Selection: Reducing No. of
Predictors
• One very rough rule of thumb is to have a number of records n
larger than 5(p + 2), where p is the number of predictors.
• Using predictors that are uncorrelated with the outcome
variable increases the variance of predictions.
• Dropping predictors that are actually correlated with the
outcome variable can increase the average error (bias) of
predictions.
• There is a trade-off between too few and too many predictors.
In general, accepting some bias can reduce the variance in
predictions.
• Methods for reducing the number of predictors p to a smaller
set are often used.
Feature (Variable, Predictor) Selection
• Why select a subset of attributes to predict the target?
• More predictors/attributes problems:
• Expensive data collection
• More missing data
• Multicollinearity – some predictors behave the same way
• Uncorrelation with target variable
• The goal
• Find parsimonious model (simplest model that performs
sufficiently well)
• More robust & higher predictive accuracy
• Variable selection methods
• Exhaustive search
• Partial Subset selection: Forward
• Partial Subset selection: Backward
• Partial Subset selection: Stepwise
Exhaustive Search = Best Subset
● All possible subsets of predictors assessed (single, pairs, triplets, etc.)
● Computationally intensive, not feasible for big data
● Judge by “adjusted R2”
● Adjusted R2 uses a penalty over number of predictors
● It avoids the artificial increase of variance when you increase no. of
predictors without compromising the information (or variance).

Penalty for
number of
predictors
Specialized Metrics Used in Regression
(lower values are better)
Criteria for balancing over-fitting and under-fitting:
• Akaike Information Criterion (AIC)
• AIC = n ln(SSE/n) + n(1 + ln(2π)) + 2(p + 1)
• Bayesian Information Criterion (BIC)
• BIC = n ln(SSE/n) + n(1 + ln(2π)) + ln(n)(p + 1)
• AIC and BIC measure the goodness of fit of a model, but also
include a penalty that is a function of the number of parameters
in the model.
• smaller AIC and BIC values are considered better.
• Mallow’s Cp
• Cp = SSE/σ2full + 2(p+1) - n
• Mallow’s Cp is equivalent to AIC for large samples
Exhaustive output shows best model for each
number of predictors
sum$which
(Intercept) Age_08_04 KM HP Met_Color Auto CC Doors Q_Tax Weight Diesel Petrol
1 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3 TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
4 TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
5 TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
6 TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
7 TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
8 TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
9 TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
10 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
11 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Each row is the best model for a given # of predictors, “TRUE” and
“FALSE” show whether the variable is included
Adjusted R2 and CP for the models with 1 predictor, 2 predictors, 3
predictors, etc. (exhaustive search method)

> sum$adjr2
[1] 0.753 0.794 0.843 0.862 0.865 0.868 0.869 0.868 0.868
0.868 0.868
> sum$cp
[1] 520.44 333.23 114.69 28.75 18.29 4.16 3.96
5.26 7.08 9.01 11.00

Metrics improve until you hit 6-7 predictors, then stabilize,

so choose model with 6-7 predictors
Exhaustive search may be computationally
infeasible - some alternatives:

FORWARD SELECTION
●Start with no predictors
●Add them one by one (add the one with largest contribution)
●Stop when the addition is not statistically significant

BACKWARD ELIMINATION
●Start with all predictors
●Successively eliminate least useful predictors one by one
●Stop when all remaining predictors have statistically significant
contribution

STEPWISE
● Like Forward Selection
● Except at each step, also consider dropping non-significant predictors
Summary
●Linear regression models are very popular tools, not only for
explanatory modeling, but also for prediction
●A good predictive model has high predictive accuracy (to a
useful practical level)
●Predictive models are fit to training data, and predictive
accuracy is evaluated on a separate validation data set
●Removing redundant predictors is key to achieving
predictive accuracy and robustness
●Subset selection help find “good” candidate models.

S2 Linear Regression LKW 9march2025
No ratings yet
S2 Linear Regression LKW 9march2025
23 pages
Hair PPT Ch05
No ratings yet
Hair PPT Ch05
18 pages
Ch06 MultipleLinearRegression
0% (2)
Ch06 MultipleLinearRegression
19 pages
Linear Regression for Analysts
No ratings yet
Linear Regression for Analysts
24 pages
Data Mining
No ratings yet
Data Mining
2 pages
0 Regularization PDF
No ratings yet
0 Regularization PDF
88 pages
Predictive Analytics - Regression
No ratings yet
Predictive Analytics - Regression
27 pages
Business Analytics Unit - V Notes - 60637708 - 2025 - 05 - 15 - 02 - 16
No ratings yet
Business Analytics Unit - V Notes - 60637708 - 2025 - 05 - 15 - 02 - 16
37 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Diagnostic Tests2
No ratings yet
Diagnostic Tests2
25 pages
Ds Module 4
No ratings yet
Ds Module 4
73 pages
Predictive Analytics Primer
No ratings yet
Predictive Analytics Primer
66 pages
Module07 - Model Selection and Regularization
No ratings yet
Module07 - Model Selection and Regularization
46 pages
Regression Analysis
No ratings yet
Regression Analysis
20 pages
Statistical Testing and Prediction Using Linear Regression: Abstract
No ratings yet
Statistical Testing and Prediction Using Linear Regression: Abstract
10 pages
Machine Learning
No ratings yet
Machine Learning
62 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Ch5 Slide VariableSelection
No ratings yet
Ch5 Slide VariableSelection
36 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Unit Iii
No ratings yet
Unit Iii
67 pages
3rd Module EDBA Contiuation1
No ratings yet
3rd Module EDBA Contiuation1
6 pages
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
No ratings yet
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
39 pages
Predictive ModellingAnalytics
No ratings yet
Predictive ModellingAnalytics
27 pages
Model Selection
No ratings yet
Model Selection
49 pages
Intermediate Analytics-Regression-Week 1
No ratings yet
Intermediate Analytics-Regression-Week 1
52 pages
Module 4
No ratings yet
Module 4
33 pages
Untitled Document
No ratings yet
Untitled Document
6 pages
Prediction & Forecasting: Regression Analysis
No ratings yet
Prediction & Forecasting: Regression Analysis
3 pages
Multiple Linear Regression Notes
No ratings yet
Multiple Linear Regression Notes
18 pages
Unit 3
No ratings yet
Unit 3
24 pages
TNP Lecture 2 G1G2
No ratings yet
TNP Lecture 2 G1G2
58 pages
Chapter 6 Supervised Learning
No ratings yet
Chapter 6 Supervised Learning
6 pages
Lecture 09 - 02.09.2024 - Regression-01
No ratings yet
Lecture 09 - 02.09.2024 - Regression-01
62 pages
Class 9 After
No ratings yet
Class 9 After
38 pages
Week 3 Practice Quiz
100% (1)
Week 3 Practice Quiz
10 pages
Mindanao State University General Santos City: Simple Linear Regression
No ratings yet
Mindanao State University General Santos City: Simple Linear Regression
12 pages
DDMA05 ModelSelection
No ratings yet
DDMA05 ModelSelection
28 pages
Predictive Modelling Process: A First Tour
No ratings yet
Predictive Modelling Process: A First Tour
11 pages
Practical - Regression
No ratings yet
Practical - Regression
114 pages
DSR Notes 3 To 5
No ratings yet
DSR Notes 3 To 5
70 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
Statistic and Data Science Ii PDF
No ratings yet
Statistic and Data Science Ii PDF
37 pages
Model Evaluation
No ratings yet
Model Evaluation
80 pages
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
No ratings yet
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
89 pages
Week 4 Lecture Slides BUS265 2023
No ratings yet
Week 4 Lecture Slides BUS265 2023
45 pages
Chap2-Some Unique Features of Data Science Projects
No ratings yet
Chap2-Some Unique Features of Data Science Projects
44 pages
Linear Review 1
No ratings yet
Linear Review 1
235 pages
Linear Regression
No ratings yet
Linear Regression
16 pages
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All Are Pasted at End)
No ratings yet
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All Are Pasted at End)
16 pages
Statistical Modelling: Regression: Choosing The Independent Variables
No ratings yet
Statistical Modelling: Regression: Choosing The Independent Variables
14 pages
Big Data Lesson 2 Lucrezia Noli
No ratings yet
Big Data Lesson 2 Lucrezia Noli
21 pages
Model Selection for Statisticians
No ratings yet
Model Selection for Statisticians
41 pages
CH 13
No ratings yet
CH 13
11 pages
Lead Scoring Group Case Study Presentation
100% (2)
Lead Scoring Group Case Study Presentation
19 pages
DADM S2 Data Preprocessing-Data Cleaning and Transformation
No ratings yet
DADM S2 Data Preprocessing-Data Cleaning and Transformation
12 pages
Unit-2 Supervised Machine Learning
No ratings yet
Unit-2 Supervised Machine Learning
132 pages
Regression Analysis Insights
No ratings yet
Regression Analysis Insights
21 pages
Peramalan Regarima Pada Data Time Series (Studi Kasus: Penjualan Tiket Pesawat PT. Kumala Wisata Tenggarong)
No ratings yet
Peramalan Regarima Pada Data Time Series (Studi Kasus: Penjualan Tiket Pesawat PT. Kumala Wisata Tenggarong)
6 pages
Skittles Project
No ratings yet
Skittles Project
4 pages
Stock Return Predictability and Model Uncertainty
No ratings yet
Stock Return Predictability and Model Uncertainty
36 pages
Ch.4 Correlation
No ratings yet
Ch.4 Correlation
1 page
CS1B - September 2024 - Examiner Report
No ratings yet
CS1B - September 2024 - Examiner Report
19 pages
P15 - 178380 - Eviews Guide
No ratings yet
P15 - 178380 - Eviews Guide
7 pages
1011MPFM Econometrics
No ratings yet
1011MPFM Econometrics
3 pages
STA2604 2022 TL 014 0 B Assignment 2
No ratings yet
STA2604 2022 TL 014 0 B Assignment 2
4 pages
Philps Curve of Bangladesh
No ratings yet
Philps Curve of Bangladesh
27 pages
Sample Size and Power Calculations in Repeated Measurement Analysis (2001)
No ratings yet
Sample Size and Power Calculations in Repeated Measurement Analysis (2001)
4 pages
9709 s13 QP 72
No ratings yet
9709 s13 QP 72
4 pages
Course Outline - Econometrics
No ratings yet
Course Outline - Econometrics
7 pages
CSCI946 W4-Clustering
No ratings yet
CSCI946 W4-Clustering
70 pages
Statistical Calculation Using Calculator
No ratings yet
Statistical Calculation Using Calculator
3 pages
Statistics Teks Ccss Alignment Final
No ratings yet
Statistics Teks Ccss Alignment Final
23 pages
Hypothesis Test Paper
No ratings yet
Hypothesis Test Paper
3 pages
8604 Solved Quiz
No ratings yet
8604 Solved Quiz
13 pages
Mckinney Time Series
No ratings yet
Mckinney Time Series
29 pages
Regression
No ratings yet
Regression
3 pages
CS3491 - Aiml - Unit Iii Supervised Learning
No ratings yet
CS3491 - Aiml - Unit Iii Supervised Learning
162 pages
Stats-Proj Group 2
0% (1)
Stats-Proj Group 2
53 pages
Group 6
No ratings yet
Group 6
50 pages
Thesis Multiple Linear Regression
100% (2)
Thesis Multiple Linear Regression
5 pages
Questions For Econometrics Viva
No ratings yet
Questions For Econometrics Viva
3 pages
6 One Hot Encoding
No ratings yet
6 One Hot Encoding
3 pages
Practice Questions
No ratings yet
Practice Questions
6 pages
Chapter 8
No ratings yet
Chapter 8
3 pages
BP 801t Biostatistics and Research Methodology Jun 2020
No ratings yet
BP 801t Biostatistics and Research Methodology Jun 2020
3 pages
Jorda 2005
No ratings yet
Jorda 2005
22 pages
Unit 1 SNM - New (Compatibility Mode) Solved Hypothesis Test PDF
No ratings yet
Unit 1 SNM - New (Compatibility Mode) Solved Hypothesis Test PDF
50 pages

Session 3 - Chapter 06 Linear Reg

Uploaded by

Session 3 - Chapter 06 Linear Reg

Uploaded by

Chapter 6: Multiple Linear

constant error (noise)

● Familiar use of regression in data analysis

● Model Goal: Fit the data well and understand the

● Metrics: “goodness-of-fit” - R2, residual analysis, p-

● Classic data mining context

Goal: predict prices of used Toyota Corollas based on their

Metrics improve until you hit 6-7 predictors, then stabilize,

You might also like