0% found this document useful (0 votes)
19 views24 pages

13 Regression

The document discusses simple linear regression, specifically exploring the relationship between the heights of husbands and wives. It explains the concepts of slope, intercept, goodness of fit, and the importance of statistics like R² and p-value in assessing the strength and significance of the relationship. The document emphasizes that while regression provides a predictive model, it is essential to evaluate the accuracy and association of the predictions.

Uploaded by

MissCameraShy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views24 pages

13 Regression

The document discusses simple linear regression, specifically exploring the relationship between the heights of husbands and wives. It explains the concepts of slope, intercept, goodness of fit, and the importance of statistics like R² and p-value in assessing the strength and significance of the relationship. The document emphasizes that while regression provides a predictive model, it is essential to evaluate the accuracy and association of the predictions.

Uploaded by

MissCameraShy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 24

Regression

Deepayan Chakrabarti ([email protected])

Regression
Simple linear regression
Let’s start with an innocuous question:
Do taller husbands have taller wives?

df.describ
e()

Paired Summary
observations statistics
Regression
Simple linear regression
Let’s start with an innocuous question:
Do people of similar heights tend to marry each other?
Mean wife
First idea: heights
Group husbands into
short and tall
Plot the heights of the
wives of each group

Are wives of the tall-husband group


taller than the wives of the short-
Regression
husband group?
Simple linear regression
Let’s start with an innocuous question:
Do people of similar heights tend to marry each other?
Mean wife
Advantages heights
Simple: just measure the
difference between mean
wife heights
Disadvantages
Throws away a lot of
information!
Why only two groups?

Regression
Simple linear regression
Let’s start with an innocuous question:
Do people of similar heights tend to marry each other?

Four groups of husbands


Regression
Simple linear regression
Let’s start with an innocuous question:
Do people of similar heights tend to marry each other?

Advantages
More fine-grained
Disadvantages
What do we measure
now?
How do we conclude
that wives also get taller
alongside husbands?
Four groups of
husbands
Regression
Simple linear regression
Let’s start with an innocuous question:
Do people of similar heights tend to marry each other?

Idea 2
Try to find a linear
relationship between
heights

W
if Husband’s
e height

s
Regression
h
Simple linear regression
Let’s start with an innocuous question:
Do people of similar heights tend to marry each other?

Idea 2
Try to find a linear
relationship between
heights

Regression is a choice.
W
You are trying to only fit a linear
relationship!
if Husband’s
e height
Wife’s height = 41.93’ + 0.70 * husband’s
s
height Regression
h
Simple linear regression
Let’s start with an innocuous question:
Do people of similar heights tend to marry each other?

Wife’s height = 41.93 + 0.70 * husband’s


height

Intercept Slope
Slope is positive ➔ positive association in heights

Regression
Slope and intercept
The equation of a line: y = β0 + β1 x
The slope β1 is the rate of change
β1 > 0
The sign of β1 matters!

β1 = 0: no association
β1 > 0: positive association
β1 < 0: negative association
β1 = 0

β1 < 0

Regression
Finding the best fit
Why choose this particular slope and intercept?

The data isn’t exactly a line


we just took the “best fit” line

What does “best fit” mean?


Related to the concept of the “outlier”
W
if Husband’s
e’ height
s
h
ei
g
ht
Regression
Finding the best fit

“Error” for
this point

W
if Husband’s height
e
’s
h
e Regression
Finding the best fit
Given husband’s height = x,what is wife’s height y?

We expect:
yexp = 41.93 + 0.70 * x
We observe: “Error” for
some other ytrue this point

Error = (ytrue – yexp)2


W
Choice of defining the error in a particular if
manner The “best fit” line is the one that minimizes the sum of all
Husband’s
e’ height
errors
You could have done other s
calculations!
h
ei
g for fitting any
——-> You always define a loss function
model based on your assumptions!
ht
Regression
Finding the best fit
Why pick this particular formula?
Error = (ytrue – yexp)2

because...
history
mathematical convenience
a belief that errors are “distributed” as a bell curve

but alternate formulas are also used


and there are many of them!

Regression
How would you check for goodness of fit?
Prediction - How accurate are your predictions?
Goodness of fit
1.

2. Association - Is there a positive or negative


relationship?

Suppose we are told of a husband’s height (say, 170cm)


Predict wife’s height
41.93 + 0.70 * 170 = 161cm

W
if Husband’s height
e
Regression
’s
Goodness of fit
Thanks to regression:
We can predict wife’s height given husband’s height
Slope = 0.7 > 0 ➔ positive association
Should we believe these two statements?

W
if Husband’s height
e
Regression
’s
Goodness of fit
Thanks to regression:
We can predict wife’s height given husband’s height
Slope = 0.7 > 0 ➔ positive association
Should we believe these two questions?

These are different questions

W
i Husband’s
f height
e Regression
Goodness of fit
For predictions
the noise around the best fit line matters
captured by a statistic called R2

W
if Husband’s height
e
Regression
’s
How would you check for goodness of fit?
1. Check for prediction using R^2 i.e. how good are your predictions in comparison to the
mean?
Check for association using P-value i.e. is there an association or it is random?
Goodness of fit
2.

For predictions
the noise around the best fit line matters
captured by a statistic called R2

High-level idea:
R2 compares the regression prediction to just predicting the
mean
Calculate (ytrue – ypred)2, and sum it up ➔ the residual
Calculate (ytrue – mean)2, and sum it up ➔ the total variance
R2 = 1.0 - ratio of the residual to the total variance
R2 is between 0 and 1
Higher is better

Regression
Goodness of fit
To check if the positive association really exists or
not
we are just asking: are we sure the slope is > 0?
captured by another statistic called the p-value

W
if Husband’s height
e
Regression
’s
Goodness of fit
To check if the positive association really exists or
not
we are just asking: are we sure the slope is > 0?
captured by another statistic called the p-value

Main idea:
What are the chances that there really is no association, and
we just see positive association due to “randomness”?

Smaller values are better


You may want it to be less than 0.05

Regression
Goodness of fit
R2 and p-value measure different things!

All points on line ➔ R2 = 1.0but Points are far from line ➔ R2 is


not sure of positive association small
➔ large p-value but clearly positive association
➔ small p-value

Regression
Relationship to correlation
Recall that the correlation coefficient also measured
the degree to which the data fits a line

[Wikipedia]
Regression
Relationship to correlation
Recall that the correlation coefficient also measured
the degree to which the data fits a line
But linear regression is also trying to fit a line
The R2 statistic tries to measure just this goodness
of fit
Are they related?

Very closely!
R2 = square of the correlation coefficient
-1 <= Correlation <= 1
0 <= R2 <= 1

Regression

You might also like