Regression
analysis
▷ A regression problem is composed of
• an outcome or response variable 𝑌
• a number of risk factors or predictor variables 𝑋𝑖 that affect 𝑌
• also called explanatory variables, or features in the machine learning community
• a question about 𝑌, such as How to predict 𝑌 under different conditions?
▷𝑌 is sometimes called the dependent variable and 𝑋𝑖 the independent
variables
• not the same meaning as statistical independence
• experimental setting where the 𝑋𝑖 variables can be modified and
changes in 𝑌
can be observed
1 / 68
Regression analysis:
objectives
Prediction Model inference
We want to learn about the relationship
We want to estimate 𝑌 at some specific between 𝑌 and 𝑋𝑖, such as the combination
values of 𝑋𝑖 of predictor variables which has the most
effect on 𝑌
2 / 68
Univariate linear
regression
(when all you have is a single predictor
variable)
3 / 68
Linear
regression
▷ Linear regression: one of the simplest and most commonly used
statistical modeling techniques
▷ Makes strong assumptions about the relationship between the
predictor variables (𝑋𝑖) and the response (𝑌) 0.8
0.6
• (a linear relationship, a straight line when plotted)
0.4
• only valid for continuous outcome variables (not applicable to
0.2
category outcomes such as success/failure)
0
0 0.5 1 1.5 2 2.5 3 3.5 4
outcome predictor
variable variable
“Fitting a line
𝑦 = 𝛽 0 + 𝛽1 × 𝑥 + error
through data”
intercept slope
4 / 68
Linear
regression
▷ Assumption: 𝑦 = 𝛽 0 + 𝛽 1 × 𝑥 + error
▷ Our task: estimate 𝛽 0 and 𝛽 1 based on the available data
▷ Resulting model is 𝑦̂ = 𝛽0̂ + 𝛽1̂ × 𝑥
• the “hats” on the variables represent the fact that they are
estimated from the available data
• 𝑦̂ is read as “the estimator for 𝑦”
▷ 𝛽 0 and 𝛽 1 are called the model parameters or coefficients
180
▷ Objective: minimize the error, the difference between our 170
observations and the predictions made by our linear 160
model 150
• minimize the length of the red lines in the figure to the right
(called the “residuals”)
140
40 45 50 55 60 65 70
5 / 68
Ordinary Least Squares
regression
▷ Ordinary Least-Squares (ols) regression: a
method for selecting the model parameters 190
• β₀ and β₁ are chosen to minimize the square of the
distance between the predicted values and the 180
actual values
170
• equivalent to minimizing the size of the red
rectangles in the figure to the right
160
▷ An application of a quadratic loss function
150
• in statistics and optimization theory, a loss function, 40 45 50 55 60 65 70 75 80 85
90
or cost function, maps from an observation or event
to a number that represents some form of “cost”
6 / 68
Simple linear regression:
example
▷ The British Doctors’ Study followed the health of a large number of
physicians in the uk over the period 1951–2001
▷ Provided conclusive evidence of linkage between smoking and lung
cancer, myocardial infarction, respiratory disease and other illnesses
▷ Provides data on annual mortality for a variety of diseases at four levels
of cigarette smoking:
1 never smoked
2 1-14 per day
3 15-24 per day
4 > 25 per day
More information: ctsu.ox.ac.uk/research/british-doctors-study
7 / 68
Simple linear regression: the
data
cigarettes smoked CVD mortality lung cancer mortality
(per day) (per 100 000 men per year) (per 100 000 men per year)
0 572 14
10 (actually 1-14) 802 105
20 (actually 15-24) 892 208
30 (actually >24) 1025 355
sease
CVD: cardiovascular di
Source: British Doctors’ Study
8 / 68
Simple linear regression:
plots
Deaths for different smoking intensities
1000
import pandas
900 import matplotlib.pyplot as plt
CVD deaths
data = pandas.DataFrame({"cigarettes": [0,10,20,30],
800
"CVD": [572,802,892,1025],
"lung": [14,105,208,355]});
700 data.plot("cigarettes", "CVD", kind="scatter")
plt.title("Deaths for different smoking intensities")
plt.xlabel("Cigarettes smoked per day")
600
plt.ylabel("CVD deaths")
0 5 10 15 20 25 30
Cigarettes smoked per day
lude that
Quite tempting to conc
deaths
cardiovascular disease
cigarette
increase linearly with
consumption…
9 / 68
Aside: beware assumptions of
causality
1964: the US Surgeon General issues a
report claiming that cigarette
smoking causes lung cancer, based
mostly on correlation data similar to
the previous slide.
lung
smoking
cancer
12 /
Aside: beware assumptions of
causality
1964: the US Surgeon General issues a
report claiming that cigarette
hidden
smoking causes lung cancer, based
mostly on correlation data similar to factor?
the previous slide.
However, correlation is not sufficient
to demonstrate causality. There might
lung
be some hidden genetic factor that smoking
cancer
causes both lung cancer and desire
for nicotine.
12 /
Beware assumptions of
causality
▷ To demonstrate the causality, you need a randomized controlled
experiment
▷ Assume we have the power to force people to smoke or not smoke
• and ignore moral issues for now!
▷ Take a large group of people and divide them into two groups
• one group is obliged to smoke
• other group not allowed to smoke (the “control” group)
▷ Observe whether smoker group develops more lung cancer than the
control group
▷ We have eliminated any possible hidden factor causing both smoking and
lung cancer
▷ More information: read about design of experiments
13 /
Fitting a linear model in
Python
▷ In these examples, we use the statsmodels library for statistics in
Python
• other possibility: the scikit-learn library for machine learning
▷ We use the formula interface to ols regression, in
statsmodels.formula.api
▷ Formulas are written outcome ~ observation
• meaning “build a linear model that predicts variable outcome as a function
of input data on variable observation”
14 /
Fitting a linear
model
import numpy, pandas
CVD deaths for different smoking intensities
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
1000
df = pandas.DataFrame({"cigarettes": [0,10,20,30],
900
"CVD": [572,802,892,1025],
CVD deaths
"lung": [14,105,208,355]});
800
df.plot("cigarettes", "CVD", kind="scatter")
lm = smf.ols("CVD ~ cigarettes", data=df).fit()
700 xmin = df.cigarettes.min()
xmax = df.cigarettes.max()
600 X = numpy.linspace(xmin, xmax, 100)
# params[0] i s th e i n t e r c e p t ( b e t a ₀ )
0 5 10 15 20 25 30 # params[1] i s th e s l o p e ( b e t a ₁ )
Cigarettes smoked per day Y = lm.params[0] + lm.params[1] * X
plt.plot(X, Y, color="darkgreen")
15 /
Parameters of the linear
model
60
▷ 𝛽0 is the intercept of the regression line
(where it meets the 𝑋 = 0 axis)
40
▷ 𝛽 1 is the slope of the regression line
▷ Interpretation of 𝛽 1 = 0.0475: a “unit” 20
𝛽 = Δ𝑦
1
increase in cigarette smoking is associated 𝛽0 { Δ𝑥
with a 0.0475 “unit” increase in deaths 0
from lung cancer 0 5 10 15 20 25 30
16 /
Scatterplot of lung cancer
deaths
Lung cancer deaths for different smoking
intensities
350
300 import pandas
250 import matplotlib.pyplot as plt
200
Lung cancer
data = pandas.DataFrame({"cigarettes": [0,10,20,30],
150 "CVD": [572,802,892,1025],
deaths
"lung": [14,105,208,355]});
100 data.plot("cigarettes", "lung", kind="scatter")
plt.xlabel("Cigarettes smoked per day")
50
plt.ylabel("Lung cancer deaths")
0
0 5 10 20 30
15 Cigarettes smoked per
25 day
lude that lung
Quite tempting to conc
linearly with
cancer deaths increase
cigarette consumption…
17 /
Fitting a linear
model
import numpy, pandas
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
Lung cancer deaths for different smoking intensities
df = pandas.DataFrame({"cigarettes": [0,10,20,30],
350
"CVD": [572,802,892,1025],
300 "lung": [14,105,208,355]});
df.plot("cigarettes", "lung", kind="scatter")
250
lm = smf.ols("lung ~ cigarettes", data=df).fit()
200
Lung cancer
xmin = df.cigarettes.min()
150 xmax = df.cigarettes.max()
deaths
X = numpy.linspace(xmin, xmax, 100)
100
# params[0] i s th e i n t e r c e p t ( b e t a ₀ )
50 # params[1] i s th e s l o p e ( b e t a ₁ )
Y = lm.params[0] + lm.params[1] * X
0
plt.plot(X, Y, color="darkgreen")
0 5 10 20 30
15 Cigarettes smoked per
25 day
d
Download the associate
Python notebook at
rg
risk-engineering.o
18 /
Using the model for
prediction
Q: What is the expected lung cancer mortality risk for a group of people
who smoke 15 cigarettes per day?
import numpy, pandas
import statsmodels.formula.api as smf
df = pandas.DataFrame({"cigarettes": [0,10,20,30],
"CVD": [572,802,892,1025],
"lung": [14,105,208,355]});
# create and f i t the l i n e a r model
lm = smf.ols(formula="lung ~ cigarettes",
data=df).fit() # use the f i t t e d model f o r p r e d i c t i o n
lm.predict({"cigarettes": [15]}) / 100000.0
# p r o b a b i l i t y o f mo r t a lit y from lung cancer, per person
per year
array([ 0.001705])
19 /
▷ How do we assess how well the linear model fits our observations?
Assessing
model
quality
• make a visual check on a scatterplot
• use a quantitative measure of “goodness of fit”
For simple linear regression, 𝑟2 is simply the square of the sample
▷
correlation coefficient 𝑟
• ▷ Coefficient of determination 𝑟2: a number that indicates how well data fit a statistical model
• it’s the proportion of total variation of outcomes explained by the m
• 𝑟2 = 1: regression line fits perfectly
• 𝑟2 = 0: regression line does not fit at all
20 /