0% found this document useful (0 votes)

17 views19 pages

REgression 1

This document discusses regression analysis, which is a statistical modeling technique used to investigate relationships between variables. It can be used for prediction of a dependent variable based on independent variables or for understanding the relationship between variables. Linear regression assumes a linear relationship between variables and is commonly used, with the goal of minimizing error between predicted and observed values using the least squares method. A simple example is presented using data from a British doctors' study on mortality rates from different diseases at various levels of cigarette consumption per day. Plots of the data demonstrate a linear relationship, but the document cautions that correlation does not necessarily prove causation.

Uploaded by

Shivani Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views19 pages

REgression 1

Uploaded by

Shivani Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Regression

analysis

▷ A regression problem is composed of

• an outcome or response variable 𝑌
• a number of risk factors or predictor variables 𝑋𝑖 that affect 𝑌
• also called explanatory variables, or features in the machine learning community
• a question about 𝑌, such as How to predict 𝑌 under different conditions?

▷𝑌 is sometimes called the dependent variable and 𝑋𝑖 the independent

variables
• not the same meaning as statistical independence
• experimental setting where the 𝑋𝑖 variables can be modified and
changes in 𝑌
can be observed

1 / 68
Regression analysis:
objectives

Prediction Model inference

We want to learn about the relationship

We want to estimate 𝑌 at some specific between 𝑌 and 𝑋𝑖, such as the combination
values of 𝑋𝑖 of predictor variables which has the most
effect on 𝑌

2 / 68
Univariate linear
regression
(when all you have is a single predictor
variable)

3 / 68
Linear
regression

▷ Linear regression: one of the simplest and most commonly used

statistical modeling techniques

▷ Makes strong assumptions about the relationship between the

predictor variables (𝑋𝑖) and the response (𝑌) 0.8

0.6
• (a linear relationship, a straight line when plotted)
0.4

• only valid for continuous outcome variables (not applicable to

0.2
category outcomes such as success/failure)
0

0 0.5 1 1.5 2 2.5 3 3.5 4

outcome predictor
variable variable

“Fitting a line
𝑦 = 𝛽 0 + 𝛽1 × 𝑥 + error
through data”

intercept slope

4 / 68
Linear
regression
▷ Assumption: 𝑦 = 𝛽 0 + 𝛽 1 × 𝑥 + error

▷ Our task: estimate 𝛽 0 and 𝛽 1 based on the available data

▷ Resulting model is 𝑦̂ = 𝛽0̂ + 𝛽1̂ × 𝑥

• the “hats” on the variables represent the fact that they are
estimated from the available data
• 𝑦̂ is read as “the estimator for 𝑦”
▷ 𝛽 0 and 𝛽 1 are called the model parameters or coefficients
180

▷ Objective: minimize the error, the difference between our 170

observations and the predictions made by our linear 160

model 150

• minimize the length of the red lines in the figure to the right
(called the “residuals”)
140
40 45 50 55 60 65 70

5 / 68
Ordinary Least Squares
regression

▷ Ordinary Least-Squares (ols) regression: a

method for selecting the model parameters 190

• β₀ and β₁ are chosen to minimize the square of the

distance between the predicted values and the 180

actual values
170
• equivalent to minimizing the size of the red
rectangles in the figure to the right
160

▷ An application of a quadratic loss function

150
• in statistics and optimization theory, a loss function, 40 45 50 55 60 65 70 75 80 85
90
or cost function, maps from an observation or event
to a number that represents some form of “cost”

6 / 68
Simple linear regression:
example

▷ The British Doctors’ Study followed the health of a large number of

physicians in the uk over the period 1951–2001

▷ Provided conclusive evidence of linkage between smoking and lung

cancer, myocardial infarction, respiratory disease and other illnesses

▷ Provides data on annual mortality for a variety of diseases at four levels

of cigarette smoking:
1 never smoked
2 1-14 per day
3 15-24 per day
4 > 25 per day

More information: ctsu.ox.ac.uk/research/british-doctors-study

7 / 68
Simple linear regression: the
data

cigarettes smoked CVD mortality lung cancer mortality

(per day) (per 100 000 men per year) (per 100 000 men per year)

0 572 14
10 (actually 1-14) 802 105
20 (actually 15-24) 892 208
30 (actually >24) 1025 355

sease
CVD: cardiovascular di

Source: British Doctors’ Study

8 / 68
Simple linear regression:
plots

Deaths for different smoking intensities

1000

import pandas
900 import matplotlib.pyplot as plt
CVD deaths

data = pandas.DataFrame({"cigarettes": [0,10,20,30],

800
"CVD": [572,802,892,1025],
"lung": [14,105,208,355]});
700 data.plot("cigarettes", "CVD", kind="scatter")
plt.title("Deaths for different smoking intensities")
plt.xlabel("Cigarettes smoked per day")
600
plt.ylabel("CVD deaths")

0 5 10 15 20 25 30
Cigarettes smoked per day

lude that
Quite tempting to conc
deaths
cardiovascular disease
cigarette
increase linearly with
consumption…
9 / 68
Aside: beware assumptions of
causality

1964: the US Surgeon General issues a

report claiming that cigarette
smoking causes lung cancer, based
mostly on correlation data similar to
the previous slide.

lung
smoking
cancer

12 /
Aside: beware assumptions of
causality

1964: the US Surgeon General issues a

report claiming that cigarette
hidden
smoking causes lung cancer, based
mostly on correlation data similar to factor?
the previous slide.

However, correlation is not sufficient

to demonstrate causality. There might
lung
be some hidden genetic factor that smoking
cancer
causes both lung cancer and desire
for nicotine.

12 /
Beware assumptions of
causality
▷ To demonstrate the causality, you need a randomized controlled
experiment

▷ Assume we have the power to force people to smoke or not smoke

• and ignore moral issues for now!

▷ Take a large group of people and divide them into two groups
• one group is obliged to smoke
• other group not allowed to smoke (the “control” group)

▷ Observe whether smoker group develops more lung cancer than the
control group

▷ We have eliminated any possible hidden factor causing both smoking and
lung cancer

▷ More information: read about design of experiments

13 /
Fitting a linear model in
Python

▷ In these examples, we use the statsmodels library for statistics in

Python
• other possibility: the scikit-learn library for machine learning

▷ We use the formula interface to ols regression, in

statsmodels.formula.api

▷ Formulas are written outcome ~ observation

• meaning “build a linear model that predicts variable outcome as a function
of input data on variable observation”

14 /
Fitting a linear
model

import numpy, pandas

CVD deaths for different smoking intensities
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
1000

df = pandas.DataFrame({"cigarettes": [0,10,20,30],
900
"CVD": [572,802,892,1025],
CVD deaths

"lung": [14,105,208,355]});
800
df.plot("cigarettes", "CVD", kind="scatter")
lm = smf.ols("CVD ~ cigarettes", data=df).fit()
700 xmin = df.cigarettes.min()
xmax = df.cigarettes.max()
600 X = numpy.linspace(xmin, xmax, 100)
# params[0] i s th e i n t e r c e p t ( b e t a ₀ )

0 5 10 15 20 25 30 # params[1] i s th e s l o p e ( b e t a ₁ )
Cigarettes smoked per day Y = lm.params[0] + lm.params[1] * X
plt.plot(X, Y, color="darkgreen")

15 /
Parameters of the linear
model

60
▷ 𝛽0 is the intercept of the regression line
(where it meets the 𝑋 = 0 axis)
40
▷ 𝛽 1 is the slope of the regression line

▷ Interpretation of 𝛽 1 = 0.0475: a “unit” 20

𝛽 = Δ𝑦
1
increase in cigarette smoking is associated 𝛽0 { Δ𝑥

with a 0.0475 “unit” increase in deaths 0

from lung cancer 0 5 10 15 20 25 30

16 /
Scatterplot of lung cancer
deaths

Lung cancer deaths for different smoking

intensities

350

300 import pandas

250 import matplotlib.pyplot as plt

200
Lung cancer

data = pandas.DataFrame({"cigarettes": [0,10,20,30],

150 "CVD": [572,802,892,1025],
deaths

"lung": [14,105,208,355]});
100 data.plot("cigarettes", "lung", kind="scatter")
plt.xlabel("Cigarettes smoked per day")
50
plt.ylabel("Lung cancer deaths")
0
0 5 10 20 30
15 Cigarettes smoked per
25 day

lude that lung

Quite tempting to conc
linearly with
cancer deaths increase
cigarette consumption…
17 /
Fitting a linear
model

import numpy, pandas

import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

Lung cancer deaths for different smoking intensities

df = pandas.DataFrame({"cigarettes": [0,10,20,30],
350
"CVD": [572,802,892,1025],
300 "lung": [14,105,208,355]});
df.plot("cigarettes", "lung", kind="scatter")
250
lm = smf.ols("lung ~ cigarettes", data=df).fit()
200
Lung cancer

xmin = df.cigarettes.min()

150 xmax = df.cigarettes.max()

deaths

X = numpy.linspace(xmin, xmax, 100)

100
# params[0] i s th e i n t e r c e p t ( b e t a ₀ )

50 # params[1] i s th e s l o p e ( b e t a ₁ )
Y = lm.params[0] + lm.params[1] * X
0
plt.plot(X, Y, color="darkgreen")
0 5 10 20 30
15 Cigarettes smoked per
25 day
d
Download the associate
Python notebook at
rg
risk-engineering.o
18 /
Using the model for
prediction
Q: What is the expected lung cancer mortality risk for a group of people
who smoke 15 cigarettes per day?

import numpy, pandas

import statsmodels.formula.api as smf

df = pandas.DataFrame({"cigarettes": [0,10,20,30],
"CVD": [572,802,892,1025],
"lung": [14,105,208,355]});
# create and f i t the l i n e a r model
lm = smf.ols(formula="lung ~ cigarettes",
data=df).fit() # use the f i t t e d model f o r p r e d i c t i o n
lm.predict({"cigarettes": [15]}) / 100000.0
# p r o b a b i l i t y o f mo r t a lit y from lung cancer, per person
per year
array([ 0.001705])

19 /
▷ How do we assess how well the linear model fits our observations?

Assessing
model
quality
• make a visual check on a scatterplot
• use a quantitative measure of “goodness of fit”
For simple linear regression, 𝑟2 is simply the square of the sample
▷

correlation coefficient 𝑟
• ▷ Coefficient of determination 𝑟2: a number that indicates how well data fit a statistical model
• it’s the proportion of total variation of outcomes explained by the m
• 𝑟2 = 1: regression line fits perfectly
• 𝑟2 = 0: regression line does not fit at all
20 /

Slides Linear Regression
No ratings yet
Slides Linear Regression
70 pages
Multiple Linear Regression - Prof. Sami Day 1
No ratings yet
Multiple Linear Regression - Prof. Sami Day 1
58 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
ML Linear Regression Trupesh Patel
No ratings yet
ML Linear Regression Trupesh Patel
23 pages
Multiple Linear Regression 1
No ratings yet
Multiple Linear Regression 1
115 pages
2 Modele Lineare
No ratings yet
2 Modele Lineare
43 pages
Aih Lab1
No ratings yet
Aih Lab1
10 pages
ML Module 2
No ratings yet
ML Module 2
185 pages
Logistic Regression Guide
No ratings yet
Logistic Regression Guide
19 pages
Unit-2 ML
No ratings yet
Unit-2 ML
39 pages
15multiple Linear Regression
No ratings yet
15multiple Linear Regression
168 pages
DS Unit-Iv
No ratings yet
DS Unit-Iv
34 pages
Statics Thinking-Regression
No ratings yet
Statics Thinking-Regression
51 pages
Module 2-Supervised Learning
No ratings yet
Module 2-Supervised Learning
74 pages
Bias and Variance Tradeoff:: High Bias Underfitting Low Training & Testing
No ratings yet
Bias and Variance Tradeoff:: High Bias Underfitting Low Training & Testing
12 pages
R Egression Simplified
No ratings yet
R Egression Simplified
24 pages
Classification With Logistic Regression: DR Sandipan Karmakar Mnit Jaipur
No ratings yet
Classification With Logistic Regression: DR Sandipan Karmakar Mnit Jaipur
54 pages
CS ELEC 4 Finals Module
No ratings yet
CS ELEC 4 Finals Module
57 pages
AAI Lecture 10 SP 25
No ratings yet
AAI Lecture 10 SP 25
37 pages
Intro To Reg Models
No ratings yet
Intro To Reg Models
27 pages
Regression Logistic Regression
100% (1)
Regression Logistic Regression
37 pages
Asite2 Chapter 12a
No ratings yet
Asite2 Chapter 12a
63 pages
Module 3: Linear Regression: TMA4268 Statistical Learning V2025
No ratings yet
Module 3: Linear Regression: TMA4268 Statistical Learning V2025
110 pages
Regression Modeling in Biostatistics
No ratings yet
Regression Modeling in Biostatistics
3 pages
Linear Regression - Module 3
No ratings yet
Linear Regression - Module 3
16 pages
Linear Regression in R - Tutorial
No ratings yet
Linear Regression in R - Tutorial
20 pages
Module 4 - Logistic Regression - Afterclass1b
No ratings yet
Module 4 - Logistic Regression - Afterclass1b
54 pages
Linear Regression Chap01
100% (1)
Linear Regression Chap01
7 pages
Lecture6 Regression
No ratings yet
Lecture6 Regression
42 pages
ch12 0
No ratings yet
ch12 0
43 pages
Predictive Analytics - Regression
No ratings yet
Predictive Analytics - Regression
27 pages
Us20 Allison
No ratings yet
Us20 Allison
10 pages
ML - LAB - BE CSE (DS) Final
No ratings yet
ML - LAB - BE CSE (DS) Final
110 pages
Data Analytics Iii Unit
No ratings yet
Data Analytics Iii Unit
8 pages
ML - Unit 2
No ratings yet
ML - Unit 2
155 pages
Lecture Oct 2 2024 Ab
No ratings yet
Lecture Oct 2 2024 Ab
15 pages
RSM1282-2025-Session 9-Binary Dependent Variables & Logistic Regression - POST
No ratings yet
RSM1282-2025-Session 9-Binary Dependent Variables & Logistic Regression - POST
35 pages
Module 3 - SimpleLinearRegression - Afterclass1b
No ratings yet
Module 3 - SimpleLinearRegression - Afterclass1b
26 pages
Unit 2&3 - 250421 - 215911
No ratings yet
Unit 2&3 - 250421 - 215911
60 pages
3CP10 Final MJJ Linear Regression
No ratings yet
3CP10 Final MJJ Linear Regression
68 pages
Concepts - Regression Overview
No ratings yet
Concepts - Regression Overview
14 pages
Regression Models for Math Majors
100% (1)
Regression Models for Math Majors
30 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
AST Day 2 Slides
No ratings yet
AST Day 2 Slides
58 pages
Regression Linear
No ratings yet
Regression Linear
24 pages
AI Lec23
No ratings yet
AI Lec23
36 pages
Introduction To Logistic Regression: Rachid Salmi, Jean-Claude Desenclos, Alain Moren, Thomas Grein
No ratings yet
Introduction To Logistic Regression: Rachid Salmi, Jean-Claude Desenclos, Alain Moren, Thomas Grein
36 pages
04 - Notebook4 - Additional Information
No ratings yet
04 - Notebook4 - Additional Information
5 pages
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model For Medical Data
No ratings yet
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model For Medical Data
7 pages
Regression Presentation
No ratings yet
Regression Presentation
20 pages
Lecture 3 - Linear Regression Imran 20022025 092939am
No ratings yet
Lecture 3 - Linear Regression Imran 20022025 092939am
46 pages
Logistic Regression Lecture Notes
No ratings yet
Logistic Regression Lecture Notes
11 pages
Linear & Logistic Regression Models
No ratings yet
Linear & Logistic Regression Models
8 pages
Logistic Regression
No ratings yet
Logistic Regression
20 pages
Logistic REGRESSION
No ratings yet
Logistic REGRESSION
10 pages
Logistic Regression Guide
100% (1)
Logistic Regression Guide
34 pages
Simple Linear and Logistic Regression
No ratings yet
Simple Linear and Logistic Regression
81 pages
Log Reg
No ratings yet
Log Reg
32 pages
Lecture9 Regression
No ratings yet
Lecture9 Regression
24 pages
Calamine
No ratings yet
Calamine
4 pages
Encrypted Text Analysis
100% (1)
Encrypted Text Analysis
4 pages
SIDDHANT
No ratings yet
SIDDHANT
27 pages
RTC Pharma07CatalogWeb
No ratings yet
RTC Pharma07CatalogWeb
168 pages
Hydrogel
No ratings yet
Hydrogel
15 pages
Business Plan For Establising Trading Company in Saudi Arabia v.1.1
No ratings yet
Business Plan For Establising Trading Company in Saudi Arabia v.1.1
11 pages
Change Control WHO
100% (2)
Change Control WHO
62 pages
RAM 200 FX Manual Equipment English Rev A For Web
No ratings yet
RAM 200 FX Manual Equipment English Rev A For Web
43 pages
Catalogue C M Máy Emeson
No ratings yet
Catalogue C M Máy Emeson
29 pages
CBSE Class 10 Science Revision Notes Chapter - 13 Magnetic Effects of Electric Current
No ratings yet
CBSE Class 10 Science Revision Notes Chapter - 13 Magnetic Effects of Electric Current
13 pages
Engine Cooling System: Section
No ratings yet
Engine Cooling System: Section
62 pages
Masterflex 801 Asean 0509
No ratings yet
Masterflex 801 Asean 0509
2 pages
Chem 201
No ratings yet
Chem 201
3 pages
Croatia Plan
No ratings yet
Croatia Plan
2 pages
New and Improved, High Nickel Alloy Castings
No ratings yet
New and Improved, High Nickel Alloy Castings
19 pages
Chemistry Basics for Students
No ratings yet
Chemistry Basics for Students
8 pages
MODULE 3 - Professionals and Practitioners in Counseling
100% (1)
MODULE 3 - Professionals and Practitioners in Counseling
10 pages
Hydraulic Oil PDF
No ratings yet
Hydraulic Oil PDF
30 pages
Chemistry Exam Preparation Guide
No ratings yet
Chemistry Exam Preparation Guide
25 pages
University of Oxford Elective Application Form
No ratings yet
University of Oxford Elective Application Form
8 pages
2015 Volkswagen Touareg 88310
No ratings yet
2015 Volkswagen Touareg 88310
684 pages
Dairy Production and Processing The Science of Milk and Milk Products
No ratings yet
Dairy Production and Processing The Science of Milk and Milk Products
12 pages
Sha Paid Facilities May 2025 - Kenya SHA Portal Before It Was Deleted
No ratings yet
Sha Paid Facilities May 2025 - Kenya SHA Portal Before It Was Deleted
93 pages
Medical Coding Reference Guide
No ratings yet
Medical Coding Reference Guide
8 pages
CompReg 17AUGUST2019
No ratings yet
CompReg 17AUGUST2019
1,713 pages
Writing Task 1
No ratings yet
Writing Task 1
21 pages
Aits 2324 FT Vii Jeea Paper 1 Sol
No ratings yet
Aits 2324 FT Vii Jeea Paper 1 Sol
19 pages
Certificate of Analysis Sheet: Product Information
No ratings yet
Certificate of Analysis Sheet: Product Information
1 page
Gender Role Reversal and The Violated Lesbian Body: Journal of Lesbian Studies
No ratings yet
Gender Role Reversal and The Violated Lesbian Body: Journal of Lesbian Studies
13 pages
More Harm Than Good What Your Doctor May Not Tell You About Common Treatments and Procedures Alan P. Zelicoff Instant Download
No ratings yet
More Harm Than Good What Your Doctor May Not Tell You About Common Treatments and Procedures Alan P. Zelicoff Instant Download
47 pages