0% found this document useful (0 votes)
7 views40 pages

Share MBBS Lecture 5 (1) - 1

The document outlines the methods of measuring relationships between variables through regression and correlation analysis, focusing on simple linear, multiple linear, and logistic regression. It details the learning objectives, statistical principles, and practical applications, including hypothesis testing and interpretation of results. Additionally, it emphasizes the importance of checking assumptions for valid regression models and the distinction between correlation and causation.

Uploaded by

olamiderbp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views40 pages

Share MBBS Lecture 5 (1) - 1

The document outlines the methods of measuring relationships between variables through regression and correlation analysis, focusing on simple linear, multiple linear, and logistic regression. It details the learning objectives, statistical principles, and practical applications, including hypothesis testing and interpretation of results. Additionally, it emphasizes the importance of checking assumptions for valid regression models and the distinction between correlation and causation.

Uploaded by

olamiderbp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Measures of relationship

Presentation outline

• Learning objectives
• Introduction
• Regression Analysis
- Simple linear regression
- Multiple linear regression
- Logistic regression

• Correlation Analysis
- Curvilinear Relationship
- Coefficient of Determination
Learning objective

After this topic, you should be able to:


• Distinguish between the basic purpose of regression
analysis and correlation analysis
• Compute and interpret a regression equation
• Interpret the coefficient of simple linear regression
• Know how to extend simple linear regression to consider
multiple risk variables
• Know how to apply some of the ideas from linear regression
when the outcome of interest is a binary outcome
• Know how and when to apply logistic regression and how to
interpret a relationship represented by an odds ratio
Introduction

• Statistical analysis is
concerned not only
with summarizing data but also with
investigating relationships
• Some of our most intriguing scientific
questions deal with the relationship between
two variables
• Does a relationship exist between use of
oral contraceptives and the incidence of
• What is the relationship of a mother’s weight to
her baby’s birth weight?
• These are typical of countless questions we
pose in seeking to understand the relationship
between two variables
• There is an all-too-human tendency to attribute
a cause-and-effect relationship to variables
that might be related
• We discuss the methods of measuring the
relationships of bivariate data

• Determine the strength of the relationships and

• Make inferences to the population from which


the sample was drawn
Regression Analysis
• Sir Francis Galton coined the term regression
during his study of heredity laws
• He observed that physical characteristics of
children were correlated with those of their
fathers
• He found that tall fathers tended to have shorter
sons, whereas short fathers tended to have
taller sons
• A phenomenon he called “regression toward the
mean”
• Subsequently, statistician embraced the term
regression line to describe a linear relationship
between two variables
Simple linear regression

• Gives the equation of the straight line that best


describes the linear relationship between two
numerical variables
• i.e. how one variable will behave as another
variable changes
• Enables the prediction of one variable using
another variable using the equation of the
straight line
Types of variable in linear regression

• The dependent variable (y) is the variable to


be predicted (i.e. usually the measured health
outcome of interest)
• The independent variable (x) or explanatory
variable is the variable used for predicting the
dependent variable
• NB: In correlation it does not matter which
variable is which, but in regression it matters
Research questions

• How does systolic blood pressure change as


age increases?
• Can a subject’s diastolic blood pressure
predict their systolic blood pressure?
• Can body fat be predicted from abdomen
circumference measurements?
• In each of these research questions which is
the dependent variable?
Equation of a straight line

• The equation of a straight line is


• y’ = a + bx
• y’ is the predicted value (of the dependent
variable y)
• a is the intercept
• b is the slope (or gradient) of the line
• x is the independent (explanatory) variable
Simple example
X Y
0 2
1 5
2 8
3 ?
4 14
5 17
6 ?

• Equation of line is y = 2 + 3x
Linear regression equation
• y’ = a + b * x
• y’ = intercept + ( slope * x )
• We want the residuals (distances between the
observations and the line) to be small
• Residual: the difference between the actual value of the
dependent variable and the predicted value from the
regression line: ε = (y’- y)
• A residual is calculated for each observation
• The values of a and b for the regression equation are
calculated to minimise the sum of the squared residuals
–called the least squares fit
Regression coefficient (b)

• The slope, b, is often called the regression


coefficient

• It has the same sign as the correlation


coefficient

• When there is no correlation between x and y,


then the regression coefficient, b, equals 0
Predicted value (y’)

• The predicted value, y’, is subject to sampling


variation
• Its precision can be estimated (prediction
error) by the standard error of the estimate
• The greater the standard error, the greater the
dispersion of predicted y values around the
regression line and hence the larger the
prediction error
Statistical inference in regression
• Regression coefficients calculated from a sample
of observations are estimates of the population
regression coefficients
• Hypothesis tests and confidence intervals can be
constructed using sample estimates to make
inferences about population regression coefficients
• For valid use of these inferential approaches, it is
necessary to check the underlying assumptions of
the model (linearity, normality, constant variance) –
discussed later
Process for simple linear regression
• Check that there is a linear relationship (scatter
plot)
• Use SPSS to fit the simple linear regression model
to find the best straight line through the data
• Check the to see the amount of variation in the
dependent variable explained by the explanatory
variable should be close to 1 (i.e. )
• Write down the regression equation
• Check the assumptions to ensure that the equation
can be used to make predictions
• A fitness gym wishes to assess their client’s body
fat. An accurate method of measuring body fat is
using an underwater weighing technique. This is
not a practical method for the fitness instructors to
carry out on the premises.
• They would like to be able to predict their client’s
body fat from other measurements, e.g. Abdomen
circumference
• 252 men had their body fat and abdomen
circumference measured
Testing hypothesis

• H0: There is no linear relationship between body fat


and abdomen circumference in the population
• H1: There is a linear relationship between body fat
and abdomen circumference in the population
Or this can be rephrased as
• H0: Abdomen circumference does not account for
any variability in body fat in the population
• H1: Abdomen circumference does account for some
of the variability in body fat in the population
Simple linear regression in SPSS

• Analyze
–Regression
–Linear
• The dependent variable is body fat
• The independent variable is abdomen
circumference
SPSS: linear regression
Model R R Adjusted Std Error of
Square R Square the estimate
1 0.814 0.662 0.661 4.5144

• R is the correlation between the two variables


0.814
• R square (R x R) is the proportion of variability
in body fat measurements that can be
explained by differences in abdomen
circumference = 0.662 or 66.2%
SPSS: linear regression
Anova
Model Source of Sum of df Mean F
variation square square

1 Regression 9984.086 1 9984.086 489.903


Residual 5094.931 250 20.380
Total 15079.017 251
• A statistically significant (p<0.001) proportion
of the variability in body fat measurements can
be attributed to the regression model (i.e.
abdomen)
SPSS:regression equation
Coeff
Model B Std Standanad t Sig
Error ize coeff.
(Beta)

Constant -35.197 2.462 -14.294 0.000

Abdomen 0.585 0.26 0.814 22.134 0.000

• Predicted body fat = constant + B x abdomen circum.


• Predicted body fat = -35.197 + 0.585 x abdomen
circum.
Prediction
• How do you use linear regression for prediction?
• The regression equation allows you to predict the value
of the dependent variable (Y) for a particular value of the
independent variable (X)
• Predicted body fat = -35.197 + 0.585 abdomen circum
• What is the predicted body fat content for a man with an
abdomen circumference of 100cm?
• Predicted body fat = -35.197 + 0.585 x 100cm
= -35.197 + 58.5
• = 23.3%
Assumptions of linear regression

• There should be a linear relationship between the


dependent variable and the independent variable
• For any value of the independent variable the
dependent variable values should follow a Normal
distribution (i.e. normally distributed residuals)
• The variance of the dependent variable values
should be the same for all independent variable
values
Checking the assumptions
• After the regression model has been fitted to
the data it is essential to check that the
assumptions of linear regression have not
been violated
• If any of the assumptions have been violated
then the regression model is likely to be invalid
• INVALID ASSUMPTIONS MEAN THAT THE
PREDICTIONS BASED ON THIS MODEL
MAY BE POOR
Assumptions

• Plot the dependent variable against the


independent variable
- Linear pattern (sausage shape) if linearity
assumption to hold
• Plot the residuals against the predicted
values
- No curvature in the plot should be seen for the
linearity assumption to hold
• Normally distributed residuals can be tested
by looking at a histogram of the residuals
• Normally distributed residuals can be tested
by looking at a normal probability plot (Normal
p-p plot)
• Constant variance of the residuals can be
assessed by plotting the residuals against the
predicted values
-There should be an even spread of residuals
around zero
Summary: simple linear regression
• Simple linear regression gives the equation of the
straight line that best describes the association
between two variables
– A linear relationship between the dependent
variable and the independent variable is required
– For any value of the independent variable the
dependent variable values should follow a Normal
distribution
– The variance of the dependent variable values
should be the same for all independent variable values
Multiple regression

• Extend the principles learnt today to multiple


linear regression
• To explore the dependency of one outcome
variable on two or more explanatory variables
simultaneously
• To study the relationship between two
variables after removing (adjusting for) the
possible effects of other “nuisance” variables
of less interest
• Logistic regression model simply mean a statistical
model which describe the relationship between a
qualitative dependent variable (i.e. presence and
absence of disease) and the independent variables
(continuous or/and categorical)
• Continuous variables are not used as dependent
in logistics regression
• The logistic model uses the odds ratio to determine
the effect a predictor variable has on the outcome
• An odds ratio is simply the ratio of 2 odds and is
used extensively in medical studies as a measure
of effect for categorical data
• Odds are usually expressed in term of probability of
an event
• If the probability of an event is p, then
• odds = p/1-p
• Similarly, odds can be converted to probability by p
= odds/1+odds
• As probability goes from 0 to 1, odds vary from 0 to
Example
Renal dysfunction

Sex yes no Total


Male 16 14 30
Female 12 18 30
Total 28 32 60

• What is the odds of Renal dysfunction?


• Compute the odds ratio
Solution
• P (renal dysfunction) = 28/60 = 0.47
• odds(renal dysfunction) = 0.47/1-0.47= 0.875
• odds() = (16/30)/(1-16/30)= 16/14=1.143
• odds() = (12/30)/(1-12/30)= 12/18=0.667
• Therefore odds ratio= odds() / odds()
= 1.143/0.667= 1.71
I.e. The odds of renal dysfunction is 71% more
in male than female
Correlation

• Measures the strength of linear association


between two continuous / discrete variables
• Can be positive or negative
• Can vary between -1 and +1
• Does not imply causation (there may be some
other factor that can explain the association)
Correlation coefficient

• The sample correlation coefficient is


represented using r and calculated as

• r = Covariance between X and Y /(Variance of


X * variance of Y)

• This is called Pearson correlation coefficient


Pearson correlation coefficient

• r=-1 Strong negative linear relationship


As the
value of X increases the value of Y decreases

• r=0 No linear relationship between X and Y

• r=+1 Strong positive linear relationshipAs the


value of X increases the value of Y increases
Hypothesis test for correlation coefficient
• It is possible to test whether a population
correlation coefficient differs significantly from
zero
• The significance of the correlation coefficient will
depend on the size of the correlation coefficient
and the number of observations in the sample
• The validity of this test requires that the variables
are observed on a random sample of
individuals and at least one of the variables
follows a normal distribution

You might also like