Measures of relationship
Presentation outline
• Learning objectives
• Introduction
• Regression Analysis
- Simple linear regression
- Multiple linear regression
- Logistic regression
• Correlation Analysis
- Curvilinear Relationship
- Coefficient of Determination
Learning objective
After this topic, you should be able to:
• Distinguish between the basic purpose of regression
analysis and correlation analysis
• Compute and interpret a regression equation
• Interpret the coefficient of simple linear regression
• Know how to extend simple linear regression to consider
multiple risk variables
• Know how to apply some of the ideas from linear regression
when the outcome of interest is a binary outcome
• Know how and when to apply logistic regression and how to
interpret a relationship represented by an odds ratio
Introduction
• Statistical analysis is
concerned not only
with summarizing data but also with
investigating relationships
• Some of our most intriguing scientific
questions deal with the relationship between
two variables
• Does a relationship exist between use of
oral contraceptives and the incidence of
• What is the relationship of a mother’s weight to
her baby’s birth weight?
• These are typical of countless questions we
pose in seeking to understand the relationship
between two variables
• There is an all-too-human tendency to attribute
a cause-and-effect relationship to variables
that might be related
• We discuss the methods of measuring the
relationships of bivariate data
• Determine the strength of the relationships and
• Make inferences to the population from which
the sample was drawn
Regression Analysis
• Sir Francis Galton coined the term regression
during his study of heredity laws
• He observed that physical characteristics of
children were correlated with those of their
fathers
• He found that tall fathers tended to have shorter
sons, whereas short fathers tended to have
taller sons
• A phenomenon he called “regression toward the
mean”
• Subsequently, statistician embraced the term
regression line to describe a linear relationship
between two variables
Simple linear regression
• Gives the equation of the straight line that best
describes the linear relationship between two
numerical variables
• i.e. how one variable will behave as another
variable changes
• Enables the prediction of one variable using
another variable using the equation of the
straight line
Types of variable in linear regression
• The dependent variable (y) is the variable to
be predicted (i.e. usually the measured health
outcome of interest)
• The independent variable (x) or explanatory
variable is the variable used for predicting the
dependent variable
• NB: In correlation it does not matter which
variable is which, but in regression it matters
Research questions
• How does systolic blood pressure change as
age increases?
• Can a subject’s diastolic blood pressure
predict their systolic blood pressure?
• Can body fat be predicted from abdomen
circumference measurements?
• In each of these research questions which is
the dependent variable?
Equation of a straight line
• The equation of a straight line is
• y’ = a + bx
• y’ is the predicted value (of the dependent
variable y)
• a is the intercept
• b is the slope (or gradient) of the line
• x is the independent (explanatory) variable
Simple example
X Y
0 2
1 5
2 8
3 ?
4 14
5 17
6 ?
• Equation of line is y = 2 + 3x
Linear regression equation
• y’ = a + b * x
• y’ = intercept + ( slope * x )
• We want the residuals (distances between the
observations and the line) to be small
• Residual: the difference between the actual value of the
dependent variable and the predicted value from the
regression line: ε = (y’- y)
• A residual is calculated for each observation
• The values of a and b for the regression equation are
calculated to minimise the sum of the squared residuals
–called the least squares fit
Regression coefficient (b)
• The slope, b, is often called the regression
coefficient
• It has the same sign as the correlation
coefficient
• When there is no correlation between x and y,
then the regression coefficient, b, equals 0
Predicted value (y’)
• The predicted value, y’, is subject to sampling
variation
• Its precision can be estimated (prediction
error) by the standard error of the estimate
• The greater the standard error, the greater the
dispersion of predicted y values around the
regression line and hence the larger the
prediction error
Statistical inference in regression
• Regression coefficients calculated from a sample
of observations are estimates of the population
regression coefficients
• Hypothesis tests and confidence intervals can be
constructed using sample estimates to make
inferences about population regression coefficients
• For valid use of these inferential approaches, it is
necessary to check the underlying assumptions of
the model (linearity, normality, constant variance) –
discussed later
Process for simple linear regression
• Check that there is a linear relationship (scatter
plot)
• Use SPSS to fit the simple linear regression model
to find the best straight line through the data
• Check the to see the amount of variation in the
dependent variable explained by the explanatory
variable should be close to 1 (i.e. )
• Write down the regression equation
• Check the assumptions to ensure that the equation
can be used to make predictions
• A fitness gym wishes to assess their client’s body
fat. An accurate method of measuring body fat is
using an underwater weighing technique. This is
not a practical method for the fitness instructors to
carry out on the premises.
• They would like to be able to predict their client’s
body fat from other measurements, e.g. Abdomen
circumference
• 252 men had their body fat and abdomen
circumference measured
Testing hypothesis
• H0: There is no linear relationship between body fat
and abdomen circumference in the population
• H1: There is a linear relationship between body fat
and abdomen circumference in the population
Or this can be rephrased as
• H0: Abdomen circumference does not account for
any variability in body fat in the population
• H1: Abdomen circumference does account for some
of the variability in body fat in the population
Simple linear regression in SPSS
• Analyze
–Regression
–Linear
• The dependent variable is body fat
• The independent variable is abdomen
circumference
SPSS: linear regression
Model R R Adjusted Std Error of
Square R Square the estimate
1 0.814 0.662 0.661 4.5144
• R is the correlation between the two variables
0.814
• R square (R x R) is the proportion of variability
in body fat measurements that can be
explained by differences in abdomen
circumference = 0.662 or 66.2%
SPSS: linear regression
Anova
Model Source of Sum of df Mean F
variation square square
1 Regression 9984.086 1 9984.086 489.903
Residual 5094.931 250 20.380
Total 15079.017 251
• A statistically significant (p<0.001) proportion
of the variability in body fat measurements can
be attributed to the regression model (i.e.
abdomen)
SPSS:regression equation
Coeff
Model B Std Standanad t Sig
Error ize coeff.
(Beta)
Constant -35.197 2.462 -14.294 0.000
Abdomen 0.585 0.26 0.814 22.134 0.000
• Predicted body fat = constant + B x abdomen circum.
• Predicted body fat = -35.197 + 0.585 x abdomen
circum.
Prediction
• How do you use linear regression for prediction?
• The regression equation allows you to predict the value
of the dependent variable (Y) for a particular value of the
independent variable (X)
• Predicted body fat = -35.197 + 0.585 abdomen circum
• What is the predicted body fat content for a man with an
abdomen circumference of 100cm?
• Predicted body fat = -35.197 + 0.585 x 100cm
= -35.197 + 58.5
• = 23.3%
Assumptions of linear regression
• There should be a linear relationship between the
dependent variable and the independent variable
• For any value of the independent variable the
dependent variable values should follow a Normal
distribution (i.e. normally distributed residuals)
• The variance of the dependent variable values
should be the same for all independent variable
values
Checking the assumptions
• After the regression model has been fitted to
the data it is essential to check that the
assumptions of linear regression have not
been violated
• If any of the assumptions have been violated
then the regression model is likely to be invalid
• INVALID ASSUMPTIONS MEAN THAT THE
PREDICTIONS BASED ON THIS MODEL
MAY BE POOR
Assumptions
• Plot the dependent variable against the
independent variable
- Linear pattern (sausage shape) if linearity
assumption to hold
• Plot the residuals against the predicted
values
- No curvature in the plot should be seen for the
linearity assumption to hold
• Normally distributed residuals can be tested
by looking at a histogram of the residuals
• Normally distributed residuals can be tested
by looking at a normal probability plot (Normal
p-p plot)
• Constant variance of the residuals can be
assessed by plotting the residuals against the
predicted values
-There should be an even spread of residuals
around zero
Summary: simple linear regression
• Simple linear regression gives the equation of the
straight line that best describes the association
between two variables
– A linear relationship between the dependent
variable and the independent variable is required
– For any value of the independent variable the
dependent variable values should follow a Normal
distribution
– The variance of the dependent variable values
should be the same for all independent variable values
Multiple regression
• Extend the principles learnt today to multiple
linear regression
• To explore the dependency of one outcome
variable on two or more explanatory variables
simultaneously
• To study the relationship between two
variables after removing (adjusting for) the
possible effects of other “nuisance” variables
of less interest
• Logistic regression model simply mean a statistical
model which describe the relationship between a
qualitative dependent variable (i.e. presence and
absence of disease) and the independent variables
(continuous or/and categorical)
• Continuous variables are not used as dependent
in logistics regression
• The logistic model uses the odds ratio to determine
the effect a predictor variable has on the outcome
• An odds ratio is simply the ratio of 2 odds and is
used extensively in medical studies as a measure
of effect for categorical data
• Odds are usually expressed in term of probability of
an event
• If the probability of an event is p, then
• odds = p/1-p
• Similarly, odds can be converted to probability by p
= odds/1+odds
• As probability goes from 0 to 1, odds vary from 0 to
Example
Renal dysfunction
Sex yes no Total
Male 16 14 30
Female 12 18 30
Total 28 32 60
• What is the odds of Renal dysfunction?
• Compute the odds ratio
Solution
• P (renal dysfunction) = 28/60 = 0.47
• odds(renal dysfunction) = 0.47/1-0.47= 0.875
• odds() = (16/30)/(1-16/30)= 16/14=1.143
• odds() = (12/30)/(1-12/30)= 12/18=0.667
• Therefore odds ratio= odds() / odds()
= 1.143/0.667= 1.71
I.e. The odds of renal dysfunction is 71% more
in male than female
Correlation
• Measures the strength of linear association
between two continuous / discrete variables
• Can be positive or negative
• Can vary between -1 and +1
• Does not imply causation (there may be some
other factor that can explain the association)
Correlation coefficient
• The sample correlation coefficient is
represented using r and calculated as
• r = Covariance between X and Y /(Variance of
X * variance of Y)
• This is called Pearson correlation coefficient
Pearson correlation coefficient
• r=-1 Strong negative linear relationship
As the
value of X increases the value of Y decreases
• r=0 No linear relationship between X and Y
• r=+1 Strong positive linear relationshipAs the
value of X increases the value of Y increases
Hypothesis test for correlation coefficient
• It is possible to test whether a population
correlation coefficient differs significantly from
zero
• The significance of the correlation coefficient will
depend on the size of the correlation coefficient
and the number of observations in the sample
• The validity of this test requires that the variables
are observed on a random sample of
individuals and at least one of the variables
follows a normal distribution