Regression Analysis
What is Correlational Research
Correlational study:
When we use correlational designs we can’t look for cause-
effect relationships because we haven’t manipulated any of the
variables
Way of predicting the value of one variable from another
(often using multiple questionnaires)
Allows us to estimate beyond the data we possess
Model is linear so we summarise data by using a straight line
The linear model
The only equation we ever really need is this one:
�������� = Model� + ������
We also saw that we often fit a linear model, which in its
simplest form can be written as:
�������� = �� + �� �� + ������
Eq. 1
�� = �� + �� �� + ε�
The fundamental idea is that an outcome for an entity can
be predicted from a model and some error associated with
that prediction.
(yi) outcome variable
(Xi) predictor variable
(b1) : a parameter associated with the predictor variable that
quantifies a the relationship it has with the outcome variable.
( b0.): a parameter that tells us the value of the outcome
when the predictor is zero
A Linear Model Yi b0 b1 X i i
b1
Parameter for the predictor
Gradient (slope) of the line
Direction/Strength of Relationship/Effect
b0
The value of outcome when predictor(s) = 0
(intercept)
Linear Models: Straight Line
Any straight line can be defined by two things:
(1)Slope: the slope (or gradient) of the line (usually
denoted by b1); and
(2) Intercept: the point at which the line crosses the vertical
axis of the graph (known as the intercept of the line, b0).
These parameters b1 and b0 are known as the regression
coefficients.
Regression co-efficients
Slope (or gradient): b1: the shape of the line (slope)
Intercept: b0 : where the line crosses the vertical (y) axis
Same b0, Different b
70
Positive
60
50
Ssales
40 None
30
20
Negative
10
0
0 2 4 6 8 10
Budget ($)
the gradient (b1) tells us what the model looks like (its shape) and the
intercept (b0) tells us where the model is (its location in geometric space).
Straight Lines
Outcome
Variable Error
Intercept: ith participant’s score
Slope: direction/
the point on predictor variable
strength of relationship
line crosses
y axis
Example – album sales
Predict number of albums you would sell from how much
you spend on advertising
Example – album sales
If we spend nothing on advertising, 50 albums were sold (b0)
What if you spend £5 on advertising?
Sales = 50 + 100*5 = 550 albums
This value of 550 album sales is known as a predicted value.
The linear model with several
predictors
��
Eq. 4
= �0 + �1 �1� + �2 �2� + ��
album sales�
= �0 + �1 advertising budget� + �2 airplay� + ��
Fitting a line to the data
Simplest Model: the mean
Without other data, the best guess of the outcome (Y) is
always the mean
Ordinary Least Squares (OLS) regression:
Fits a line of best fit to the data
Estimates the constant (b0) and parameters of each
predictor (b for each X)
SPSS finds the values of the parameters that have the least
amount of error
Total Sum of Squares, SST
7
4
Sales
0
0 1 2 3 4 5
Budget ($)
SST
Total variability (variability between scores and the mean)
Residual Sum of Squares, SSR
SSR
Residual/Error variability (variability between the regression model and the
actual data)
7
4
Sales
0
0 1 2 3 4 5
Budget($)
Model Sum of Squares, SSM
SSM
Model variability (difference in variability between the model and the mean)
Testing the Fit of the Model
We need to see whether the model is a reasonable ‘fit’ of the
actual data.
SST
Total variability (variability between scores and the mean)
SSR
Residual/Error variability (variability between the regression
model and the actual data)
SSM
Model variability (difference in variability between the model
and the mean)
Testing the Model: ANOVA
SST
Total variance
SSM SSR
Error in Model
Improvement
due to Model
Testing the Model: ANOVA
If the model results in better prediction than using
the mean, then SSM should be greater than SSR
Mean Squared Error
Sums of Squares are total values, we use Mean
Squared Error instead.
MS M
F MS R
Testing the Model: R2
R2
The proportion of variance accounted for by the regression
model.
The Pearson Correlation Coefficient between observed and
predicted scores squared
Adjusted R2
An estimate of R2 in the population (shrinkage)
2 SS M
R SST
Summary
We can fit linear models predicting an outcome from one or
more predictors
Parameter estimates (b)
Tell us about the shape of the model
Tell us about size and direction of relationship between predictor
and outcome
Can significance test
CI tells us about population value
Use bootstrapping if assumptions are in doubt
Model Fit
ANOVA
R2
Running the analysis
FILE: Album_sales.sav
What are our IV and DV?
How many participants/data points are there?
What kind of variables do we
have? (nominal, interval
or scale)
Does the scatterplot (on p7)
show a positive or negative
relationship between
the two variables?
Running the analysis
Analyse Regression Linear…
Predictor (IV) goes in “Independent(s)”
Outcome (DV) goes in “Dependent”
Running the analysis
Click on “Bootstrap…”
Runs the analysis on a sample of your data for 1000 iterations
Check “Perform Bootstrapping…”
and choose BCa
“Continue” then “OK” to run
Interpretation:
Simple Regression
Navigating the output
Model Summary: how useful is our model?
ANOVA: is our model better than the mean?
Coefficients: What are the numbers?
Bootstrap for coefficients
Model summary
First, is this model is better than using the mean?
For simple regression, R = correlation coefficient
Compare errors (differences between predicted and
observed values) for both the mean model and the
regression model
amount of variance explained by the model vs the mean (R2)
Expressed as a percentage
R values range from –1 to 1, so this is a
large positive correlation
adjusted R2, gives us
R2 :
how much of the variability in the some idea of how our
outcome is accounted for by the predictors. model generalizes and,
is ideally very close to
Here, the predictor accounts for 33.5% of the our value for R2
outcome (.335*100)
ANOVA
F-ratio measures how well the model predicts the outcome
(MSM) compared to error in the model (MSR)
Tells us if using our model is significantly better than using
the mean alone
F(1, 198) = 99.59, p < .001
Coefficients
Assess individual predictors using t-tests
H0: our value of b1 is zero
Therefore should be significant if the predictor is related
If b1 = 0, the outcome was unchanged by that predictor
variable
Examines if our value of b is big compared to the error
b0: Intercept Is budget a
b1: Slope
significant predictor?
T-test: Are our variables significant predictors of our outcome?
In this case, the t-test tells us the same thing as the ANOVA
Because only one predictor
We can also use this table to form our equation
Intercept (b0): if no money is spent on advertising how
many albums will be sold? (units are in 1,000s)
134,140 albums sold when advertising is 0 (134.10 * 1000)
Coefficient (b1): if we increase our predictor by 1 unit
(£1000), how many more albums will we sell?
96 additional albums sold for each £1,000 of advertising
budget spent (0.096 * 1000)=96
Bias
We need the meet the four assumptions:
Linearity: the relationship to model is actually linear
Additivity: the outcome can be predicted by adding together all
predictors
Normality: residuals to be normally distributed for optimal b
estimates, normal sampling distribution for accurate CI and
statistical tests
Homoscedasticity
Meeting these assumptions can trust our estimates of b and
their associated confidence intervals and significance tests
If not, then we can bootstrap to compute robust parameters and
confidence intervals instead
The bootstrap CI: the population value for b is likely to fall
between .08 and .11
Boundaries do not include zero genuine positive
relationship between advertising budget and album sales
If it contained 0, the true value might be 0 [i.e. no effect] or a
negative number [the opposite of our sample]
The p value associated with the confidence interval is also
highly significant (p=.001)
Album Sales: More Predictors
Analyse Regression Linear…
Add a second block
for new predictors
Using the Model
If a company wanted to spend £100,000 on advertising,
how many albums would we predict they would sell?
Hint: units are in 1,000s!
Sales = 134.14 + .096 (100,000)
Sales = 134.23 (100000)
Make a prediction: approximately 13,500,000 albums would
be sold if the company spent £100,000 on advertising
Album Sales: More Predictors
Advertising only accounted for 33.5% of variance in albums
sales, leaving 66.5% variance unaccounted for
[Link] includes 2 additional predictors:
Amount of airplay the band receives on the radio
The attractiveness ratings of the band
Add these to the model to see if the model improves
Interpretation:
Multiple Regression
In this output
we have 2
models:
1. Only
advertising
2. All 3 how our model
predictors generalizes
Multiple correlation co-efficient
between the predictors and How much of the variability in the outcome is
outcome. accounted for by the predictors
In a hierarchical regression, this Advertising accounts for 33.5% while
can change with the addition of attractiveness and airplay account for an extra
new variables into the model. 33%
F(1, 198) = 99.59, p < .001 F(3, 196) = 129.50, p < .001
Both models significantly improved our ability to predict the outcome variable
compared to not fitting the model (using the mean model)
Assess the contribution of each predictor using t-tests
Advertising budget: t(196)= 12.26, p<.001
Did the other predictors contribute significantly to the model?
No. of radio plays: : t(196)= 12.12, p<.001
Attractiveness of band: t(196)= 4.55, p<.001
Remember: significance tests are only reliable if we have met our
assumptions!
Advertising budget: (b1= 0.09)
As advertising budget increases by 1 unit (£1000), album sales
increase by 0.09 units
Airplay: (b2= 3.37)
As number of plays on radio 1 per week increases by 1 unit (1 play),
album sales increase by 3.37 units
Attractiveness: (b3= 11.09)
As attractiveness rating of band increases by 1 unit album sales
increase by 11.09 units
If assumptions are not met use bootstrap CIs
Advertising: (b=0.09) [0.07, 0.10], p=.001
Number of radio plays (b=3.37) [2.80, 3.99], p=.001
Attractiveness of band (b=11.09) [6.25, 15.10], p=.001
Bootstrap CIs do not cross zero
Can conclude confidently that bs are positive (do contribute)
Tasks!
Please complete all tasks (1 – 4) at the end of the handout
Write out all tests in APA style
Complete all calculations
We will give answers at the end of the session
Task 1
Correlation of .35 between suicide and listening to heavy metal
The model explains 12.5% of variance in suicidal tendencies (.125 x 100)
The model is significantly better than the mean model at predicting
suicide rates
F(1, 2135) = 304.78, p < .001, with heavy metal listening predicting suicide
risk, t(2135) = -17.46, p <.001, [-0.70, -0.53]
There is a negative relationship between listening to heavy metal and
suicide risk
As listening increases, suicide risk decreases
Suicide risk = 16.04 + (-0.61* heavy metal listening)
As listening increases by 1 unit, suicide risk decreases by 0.61 units
Task 2
Correlation of .08 between tea drinking and cognitive function
The model explains 0.6% of variance in cognitive functioning (.006 x
100)
Drinking tea significantly predicts cognitive function
F(1, 714) = 4.33, p = .038
Positive relationship between drinking tea and cognitive scores
As tea drinking increases, so does cognitive function
t(714) = 2.08, p = .038
Cognitive function = 49.22 + (0.46 x 10)
Score after 10 cups of tea = 53.82
Task 3
Correlation of .81 between mortality and number pubs
The model explains 64.9% of variance in mortality (.649 x 100)
Number of pubs significantly predicts mortality
F(1, 6) = 11.12, p = .016
Positive relationship between number of pubs and number of
deaths
As pubs increase, so does mortality, t(6) = 3.33, p = .016
Mortality = 3351.96 + (14.34*pubs)
Task 4
The model explains 69.1% of variance in dishonesty ratings
(.691 x 100)
The model is significantly better than using the mean to predict
dishonesty ratings
F(1, 98) = 219.10, p< .001
There is a negative relationship between rating of likeability and
ratings of dishonesty
t(98) = 14.80, p < .001
The scales are written so as likeability increases DIShonesty decreases
(honesty increases)
Dishonesty = -1.86 + (0.94* likeableness)