Statistics Correlation Analysis
Statistics Correlation Analysis
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio
CORRELATION ANALYSIS
Correlation is a degree of relationship between variables, which seeks to determine how well a linear or other equation describes or
explains the relationship between variables. It also implies “association” between two variables.
N xy x y
r
N x x N y y
(Note: For EDM 503 purposes, solutions will be computer-aided.)
2 2 2 2
The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A value of 0 indicates that there is no
association between
the two variables. This
is shown in figure 1.
Figure 1
Y a bx where
Y = predicted value
a = y-intercept
b = slope of the regression line
x = the value of x to be predicted
Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio
1. Conduct correlation analysis only when two variables are “suspected” to have a relationship, or there
is sound theory that may back up the hypothesis. Your target is to determine whether or not there is
sufficient evidence that your suspicion is true.
2. No attempt, whatsoever, at correlation analysis must be carried out between two variables which
hypothesized relationship has no sound theoretical underpinnings. E.g., relationship between your
IQ and the size of your foot; between Philippine gross domestic product & IQ of Philippine president
(or is there???)
3. In graduate thesis, only relationships between or among variables that have not been explored in
any part of the world, as far as we know, may be considered. Otherwise, the researcher may just be
duplicating what already exists. The researcher is, therefore, challenged to be innovative and
creative towards exploring “unexplored” but theoretically sound suspected relationships
between/among variables. This would mean the introduction of new or more variables into the
conceptual framework.
The following are: 1) SPSS data layout of pairs of predictor and criterion variables both in the interval/ratio
scale, the respective resulting scatter plots, & their corresponding interpretations.
(SPSS: CLICK GRAPHS, CLICK SCATTER, CLICK SIMPLE, CLICK DEFINE, ENTER PREDICTOR IN X-
AXIS, ENTER CRITERION VARIABLE IN Y-AXIS, CLICK OK)
----------------------------------------------------------------------------------- xxxxxxxxxxxxxx------------------------------------------------------------------------------------------
Illustration 1:
By inspection of the Resulting scatter plot
SPSS Data Layout: data, you will notice that “y
(criterion) doubles given 30
x y the value of x (predictor)
1.00 2.00 in a perfect manner.”
2.00 4.00 Hence, we can get y
3.00 6.00 exactly, given x (e.g., if x is
8.5, then y is 17). 20
4.00 8.00
This set of data, if plotted
5.00 10.00 on a graph in a one-to-one
6.00 12.00 correspondence, will
7.00 14.00 form a perfect or straight
8.00 16.00 line with a positive slope. 10
y = 2x 0 2 4 6 8 10 12
and r = +1
X
The data set, scatter plot & the equation are one & the same, i.e., the perfect positive linear relationship
between x & y evident in the data set also becomes evident in the scatter plot, which means that an
Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio
equation can be formulated to represent the relationship. Such a relationship will yield a Pearson
correlation coefficient (r) of +1.
------------------------x0x---------------------------
Illustration 2:
By inspection of the Resulting scatter plot
data, you will notice that
SPSS Data Layout: “as x decreases by 1, y 30
x y increases by 2 in a
10.00 2.00 perfect manner.” Hence,
9.00 4.00 we can get y exactly, given
x (e.g., if x is 4.5, then y is
8.00 6.00 13). 20
7.00 8.00 This set of data, if plotted
6.00 10.00 on a graph in a one-to-one
5.00 12.00 correspondence, will
4.00 14.00 form a perfect or straight 10
3.00 16.00 line with a negative slope.
2.00 18.00 The perfect linear
1.00 20.00 relationship between x & y
can be further expressed in
Y1
the equation: 0
y = 22 - 2x
0 2 4 6 8 10 12
and r = -1 X1
Similarly, the above data set, scatter plot & the equation are one & the same, i.e., the perfect
negative linear relationship between x & y evident in the data set also becomes evident in the scatter plot,
which means that an equation can be formulated to represent the relationship. (Note: Formulation of equation
given a set of data is not within the scope of EDM; for further exploration, refer to Algebra references). Such a relationship will
yield a Pearson correlation coefficient (r) of -1.
The preceding illustrations (#1 & #2) show perfect linear relationships between two variables in the
interval or ratio scale. Perfect correlation is characterized by a scatter plot where all points (x, y) fall on a
straight line, an r = +/- 1, and whose relationship can be expressed as a linear equation. In the real world,
however, we rarely find two interval/ratio variables that are perfectly related. In most, if not all, linear
relationships could fall in any of the following: strong/high positive, strong/high negative, moderate positive,
moderate negative, slight/low positive, slight/low negative or none at all. These classifications imply that a
linear relationship measured by a correlation coefficient has both magnitude (strong, moderate, slight) and
direction (+, -). Specifically, magnitude falls between -1 & +1, inclusive. (Note: Manually computed r
which results in an r value greater than 1 or less than -1 is certainly due to erroneous computations). The
arbitrary scale for the interpretation of r is given below.
Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio
Illustration 3:
By inspection of the Resulting scatter plot
data, you will notice that “a
SPSS Data Layout: higher value of x tends to 200
Y2
80
arbitrary scale, x & y 20 30 40 50 60 70 80 90 100
appear to be strongly X2
correlated.
Illustration 4:
By inspection of the Resulting scatter plot
data, an upward linear
SPSS Data Layout: trend between x & y is still 400
Illustration 5:
Resulting scatter plot
By inspection of the
SPSS Data Layout:
data, you will notice that 300
is obviously negative.
54.00 240.00
Resulting correlation
43.00 150.00
coefficient (r) is - 0.728
89.00 120.00
(SPSS output). It may 100
34.00 50.00
be deduced that there is
55.00 100.00
a moderate to strong
77.00 40.00
negative linear
90.00 120.00
Y3
0
correlation between x & 0 10 20 30 40
y.
Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
X3
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio
Illustration 6:
Resulting scatter plot
The data show that y
SPSS Data Layout:
is constant while x varies. 100.6
99.4
100. 0 20 40 60 80 100 120 140
X4
Illustration 7:
Resulting scatter plot
SPSS Data Layout: This time, the data
show that x is constant
140
0
also not a function of x 54.4 54.6 54.8 55.0 55.2 55.4 55.6
X5
Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio
Illustration 8:
Resulting scatter plot
SPSS Data Layout: No linear trend 300
Y6
15.00 150.00 will approach zero (0), 0
0 20 40 60 80 100 120 140
130.00 20.00 away from either +1 or - X6
37.00 200.00 1..
Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio
Illustration 10:
SPSS Data Layout: Any trend from the data may Resulting scatter plot
x y
not be evident. But, in the
20.00 5.00 resulting scatter plot, a trend is 40
16.00 5.80 seen. Notice seemingly two
17.00 4.50 cluster of points as shown. Two
40.00 30.00 separate r’s & analyses may be
14.00 5.00 more appropriate than 30
38.00 24.00 representing the relationship/s
17.00 6.00 using a single r. It appears in the
20.00 6.00 scatter plot that we are dealing
19.00 5.20
with two different “populations”
20.00 4.20 20
18.00 3.50
(Further discussion is beyond the
28.00 20.00 scope of EDM). Suffice it to say
26.00 19.00 that scatter plots show pictures of
29.00 25.00 relationships that may not be
18.00 4.20 evident from the raw data or from 10
26.00 29.00 computed statistics. Graphs have
35.00 25.00 their rightful place in statistical
23.00 20.00 analysis.
16.00 23.00 0
Y
13.00 5.60 10 20 30 40 50
X
COMPLETE PEARSON CORRELATION & REGRESSION ANALYSIS
Problem:
Below are the scores of 10 college students in Mathematics & English proficiency tests of 100 items each.
Math English Is Mathematics ability linearly related to English proficiency? Do people inclined
(x) (y) to Mathematics have good command of English?
95.00 90.00
65.00 75.00
78.00 88.00
88.00 90.00
46.00 58.00
90.00 88.00
82.00 84.00
85.00 80.00
74.00 65.00
75.00 70.00
Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio
Analysis
Hypotheses:
Scatter Plot:
100
50
40 50 60 70 80 90 100
MATH
Computed r: (SPSS: click ANALYZE, CORRELATION, BIVARIATE (default Pearson; automatically checked),
Click variables & enter box, click OK)
Correlations
Null hypothesis is rejected. Linear relationship
MATH ENG between math & English is significant at the 0.01
MATH Pearson Correlation 1 .852** level (as indicated by double asterisks; p = 0.002
Sig. (2-tailed)
< 0.01). Linear relationship can be further
. .002
described as “high” or “strong” positive based on r
N 10 10 = 0.852. It can deduced that college students with
ENG Pearson Correlation .852** 1 relatively higher math proficiency tend to have
Sig. (2-tailed) .002 . higher English proficiency, low math ability
N 10 10 students tend to show low command English.
**. Correlation is s ignificant at the 0.01 level
Coefficientofdetermination:
(2-tailed).
SPSS: click REGRESSION, LINEAR, enter math (x) as independent, enter eng as dependent, method ENTER
(default method is ENTER), click OK. (Note: For multiple regression, best methods are backward selection &
stepwise; other methods are not advisable; discussion is beyond the scope of EDM)
Model Summary
Coeffi cientsa
Unstandardized St andardiz ed
Coeffic ient s Coeffic ient s
Model B St d. E rror Beta t Sig. The regression model
1 (Const ant) 25.725 11.695 2.200 .059 indicates that for every 1 point
MA TH .682 .148 .852 4.606 .002 increase/decrease in math
a. Dependent Variable: E NG score (x), English score (y) is
b a
expected to increase/decrease
by 0.682. Also, given a math
score of 85, English score is
Ŷ = 25.725 + 0.682x estimated to be 83.695,
computed as:
Ŷ = 25.725 + 0.682 (85)
Table 1
Correlation Analysis between Mathematics Ability and English Proficiency
Variable Coefficient of Correlation (r) R Square Probability Value
Mathematics Ability and 0.852 0.726 0.02**
English Proficiency
(*Significant at alpha = 0.01)
There is a significant relationship between the Mathematics Ability and English Proficiency. The coefficient of
correlation of 0.852 further described as “high” or strong positive relationship with the corresponding probability value
of 0.02 which is significant at alpha = 0.01. This means that college students with relatively higher math proficiency
tend to have higher English proficiency, low math ability students tend to show low command English.
The 72.6% of the variation in English scores can be explained by or can be attributed to the variation in
Mathematics scores. In other words, variation in math proficiency accounts for 72.6% of the variation in English
proficiency. Only 27.4% is accounted for by other factors.
Table 2
Regression Model of Mathematics Ability and English Proficiency
Parameters Values
Regression Correlation Coefficient (r) 0.852
Regression Correlation of Determination (R Square) 0.726
Regression Constant 25.725
Regression Coefficient 0.682
Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio
Where
x = predicted scores on mathematics ability
y = estimated English score
The regression model indicates that for every 1 point increase/decrease in math score (x), English score (y) is
expected to increase/decrease by 0.682. Also, given a math score of 85, English score is estimated to be 83.695,
Integration
Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011