0% found this document useful (0 votes)
288 views

Statistics Correlation Analysis

This document discusses correlation and linear regression analysis. Correlation determines the relationship between two variables and can range from +1 to -1, where 0 indicates no association. Pearson's correlation coefficient (r) specifically measures the strength and direction of the linear association between interval/ratio variables. Linear regression analysis predicts the value of a dependent variable based on the value of an independent variable. The regression equation is Y = a + bx, where a is the y-intercept, b is the slope, and x and Y are the predictor and predicted variables. Two examples demonstrate perfect linear relationships between variables, where the data points fall exactly on a line and r = ±1. In reality, most relationships are not perfect and r can indicate strong

Uploaded by

joan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
288 views

Statistics Correlation Analysis

This document discusses correlation and linear regression analysis. Correlation determines the relationship between two variables and can range from +1 to -1, where 0 indicates no association. Pearson's correlation coefficient (r) specifically measures the strength and direction of the linear association between interval/ratio variables. Linear regression analysis predicts the value of a dependent variable based on the value of an independent variable. The regression equation is Y = a + bx, where a is the y-intercept, b is the slope, and x and Y are the predictor and predicted variables. Two examples demonstrate perfect linear relationships between variables, where the data points fall exactly on a line and r = ±1. In reality, most relationships are not perfect and r can indicate strong

Uploaded by

joan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Gordon College

Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio

CORRELATION ANALYSIS

Correlation is a degree of relationship between variables, which seeks to determine how well a linear or other equation describes or
explains the relationship between variables. It also implies “association” between two variables.

PEARSON PRODUCT MOMENT CORRELATION COEFFICIENT


The Pearson product-moment correlation coefficient (or Pearson r for short) is a measure of the strength of a linear association
between two variables with interval and ratio type of scale.

N  xy   x y
r
N  x   x N  y   y 
(Note: For EDM 503 purposes, solutions will be computer-aided.)
2 2 2 2

The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A value of 0 indicates that there is no
association between
the two variables. This
is shown in figure 1.

Figure 1

Linear Regression Analysis


Objectives
Regression analysis is the next step up after correlation; it is used when we want to predict the value of a variable based on the
value of another variable. In this case, the variable we are using to predict the other variable's value is called the independent
variable or sometimes the predictor variable. The variable we are wishing to predict is called the dependent variable or sometimes
the outcome variable.
Example
A salesman for a large car brand is interested in determining whether there is a relationship between an individual's income and the
price they pay for a car. They will use this information to determine which cars to offer potential customers in new areas where
average income is known.
Assumptions
 Variables are measured at the interval or ratio level (continuous).
 Variables are approximately normally distributed.
 There is a linear relationship between the two variables.

The equation for a fitted line is:

Y  a  bx where
Y = predicted value
a = y-intercept
b = slope of the regression line
x = the value of x to be predicted

Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio

Some more notes:

1. Conduct correlation analysis only when two variables are “suspected” to have a relationship, or there
is sound theory that may back up the hypothesis. Your target is to determine whether or not there is
sufficient evidence that your suspicion is true.

2. No attempt, whatsoever, at correlation analysis must be carried out between two variables which
hypothesized relationship has no sound theoretical underpinnings. E.g., relationship between your
IQ and the size of your foot; between Philippine gross domestic product & IQ of Philippine president
(or is there???)

3. In graduate thesis, only relationships between or among variables that have not been explored in
any part of the world, as far as we know, may be considered. Otherwise, the researcher may just be
duplicating what already exists. The researcher is, therefore, challenged to be innovative and
creative towards exploring “unexplored” but theoretically sound suspected relationships
between/among variables. This would mean the introduction of new or more variables into the
conceptual framework.

The following are: 1) SPSS data layout of pairs of predictor and criterion variables both in the interval/ratio
scale, the respective resulting scatter plots, & their corresponding interpretations.

(SPSS: CLICK GRAPHS, CLICK SCATTER, CLICK SIMPLE, CLICK DEFINE, ENTER PREDICTOR IN X-
AXIS, ENTER CRITERION VARIABLE IN Y-AXIS, CLICK OK)
----------------------------------------------------------------------------------- xxxxxxxxxxxxxx------------------------------------------------------------------------------------------

Illustration 1:
By inspection of the Resulting scatter plot
SPSS Data Layout: data, you will notice that “y
(criterion) doubles given 30
x y the value of x (predictor)
1.00 2.00 in a perfect manner.”
2.00 4.00 Hence, we can get y
3.00 6.00 exactly, given x (e.g., if x is
8.5, then y is 17). 20
4.00 8.00
This set of data, if plotted
5.00 10.00 on a graph in a one-to-one
6.00 12.00 correspondence, will
7.00 14.00 form a perfect or straight
8.00 16.00 line with a positive slope. 10

9.00 18.00 The perfect linear


10.00 20.00 relationship between x & y
can be further expressed in
the equation:
0
Y

y = 2x 0 2 4 6 8 10 12
and r = +1
X

The data set, scatter plot & the equation are one & the same, i.e., the perfect positive linear relationship
between x & y evident in the data set also becomes evident in the scatter plot, which means that an

Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio
equation can be formulated to represent the relationship. Such a relationship will yield a Pearson
correlation coefficient (r) of +1.
------------------------x0x---------------------------

Illustration 2:
By inspection of the Resulting scatter plot
data, you will notice that
SPSS Data Layout: “as x decreases by 1, y 30

x y increases by 2 in a
10.00 2.00 perfect manner.” Hence,
9.00 4.00 we can get y exactly, given
x (e.g., if x is 4.5, then y is
8.00 6.00 13). 20
7.00 8.00 This set of data, if plotted
6.00 10.00 on a graph in a one-to-one
5.00 12.00 correspondence, will
4.00 14.00 form a perfect or straight 10
3.00 16.00 line with a negative slope.
2.00 18.00 The perfect linear
1.00 20.00 relationship between x & y
can be further expressed in
Y1

the equation: 0

y = 22 - 2x
0 2 4 6 8 10 12

and r = -1 X1

Similarly, the above data set, scatter plot & the equation are one & the same, i.e., the perfect
negative linear relationship between x & y evident in the data set also becomes evident in the scatter plot,
which means that an equation can be formulated to represent the relationship. (Note: Formulation of equation
given a set of data is not within the scope of EDM; for further exploration, refer to Algebra references). Such a relationship will
yield a Pearson correlation coefficient (r) of -1.

The preceding illustrations (#1 & #2) show perfect linear relationships between two variables in the
interval or ratio scale. Perfect correlation is characterized by a scatter plot where all points (x, y) fall on a
straight line, an r = +/- 1, and whose relationship can be expressed as a linear equation. In the real world,
however, we rarely find two interval/ratio variables that are perfectly related. In most, if not all, linear
relationships could fall in any of the following: strong/high positive, strong/high negative, moderate positive,
moderate negative, slight/low positive, slight/low negative or none at all. These classifications imply that a
linear relationship measured by a correlation coefficient has both magnitude (strong, moderate, slight) and
direction (+, -). Specifically, magnitude falls between -1 & +1, inclusive. (Note: Manually computed r
which results in an r value greater than 1 or less than -1 is certainly due to erroneous computations). The
arbitrary scale for the interpretation of r is given below.

Range of computed r Interpretation


± 1.0 Perfect Relationship
± 0.70 to 0.99 Strong/ High Relationship
± 0.40 to 0.69 Moderate Relationship
± 0.01 to 0.39 Slight/ Low Relationship
0 No Correlation
-------------------------------- x --- 0 -----x -----------------------------------

Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio
Illustration 3:
By inspection of the Resulting scatter plot
data, you will notice that “a
SPSS Data Layout: higher value of x tends to 200

x y yield a higher value of y.


23.00 103.00 Notice that y cannot be 180

45.00 120.00 perfectly determined given


x. However, a positive
76.00 144.00 linear trend is noticeable.
160

54.00 132.00 Thus, it may be surmised


43.00 100.00 that a strong positive 140

89.00 180.00 correlation exists between x


34.00 100.00 & y as can be gleaned in 120
55.00 128.00 the scatter plot.
77.00 139.00 Correlation coefficient 100
90.00 156.00 (r) is 0.925 (SPSS
output). Using the

Y2
80
arbitrary scale, x & y 20 30 40 50 60 70 80 90 100

appear to be strongly X2
correlated.

Illustration 4:
By inspection of the Resulting scatter plot
data, an upward linear
SPSS Data Layout: trend between x & y is still 400

x y quite evident, i.e., “a


3.50 120.20 higher value of x tends to
6.70 35.60 yield a higher value of y. 300
However, compared to
7.00 100.35 Illustration 3, notice that
9.60 45.60 points here are “more
10.90 180.40 dispersed”. This indicates
200

15.00 145.00 that, here, the linear


21.20 230.90 relationship between x & y
23.00 175.80 is lesser in strength than 100

25.60 120.60 that in Illustration 3.


30.20 350.30 Correlation coefficient
Y2

(r) is 0.732 (SPSS 0


0 10 20 30 40
output).
X2

Illustration 5:
Resulting scatter plot
By inspection of the
SPSS Data Layout:
data, you will notice that 300

x y “as x increases, y tends


23.00 280.00 to decrease”. This is
45.00 120.00 further elucidated in the
76.00 177.00 scatter plot where line slope 200

is obviously negative.
54.00 240.00
Resulting correlation
43.00 150.00
coefficient (r) is - 0.728
89.00 120.00
(SPSS output). It may 100
34.00 50.00
be deduced that there is
55.00 100.00
a moderate to strong
77.00 40.00
negative linear
90.00 120.00
Y3

0
correlation between x & 0 10 20 30 40

y.
Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
X3
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio

-------------------------------- x --- 0 -----x -----------------------------------

Illustration 6:
Resulting scatter plot
The data show that y
SPSS Data Layout:
is constant while x varies. 100.6

x y Therefore, x does not have


45.00 100.00 any influence to y at all 100.4

78.00 100.00 since changes in x do not


result in changes in y. 100.2
90.00 100.00
The resulting scatter
120.00 100.00 plot shows a horizontal line 100.0
10.00 100.00 since y is constant.
40.00 100.00 Correlation coefficient (r) is 99.8
34.00 100.00 zero (0) which means no
89.00 100.00 linear relationship. In 99.6
109.00 100.00 equation from, y is not a
60.00 100.00 function of x; simply y =
Y4

99.4
100. 0 20 40 60 80 100 120 140

X4

-------------------------------- x --- 0 -----x ----------------------------------

Illustration 7:
Resulting scatter plot
SPSS Data Layout: This time, the data
show that x is constant
140

x y while y varies. Therefore, 120


55.00 45.00 y is not influenced by x at
55.00 78.00 all since y changes even if 100

55.00 90.00 x does not change.


Change in y is explained by
55.00 120.00 80
other factors except x.
55.00 10.00 The resulting scatter 60
55.00 40.00 plot shows a vertical line
55.00 34.00 since x is constant. 40

55.00 89.00 Correlation coefficient (r) is


55.00 109.00 also zero (0) which means 20

55.00 60.00 no linear relationship. Y is


Y5

0
also not a function of x 54.4 54.6 54.8 55.0 55.2 55.4 55.6

X5

Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio
Illustration 8:
Resulting scatter plot
SPSS Data Layout: No linear trend 300

between x & y is evident


x y based from both the data
17.00 56.00 and the resulting scatter
87.00 156.00 plot. Points are
200

35.00 280.00 “scattered” suggesting


90.00 250.00 that y is not influenced
28.00 23.00 by x. It appears that x is 100

46.00 5.00 not related to y.


8.00 180.00 Correlation coefficient (r)

Y6
15.00 150.00 will approach zero (0), 0
0 20 40 60 80 100 120 140
130.00 20.00 away from either +1 or - X6
37.00 200.00 1..

---------------------------------- x ----- 0 -----x ----------------------------------


Illustration 9:
You will notice that in the Resulting scatter plot
SPSS Data Layout: data, y seems to be
approximately the square of x. 200

x y The resulting scatter plot shows


4.00 15.00 a curvilinear relationship
5.00 28.00 between x & y., i.e., a quadratic
6.00 39.00 function, rather than a linear 100
-9.00 82.00 function, approximates the
-2.00 5.00 relationship between x & y. It
.00 .20 can only be said that there is NO
10.00 103.00 LINEAR RELATIONSHIP
0
12.00 140.00 BETYWEEN x & y. To say that
-14.00 195.00 there is no relationship at all is
9.00 78.00 not accurate; there is a
relationship, but it is not linear.
Y7

In this case, r will approach 0. -100


-20 -10 0 10 20
But unlike in illustration 8 in
which x & y have no relationship, X7
illustration 9 shows a case of no
The importance of using a scatter plot to describe
linear relationship only. It would relationships is implied in the preceding examples, i.e.,
call for the exploration of another it is only with a graph that one may discover possible
kind of relationship (but linear or non-linear relationships between variables.
discussion is beyond the scope Before computing any statistics, for that matter, the
of EDM), as the graph clearly scatter plot would provide “reason to believe so or
suggests. otherwise”.

Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio
Illustration 10:

SPSS Data Layout: Any trend from the data may Resulting scatter plot
x y
not be evident. But, in the
20.00 5.00 resulting scatter plot, a trend is 40
16.00 5.80 seen. Notice seemingly two
17.00 4.50 cluster of points as shown. Two
40.00 30.00 separate r’s & analyses may be
14.00 5.00 more appropriate than 30
38.00 24.00 representing the relationship/s
17.00 6.00 using a single r. It appears in the
20.00 6.00 scatter plot that we are dealing
19.00 5.20
with two different “populations”
20.00 4.20 20
18.00 3.50
(Further discussion is beyond the
28.00 20.00 scope of EDM). Suffice it to say
26.00 19.00 that scatter plots show pictures of
29.00 25.00 relationships that may not be
18.00 4.20 evident from the raw data or from 10
26.00 29.00 computed statistics. Graphs have
35.00 25.00 their rightful place in statistical
23.00 20.00 analysis.
16.00 23.00 0

Y
13.00 5.60 10 20 30 40 50

X
COMPLETE PEARSON CORRELATION & REGRESSION ANALYSIS

Problem:

Below are the scores of 10 college students in Mathematics & English proficiency tests of 100 items each.

Math English Is Mathematics ability linearly related to English proficiency? Do people inclined
(x) (y) to Mathematics have good command of English?
95.00 90.00
65.00 75.00
78.00 88.00
88.00 90.00
46.00 58.00
90.00 88.00
82.00 84.00
85.00 80.00
74.00 65.00
75.00 70.00

Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio
Analysis
Hypotheses:

Ho: Mathematics ability is not linearly related to English proficiency


H1: Mathematics ability is linearly related to English proficiency

Scatter Plot:
100

The scatter plot indicates an


90 upward linear trend between
Mathematics proficiency &
English proficiency. Thus, “there
80
is reason to believe that they are
related.”
70
(Note: Refer to steps in complete
correlation analysis.)
60
ENG

50
40 50 60 70 80 90 100

MATH

Computed r: (SPSS: click ANALYZE, CORRELATION, BIVARIATE (default Pearson; automatically checked),
Click variables & enter box, click OK)

Correlations
Null hypothesis is rejected. Linear relationship
MATH ENG between math & English is significant at the 0.01
MATH Pearson Correlation 1 .852** level (as indicated by double asterisks; p = 0.002
Sig. (2-tailed)
< 0.01). Linear relationship can be further
. .002
described as “high” or “strong” positive based on r
N 10 10 = 0.852. It can deduced that college students with
ENG Pearson Correlation .852** 1 relatively higher math proficiency tend to have
Sig. (2-tailed) .002 . higher English proficiency, low math ability
N 10 10 students tend to show low command English.
**. Correlation is s ignificant at the 0.01 level
Coefficientofdetermination:
(2-tailed).

Further, 72.6% of the variation in English scores can be explained by


r2 = 0.726 (or can be attributed to) the variation in Math scores. In other words,
variation in math proficiency accounts for 72.6% of the variation in
English proficiency. Only 27.4% is accounted for by other factors.
(Note: A significant correlation between x & y indicates that x is a
predictor of y. Correlation should precede regression (prediction);
prediction is not proper when x is not related to y.
Regression Model (Ŷ = a + bx):
Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio

SPSS: click REGRESSION, LINEAR, enter math (x) as independent, enter eng as dependent, method ENTER
(default method is ENTER), click OK. (Note: For multiple regression, best methods are backward selection &
stepwise; other methods are not advisable; discussion is beyond the scope of EDM)

Model Summary

Adjusted Std. Error of


Model R R Square R Square the Estimate
1 .852a .726 .692 6.31099
a. Predictors: (Constant), MATH

Coeffi cientsa

Unstandardized St andardiz ed
Coeffic ient s Coeffic ient s
Model B St d. E rror Beta t Sig. The regression model
1 (Const ant) 25.725 11.695 2.200 .059 indicates that for every 1 point
MA TH .682 .148 .852 4.606 .002 increase/decrease in math
a. Dependent Variable: E NG score (x), English score (y) is
b a
expected to increase/decrease
by 0.682. Also, given a math
score of 85, English score is
Ŷ = 25.725 + 0.682x estimated to be 83.695,
computed as:
Ŷ = 25.725 + 0.682 (85)

Samples Table: Linear Correlation and Regression Analysis

Table 1
Correlation Analysis between Mathematics Ability and English Proficiency
Variable Coefficient of Correlation (r) R Square Probability Value
Mathematics Ability and 0.852 0.726 0.02**
English Proficiency
(*Significant at alpha = 0.01)

There is a significant relationship between the Mathematics Ability and English Proficiency. The coefficient of
correlation of 0.852 further described as “high” or strong positive relationship with the corresponding probability value
of 0.02 which is significant at alpha = 0.01. This means that college students with relatively higher math proficiency
tend to have higher English proficiency, low math ability students tend to show low command English.
The 72.6% of the variation in English scores can be explained by or can be attributed to the variation in
Mathematics scores. In other words, variation in math proficiency accounts for 72.6% of the variation in English
proficiency. Only 27.4% is accounted for by other factors.

Table 2
Regression Model of Mathematics Ability and English Proficiency
Parameters Values
Regression Correlation Coefficient (r) 0.852
Regression Correlation of Determination (R Square) 0.726
Regression Constant 25.725
Regression Coefficient 0.682

The resulting regression equation is

Ŷ = 25.725 + 0.682 (85)

Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011
Gordon College
Statistics (EDM 503) 2nd sem 2013-2014: Correlation and Regression Analysis
Mr. Darwin P. Paguio
Where
x = predicted scores on mathematics ability
y = estimated English score

The regression model indicates that for every 1 point increase/decrease in math score (x), English score (y) is
expected to increase/decrease by 0.682. Also, given a math score of 85, English score is estimated to be 83.695,

Integration

Conceptualize your own research problem in your own field of specialization

Reference: Pangilinan, Diana B. PhD Data Analysis Lecture, University of the Assumption, 2011

You might also like