0% found this document useful (0 votes)

5 views

File Gabungan

Uploaded by

Elva Marda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

File Gabungan

Uploaded by

Elva Marda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 107

Lecture 8

Correlation and Simple

Regression Analysis

• The product moment correlation, r, summarises the

strength of association between two metric (interval or
ratio scaled) variables, say X and Y.

• It is an index used to determine whether a linear or

straight-line relationship exists between X and Y.

• As it was originally proposed by Karl Pearson, it is also

known as the Pearson correlation coefficient.
It is also referred to as simple correlation, bivariate
correlation or merely the correlation coefficient.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
From a sample of n observations, X and Y, the product
moment correlation, r, can be calculated as:
n

i =1
(X i − X ) ( Yi − Y )
r =
n n

2

2
( Xi − X ) ( Yi − Y )
i=1 i= 1

Division of the numerator and denominator by (n – 1) gives:

n
( Xi − X )( Yi − Y )

i =1 n –1
r =
n
( Xi − X )2 n
( Yi − Y )2
 n −1  n −1
i =1 i =1

C OV x y
=
Sx Sy
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• r varies between −1.0 and +1.0.

• The correlation coefficient between two variables will be

the same regardless of their underlying units of
measurement.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 22.1
Explaining attitude towards the city of
residence

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
The correlation coefficient may be calculated as follows:
= (10 + 12 + 12 + 4 + 12 + 6 + 8 + 2 + 18 + 9 + 17 + 2)/12
X = 9.333

= (6 + 9 + 8 + 3 + 10 + 4 + 5 + 2 + 11 + 9 + 10 + 2)/12
Y = 6.583
n

i
(X i – X )(Y i – Y ) = (10 − 9.33)(6 − 6.58) + (12 − 9.33)(9 − 6.58)
+ (12 – 9.33)(8 − 6.58) + (4 − 9.33)(3 − 6.58)
=1
+ (12 − 9.33)(10 − 6.58) + (6 − 9.33)(4 − 6.58)
+ (8 − 9.33)(5 − 6.58) + (2 − 9.33) (2 − 6.58)
+ (18 – 9.33)(11 − 6.58) + (9 − 9.33)(9 − 6.58)
+ (17 − 9.33)(10 − 6.58) + (2 – 9.33)(2 − 6.58)
= − 0.3886 + 6.4614 + 3.7914 + 19.0814
+ 9.1314 + 8.5914 + 2.1014 + 33.5714
+ 38.3214 − 0.7986 + 26.2314 + 33.5714
= 179.6668

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
n
 (X i – X )2 = (10 − 9.33)2 + (12 − 9.33)2 + (12 − 9.33)2 + (4 − 9.33)2
i 1
= + (12 − 9.33)2 + (6 − 9.33)2 + (8 − 9.33)2 + (2 − 9.33)2
+ (18 − 9.33)2 + (9 − 9.33)2 + (17 − 9.33)2 + (2 − 9.33)2
= 0.4489 + 7.1289 + 7.1289 + 28.4089
+ 7.1289+ 11.0889 + 1.7689 + 53.7289
+ 75.1689 + 0.1089 + 58.8289 + 53.7289
= 304.6668

n
 (Yi – Y)2 = (6 − 6.58)2 + (9 − 6.58)2 + (8 − 6.58)2 + (3 − 6.58)2
i =1 + (10 − 6.58)2 + (4 − 6.58)2 + (5 − 6.58)2 + (2 − 6.58)2
+ (11 − 6.58)2 + (9 − 6.58)2 + (10 − 6.58)2 + (2 − 6.58)2
= 0.3364 + 5.8564 + 2.0164 + 12.8164
+ 11.6964 + 6.6564 + 2.4964 + 20.9764
+ 19.5364 + 5.8564 + 11.6964 + 20.9764
= 120.9168

Thus, r= 179.6668
= 0.9361
(304.6668) (120.9168)
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Decomposition of the total variation

explained variation
r2 =
total variation

SSx
=
SSy

= total variation – error variation

total variation

SSy – SSer ror

=
SSy

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• When it is computed for a population rather than a sample, the
product moment correlation is denoted by  , the Greek letter
rho. The coefficient r is an estimator of  .

• The statistical significance of the relationship between two

variables measured by using r can be conveniently tested. The
hypotheses are:

H0:  = 0
H1:   0

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
The test statistic is:
t = r n −2
1/ 2
1− r2
which has a t distribution with n − 2 degrees of freedom.
For the correlation coefficient calculated based on the
data given in Table 22.1,

t = 0.9361 12 −2 1/2

1− (0.9361) 2
= 8.414
and the degrees of freedom df = 12 − 2 = 10. From the
t distribution table (Table 4 in the statistical appendix),
the critical value of t for a two-tailed test and
 = 0.05 is 2.228. Hence, the null hypothesis of no
relationship between X and Y is rejected.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Partial correlation

A partial correlation coefficient measures the

association between two variables after controlling for
or adjusting for the effects of one or more additional
variables.
rxy – (rxz) (ryz)
rxy .z = 2 2
1– rxz 1– ryz
• Partial correlations have an order associated with them. The
order indicates how many variables are being adjusted or
controlled.
• The simple correlation coefficient, r, has a zero-order, as it
does not control for any additional variables while measuring
the association between two variables.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Partial correlation

• The coefficient rxy.z is a first-order partial correlation

coefficient, as it controls for the effect of one additional
variable, Z.

• A second-order partial correlation coefficient controls for

the effects of two variables, a third-order for the effects of
three variables and so on.

• The special case when a partial correlation is larger than

its respective zero-order correlation involves a suppressor
effect.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Regression analysis
Regression analysis examines associative relationships
between a metric-dependent variable and one or more
independent variables in the following ways:
• Determine whether the independent variables explain a significant
variation in the dependent variable: whether a relationship exists.
• Determine how much of the variation in the dependent variable can
be explained by the independent variables: strength of the
relationship.
• Determine the structure or form of the relationship: the mathematical
equation relating the independent and dependent variables.
• Predict the values of the dependent variable.
• Control for other independent variables when evaluating the
contributions of a specific variable or set of variables.
Regression analysis is concerned with the nature and degree of
association between variables and does not imply or assume any
causality.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Statistics associated with bivariate
regression analysis

• Bivariate regression model. The basic regression equation is

Yi =  0 + 1 Xi + ei, where Y = dependent or criterion variable,
X = independent or predictor variable, 0 = intercept of the line,
 1 = slope of the line, and e is the error term associated with
i
the i th observation.
• Coefficient of determination. The strength of association is
measured by the coefficient of determination, r2. It varies
between 0 and 1 and signifies the proportion of the total
variation in Y that is accounted for by the variation in X.
• Estimated or predicted value. The estimated or predicted
value of Yi is Ŷβi = a + bx, whereŶi is the predicted value of Yi,
and a and b are estimators of β0 and β1, respectively.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• Regression coefficient. The estimated parameter, b, is
usually referred to as the non-standardised regression
coefficient.
• Scattergram. A scatter diagram, or scattergram, is a plot
of the values of two variables for all the cases or
observations.
• Standard error of estimate. This statistic, SEE, is the
standard deviation of the actual Y values from the
predicted Ŷ values.
• Standard error. The standard deviation of b, SEb, is
called the standard error.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• Standardised regression coefficient. Also termed the
beta coefficient or beta weight, this is the slope obtained
by the regression of Y on X when the data are
standardised.
• Sum of squared errors. The distances of all the points
from the regression line are squared and added together
to arrive at the sum of squared errors, which is a
measure of total error, ej2.
• t statistic. A t statistic with n − 2 degrees of freedom can
be used to test the null hypothesis that no linear
relationship exists between X and Y, or:
H0 : 1 = 0, where t = b
SEb
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
plot the scatter diagram

• A scatter diagram, or scattergram, is a plot of the

values of two variables for all the cases or
observations.
• The most commonly used technique for fitting a
straight line to a scattergram is the least squares
procedure.

In fitting the line, the least squares procedure

minimizes the sum of squared errors, ej2 .

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 22.2
Conducting bivariate regression
analysis

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
formulate the bivariate regression model
In the bivariate regression model, the general form of a
straight line is: Y =  0 +  1 X

where
Y = dependent or criterion variable
X = independent or predictor variable
 0 = intercept of the line
 1 = slope of the line.

The regression procedure adds an error term to account for the

probabilistic or stochastic nature of the relationship:

Yi =  0 +  1 Xi + ei

where ei is the error term associated with the ith observation.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 22.6
Decomposition of the total variation in
bivariate regression

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
estimate the parameters
In most cases,  0 and  1 are unknown and are estimated
from the sample observations using the equation:
Y i = a + bxi
where Yi is the estimated or predicted value of Yi, and
a and b are estimators of  0 and  1 , respectively.
COVxy
b=
Sx2

n

i =1
(Xi – X )(Yi – Y)
= n
 ( Xi – X )
2
i =1

n

i=1
XiYi – nX Y
= n

2
Xi – nX 2
i =1

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
The intercept, a, may then be calculated using:
a = Y – bX

For the data in Table 22.2, the estimation of parameters may be

illustrated as follows:
12
 XiYi = (10)(6) + (12)(9) + (12)(8) + (4)(3) + (12)(10) + (6)(4)
i =1 + (8)(5) + (2)(2) + (18)(11) + (9)(9) + (17)(10) + (2)(2)
= 917

12

i
Xi2 = 102 + 122 + 122 + 42 + 122 + 62
=1
+ 82 + 22 + 182 + 92 + 172 + 22
= 1350

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
It may be recalled from earlier calculations of the simple correlation that:

X = 9.333
Y = 6.583
Given n = 12, b can be calculated as:
917 –(12)(9.333)(6.583)
b=
1350 –(12)(9.333)2
= 0.5897

a = Y – bX
= 6.583 – (0.5897)(9.333)
= 1.0793

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
estimate the standardised regression
coefficient
• Standardisation is the process by which the raw data are
transformed into new variables that have a mean of 0 and a
variance of 1.
• When the data are standardised, the intercept assumes a
value of 0.
• The term beta coefficient or beta weight is used to denote
the standardised regression coefficient.

Byx = Bxy = rxy

• There is a simple relationship between the standardised and

non-standardised regression coefficients:
Byx = byx Sx
Sy

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
test for significance
The statistical significance of the linear relationship
between X and Y may be tested by examining the
hypotheses:
H0:  1 = 0
H 1:  1  0
A t statistic with n – 2 degrees of freedom can be
used, where t = b
SEb
SEb denotes the standard deviation of b and is called
the standard error.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
test for significance

Using quantitative data analysis software, the regression of

attitude on duration of residence, using the data shown in Table
22.1, yielded the results shown in Table 22.2. The intercept, a,
equals 1.0793, and the slope, b, equals 0.5897.
Therefore, the estimated equation is:

Attitude ( Ŷ ) = 1.0793 + 0.5897 (Duration of residence)

The standard error, or standard deviation of b is estimated as
0.07008, and the value of the t statistic as t = 0.5897/0.0701 =
8.414, with n – 2 = 10 degrees of freedom.
From Table 4 in the statistical appendix, we see that the critical
value of t with 10 degrees of freedom and  = 0.05 is 2.228 for
a two-tailed test. Since the calculated value of t is larger than
the critical value, the null hypothesis is rejected.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
determine the strength and significance
of association
The total variation, SSy, may be decomposed into the variation
accounted for by the regression line, SSreg, and the error or
residual variation, SSerror or SSres, as follows:
SSy = SSreg + SSres
n
where SSy =i  (Y
=1 i −Y )2

SSreg =  (Ŷ −Y)2
i
i =1
n
SSres =  (Y −Ŷi)2
i =1 i

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
The strength of association may then be calculated as follows:
SSreg
r2=
SSy

SSy −SSres
=
SSy

To illustrate the calculations of r2, let us consider again the effect

of attitude toward the city on the duration of residence. It may be
recalled from earlier calculations of the simple correlation
coefficient that: n
SSy =  (Yi − Y )
2

i =1
= 120.9168
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
The predicted values (Ŷ ) can be calculated using the regression
equation:

Attitude (Ŷ ) = 1.0793 + 0.5897 (duration of residence)

For the first observation in Table 22.1, this value is:

(Ŷ ) = 1.0793 + 0.5897 × 10 = 6.9763

For each successive observation, the predicted values are, in order,

8.1557, 8.1557, 3.4381, 8.1557, 4.6175, 5.7969, 2.2587, 11.6939,
6.3866, 11.1042 and 2.2587.

i =1

= (6.9763 − 6.5833)2 + (8.1557 − 6.5833)2

+ (8.1557 − 6.5833)2 + (3.4381 − 6.5833)2
+ (8.1557 − 6.5833)2 + (4.6175 − 6.5833)2
+ (5.7969 – 6.5833)2 + (2.2587 − 6.5833)2
+ (11.6939 − 6.5833)2 + (6.3866 − 6.5833)2
+ (11.1042 − 6.5833)2 + (2.2587 − 6.5833)2
= 0.1544 + 2.4724 + 2.4724 + 9.8922 + 2.4724
+ 3.8643 + 0.6184 + 18.7021 + 26.1182
+ 0.0387 + 20.4385 + 18.7021
= 105.9524
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
n
SSreg =  (Yi − Yi )
2

i =1

= (6 − 6.9763)2 + (9 − 8.1557)2 + (8 − 8.1557)2 + (3 − 3.4381)2

+ (10 − 8.1557)2 + (4 − 4.6175)2 + (5 − 5.7969)2 + (2 − 2.2587)2
+ (11 − 11.6939)2 + (9 − 6.3866)2 + (10 − 11.1042)2 + (2 − 2.2587)2
= 14.9644
It can be seen that SSy = SSreg + SSres . Furthermore,

SSreg
r2=
SSy
105.9466
=
120.9168

= 0.8762
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Another, equivalent test for examining the significance of the
linear relationship between X and Y (significance of b) is the
test for the significance of the coefficient of determination.
The hypotheses in this case are:
2
H0 : Rpop = 0
2
H1: Rpop 0

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
The appropriate test statistic is the F statistic:
S S reg
F=
SS res /(n – 2)
which has an F distribution with 1 and (n – 2) df. The F test is a generalised form
of the t test. If a random variable is t distributed with n degrees of freedom, then
t2 is F distributed with 1 and n df. Hence, the F test for testing the significance of
the coefficient of determination is equivalent to testing the following hypotheses:
H 0:  1 = 0
H 0:  1  0
or
H 0:  = 0
H 1:   0

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
From Table 22.2, it can be seen that:
105.9522
r2 =
105.9522 + 14.9644
= 0.8762
Which is the same as the value calculated earlier. The value of the
F statistic is:
105.9522
F =
14.964/10
= 70.8028
with 1 and 10 df. The calculated F statistic exceeds the critical value of
4.96 determined from Table 5 in the Appendix. Therefore, the
relationship is significant at a = 0.05, corroborating the results of the
t test.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 22.2
Bivariate regression

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
check prediction accuracy
To estimate the accuracy of predicted values, Ŷ , it is useful to
calculate the standard error of estimate, SEE.
n

 (Yi − Yˆ )2
SEE = i =1

n−2
or
SEE = SSres
n− 2
or more generally, if there are k independent variables,

SEE = SSres
n − k −1
For the data given in Table 22.2, the SEE is estimated as follows:
SEE = 14.9644/(12–2)
= 1.22329
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Assumptions

• The error term is normally distributed. For each fixed

value of X, the distribution of Y is normal.

• The means of all these normal distributions of Y,

given X, lie on a straight line with slope b.

• The mean of the error term is 0.

• The variance of the error term is constant. This

variance does not depend on the values assumed by
X.

• The error terms are uncorrelated. In other words, the

observations have been drawn independently.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Lecture 4
Cross-tabulation

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• While a frequency distribution describes one
variable at a time, a cross-tabulation describes
two or more variables simultaneously.

• Cross-tabulation results in tables that reflect the

joint distribution of two or more variables with a
limited number of categories or distinct values,
for example, Table 20.3.

• Since two variables have been cross-classified,

percentages could be computed either column-
wise, based on column totals (Table 20.4), or
row-wise, based on row totals (Table 20.5).

• The general rule is to compute the percentages

in the direction of the independent variable,
across the dependent variable. The correct way
of calculating percentages is as shown in
Table 20.4.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.4
Gender and internet usage – column
totals

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.5
Gender and internet usage – row totals

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Three variables cross-tabulation
refine an initial relationship

As shown in Figure 20.7, the introduction of a third

variable can result in four possibilities:
• As can be seen from Table 20.6, 52% of unmarried participants
fell in the high-purchase category, as opposed to 31% of the
married participants. Before concluding that unmarried
participants purchase more designer clothing than those who
are married, a third variable, the buyer’s gender, was
introduced into the analysis.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• As shown in Table 20.7, in the case of females, 60% of the
unmarried participants fall in the high-purchase category, as
compared to 25% of those who are married. On the other
hand, the percentages are much closer for males, with 40% of
the unmarried participants and 35% of the married participants
falling in the high-purchase category.

• Hence, the introduction of gender (third variable) has refined

the relationship between marital status and purchase of luxury
branded clothing (original variables). Unmarried participants
are more likely to fall in the high-purchase category than
married ones, and this effect is much more pronounced for
females than for males.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.6
Purchase of luxury branded clothing by
marital status

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.7
Purchase of luxury branded clothing by
marital status and gender

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Three variables cross-tabulation
initial relationship was spurious

• Table 20.8 shows that 32% of those with university degrees

own an expensive car, as compared to 21% of those without
university degrees. Realising that income may also be a factor,
the researcher decided to re-examine the relationship between
education and ownership of expensive cars in the light of
income level.
• In Table 20.9, the percentages of those with and without
university degrees who own expensive automobiles are the
same for each income group. When the data for the high-
income and low-income groups are examined separately, the
association between education and ownership of expensive
automobiles disappears, indicating that the initial relationship
observed between these two variables was spurious.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.8
Ownership of expensive cars by
education level

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.9
Ownership of expensive cars by
education and income levels

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Three variables cross-tabulation
reveal suppressed association
• Table 20.10 shows no association between desire to travel abroad
and age.
• When gender was introduced as the third variable, Table 20.11
was obtained. Among men, 60% of those under 45 indicated a
desire to travel abroad, as compared to 40% of those 45 or older.
The pattern was reversed for women, where 35% of those under
45 indicated a desire to travel abroad as opposed to 65% of those
45 or older.
• Since the association between desire to travel abroad and age
runs in the opposite direction for males and females, the
relationship between these two variables is masked when the data
are aggregated across gender as in Table 20.10.
• But when the effect of gender is controlled, as in Table 20.11, the
suppressed association between desire to travel abroad and age
is revealed for the separate categories of males and females.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.10
Desire to travel abroad by age

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.11
Desire to travel abroad by age and
gender

• To determine whether a systematic association exists, the

probability of obtaining a value of chi-square as large or
larger than the one calculated from the cross-tabulation is
estimated.
• An important characteristic of the chi-square statistic is
the number of degrees of freedom (df ) associated with it.
That is, df = (r − 1) × (c − 1).
• The null hypothesis (H0) of no association between the
two variables will be rejected only when the calculated
value of the test statistic is greater than the critical value
of the chi-square distribution with the appropriate degrees
of freedom, as shown in Figure 20.8.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 20.8
Chi-square test of association

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Statistics associated with
cross-tabulation chi-square

• The chi-square statistic ( 2) is used to test the

statistical significance of the observed association in
a cross-tabulation.
• The expected frequency for each cell can be
calculated by using a simple formula:
nn
fe = nr c

where nr = total number in the row

nc = total number in the column
n = total sample size.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
For the data in Table 20.3, the expected frequencies for
the cells going from left to right and from top to bottom, are:
15 × 15/30 = 7.50, 15 × 15/30 = 7.50,
15 × 15/30 = 7.50,15 × 15/30 = 7.50

Then the value of  is calculated as follows:

2 (fo− fe )2
= 
all cells fe

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
For the data in Table 20., the value of  2is
calculated as:

2 = (5 − 7.5)2/7.5 + (10 − 7.5)2/7.5 + (10 − 7.5)2/7.5 + (5 − 7.5)2/7.5

= 0.833 + 0.833 + 0.833 + 0.833

= 3.333

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• The chi-square distribution is a skewed distribution whose
shape depends solely on the number of degrees of freedom.
As the number of degrees of freedom increases, the chi-square
distribution becomes more symmetrical.
• Table 3 in the statistical appendix contains upper-tail areas of
the chi-square distribution for different degrees of freedom. For
1 degree of freedom the probability of exceeding a chi-square
value of 3.841 is 0.05.
• For the cross-tabulation given in Table 20.3, there are
(2 − 1) × (2 − 1) = 1 degree of freedom. The calculated chi-
square statistic had a value of 3.333. Since this is less than the
critical value of 3.841, the null hypothesis of no association
cannot be rejected, indicating that the association is not
statistically significant at the 0.05 level.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Lecture 3
Descriptive Statistics and
Hypothesis Testing

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Frequency distribution

• In a frequency distribution, one variable is

considered at a time.

• A frequency distribution for a variable produces a

table of frequency counts, percentages and
cumulative percentages for all the values
associated with that variable.

• The mean, or average value, is the most commonly used

n tendency. The mean, X , is given by:
measure of central
X i
i =1
X =
n
where
Xi = observed values of the variable X
n = number of observations (sample size).

• The mode is the value that occurs most frequently. It represents

the highest peak of the distribution. The mode is a good measure
of location when the variable is inherently categorical or has
otherwise been grouped into categories.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• The median of a sample is the middle value when the data are
arranged in ascending or descending order. If the number of data
points is even, the median is usually estimated as the midpoint
between the two middle values – by adding the two middle values
and dividing their sum by 2. The median is the 50th percentile.

• The range measures the spread of the data. It is

simply the difference between the largest and
smallest values in the sample:
Range = Xlargest – Xsmallest
• The interquartile range is the difference between
the 75th and 25th percentile. For a set of data points
arranged in order of magnitude, the pth percentile is
the value that has p% of the data points below it and
(100 – p)% above it.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• The variance is the mean squared deviation from the
mean. The variance can never be negative.
• The standard deviation is the square root of the
variance. n 2
sx =  (Xi − X)
i -1
n−1
• The coefficient of variation is the ratio of the
standard deviation to the mean expressed as a
percentage, and it is a unitless measure of relative
variability. sx
CV =
X
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Statistics associated with frequency
distribution measures of shape

• Skewness. The tendency of the deviations from the mean

to be larger in one direction than in the other. It can be
thought of as the tendency for one tail of the distribution to
be heavier than the other.
• Kurtosis is a measure of the relative peakedness or
flatness of the curve defined by the frequency distribution.
The kurtosis of a normal distribution is zero. If the kurtosis
is positive, then the distribution is more peaked than a
normal distribution. A negative value means that the
distribution is flatter than a normal distribution.

• A null hypothesis is a statement of the status quo, one of

no difference or no effect. If the null hypothesis is not
rejected, no changes will be made.
• An alternative hypothesis is one in which some
difference or effect is expected. Accepting the alternative
hypothesis will lead to changes in opinions or actions.
• The null hypothesis refers to a specified value of the
population parameter (e.g. ,,  ), not a sample statistic
(e.g. ).
X

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• A null hypothesis may be rejected, but it can never be
accepted based on a single test. In classical hypothesis
testing, there is no way to determine whether the null
hypothesis is true.
• The null hypothesis is formulated in such a way that its
rejection leads to the acceptance of the desired
conclusion. The alternative hypothesis represents the
conclusion for which evidence is sought.
H0:   0.40
H1:   0.40

• The test of the null hypothesis is a one-tailed test,

because the alternative hypothesis is expressed
directionally. If that is not the case, then a two-tailed test
would be required, and the hypotheses would be
expressed as:
H0:  = 0.40
H1:   0.40

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
A general procedure for hypothesis testing
step 2: Select an appropriate statistical
technique
• The test statistic measures how close the sample has
come to the null hypothesis.
• The test statistic often follows a well-known distribution,
such as the normal, t, or chi-square distribution.
• In our example, the z statistic, which follows the standard
normal distribution, would be appropriate.

Type I error
• Type I error occurs when the sample results lead to the
rejection of the null hypothesis when it is in fact true.
• The probability of type I error (  ) is also called the level
of significance.

Type II error
• Type II error occurs when, based on the sample results,
the null hypothesis is not rejected when it is in fact false.
• The probability of type II error is denoted by  .
• Unlike , which is specified by the researcher, the
magnitude of  depends on the actual value of the
population parameter (proportion).
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 20.6
A broad classification of hypothesis
testing procedures

When you can measure what

you are speaking about and
express it in numbers, you
know something about it.
– Lord Kelvin

Measurement means assigning numbers or other

symbols to characteristics of objects according to
certain pre-specified rules.
– One-to-one correspondence between the
numbers and the characteristics being measured.
– The rules for assigning numbers should be
standardised and applied uniformly.
– Rules must not change over objects or time.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Scaling involves creating a continuum upon
which measured objects are located.
Consider an attitude scale from 1 to 100. Each
respondent is assigned a number from 1 to 100,
with 1 = extremely unfavourable, and 100 =
extremely favourable. Measurement is the actual
assignment of a number from 1 to 100 to each
respondent. Scaling is the process of placing the
respondents on a continuum, for example, with
respect to their attitude towards Formula One
racing.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 12.1
An illustration of primary scales of
measurement

• The numbers serve only as labels or tags for identifying

and classifying objects.
• When used for identification, there is a strict one-to-one
correspondence between the numbers and the objects.
• The numbers do not reflect the amount of the
characteristic possessed by the objects.
• The only permissible operation on the numbers in a
nominal scale is counting.
• Only a limited number of statistics, all of which are based
on frequency counts, are permissible, e.g. percentages
and mode.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 12.2
Illustration of primary scales of
measurement

• A ranking scale in which numbers are assigned to

objects to indicate the relative extent to which the
objects possess some characteristic.
• Can determine whether an object has more or less of a
characteristic than some other object, but not how much
more or less.
• Any series of numbers can be assigned that preserves
the ordered relationships between the objects.
• In addition to the counting operation allowable for
nominal scale data, ordinal scales permit the use of
statistics based on centiles, e.g. percentile, quartile and
median.

• Numerically equal distances on the scale represent

equal values in the characteristic being measured.
• It permits comparison of the differences between
objects.
• The location of the zero point is not fixed. Both the zero
point and the units of measurement are arbitrary.
• Any positive linear transformation of the form y = a + bx
will preserve the properties of the scale.
• It is not meaningful to take ratios of scale values.
• Statistical techniques that may be used include all of
those that can be applied to nominal and ordinal data in
addition the arithmetic mean, standard deviation and
other statistics commonly used in marketing research.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Ratio scale

• Possesses all the properties of the nominal,

ordinal and interval scales.

• It has an absolute zero point.

• It is meaningful to compute ratios of scale values.

• Only proportionate transformations of the form

y = bx, where b is a positive constant, are
allowed.

• All statistical techniques can be applied to ratio

• Comparative scales involve the direct

comparison of stimulus objects. Comparative
scale data must be interpreted in relative terms
and have only ordinal or rank order properties.

• In non-comparative scales, each object is

scaled independently of the others in the
stimulus set. The resulting data are generally
assumed to be interval or ratio scaled.

• The respondents are provided with a scale that

has a number or brief description associated
with each category.

• The categories are ordered in terms of scale

position, and the respondents are required to
select the specified category that best describes
the object being rated.

• The commonly used itemised rating scales are

The true score model provides a framework for

understanding the accuracy of measurement.
XO = XT + XS + XR

where

XO = the observed score or measurement

XT = the true score of the characteristic
XS = systematic error
XR = random error.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Reliability
• Reliability can be defined as the extent to which
measures are free from random error, XR. If XR = 0, the
measure is perfectly reliable.

• In test-retest reliability, respondents are administered

identical sets of scale items at two different times and
the degree of similarity between the two measurements
is determined.

• In alternative-forms reliability, two equivalent forms of

the scale are constructed and the same respondents are
measured at two different times, with a different form
being used each time.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• Internal consistency reliability determines the extent
to which different parts of a summated scale are
consistent in what they indicate about the characteristic
being measured.
• In split-half reliability, the items on the scale are
divided into two halves and the resulting half scores are
correlated.
• The coefficient alpha, or Cronbach’s alpha, is the
average of all possible split-half coefficients resulting
from different ways of splitting the scale items. This
coefficient varies from 0 to 1, and a value of 0.6 or less
generally indicates unsatisfactory internal consistency
reliability.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Validity

• The validity of a scale may be defined as the extent to

which differences in observed scale scores reflect true
differences among objects on the characteristic being
measured, rather than systematic or random error.
Perfect validity requires that there be no measurement
error (XO = XT, XR = 0, XS = 0).
• Content validity is a subjective but systematic
evaluation of how well the content of a scale represents
the measurement task at hand.
• Criterion validity reflects whether a scale performs as
expected in relation to other variables selected (criterion
variables) as meaningful criteria.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• Construct validity addresses the question of what
construct or characteristic the scale is, in fact,
measuring. Construct validity includes convergent,
discriminant and nomological validity.
• Convergent validity is the extent to which the scale
correlates positively with other measurements of the
same construct.
• Discriminant validity is the extent to which a measure
does not correlate with other constructs from which it is
supposed to differ.
• Nomological validity is the extent to which the scale
correlates in theoretically predicted ways with measures
of different but related constructs.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Relationship between reliability and
validity

• If a measure is perfectly valid, it is also perfectly reliable. In

this case XO = XT, XR = 0 and XS = 0.
• If a measure is unreliable, it cannot be perfectly valid, since at
a minimum XO = XT + XR. Furthermore, systematic error may
also be present, i.e. XS ≠ 0. Thus, unreliability implies
invalidity.
• If a measure is perfectly reliable, it may or may not be
perfectly valid, because systematic error may still be present
(XO = XT + XS).
• Reliability is a necessary, but not sufficient, condition for
validity.

New Edexcel Further Statistics 1
No ratings yet
New Edexcel Further Statistics 1
221 pages
3RD Periodical Exam in Stat
91% (11)
3RD Periodical Exam in Stat
7 pages
Stat Bivariate Data Analysis Project Olympics Fall 2020-1
No ratings yet
Stat Bivariate Data Analysis Project Olympics Fall 2020-1
2 pages
EIN4333 Lasrado HW05
0% (1)
EIN4333 Lasrado HW05
2 pages
SCM Session 6 Correlation and Regression Analysis
No ratings yet
SCM Session 6 Correlation and Regression Analysis
63 pages
Chapter Seventeen: Correlation and Regression
No ratings yet
Chapter Seventeen: Correlation and Regression
80 pages
Correlation & Regression
No ratings yet
Correlation & Regression
70 pages
17.correlation and Regression PDF
No ratings yet
17.correlation and Regression PDF
80 pages
Chapter Seventeen: Correlation and Regression
No ratings yet
Chapter Seventeen: Correlation and Regression
71 pages
Lecture 10
No ratings yet
Lecture 10
33 pages
15 MAY - NR - Correlation and Regression
No ratings yet
15 MAY - NR - Correlation and Regression
10 pages
Lecture 11-Correlation and Linear Regression
No ratings yet
Lecture 11-Correlation and Linear Regression
7 pages
Regression Analysis
No ratings yet
Regression Analysis
5 pages
Chapter_10.QM sir pac
No ratings yet
Chapter_10.QM sir pac
8 pages
Correlation and Regression
No ratings yet
Correlation and Regression
23 pages
REGRESSION ANALYSIS
No ratings yet
REGRESSION ANALYSIS
6 pages
Simple Linear Regression and Correlation Analysis: Chapter Five
No ratings yet
Simple Linear Regression and Correlation Analysis: Chapter Five
5 pages
IV - Measures of Relationship
100% (1)
IV - Measures of Relationship
4 pages
CORRELATION and Regression
No ratings yet
CORRELATION and Regression
6 pages
ECONOMETRICSII Notes1
No ratings yet
ECONOMETRICSII Notes1
15 pages
biostat lecture note 3
No ratings yet
biostat lecture note 3
5 pages
PSNM - Ch. 1
No ratings yet
PSNM - Ch. 1
16 pages
Y X y X N B: Linear Regression
No ratings yet
Y X y X N B: Linear Regression
7 pages
Chapter 13 PowerPoint
No ratings yet
Chapter 13 PowerPoint
36 pages
Statistics and Probability: Quarter 4 - (Week 6)
No ratings yet
Statistics and Probability: Quarter 4 - (Week 6)
8 pages
Math 100: Mathematics in The Modern World (MMW) Data Management
No ratings yet
Math 100: Mathematics in The Modern World (MMW) Data Management
32 pages
Introduction to Correlation and Regression Analysis (1) (3)
No ratings yet
Introduction to Correlation and Regression Analysis (1) (3)
14 pages
Correlation Analysis and Regression 22
No ratings yet
Correlation Analysis and Regression 22
41 pages
Correlation & Regression (Complete) .PDF Theory Module-6-B
100% (1)
Correlation & Regression (Complete) .PDF Theory Module-6-B
9 pages
Stats ch12 PDF
No ratings yet
Stats ch12 PDF
28 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Chapter 5 - 1
No ratings yet
Chapter 5 - 1
5 pages
7 Measure of Relationship
100% (1)
7 Measure of Relationship
6 pages
19 - Correlation and Regression
No ratings yet
19 - Correlation and Regression
7 pages
Correlation 1
100% (1)
Correlation 1
57 pages
Bivariate Analysis
No ratings yet
Bivariate Analysis
10 pages
Sta404 - Chapter 5 - Bivariate Analysis (Student)
No ratings yet
Sta404 - Chapter 5 - Bivariate Analysis (Student)
27 pages
Correlation and Regression
No ratings yet
Correlation and Regression
7 pages
Captura de ecrã 2024-10-16 à(s) 13.04.06
No ratings yet
Captura de ecrã 2024-10-16 à(s) 13.04.06
38 pages
Correlation and Simple Regression
No ratings yet
Correlation and Simple Regression
5 pages
Lecture 1
No ratings yet
Lecture 1
44 pages
BONGGA Statistics-and-Probability 4Q SLM8
No ratings yet
BONGGA Statistics-and-Probability 4Q SLM8
10 pages
U4 m4
No ratings yet
U4 m4
19 pages
2. Correlation, Regression & Curve Fitting
No ratings yet
2. Correlation, Regression & Curve Fitting
6 pages
Correlation
0% (1)
Correlation
22 pages
Chapter 3 complete
No ratings yet
Chapter 3 complete
109 pages
Regression and Correlation
No ratings yet
Regression and Correlation
37 pages
Corr and Regress
No ratings yet
Corr and Regress
42 pages
Microsoft PowerPoint Session 4 PDF
No ratings yet
Microsoft PowerPoint Session 4 PDF
86 pages
Module Five: Correlation Objectives
No ratings yet
Module Five: Correlation Objectives
11 pages
Chapter Four Correlation Analysis: Positive or Negative
No ratings yet
Chapter Four Correlation Analysis: Positive or Negative
15 pages
TOPIC 9
No ratings yet
TOPIC 9
9 pages
Correlation and Regression...
No ratings yet
Correlation and Regression...
39 pages
Course Pack Correlation
No ratings yet
Course Pack Correlation
12 pages
Unit 5: Correlation and Regression
No ratings yet
Unit 5: Correlation and Regression
23 pages
Lectures 5 6 - Correlation Analysis
No ratings yet
Lectures 5 6 - Correlation Analysis
29 pages
Online Correlation and Regression
No ratings yet
Online Correlation and Regression
6 pages
Correlation and Regression
No ratings yet
Correlation and Regression
7 pages
Correlation Questions
No ratings yet
Correlation Questions
12 pages
Correlation and Regression
No ratings yet
Correlation and Regression
43 pages
12.1correlation and simple linear
No ratings yet
12.1correlation and simple linear
45 pages
CORRELATION-AND-REGRESSION (1)
No ratings yet
CORRELATION-AND-REGRESSION (1)
23 pages
Introduction To Correlationand Regression Analysis BY Farzad Javidanrad PDF
No ratings yet
Introduction To Correlationand Regression Analysis BY Farzad Javidanrad PDF
52 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Chapter 13 Text PDF
No ratings yet
Chapter 13 Text PDF
31 pages
IARE P&S Lecture Notes 0
No ratings yet
IARE P&S Lecture Notes 0
71 pages
Inferential Statistics
No ratings yet
Inferential Statistics
29 pages
Lesson_7_2_AP_Stats_Math_Medic_efd9e312bc
No ratings yet
Lesson_7_2_AP_Stats_Math_Medic_efd9e312bc
2 pages
Tabla D3 Duncan Statistics Control
No ratings yet
Tabla D3 Duncan Statistics Control
3 pages
Goodness of Fit Test: A Multinomial Population Goodness of Fit Test: Poisson and Normal Distributions Test of Independence
No ratings yet
Goodness of Fit Test: A Multinomial Population Goodness of Fit Test: Poisson and Normal Distributions Test of Independence
54 pages
Amity MBA Chapter 5 Probability
No ratings yet
Amity MBA Chapter 5 Probability
55 pages
Best Fit Line Regression
No ratings yet
Best Fit Line Regression
7 pages
Statistics and Probability - Assessment
100% (1)
Statistics and Probability - Assessment
2 pages
REPORT Split Plot ANOVA (SPANOVA)
No ratings yet
REPORT Split Plot ANOVA (SPANOVA)
13 pages
Stat 331 Tutorial Set
No ratings yet
Stat 331 Tutorial Set
2 pages
Queiung Model
No ratings yet
Queiung Model
33 pages
BRM - Chapter 1 - What Is Statistics
No ratings yet
BRM - Chapter 1 - What Is Statistics
2 pages
Multiple Linear Regression Using Python Machine Learning: Kaleab Woldemariam, June 2017
No ratings yet
Multiple Linear Regression Using Python Machine Learning: Kaleab Woldemariam, June 2017
8 pages
Random Variables and Probability Distributions: Topic 5
No ratings yet
Random Variables and Probability Distributions: Topic 5
7 pages
S1 Papers To Jan 13
No ratings yet
S1 Papers To Jan 13
146 pages
CS340 Machine Learning Information Theory
No ratings yet
CS340 Machine Learning Information Theory
22 pages
Using The T-Test: IB Biology Topic 1
No ratings yet
Using The T-Test: IB Biology Topic 1
22 pages
The Use of MEWMA Control Chart in Controlling Major Component of Cement Product
No ratings yet
The Use of MEWMA Control Chart in Controlling Major Component of Cement Product
11 pages
Clelland Incumbency Report
No ratings yet
Clelland Incumbency Report
6 pages
Full Introductory Biostatistics 2nd Edition Chap T. Le Ebook All Chapters
100% (14)
Full Introductory Biostatistics 2nd Edition Chap T. Le Ebook All Chapters
60 pages
ARIMA Models: X = X + Z, ∼ W N (0, σ)
No ratings yet
ARIMA Models: X = X + Z, ∼ W N (0, σ)
9 pages
Decision Analysis
No ratings yet
Decision Analysis
50 pages
Prems Mann
0% (1)
Prems Mann
17 pages
Regression Dataset
No ratings yet
Regression Dataset
3 pages
NLOGIT 6 Reference Guide
No ratings yet
NLOGIT 6 Reference Guide
695 pages