0% found this document useful (0 votes)
5 views

File Gabungan

Uploaded by

Elva Marda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

File Gabungan

Uploaded by

Elva Marda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

Lecture 8

Correlation and Simple


Regression Analysis

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Product moment correlation

• The product moment correlation, r, summarises the


strength of association between two metric (interval or
ratio scaled) variables, say X and Y.

• It is an index used to determine whether a linear or


straight-line relationship exists between X and Y.

• As it was originally proposed by Karl Pearson, it is also


known as the Pearson correlation coefficient.
It is also referred to as simple correlation, bivariate
correlation or merely the correlation coefficient.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
From a sample of n observations, X and Y, the product
moment correlation, r, can be calculated as:
n

i =1
(X i − X ) ( Yi − Y )
r =
n n

2

2
( Xi − X ) ( Yi − Y )
i=1 i= 1

Division of the numerator and denominator by (n – 1) gives:

n
( Xi − X )( Yi − Y )

i =1 n –1
r =
n
( Xi − X )2 n
( Yi − Y )2
 n −1  n −1
i =1 i =1

C OV x y
=
Sx Sy
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• r varies between −1.0 and +1.0.

• The correlation coefficient between two variables will be


the same regardless of their underlying units of
measurement.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 22.1
Explaining attitude towards the city of
residence

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
The correlation coefficient may be calculated as follows:
= (10 + 12 + 12 + 4 + 12 + 6 + 8 + 2 + 18 + 9 + 17 + 2)/12
X = 9.333

= (6 + 9 + 8 + 3 + 10 + 4 + 5 + 2 + 11 + 9 + 10 + 2)/12
Y = 6.583
n

i
(X i – X )(Y i – Y ) = (10 − 9.33)(6 − 6.58) + (12 − 9.33)(9 − 6.58)
+ (12 – 9.33)(8 − 6.58) + (4 − 9.33)(3 − 6.58)
=1
+ (12 − 9.33)(10 − 6.58) + (6 − 9.33)(4 − 6.58)
+ (8 − 9.33)(5 − 6.58) + (2 − 9.33) (2 − 6.58)
+ (18 – 9.33)(11 − 6.58) + (9 − 9.33)(9 − 6.58)
+ (17 − 9.33)(10 − 6.58) + (2 – 9.33)(2 − 6.58)
= − 0.3886 + 6.4614 + 3.7914 + 19.0814
+ 9.1314 + 8.5914 + 2.1014 + 33.5714
+ 38.3214 − 0.7986 + 26.2314 + 33.5714
= 179.6668

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
n
 (X i – X )2 = (10 − 9.33)2 + (12 − 9.33)2 + (12 − 9.33)2 + (4 − 9.33)2
i 1
= + (12 − 9.33)2 + (6 − 9.33)2 + (8 − 9.33)2 + (2 − 9.33)2
+ (18 − 9.33)2 + (9 − 9.33)2 + (17 − 9.33)2 + (2 − 9.33)2
= 0.4489 + 7.1289 + 7.1289 + 28.4089
+ 7.1289+ 11.0889 + 1.7689 + 53.7289
+ 75.1689 + 0.1089 + 58.8289 + 53.7289
= 304.6668

n
 (Yi – Y)2 = (6 − 6.58)2 + (9 − 6.58)2 + (8 − 6.58)2 + (3 − 6.58)2
i =1 + (10 − 6.58)2 + (4 − 6.58)2 + (5 − 6.58)2 + (2 − 6.58)2
+ (11 − 6.58)2 + (9 − 6.58)2 + (10 − 6.58)2 + (2 − 6.58)2
= 0.3364 + 5.8564 + 2.0164 + 12.8164
+ 11.6964 + 6.6564 + 2.4964 + 20.9764
+ 19.5364 + 5.8564 + 11.6964 + 20.9764
= 120.9168

Thus, r= 179.6668
= 0.9361
(304.6668) (120.9168)
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Decomposition of the total variation

explained variation
r2 =
total variation

SSx
=
SSy

= total variation – error variation


total variation

SSy – SSer ror


=
SSy

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• When it is computed for a population rather than a sample, the
product moment correlation is denoted by  , the Greek letter
rho. The coefficient r is an estimator of  .

• The statistical significance of the relationship between two


variables measured by using r can be conveniently tested. The
hypotheses are:

H0:  = 0
H1:   0

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
The test statistic is:
t = r n −2
1/ 2
1− r2
which has a t distribution with n − 2 degrees of freedom.
For the correlation coefficient calculated based on the
data given in Table 22.1,

t = 0.9361 12 −2 1/2

1− (0.9361) 2
= 8.414
and the degrees of freedom df = 12 − 2 = 10. From the
t distribution table (Table 4 in the statistical appendix),
the critical value of t for a two-tailed test and
 = 0.05 is 2.228. Hence, the null hypothesis of no
relationship between X and Y is rejected.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Partial correlation

A partial correlation coefficient measures the


association between two variables after controlling for
or adjusting for the effects of one or more additional
variables.
rxy – (rxz) (ryz)
rxy .z = 2 2
1– rxz 1– ryz
• Partial correlations have an order associated with them. The
order indicates how many variables are being adjusted or
controlled.
• The simple correlation coefficient, r, has a zero-order, as it
does not control for any additional variables while measuring
the association between two variables.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Partial correlation

• The coefficient rxy.z is a first-order partial correlation


coefficient, as it controls for the effect of one additional
variable, Z.

• A second-order partial correlation coefficient controls for


the effects of two variables, a third-order for the effects of
three variables and so on.

• The special case when a partial correlation is larger than


its respective zero-order correlation involves a suppressor
effect.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Regression analysis
Regression analysis examines associative relationships
between a metric-dependent variable and one or more
independent variables in the following ways:
• Determine whether the independent variables explain a significant
variation in the dependent variable: whether a relationship exists.
• Determine how much of the variation in the dependent variable can
be explained by the independent variables: strength of the
relationship.
• Determine the structure or form of the relationship: the mathematical
equation relating the independent and dependent variables.
• Predict the values of the dependent variable.
• Control for other independent variables when evaluating the
contributions of a specific variable or set of variables.
Regression analysis is concerned with the nature and degree of
association between variables and does not imply or assume any
causality.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Statistics associated with bivariate
regression analysis

• Bivariate regression model. The basic regression equation is


Yi =  0 + 1 Xi + ei, where Y = dependent or criterion variable,
X = independent or predictor variable, 0 = intercept of the line,
 1 = slope of the line, and e is the error term associated with
i
the i th observation.
• Coefficient of determination. The strength of association is
measured by the coefficient of determination, r2. It varies
between 0 and 1 and signifies the proportion of the total
variation in Y that is accounted for by the variation in X.
• Estimated or predicted value. The estimated or predicted
value of Yi is Ŷβi = a + bx, whereŶi is the predicted value of Yi,
and a and b are estimators of β0 and β1, respectively.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• Regression coefficient. The estimated parameter, b, is
usually referred to as the non-standardised regression
coefficient.
• Scattergram. A scatter diagram, or scattergram, is a plot
of the values of two variables for all the cases or
observations.
• Standard error of estimate. This statistic, SEE, is the
standard deviation of the actual Y values from the
predicted Ŷ values.
• Standard error. The standard deviation of b, SEb, is
called the standard error.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• Standardised regression coefficient. Also termed the
beta coefficient or beta weight, this is the slope obtained
by the regression of Y on X when the data are
standardised.
• Sum of squared errors. The distances of all the points
from the regression line are squared and added together
to arrive at the sum of squared errors, which is a
measure of total error, ej2.
• t statistic. A t statistic with n − 2 degrees of freedom can
be used to test the null hypothesis that no linear
relationship exists between X and Y, or:
H0 : 1 = 0, where t = b
SEb
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
plot the scatter diagram

• A scatter diagram, or scattergram, is a plot of the


values of two variables for all the cases or
observations.
• The most commonly used technique for fitting a
straight line to a scattergram is the least squares
procedure.

In fitting the line, the least squares procedure


minimizes the sum of squared errors, ej2 .

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 22.2
Conducting bivariate regression
analysis

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
formulate the bivariate regression model
In the bivariate regression model, the general form of a
straight line is: Y =  0 +  1 X

where
Y = dependent or criterion variable
X = independent or predictor variable
 0 = intercept of the line
 1 = slope of the line.

The regression procedure adds an error term to account for the


probabilistic or stochastic nature of the relationship:

Yi =  0 +  1 Xi + ei

where ei is the error term associated with the ith observation.


Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 22.3
Plot of attitude with duration

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 22.4
Which straight line is best?

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 22.5
Bivariate regression

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 22.6
Decomposition of the total variation in
bivariate regression

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
estimate the parameters
In most cases,  0 and  1 are unknown and are estimated
from the sample observations using the equation:
Y i = a + bxi
where Yi is the estimated or predicted value of Yi, and
a and b are estimators of  0 and  1 , respectively.
COVxy
b=
Sx2

n

i =1
(Xi – X )(Yi – Y)
= n
 ( Xi – X )
2
i =1

n

i=1
XiYi – nX Y
= n

2
Xi – nX 2
i =1

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
The intercept, a, may then be calculated using:
a = Y – bX

For the data in Table 22.2, the estimation of parameters may be


illustrated as follows:
12
 XiYi = (10)(6) + (12)(9) + (12)(8) + (4)(3) + (12)(10) + (6)(4)
i =1 + (8)(5) + (2)(2) + (18)(11) + (9)(9) + (17)(10) + (2)(2)
= 917

12

i
Xi2 = 102 + 122 + 122 + 42 + 122 + 62
=1
+ 82 + 22 + 182 + 92 + 172 + 22
= 1350

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
It may be recalled from earlier calculations of the simple correlation that:

X = 9.333
Y = 6.583
Given n = 12, b can be calculated as:
917 –(12)(9.333)(6.583)
b=
1350 –(12)(9.333)2
= 0.5897

a = Y – bX
= 6.583 – (0.5897)(9.333)
= 1.0793

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
estimate the standardised regression
coefficient
• Standardisation is the process by which the raw data are
transformed into new variables that have a mean of 0 and a
variance of 1.
• When the data are standardised, the intercept assumes a
value of 0.
• The term beta coefficient or beta weight is used to denote
the standardised regression coefficient.

Byx = Bxy = rxy

• There is a simple relationship between the standardised and


non-standardised regression coefficients:
Byx = byx Sx
Sy

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
test for significance
The statistical significance of the linear relationship
between X and Y may be tested by examining the
hypotheses:
H0:  1 = 0
H 1:  1  0
A t statistic with n – 2 degrees of freedom can be
used, where t = b
SEb
SEb denotes the standard deviation of b and is called
the standard error.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
test for significance

Using quantitative data analysis software, the regression of


attitude on duration of residence, using the data shown in Table
22.1, yielded the results shown in Table 22.2. The intercept, a,
equals 1.0793, and the slope, b, equals 0.5897.
Therefore, the estimated equation is:

Attitude ( Ŷ ) = 1.0793 + 0.5897 (Duration of residence)


The standard error, or standard deviation of b is estimated as
0.07008, and the value of the t statistic as t = 0.5897/0.0701 =
8.414, with n – 2 = 10 degrees of freedom.
From Table 4 in the statistical appendix, we see that the critical
value of t with 10 degrees of freedom and  = 0.05 is 2.228 for
a two-tailed test. Since the calculated value of t is larger than
the critical value, the null hypothesis is rejected.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
determine the strength and significance
of association
The total variation, SSy, may be decomposed into the variation
accounted for by the regression line, SSreg, and the error or
residual variation, SSerror or SSres, as follows:
SSy = SSreg + SSres
n
where SSy =i  (Y
=1 i −Y )2

SSreg =  (Ŷ −Y)2
i
i =1
n
SSres =  (Y −Ŷi)2
i =1 i

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
The strength of association may then be calculated as follows:
SSreg
r2=
SSy

SSy −SSres
=
SSy

To illustrate the calculations of r2, let us consider again the effect


of attitude toward the city on the duration of residence. It may be
recalled from earlier calculations of the simple correlation
coefficient that: n
SSy =  (Yi − Y )
2

i =1
= 120.9168
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
The predicted values (Ŷ ) can be calculated using the regression
equation:

Attitude (Ŷ ) = 1.0793 + 0.5897 (duration of residence)

For the first observation in Table 22.1, this value is:

(Ŷ ) = 1.0793 + 0.5897 × 10 = 6.9763

For each successive observation, the predicted values are, in order,


8.1557, 8.1557, 3.4381, 8.1557, 4.6175, 5.7969, 2.2587, 11.6939,
6.3866, 11.1042 and 2.2587.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Therefore, n
SSreg =  (Yi − Y )
2

i =1

= (6.9763 − 6.5833)2 + (8.1557 − 6.5833)2


+ (8.1557 − 6.5833)2 + (3.4381 − 6.5833)2
+ (8.1557 − 6.5833)2 + (4.6175 − 6.5833)2
+ (5.7969 – 6.5833)2 + (2.2587 − 6.5833)2
+ (11.6939 − 6.5833)2 + (6.3866 − 6.5833)2
+ (11.1042 − 6.5833)2 + (2.2587 − 6.5833)2
= 0.1544 + 2.4724 + 2.4724 + 9.8922 + 2.4724
+ 3.8643 + 0.6184 + 18.7021 + 26.1182
+ 0.0387 + 20.4385 + 18.7021
= 105.9524
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
n
SSreg =  (Yi − Yi )
2

i =1

= (6 − 6.9763)2 + (9 − 8.1557)2 + (8 − 8.1557)2 + (3 − 3.4381)2


+ (10 − 8.1557)2 + (4 − 4.6175)2 + (5 − 5.7969)2 + (2 − 2.2587)2
+ (11 − 11.6939)2 + (9 − 6.3866)2 + (10 − 11.1042)2 + (2 − 2.2587)2
= 14.9644
It can be seen that SSy = SSreg + SSres . Furthermore,

SSreg
r2=
SSy
105.9466
=
120.9168

= 0.8762
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Another, equivalent test for examining the significance of the
linear relationship between X and Y (significance of b) is the
test for the significance of the coefficient of determination.
The hypotheses in this case are:
2
H0 : Rpop = 0
2
H1: Rpop 0

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
The appropriate test statistic is the F statistic:
S S reg
F=
SS res /(n – 2)
which has an F distribution with 1 and (n – 2) df. The F test is a generalised form
of the t test. If a random variable is t distributed with n degrees of freedom, then
t2 is F distributed with 1 and n df. Hence, the F test for testing the significance of
the coefficient of determination is equivalent to testing the following hypotheses:
H 0:  1 = 0
H 0:  1  0
or
H 0:  = 0
H 1:   0

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
From Table 22.2, it can be seen that:
105.9522
r2 =
105.9522 + 14.9644
= 0.8762
Which is the same as the value calculated earlier. The value of the
F statistic is:
105.9522
F =
14.964/10
= 70.8028
with 1 and 10 df. The calculated F statistic exceeds the critical value of
4.96 determined from Table 5 in the Appendix. Therefore, the
relationship is significant at a = 0.05, corroborating the results of the
t test.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 22.2
Bivariate regression

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Conducting bivariate regression analysis
check prediction accuracy
To estimate the accuracy of predicted values, Ŷ , it is useful to
calculate the standard error of estimate, SEE.
n

 (Yi − Yˆ )2
SEE = i =1

n−2
or
SEE = SSres
n− 2
or more generally, if there are k independent variables,

SEE = SSres
n − k −1
For the data given in Table 22.2, the SEE is estimated as follows:
SEE = 14.9644/(12–2)
= 1.22329
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Assumptions

• The error term is normally distributed. For each fixed


value of X, the distribution of Y is normal.

• The means of all these normal distributions of Y,


given X, lie on a straight line with slope b.

• The mean of the error term is 0.

• The variance of the error term is constant. This


variance does not depend on the values assumed by
X.

• The error terms are uncorrelated. In other words, the


observations have been drawn independently.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Lecture 4
Cross-tabulation

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• While a frequency distribution describes one
variable at a time, a cross-tabulation describes
two or more variables simultaneously.

• Cross-tabulation results in tables that reflect the


joint distribution of two or more variables with a
limited number of categories or distinct values,
for example, Table 20.3.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.3
Gender and internet usage

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Two variables cross-tabulation

• Since two variables have been cross-classified,


percentages could be computed either column-
wise, based on column totals (Table 20.4), or
row-wise, based on row totals (Table 20.5).

• The general rule is to compute the percentages


in the direction of the independent variable,
across the dependent variable. The correct way
of calculating percentages is as shown in
Table 20.4.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.4
Gender and internet usage – column
totals

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.5
Gender and internet usage – row totals

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Three variables cross-tabulation
refine an initial relationship

As shown in Figure 20.7, the introduction of a third


variable can result in four possibilities:
• As can be seen from Table 20.6, 52% of unmarried participants
fell in the high-purchase category, as opposed to 31% of the
married participants. Before concluding that unmarried
participants purchase more designer clothing than those who
are married, a third variable, the buyer’s gender, was
introduced into the analysis.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• As shown in Table 20.7, in the case of females, 60% of the
unmarried participants fall in the high-purchase category, as
compared to 25% of those who are married. On the other
hand, the percentages are much closer for males, with 40% of
the unmarried participants and 35% of the married participants
falling in the high-purchase category.

• Hence, the introduction of gender (third variable) has refined


the relationship between marital status and purchase of luxury
branded clothing (original variables). Unmarried participants
are more likely to fall in the high-purchase category than
married ones, and this effect is much more pronounced for
females than for males.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.6
Purchase of luxury branded clothing by
marital status

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.7
Purchase of luxury branded clothing by
marital status and gender

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Three variables cross-tabulation
initial relationship was spurious

• Table 20.8 shows that 32% of those with university degrees


own an expensive car, as compared to 21% of those without
university degrees. Realising that income may also be a factor,
the researcher decided to re-examine the relationship between
education and ownership of expensive cars in the light of
income level.
• In Table 20.9, the percentages of those with and without
university degrees who own expensive automobiles are the
same for each income group. When the data for the high-
income and low-income groups are examined separately, the
association between education and ownership of expensive
automobiles disappears, indicating that the initial relationship
observed between these two variables was spurious.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.8
Ownership of expensive cars by
education level

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.9
Ownership of expensive cars by
education and income levels

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Three variables cross-tabulation
reveal suppressed association
• Table 20.10 shows no association between desire to travel abroad
and age.
• When gender was introduced as the third variable, Table 20.11
was obtained. Among men, 60% of those under 45 indicated a
desire to travel abroad, as compared to 40% of those 45 or older.
The pattern was reversed for women, where 35% of those under
45 indicated a desire to travel abroad as opposed to 65% of those
45 or older.
• Since the association between desire to travel abroad and age
runs in the opposite direction for males and females, the
relationship between these two variables is masked when the data
are aggregated across gender as in Table 20.10.
• But when the effect of gender is controlled, as in Table 20.11, the
suppressed association between desire to travel abroad and age
is revealed for the separate categories of males and females.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.10
Desire to travel abroad by age

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.11
Desire to travel abroad by age and
gender

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Statistics associated with
cross-tabulation

• To determine whether a systematic association exists, the


probability of obtaining a value of chi-square as large or
larger than the one calculated from the cross-tabulation is
estimated.
• An important characteristic of the chi-square statistic is
the number of degrees of freedom (df ) associated with it.
That is, df = (r − 1) × (c − 1).
• The null hypothesis (H0) of no association between the
two variables will be rejected only when the calculated
value of the test statistic is greater than the critical value
of the chi-square distribution with the appropriate degrees
of freedom, as shown in Figure 20.8.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 20.8
Chi-square test of association

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Statistics associated with
cross-tabulation chi-square

• The chi-square statistic ( 2) is used to test the


statistical significance of the observed association in
a cross-tabulation.
• The expected frequency for each cell can be
calculated by using a simple formula:
nn
fe = nr c

where nr = total number in the row


nc = total number in the column
n = total sample size.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
For the data in Table 20.3, the expected frequencies for
the cells going from left to right and from top to bottom, are:
15 × 15/30 = 7.50, 15 × 15/30 = 7.50,
15 × 15/30 = 7.50,15 × 15/30 = 7.50

Then the value of  is calculated as follows:


2

2 (fo− fe )2
= 
all cells fe

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
For the data in Table 20., the value of  2is
calculated as:

2 = (5 − 7.5)2/7.5 + (10 − 7.5)2/7.5 + (10 − 7.5)2/7.5 + (5 − 7.5)2/7.5

= 0.833 + 0.833 + 0.833 + 0.833

= 3.333

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• The chi-square distribution is a skewed distribution whose
shape depends solely on the number of degrees of freedom.
As the number of degrees of freedom increases, the chi-square
distribution becomes more symmetrical.
• Table 3 in the statistical appendix contains upper-tail areas of
the chi-square distribution for different degrees of freedom. For
1 degree of freedom the probability of exceeding a chi-square
value of 3.841 is 0.05.
• For the cross-tabulation given in Table 20.3, there are
(2 − 1) × (2 − 1) = 1 degree of freedom. The calculated chi-
square statistic had a value of 3.333. Since this is less than the
critical value of 3.841, the null hypothesis of no association
cannot be rejected, indicating that the association is not
statistically significant at the 0.05 level.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Lecture 3
Descriptive Statistics and
Hypothesis Testing

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Frequency distribution

• In a frequency distribution, one variable is


considered at a time.

• A frequency distribution for a variable produces a


table of frequency counts, percentages and
cumulative percentages for all the values
associated with that variable.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 20.2
Frequency distribution of ‘Familiarity
with the internet’

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 20.1
Frequency histogram

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Statistics associated with frequency
distribution measures of location

• The mean, or average value, is the most commonly used


n tendency. The mean, X , is given by:
measure of central
X i
i =1
X =
n
where
Xi = observed values of the variable X
n = number of observations (sample size).

• The mode is the value that occurs most frequently. It represents


the highest peak of the distribution. The mode is a good measure
of location when the variable is inherently categorical or has
otherwise been grouped into categories.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• The median of a sample is the middle value when the data are
arranged in ascending or descending order. If the number of data
points is even, the median is usually estimated as the midpoint
between the two middle values – by adding the two middle values
and dividing their sum by 2. The median is the 50th percentile.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Statistics associated with frequency
distribution measures of variability

• The range measures the spread of the data. It is


simply the difference between the largest and
smallest values in the sample:
Range = Xlargest – Xsmallest
• The interquartile range is the difference between
the 75th and 25th percentile. For a set of data points
arranged in order of magnitude, the pth percentile is
the value that has p% of the data points below it and
(100 – p)% above it.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• The variance is the mean squared deviation from the
mean. The variance can never be negative.
• The standard deviation is the square root of the
variance. n 2
sx =  (Xi − X)
i -1
n−1
• The coefficient of variation is the ratio of the
standard deviation to the mean expressed as a
percentage, and it is a unitless measure of relative
variability. sx
CV =
X
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Statistics associated with frequency
distribution measures of shape

• Skewness. The tendency of the deviations from the mean


to be larger in one direction than in the other. It can be
thought of as the tendency for one tail of the distribution to
be heavier than the other.
• Kurtosis is a measure of the relative peakedness or
flatness of the curve defined by the frequency distribution.
The kurtosis of a normal distribution is zero. If the kurtosis
is positive, then the distribution is more peaked than a
normal distribution. A negative value means that the
distribution is flatter than a normal distribution.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 20.2
Skewness of a distribution

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Hypothesis testing

• A null hypothesis is a statement of the status quo, one of


no difference or no effect. If the null hypothesis is not
rejected, no changes will be made.
• An alternative hypothesis is one in which some
difference or effect is expected. Accepting the alternative
hypothesis will lead to changes in opinions or actions.
• The null hypothesis refers to a specified value of the
population parameter (e.g. ,,  ), not a sample statistic
(e.g. ).
X

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• A null hypothesis may be rejected, but it can never be
accepted based on a single test. In classical hypothesis
testing, there is no way to determine whether the null
hypothesis is true.
• The null hypothesis is formulated in such a way that its
rejection leads to the acceptance of the desired
conclusion. The alternative hypothesis represents the
conclusion for which evidence is sought.
H0:   0.40
H1:   0.40

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
A general procedure for hypothesis testing
step 1: Formulate the hypothesis

• The test of the null hypothesis is a one-tailed test,


because the alternative hypothesis is expressed
directionally. If that is not the case, then a two-tailed test
would be required, and the hypotheses would be
expressed as:
H0:  = 0.40
H1:   0.40

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
A general procedure for hypothesis testing
step 2: Select an appropriate statistical
technique
• The test statistic measures how close the sample has
come to the null hypothesis.
• The test statistic often follows a well-known distribution,
such as the normal, t, or chi-square distribution.
• In our example, the z statistic, which follows the standard
normal distribution, would be appropriate.

z= p−
p
where
 (1 − )
p =
n Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
A general procedure for hypothesis testing
step 3: Choose the level of significance, α

Type I error
• Type I error occurs when the sample results lead to the
rejection of the null hypothesis when it is in fact true.
• The probability of type I error (  ) is also called the level
of significance.

Type II error
• Type II error occurs when, based on the sample results,
the null hypothesis is not rejected when it is in fact false.
• The probability of type II error is denoted by  .
• Unlike , which is specified by the researcher, the
magnitude of  depends on the actual value of the
population parameter (proportion).
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 20.6
A broad classification of hypothesis
testing procedures

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Measurement and Scaling

When you can measure what


you are speaking about and
express it in numbers, you
know something about it.
– Lord Kelvin

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Measurement and scaling

Measurement means assigning numbers or other


symbols to characteristics of objects according to
certain pre-specified rules.
– One-to-one correspondence between the
numbers and the characteristics being measured.
– The rules for assigning numbers should be
standardised and applied uniformly.
– Rules must not change over objects or time.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Scaling involves creating a continuum upon
which measured objects are located.
Consider an attitude scale from 1 to 100. Each
respondent is assigned a number from 1 to 100,
with 1 = extremely unfavourable, and 100 =
extremely favourable. Measurement is the actual
assignment of a number from 1 to 100 to each
respondent. Scaling is the process of placing the
respondents on a continuum, for example, with
respect to their attitude towards Formula One
racing.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 12.1
An illustration of primary scales of
measurement

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Nominal scale

• The numbers serve only as labels or tags for identifying


and classifying objects.
• When used for identification, there is a strict one-to-one
correspondence between the numbers and the objects.
• The numbers do not reflect the amount of the
characteristic possessed by the objects.
• The only permissible operation on the numbers in a
nominal scale is counting.
• Only a limited number of statistics, all of which are based
on frequency counts, are permissible, e.g. percentages
and mode.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 12.2
Illustration of primary scales of
measurement

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Ordinal scale

• A ranking scale in which numbers are assigned to


objects to indicate the relative extent to which the
objects possess some characteristic.
• Can determine whether an object has more or less of a
characteristic than some other object, but not how much
more or less.
• Any series of numbers can be assigned that preserves
the ordered relationships between the objects.
• In addition to the counting operation allowable for
nominal scale data, ordinal scales permit the use of
statistics based on centiles, e.g. percentile, quartile and
median.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Interval scale

• Numerically equal distances on the scale represent


equal values in the characteristic being measured.
• It permits comparison of the differences between
objects.
• The location of the zero point is not fixed. Both the zero
point and the units of measurement are arbitrary.
• Any positive linear transformation of the form y = a + bx
will preserve the properties of the scale.
• It is not meaningful to take ratios of scale values.
• Statistical techniques that may be used include all of
those that can be applied to nominal and ordinal data in
addition the arithmetic mean, standard deviation and
other statistics commonly used in marketing research.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Ratio scale

• Possesses all the properties of the nominal,


ordinal and interval scales.

• It has an absolute zero point.

• It is meaningful to compute ratios of scale values.

• Only proportionate transformations of the form


y = bx, where b is a positive constant, are
allowed.

• All statistical techniques can be applied to ratio


data.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 12.1
Primary scales of measurement

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 12.2
A classification of scaling techniques

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
A comparison of scaling techniques

• Comparative scales involve the direct


comparison of stimulus objects. Comparative
scale data must be interpreted in relative terms
and have only ordinal or rank order properties.

• In non-comparative scales, each object is


scaled independently of the others in the
stimulus set. The resulting data are generally
assumed to be interval or ratio scaled.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 12.4
Basic non-comparative scales

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Itemised rating scales

• The respondents are provided with a scale that


has a number or brief description associated
with each category.

• The categories are ordered in terms of scale


position, and the respondents are required to
select the specified category that best describes
the object being rated.

• The commonly used itemised rating scales are


the Likert, semantic differential and Stapel
scales.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 12.6
Continuous rating scale

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 12.7
The Likert scale

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 12.8
Semantic differential scale

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 12.9
The Stapel scale

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Table 12.6
Some commonly used scales in
marketing

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 12.13
Development of a multi-item scale

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Figure 12.14
Scale evaluation

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Measurement accuracy

The true score model provides a framework for


understanding the accuracy of measurement.
XO = XT + XS + XR

where

XO = the observed score or measurement


XT = the true score of the characteristic
XS = systematic error
XR = random error.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Potential sources of error on
measurement

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Reliability
• Reliability can be defined as the extent to which
measures are free from random error, XR. If XR = 0, the
measure is perfectly reliable.

• In test-retest reliability, respondents are administered


identical sets of scale items at two different times and
the degree of similarity between the two measurements
is determined.

• In alternative-forms reliability, two equivalent forms of


the scale are constructed and the same respondents are
measured at two different times, with a different form
being used each time.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• Internal consistency reliability determines the extent
to which different parts of a summated scale are
consistent in what they indicate about the characteristic
being measured.
• In split-half reliability, the items on the scale are
divided into two halves and the resulting half scores are
correlated.
• The coefficient alpha, or Cronbach’s alpha, is the
average of all possible split-half coefficients resulting
from different ways of splitting the scale items. This
coefficient varies from 0 to 1, and a value of 0.6 or less
generally indicates unsatisfactory internal consistency
reliability.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Validity

• The validity of a scale may be defined as the extent to


which differences in observed scale scores reflect true
differences among objects on the characteristic being
measured, rather than systematic or random error.
Perfect validity requires that there be no measurement
error (XO = XT, XR = 0, XS = 0).
• Content validity is a subjective but systematic
evaluation of how well the content of a scale represents
the measurement task at hand.
• Criterion validity reflects whether a scale performs as
expected in relation to other variables selected (criterion
variables) as meaningful criteria.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
• Construct validity addresses the question of what
construct or characteristic the scale is, in fact,
measuring. Construct validity includes convergent,
discriminant and nomological validity.
• Convergent validity is the extent to which the scale
correlates positively with other measurements of the
same construct.
• Discriminant validity is the extent to which a measure
does not correlate with other constructs from which it is
supposed to differ.
• Nomological validity is the extent to which the scale
correlates in theoretically predicted ways with measures
of different but related constructs.
Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved
Relationship between reliability and
validity

• If a measure is perfectly valid, it is also perfectly reliable. In


this case XO = XT, XR = 0 and XS = 0.
• If a measure is unreliable, it cannot be perfectly valid, since at
a minimum XO = XT + XR. Furthermore, systematic error may
also be present, i.e. XS ≠ 0. Thus, unreliability implies
invalidity.
• If a measure is perfectly reliable, it may or may not be
perfectly valid, because systematic error may still be present
(XO = XT + XS).
• Reliability is a necessary, but not sufficient, condition for
validity.

Copyright © 2018, 2012, 2007 Pearson Education, Inc. All Rights Reserved

You might also like