Topic 5. Mean separation: Multiple comparisons [S&T Ch.8 except 8.
3]
5. 1. Basic concepts
If there are more than 2 treatments the problem is to determine which means
are significantly different. This process can take 2 forms.
*
Planned F test (Orthogonal F test, Topic 4).
*
*
*
*
*
Planned Motivated by treatment structure
Independent
More precise and powerful
Limited to (t-1) comparisons
Well defined method
Effects suggested by the data (multiple comparison tests, Topic 5)
*
*
*
*
No preliminary grouping is known
Useful where no particular relationships exists among treatments.
No limited number of comparisons
Many methods/philosophies to choose from
5. 2. Error rates
Selection of the most appropriate mean separation test is heavily influenced
by the error rate. Recall that a Type I error occurs when one incorrectly
rejects a null true H0.
The Type I error rate is the fraction of times a Type I error
is made.
In a single comparison this is the value .
When comparing 3 or more treatment means there are at least 2 kinds of
Type I error rates:
Comparison-wise type I error rate CER
This is the number of type I errors divided by the total number or
comparisons
Experiment-wise type I error rate EER
This is the number of experiments in which at least one Type I error
occurs, divided by the total number of experiments
Example
An experimenter conducts 100 experiments with 5 treatments each.
In each experiment, there are 10 possible pairwise comparisons:
Total possible pairwise comparisons:
t (t 1)
2
Over 100 experiments, there are 1,000 possible pairwise comparisons.
Suppose that there are no true differences among the treatments (i.e. H0 is
true) and that in each of the 100 experiments one Type I error is made.
Then the CER over all experiments is:
CER= (100 mistakes) / (1000 comparisons) = 0.1 or 10%.
The EER is
EER= (100 experiments with mistakes) / (100 experiments) = 1 or 100%.
Relationship between CER and EER
1. EER is the probability of a Type I error for the complete experiment. As the
number of means increases, the EER 1 (100%).
2. To preserve a low EER, the CER has to be kept very low. Conversely, to
maintain a reasonable CER, the EER must be very large.
3. The relative importance of controlling these two Type I error ratios depends on
the objectives of the study.
*
When incorrectly rejecting one comparison jeopardizes the entire
experiment or when the consequence of incorrectly rejecting one
comparison is as serious as incorrectly rejecting a number of comparisons,
EER control is more important.
When one erroneous conclusion will not affect other inferences in an
experiment, the CER is more important.
4. Different multiple comparison procedures have been developed based on
different philosophies of controlling these two kinds of errors.
COMPUTATION OF EER
The EER is difficult to compute because Type I errors are not independent.
But it is possible to compute an upper bound for the EER by assuming that
the probability of a Type I error in any comparison is and is independent
of the other comparisons. In this case:
Upper bound EER = 1 - (1 - )p
p= No of pairs to compare = t(t-1)/2.
(1 - )= Pb of not making an error in one comparison
(1 - )p= Pb of not making an error in p comparisons
Example
If 10 treatments p= 10* 9/ 2= 45 pairwise comparisons
If = 0.05 pb. of not making a Type 1 error in one comparison= 0.95
Pb. of not making a Type 1 error in p comparisons: (1-)p=0.9545= 0.1
Probability of making 1 or more Type 1 error in p comparisons: 1- 0.1=0.9
Upper bound EER= 0.9.
This formula can be used to fix EER and then calculate the required
EER= 0.1(1-)
45
=1-0.1 (1-)= 0.91/45=0.997 = 0.002
Partial null hypothesis: Suppose there are 10 treatments and one shows a
significant effect while the other 9 are equal. ANOVA will reject Ho.
x
Yi.
x Y..
There is a probability of
making a Type I error between
each of the 9 similar treatments
The upper bound EER
under this partial null
hypothesis is computed by
2
Treatment number
10
setting t = 9 p=9(9-1)/2=36
EER=0.84
The experimenter will incorrectly conclude that some pair of similar
effects are different 84% of the time.
Examples
Table 4.1. Results (mg dry weight) of an experiment (CRD) to determine
the effect of seed treatment by acids on the early growth of rice seedlings.
Treatment
Replications
Control
HCl
Propionic
Butyric
4.23
3.85
3.75
3.66
4.38
3.78
3.65
3.67
4.1
3.91
3.82
3.62
3.99
3.94
3.69
3.54
4.25
3.86
3.73
3.71
Overall
Total
Mean
Yi.
Yi.
20.95
19.34
18.64
18.2
4.19
3.87
3.73
3.64
Y.. = 77.13 Y.. = 3.86
Table 4.2. ANOVA of data in Table 4.1.
Source of
Sum of
Variation
df
Squares
Total
19
1.0113
Treatment
Exp. error
3
16
0.8738
0.1376
Mean
Squares
0.2912
0.0086
F
33.87
Table 5.1. Unequal N. Weight gains (lb/animal/day) as affected by three
different feeding rations. CRD with unequal replications.
Treat.
Control
Feed-A
Feed-B
Feed-C
N
1.21
1.34
1.45
1.31
1.19
1.41
1.45
1.32
1.17
1.38
1.51
1.28
1.23
1.29
1.39
1.35
1.29 1.14
1.36 1.42 1.37
1.44
1.41 1.27 1.37
1.32
Overall
Total
6 7.23
8 10.89
5 7.24
7 9.31
26 34.67
Mean
1.20
1.36
1.45
1.33
1.33
Table 5-2. ANOVA of data in Table 5-1.
Source of
Variation
Total
Treatment
Exp. Error
df
25
3
22
Sum of
Squares
0.2202
0.1709
0.0493
Mean
Squares
0.05696
0.00224
25.41
Complete and partial null hypothesis in SAS
4
EER under the complete null hypothesis: all population means are equal
EER under a partial null hypothesis: some means are equal but some
differ.
SAS subdivides the error rates into:
CER = comparison-wise error rate
EERC = experiment-wise error rate under complete null hypothesis (the
standard EER)
EERP = experiment-wise error rate under a partial null hypothesis.
MEER = maximum experiment-wise error rate under any complete or
partial null hypothesis.
5. 3. Multiple comparison tests
ST&D Ch. 8 and SAS/STAT (GLM)
Statistical methods for making two or more inferences while controlling the
probability of making at least one Type I error are called simultaneous
inference methods:
Multiple comparison techniques divide themselves into two groups:
Fixed-range tests: Those which can provide confidence intervals and
tests of hypothesis. One range for testing all differences in balanced
designs
Multiple-stage tests: Those which are essentially only tests of
hypothesis. Variable ranges.
5. 3. 1. Fixed-range tests
These tests provide one range for testing all differences in balanced
designs and can provide confidence intervals.
Many fixed-range procedures are available and considerable controversy
exists as to which procedure is most appropriate.
We will present four commonly used procedures starting from the less
conservative (high power) to the more conservative (lower power):
LSD
Dunnett
Tukey
Scheffe.
5. 3. 1. 1. The repeated t and least significant difference: LSD
LSD test is one of the simplest and one of the most widely misused.
LSD test declares the between means Y i and Y i' to be significant when:
| Y i - Y i' | > t/2, MSE df
MSE
2
for equal r (SAS: LSD test)
r
Where MSE = pooled s2 was calculated by PROC ANOVA or PROC GLM.
2 is coming from the X1-X2~ t (1-2, s12+s22), if s12=s22 s12+s22=2 sp2
From Table 4.1: = 0.05, MSE = 0.0086 with 16 df.
2
5
LSD 0.025 = 2.12 0.0086 = 0.1243
If | Y i - Y i' | > 0.1243, the treatments are said to be significantly different.
A systematic procedure of comparison is to arrange the means in descending
or ascending order as shown below.
Treatment
Control
HCl
Propionic
Butyric
Mean
4.19
3.87
3.73
3.64
LSD
a
b
c
c
First compare the largest with the smallest mean.
If these two means are significantly different, then compare the next
largest with the smallest.
Repeat this process until a non-significant difference is found.
Identify these two and any means in between with a common lower case
letter by each mean.
When all the treatments are equally replicated, only one LSD value is
required to test all possible comparisons. One advantage of the LSD
procedure is its ease of application.
LSD is readily used to construct confidence intervals for mean differences.
The 1- confidence limits are = Y A - Y B LSD
LSD with different number of replications
Different LSD must be calculated for each comparison involving different
numbers of replications.
| Y i - Y i' | > t/2, MSE df
MSE (
1 1
) for unequal r (SAS: repeated t test)
r1 r2
For Table 5.1 with unequal replications:
Control
1.20 c
Feed-C
1.33 b
Feed-A
1.36 b
Feed-B
1.45 a
The 5% LSD for comparing the Control vs. Feed-B (1.45-1.20=0.25) is,
1
6
1
5
LSD 0.025= 2.074 0.00224( ) = 0.0595, and | Y i - Y i' | > 0.0595 Reject H0
Because of the unequal replication, the LSD values and the length of the
confidence intervals vary among different pairs of mean comparisons
A vs. Control
A vs. B
A vs. C
B vs. C
C vs. Control
= 0.0531
= 0.0560
= 0.0509
= 0.0575
= 0.0546
General considerations
The LSD test is much safer when the means to be compared are selected
in advance of the experiment.
It is primarily intended for use when there is no predetermined structure
to the treatments (e.g. in variety trials).
The LSD test is the only test for which the error rate equals the
comparison wise error rate. This is often regarded as too liberal (i.e. to
ready to reject the null hypothesis).
It has been suggested that the EEER can be held to the level by
performing the overall ANOVA test at the level and making further
comparisons only if the F test is significant (Fisher's Protected LSD
test). However, it was then demonstrated that this assertion is false if
there are more than three means.
A preliminary F test controls the EERC but not the EERP.
5. 3. 1. 2. Dunnett's Method
Compare a control with each of several other treatments
Dunnett's test holds the maximum EER under any complete or partial
null hypothesis (MEER) < .
In this method a t* value is calculated for each comparison.
The tabular t* value is given in Table A-9 (ST&D p625).
DLSD= t*/2, MSE df
MSE
2
, for equal r
r
DLSD= t*/2, MSE df (Dunnett)
MSE (
1 1
) for unequal r
r1 r2
From Table 4-1, MSE = 0.0086 with 16 df and p= 3, t*/2=2.59 (Table A9)
2
5
DLSD 0.025 = 2.59 0.0086 = 0.152
DLSD= 0.152 > LSD= 0.124!
0.152= least significant difference between control and any other treatment.
Control - HC1= 4.19 - 3.87= 0.32. Since 0.32>0.152 (DLSD) Significant!
The 95% simultaneous confidence intervals for all three differences are
computed as = Y o - Y i DLSD. The limits of these differences are,
Control
Control
Control
butyric
HC1
propionic
=
=
=
0.32 0.15
0.46 0.15
0.55 0.15
That is, we have 95% confidence that the three true differences fall
simultaneously within the above ranges.
For unequal replication: Table 5-1
To compare control with feed-C, t*0.025, 22 , p=3= 2.517 (from SAS)
1
6
1
7
DLSD= 2.517 0.00224( ) = 0.06627
Since Y o - Y c = 0.125 is larger than 0.06627, it is significant. The other
differences are also significant.
8
5. 3. 1. 3. Tukey's w procedure
Tukey's test was designed for all possible pairwise comparisons.
The test is sometimes called "honestly significant difference test" HSDT
It controls the MEER when the sample sizes are equal.
It uses a statistic similar to the LSD but with a number q,(p, MSE df) that is
obtained from Table A8 (distribution (Y MAX Y MIN ) / sY )
w = q,(p, MSE df)
MSE
r
for equal r ( 2 is missing because is inside q)
p=2 df= =0.05 q=2.77 = 1.96* 2
w = q,(p, MSE df)
MSE (
1 1
) / 2 for unequal r (SAS manual)
r1 r2
For Table 4.1: q 0.05,(4, 16) = 4.05
w= 4.05
0.0086
= 0.168 (Note that w= 0.168 > DLSD= 0.152)
5
Tukey critical value is larger than that of Dunnett because the Tukey family
of contrasts is larger (all pairs of means).
Table 4.1
Treatment
Control
HCl
Propionic
Butyric
Mean
4.19
3.87
3.73
3.64
This test does not detect significant
differences between HCl and
Propionic
a
b
b c
c
For unequal r, as in Table 5.1, the contrast between control with feed-C,
1
6
1
7
q 0.05,(4, 22) = 3.93 w = 3.93 0.00224( ) / 2 = 0.0731
Since Y o - Y c = 0.125 is larger than 0.0731, it is significant. As in the LSD
the only pairwise comparison that is not significant is between Feed-C
( Y =1.33) and Feed-A ( Y =1.36).
5. 3. 1. 4. Scheffe's F test
Scheffe's test is compatible with the overall ANOVA F test in that it
never declares a contrast significant if the overall F test is nonsignificant.
Scheffe's test controls the MEER for ANY set of contrasts including
pairwise comparisons.
Since this procedure allows for more kinds of comparisons, it is less
sensitive in finding significant differences than other pairwise
comparison procedures.
For pairwise comparisons with equal r, the Scheffe's critical difference
SCD has a similar structure as that described for previous tests.
2
for equal r
r
SCD = df TR F (df TR , df MSE )
MSE
SCD = df TR F (df TR , df MSE )
MSE (
1 1
) for unequal r
r1 r2
From Table 4-1, MSE = 0.0086 with dfTR= 3, dfMSE= 16, and r= 5
SCD 0.05= 3 * 3.24
Treatment
Control
HCl
Propionic
Butyric
Mean
4.19
3.87
3.73
3.64
0.0086
2
= 0.183 (Note that SCD= 0.183 > w= 0.168)
5
a
b
b c
c
This test does not detect significant
differences between HCl and
Propionic
Unequal replications: a different SCD is required for each comparison. The
contrast Control vs. Feed-C (Table 5-1),
SCD 0.05, (3, 22) =
1 1
3 * 3.05 0.00224( ) = 0.0796
6 7
Since Y o - Y c = 0.125 is larger than 0.0796, it is significant.
Scheffe's procedure is also readily used for interval estimation.
The 1- confidence limits are = Y A - Y B SCD. The resulting intervals are
simultaneous. The probability is at least 1- that all of them are
simultaneously true.
10
Scheffe's comparisons among groups
The most important use of Scheffe's test is for arbitrary comparisons
among groups of means.
If we are interested only in testing the differences between all pairs of
means, Tukey is more sensitive than Scheffe.
To make comparisons among groups of means, we will define a contrast as
in Topic 4:
Q=
Y i. with ci 0 (or
ci 0 for unequal replication)
We will reject the hypothesis (Ho) that the contrast Q= 0 if the absolute
value of Q is larger than a critical vale FS.
This is the general form for Scheffe's F test:
Critical value FS = df TR F (df TR , df MSE )
ci2
MSE
ri
In the previous pairwise comparisons the contrast is 1 vs. -1
ci2
2/ r .
ri
Example
If we want to compare the control vs. the average of the three acids in Table
4.1, the contrast coefficients +3 -1 -1 1 are multiplied by the MEANS.
Q= 4.190 * 3 + 3.868 * (-1) + 3.728 * (-1) + 3.640 * (-1)= 1.334
Critical value:
FS 0.05, (3, 16) = 3 * 3.24 0.0086(32 ( 1) 2 ( 1) 2 ( 1) 2 ) / 5 = 0.4479
Q> Fs, therefore we reject Ho. The control (4.190-mg) is significantly
different from the average of the three acid treatments (3.745-mg).
11
5. 3. 2. Multiple-stage tests
By giving up the facility for simultaneous estimation with one value, it is
possible to obtain tests with greater power: multiple-stage tests (MSTs).
The best known MSTs are: Duncan and Student-Newman-Keuls (SNK)
They use studentized range statistic.
Multiple range tests should only be used with balanced designs since they are
inefficient with unbalanced ones.
Smallest
Y1
p =2
p =3
Y2
p =2
p =4
Y3
p =2
p =3
Y4
With means arrayed from the lowest to the
highest, a multiple-range test gives
significant ranges that become smaller as
the pairwise means to be compared are
closer in the array
Imagine a Tukey test where you 1st test the
most distant means and then ignore them
in the next level, resulting in a smaller p
and smaller critical value in Table 8
Largest mean
5. 3. 2. 1. Duncan's multiple range tests: no longer accepted
The test is identical to LSD for adjacent means in an array but requires
progressively larger values for significance between means as they are
more widely separated in the array.
It controls the CER at the level but it has a high type I error rate
(MEER).
Duncan's test used to be the most popular method but many journals no
longer accept it and is not recommended by SAS.
12
5. 3. 2. 2. The Student-Newman-Keuls (SNK) test
This test is more conservative than Duncan's in that the type I error rate is
smaller.
It is often accepted by journals that do not accept Duncan's test.
The SNK test controls the EERC at the level
but has poor behavior in terms of the EERP and MEER. This method
is not recommended by SAS.
The procedure is to compute a set of critical values (ST&D Table A8).
Wp = q,(p, MSE df)
MSE
r
p= t, t-1, ..., 2
For Table 4.1
P
q 0.05 (p, 16)
3.0
3.65
4.05
Wp
0.124 0.151 0.168 and for p= 2 Wp= LSD
Treatment
Control
HCl
Propionic
Butyric
Mean
4.19
3.87
3.73
3.64
Note that for p= t Wp= Tukey w=0.168
Wp
a
b
c
c
Remember that HCl vs. Propionic
was NS in Tukeys procedure.
SNK is more sensitive than Tukey
But.the cost is larger EERP!
Assume the following partial null hypothesis (means from smallest to largest)
Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10
There are 5 NS independent pairs compared by LSD
EERP=1-(1-0.05)5=0.23 Higher EERP than Tukey!
13
5. 3. 2. 3. The REGWQ method
A variety of MSTs that control MEER have been proposed, but these
methods are not as well known as those of Duncan and SNK.
Ryan, Einot and Gabriel, and Welsh (REGWQ) method:
p= 1- (1-) p / t for p < t-1
and p = for p t-1 (=SNK).
p3
p<3
In Table 4.1
The REGWQ method does the comparisons using a range test.
This method appears to be among the most powerful multiple range tests and
is recommended by SAS for equal replication.
Assuming the sample means have been arranged in order from Y 1 through
Y k, the homogeneity of means Y i., ..., Y j, is rejected by REGWQ if:
Yi- Yj
q(p; p, dfMSE)
MSE
r
(Use Table A.8 ST&D)
For Table 4.1 data:
P
p
q p
2
0.0253
3.49
(p, 16)
Critical value
4
0.05
3.65
0.05
4.05
0.145
0.151
0.168
>SNK
=SNK
=SNK
2= 1- (1-) p / t = 1-0.95 2/4 = 0.0253 (not in Table 8, use SAS)
For p < t-1 the REGWQ critical value (0.145) is larger than SNK (0.124).
Note that the difference between HCl and propionic is significant with SNK
but no significant with REGWQ (3.87 - 3.73 < 0.145).
Treatment
Control
HCl
Propionic
Butyric
Mean
4.19
3.87
3.73
3.64
F5
a
b
b c
c
This test does not detect significant
differences between HCl and
Propionic. The price of the better
EERP is a lower sensitivity
14
5. 4. Conclusions and recommendations
There are many procedures available for multiple comparisons. A
least 20 other parametric and many non-parametric and multivariate
methods.
There is no consensus as to which one is the most appropriate procedure
to recommend to all users.
One main difficulty in comparing the procedures results from the
different kinds of Type I error rates used, namely, experiment-wise
versus comparison-wise.
The difference in performance of any two procedures is likely to be due
to the different Type I error probabilities than to the techniques used.
To a large extent, the choice of a procedure will be subjective and will
hinge on a choice between a comparison-wise error rate (such as LSD)
and an experiment-wise error rate (such as protected LSD and
Scheffe's test).
Scheffe's method provides a very general technique to test all possible
comparisons among means. For just pairwise comparisons, Scheffe's
method is less appropriate than Tukeys test, as it is overly conservative.
Dunnett's test should be used if the experimenter only wants to make
comparisons between each of several treatments and a control.
The SAS manual makes the following additional recommendations: for
controlling the MEER use the REGWQ method in a balanced design
and Tukey method for unbalanced designs, which also gives confidence
intervals.
One point to note is that unbalanced designs can give strange results.
ST&D p 200: 4 treatments, A, B, C, and D with A > B > C > D. A and D
each have 2 replications while B and C each have 11. No significant
difference was found between the extremes A and D but was detected
between B and C (LSD).
Treatment
A
B
C
D
Data
10, 17
10, 11, 12, 12, 13, 14, 15, 16, 16, 17, 18
14, 15, 16, 16, 17, 18, 19, 20, 20, 21, 22
16,21
Mean
13.5
14.0
18.0
18.5
NS
15