0% found this document useful (0 votes)
7 views10 pages

9 Statistical Considerations 2001 Surgical Research

This document discusses statistical considerations in experimental design, focusing on hypothesis testing and sample size calculations. It emphasizes the importance of understanding variability in data and the need for a robust framework to make informed decisions in research. The chapter outlines how to determine sample sizes and the implications of power and effect size in hypothesis testing.

Uploaded by

hanapui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

9 Statistical Considerations 2001 Surgical Research

This document discusses statistical considerations in experimental design, focusing on hypothesis testing and sample size calculations. It emphasizes the importance of understanding variability in data and the need for a robust framework to make informed decisions in research. The chapter outlines how to determine sample sizes and the implications of power and effect size in hypothesis testing.

Uploaded by

hanapui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

▼ ▼

9
Statistical Considerations
David T. Mauger* and Gordon L. Kauffman, Jr.†
*Department of Health Evaluation Sciences and

Department of Surgery, The Milton S. Hershey Medical Center, Penn State College of Medicine, Hershey, Pennsylvania 17033

I. Introduction ly in humans), on the other hand, is highly susceptible to sig-


II. Hypothesis Testing nificant unexplainable or naturally occurring variation. For
example, the concentration of glucose in human blood is
III. Sample Size Calculations conceptually quantifiable, but in practice, measurements of
IV. Summary blood glucose are subject to substantial variability. Major
References components of this variability are variation between people,
▼ ▼
variation within a person over time, and variation induced
by measurement error because glucose levels cannot be
measured directly and must be estimated by assay.
I. Introduction In order to make statistical inference that can be meaning-
fully quantified, one must develop a framework for modeling
Inference is the means by which conclusions can be natural variation. It is assumed that the randomness underly-
drawn either from logic or deduction. Statistical inference is ing the process being observed can be described mathemati-
the means by which conclusions about an underlying cally. These mathematical models form the basis on which
process can be drawn from observed data. Statistics may be statistical theory is built. This chapter provides some insight
viewed as a set of tools for making decisions in the face of into the statistical theory underlying quantitative issues in
uncertainty. Even though the data have been observed with- experimental design without going into mathematical detail
out doubt in a given experiment, uncertainty exists because or deriving results. From a practical point of view, the most
of the stochastic (random) nature of the underlying process important quantitative issues in experimental design are cost
giving rise to the data. If the same experiment were repeated and benefit. The cost of a proposed study, in time and
under exactly the same conditions, no one would be sur- resources, is weighed against the expected gain in scientific
prised if the observed data were not exactly the same. The knowledge. For example, suppose one is interested in study-
amount of unexplainable variation that can be expected in ing the amount of time it takes to complete a cardiac bypass
any given experiment depends largely on the nature of the procedure. Not all cardiac bypass procedures take the same
process being studied. In a physics lab, for example, one amount of time and no one, certainly no surgeon, would
would expect to find very little variation in measurements of accept the time taken to complete the next procedure (or any
force or gravity. Similar repeatability might also be expected one particular procedure) as a definitive answer. One might
in a well-run chemistry lab. Biomedical research (particular- conceivably undertake to determine the lengths of all cardiac

Copyright © 2001 by Academic Press.


Surgical Research 71 All rights of reproduction in any form reserved.
72 Mauger and Kauffman

bypass procedures done in the United States last year, and will lead to a decision to either accept or reject the null
assuming that surgical practices have not changed very much hypothesis. This approach is predicated on the use of a model
during that time, these data would provide much scientific that can be framed in terms of a pair of alternative hypothe-
knowledge about the duration of such procedures. Such an ses. For example, one may wish to determine which of two
undertaking would be very costly, and one might reasonably drugs should be used to minimize blood loss during surgery.
ask whether (nearly) equivalent scientific knowledge could A complete but rather lengthy description of one such model
have been obtained based on only a sample of cardiac bypass is as follows: If all patients undergoing surgery were given
procedures done in the United States last year. In particular, drug A, the distribution of blood loss in the population would
how many observations are required to arrive at a definitive be similar to the normal distribution, the mean of the distri-
answer, or at least an answer with some high degree of quan- bution would be A, and the standard deviation would be .
tifiable confidence? Methods for answering questions like Similarly, if all patients undergoing surgery were given drug
this are the subject of this chapter. For both practical and eth- B, the distribution of blood loss in the population would be
ical reasons it is important to be able to answer such ques- similar to the normal distribution, the mean of the distribu-
tions precisely. In this age of high competition for limited tion would be B, and the standard deviation would be .
biomedical research funding it is critical that an investigator This model is displayed visually in Fig. 1. Statistical texts
be able to convincingly justify their research plans. use shorthand notation to express models of this sort.
There is another very important question to address: what
method should be used to collect information on the proposed XA  N(A, ) and XB  N(B, ).
sample? Or in other words, what experimental design should Here X stands for the outcome of interest (amount of blood
be used? This question is at least partly qualitative in nature lost during surgery), the symbol “” means “is distributed
and in general cannot be answered based solely on a quantita- as,” and N(, ) denotes the normal distribution with mean
tive cost–benefit analysis. However, this question must be  and standard deviation . In statistical terms, this model is
answered before one can proceed to address the issue of sam- fully characterized by the three parameters, A, B, and .
ple size. This is because all sample size questions are design That is, if one knows the numeric values for A, B, and ,
dependent. Even though the general principle that larger sam- one knows everything possible about the model. This does
ple size always results in greater scientific knowledge is true, not mean that the exact amount of blood loss for any partic-
the exact mathematical form of this relationship depends ular patient is known because that is subject to unexplain-
on the manner in which the sample will be obtained. This able, natural variability. However, one could determine the
chapter does not provide a pathway for choosing an experi- probability that a particular patient’s blood loss would be
mental design, but focuses rather on the rationale for deter- above a certain value.
mining an appropriate sample size for a chosen experimental If one accepts that this model is reasonable, then the
design. research question at hand (is there any difference between
The primary reason to consider statistical issues during the drugs with respect to blood loss?) can be addressed by
study design is to ensure judicious use of limited resources, considering two special cases of the model.
and the importance of careful consideration cannot be over-
stated. In all cases, the cost of a research study (in both time (Case 1) XA  N(A, ) and XB  N(B, )
and money) is directly related to the sample size. A study and A  B.
that is underpowered because the sample size is too small (Case 2) XA  N(A, ) and XB  N(B, )
has little chance to provide useful results and is a waste of and A  B.
valuable resources. At the other extreme, an overpowered
study is also a waste of resources because the same informa- The second case can be written more simply as XA and XB 
tion could have been obtained with fewer observations. This N(, ); there is no need to distinguish between A and B
chapter develops the rationale behind sample size calcula- because they are equal. In terms of the research question, A
tions, presents some approximate formulas that can be used and B are the parameters of interest and  is a nuisance
to get ballpark sample size estimates, and provides guidance parameter. By “nuisance parameter” it is meant that  is not
for using these formulas appropriately. directly relevant to the research question. That is, the only
thing that distinguishes between case 1 and case 2 is
whether A and B are equal. It is important to note that case
II. Hypothesis Testing 1 and case 2 are competing models. If one accepts the gener-
al model, then either case 1 is true or case 2 is true, but they
In many research settings the question of interest is the cannot both be true or both be false.
acceptance or rejection of a posited hypothesis. Statistical The question of whether A and B are equal can be
inference of this sort is termed “hypothesis testing.” In this phrased in terms of the following pair of competing
setting, the statistical analysis done at the end of the study hypotheses:
9 Statistical Considerations 73

Figure 1 Illustration of the normal distribution model in the two-group setting. The
standard deviation of the distribution is denoted by ; A and B denote distribution means.

H0: A  B versus Ha: A  B. called the power of the decision rule. Power is the probabili-
ty of correctly rejecting the null hypothesis. It is convention-
Once the data have been collected, a statistical analysis that al to quantify the behavior of decision rule in terms of its
directs a choice between these two hypotheses will be per- type I error rate () and its power (1  ). The hypothetical
formed. In general terms, a statistical decision rule is stated model presented here, that blood loss follows the normal
as follows: reject the null hypothesis if the observed data are distribution, is only one example of the kinds of models that
sufficiently inconsistent with it, otherwise do not reject the can be used. All hypothesis tests are framed in terms of com-
null hypothesis. This kind of decision rule is intuitively sen- peting model-based hypotheses, but not all hypothesis tests
sible, but there are many ways one could measure the incon- are based on parameters, nor on the normal distribution.
sistency of the observed data with the null hypothesis lead- One statistic that measures how consistent the observed
ing to many different potential decision rules. It is necessary data are with the null hypothesis is called the p value. So it is
to have some quantifiable way of comparing potential deci- reasonable that the p value could form the basis for a deci-
sion rules so that the optimal can be identified. The situation sion rule to reject or not reject the null hypothesis. Methods
is summarized in the following table. for calculating p values will be taken up in a later chapter. It
Reject H0 Do not reject H0 is often the case that there is more than one reasonable way
to calculate p values for a given set of data (e.g., two-sample
H0 is true Type I error Correct t test or Mann–Whitney test). For now it suffices to know
Ha is true Correct Type II error that if the null hypothesis is true, the p value will follow the
standard uniform distribution. That is, the p value is equally
Either H0 or Ha is true and the decision will be to either likely to take any value between zero and one. The left-hand
reject or not reject H0. The decision to reject H0 when it is panel in Fig. 2 shows the distribution of p-values that would
true is called a type I error and the decision not to reject H0 be observed if the null hypothesis were true and the experi-
when it is false is called a type II error. The behavior of a ment were repeated many times. If the null hypothesis is not
decision rule is quantified in terms of its type I and type II true, the p value will tend to be closer to zero than to one.
error rates. The type I error rate (denoted ) of a decision For this reason, decision rules based on p values are con-
rule is the probability that using it will result in a type I error structed so that the null hypothesis is rejected if the observed
and likewise the type II error rate (denoted ) of a decision p value is sufficiently small: reject the null hypothesis if the
rule is the probability that using it will result in a type II observed p value is less than some predetermined criteria,
error. The complement of the type II error rate (1  ) is otherwise do not reject the null hypothesis. The type I error
74 Mauger and Kauffman

Figure 2 Illustration of the p value distribution under the null hypothesis (left panel) and under a par-
ticular instance of the alternative hypothesis (right panel).

rate for this decision rule can be determined from the left- (Ha: A  B) can be equivalently stated as Ha: A  B 
hand panel in Fig. 2. The darker shaded area corresponds to 0. The difference, A  B, is the effect size.
the probability that the p value will be less than 0.05 if the A basic principle is that power is directly related to effect
null hypothesis is true; the darker shaded area is 0.05. It is an size. It is intuitively sensible that power, the probability of
interesting property of p values that if the null hypothesis is correctly rejecting the null hypothesis, should get larger as
true, the probability that the p value will be less than some the degree to which the null hypothesis is false gets larger. In
number X is X. So the type I error rate of this decision rule is addition to effect size, power also depends on sample size
the value of the predetermined criteria. It has become stan- (the amount of information collected), unexplained variabil-
dard to set this criteria equal to 0.05, i.e.,   0.05. This  is ity (the degree to which randomness dilutes the information
also sometimes called the significance level because p val- collected), study design (how the information will be collect-
ues less than  are termed “statistically significant” because ed), planned statistical analysis (the method that will be used
they justify rejecting the null hypothesis. This decision rule to calculate the p value), and type I error rate (the criteria for
is stated explicitly as follows: reject the null hypothesis if rejecting the null hypothesis). The relationship between
the p value for the observed data is less than 0.05, otherwise power and type I error is illustrated in the right-hand panel of
do not reject the null hypothesis. Fig. 2. The distribution curve shown is hypothetical, but is
It is important to understand that for decision rules based characteristic of the distribution of p values under the alter-
on p values the type I error rate is determined solely by the native hypothesis in that p values are likely to be closer to
significance level of the decision rule; it is independent of zero than to one. In this example, the darker shaded area cor-
study design, the method used to calculate it, sample size, responds to power, the probability that a p value will be less
and the amount of underlying variability. Power, on the than 0.05. If the type I error rate were decreased, the power
other hand, depends on the distribution of the p value when (area to the left of ) would also decrease.
the null hypothesis is false, which is directly influenced by Of all the quantities related to hypothesis testing, power
study design, the method used to calculate it, sample size, and sample size are the two components that are under the
and underlying variability. Determining the magnitude of unrestricted control of the investigator. Typically there are
power for this decision rule is quite complex. Part of the only a few study designs that can be used in any given set-
problem is that saying that the null hypothesis is false is not ting, the appropriate analysis is generally determined by the
completely informative. The alternative hypothesis in this study design, and  is usually set at 0.05 by convention. The
example is ambiguous in that it says only that the two means other two pieces, effect size and unexplained variability, are
are not equal, but does not say anything about the magnitude determined by the nature of the process being studied. For
of the difference. The degree to which the null hypothesis is example, one measuring device may be more precise than
not true is called the effect size. The alternative hypotheses another, but no device can remove the natural variability
9 Statistical Considerations 75

between subjects in human research. Because of this, it is ject is receiving. If there is lack of blinding, there is a possi-
useful to pose the following question. “For a given study bility of intentionally biasing the results, particularly if the
design, analysis,  level, effect size, and amount of variabil- outcome is subjective in nature. Even if the outcome is com-
ity, what is the relationship between power and sample pletely objective, lack of blinding allows the possibility of
size?” The exact mathematical form of the relationship bias in the way a subject is treated during the study. Surgical
between power and sample size is not always tractable. Rel- research is one area where blinding can be very difficult for
atively simple formulas are available in some cases (e.g., both ethical and practical reasons and the issue of blinding
two-group mean comparison setting), but the use of comput- must be carefully considered.
er software programs is unavoidable in many situations.
Inherent in the hypothesis testing approach is the
assumption that the only possible explanation for inconsis- III. Sample Size Calculations
tency between the observed data and the null hypothesis is
the alternative hypothesis. Another perfectly valid explana- The usual way of expressing the relationship between
tion is that the assumed model is incorrect, e.g., the outcome power and sample size is to fix power at the desired level and
does not follow the normal distribution, the observed sam- then calculate the necessary sample size. The alternative
ples are not representative of the underlying population, approach is to fix the sample size and determine the resulting
there are differences between the groups other than simply power. These two approaches are illustrated in Fig. 3. The
which treatment they received. If the assumed model is two graphs shown in this figure are equivalent (i.e., the right-
incorrect, a small p value may be reflective of just that. The hand panel is the same as the left-hand panel rotated 90°), but
question of whether the null hypothesis is true cannot be it is instructional to consider the relationship both ways.
reasonably addressed because the hypotheses themselves Heuristically, the first approach is accomplished in the left-
were predicated on the underlying model. Problems with the hand panel by choosing a point on the x axis (power) and then
underlying model can sometimes be addressed by the statis- finding the associated sample size. In this hypothetical exam-
tical analysis so that the desired interpretation of the p value ple, 90% power requires a sample size of approximately 21.
is preserved. However, some problems cannot be addressed Conversely, the right-hand panel illustrates the approach of
by statistical analysis and therefore render the data inca- first choosing the sample size (x axis) and then determining
pable of yielding a conclusion. Most common among these the resulting power. In this example, a sample size of 25
problems are biases due to nonrepresentative sampling and yields approximately 95% power. In biomedical research,
lack of blinding. The validity of the assumption that the adequate power is usually defined as anything greater than
observed data are representative of the underlying popula- 80%. However, because power is the complement of the type
tion is absolutely critical to hypothesis testing. The best way II error rate, it might seem that 95% power would be more
of ensuring that the sample is representative is by taking a consistent with the standard 5% type I error rate.
random sample. The use of selection criteria in determining Sample size calculations are very often done without
the sample is separate from the issue of randomness. Selec- direct consideration of the shape of the curve relating power
tion criteria are used to define the population of interest; the and sample size. The investigator simply supplies the
relevant assumption is that the study participants represent a desired power and other parameters (, effect size, variabili-
random sample of all persons who would fit the selection ty, etc), and the computer software calculates the necessary
criteria. A related issue is that of randomization in a two- sample size. However, Fig. 3 also illustrates another impor-
treatment design. A common design for a study comparing tant feature of the relationship between power and sample
two treatments is to give one treatment to one-half of the size. The curve relating power and sample size has a hyper-
subjects and the other treatment to the other half. A random- bolic shape. This is generally true, although the curve shown
ized study is one in which treatment assignment for each in Fig. 3 corresponds to a particular hypothetical example.
subject is random, i.e., does not depend on any characteris- The consequence of this fact is that it is critical for the inves-
tics of the individual. If treatment assignment is done in a tigator to consider the entire curve rather than just identify-
nonrandom fashion (e.g., by gender), this produces a situa- ing the sample size required for the specified power. Consid-
tion in which the two groups differ on some other factor in ering the entire curve is analogous to a cost–benefit analysis.
addition to treatment. In this case, the treatment effect is said The cost of a study (in dollars and time) is directly related to
to be confounded because it is impossible to determine the the sample size, and power is an intuitive measure of the
extent to which observed differences can be attributed to the value of a study. Looking at the right-hand panel in Fig. 3, in
treatment as opposed to the confounding factor. The purpose this example power increases linearly with sample size up to
of randomization is to eliminate confounding factors. about 15. Above 15 there is a diminishing return in power
Another way in which a study can be confounded is lack of for equivalent increases in sample size. From the sample
blinding. A fully blinded study is one in which neither the size calculation point of view, the left-hand panel indicates
subject nor the investigator knows which treatment the sub- that for this particular example it is certainly reasonable to
76 Mauger and Kauffman

Figure 3 Hypothetical illustration of the relationship between sample size and power. Note the hyperbolic shape of
the curve, indicating a nonlinear trade-off between sample size and power.

choose a sample size corresponding to 80–90% power. type II error rates. The value of z corresponding to the spec-
Whether it would be sensible to design the study with 95% ified  and  can be obtained from a table of the cumulative
power depends on other considerations. If sample size is not normal distribution, available in most statistics texts and
a budgetary constraint or if it is critical to have a very low many scientific calculators. Several of the more commonly
type II error rate, then it may be sensible to consider 95% used values are provided as follows:
power or higher. A general rule of thumb is to choose a sam-  z1/2 Power z1
ple size that ensures 80–90% power unless this is beyond the
point on the curve where sample size increases rapidly. If so, 0.10 1.64 0.80 0.84
it may still be reasonable to consider 80% power or higher, 0.05 1.96 0.85 1.04
but this should be done in light of all aspects of the study. 0.01 2.58 0.90 1.28
Alternatively, one could consider using a different study
design. Not all computer programs for calculating sample When it can be reasonably assumed that the standard deviation
size are capable of producing graphs like those in Fig. 3. If is same in each group (A  B) and for the standard
one does not have access to such software, a rough outline of   0.05 and power  0.8 (  0.2), this formula simplifies to
the sample size–power curve can be generated by calculat-

 

ing sample size for several different levels of power (e.g., 22(1.96 0.84)2 2
 16
70, 75, 80, 85, 90, and 95%). (A  B)2 A  B
or

 
A. Two-Group Mean Comparison 22(1.96 1.28)2  2
 21 ,
The two-group design for which the planned analysis is (A  B)2 A  B
to compare group means using the two-sample t test (see
for   0.05 and power  0.9 (  0.1). In these formulas,
Chapter, this volume) is one setting for which the sample
power is fixed and sample size is expressed as a function of
size formula is relatively simple and can be approximated
unexplained variability () and effect size under the alterna-
well enough so that one can do reasonable “back of the
tive hypothesis (A  B). The ratio (A  B)/ is also
envelope” calculations. The approximate sample size
called the relative effect size because it represents the mag-
required for each group is
nitude of the effect size relative to the unexplained variabili-
(A2 B2 )(z1/2 z1)2 ty. Sample size is inversely proportional to the square of the
, relative effect size in the two-group setting. The relationship
(A  B)2
between sample size and power for several different relative
where (A, A) and (B, B) are the means and standard effect sizes is shown in Fig. 4. This figure also demonstrates
deviations of the two groups, and  and  are the type I and the potential for difference in shape of the sample size curve.
9 Statistical Considerations 77

Figure 4 Hypothetical illustration of the difference between sample size/power curves for various relative effect sizes in
the two-group setting.

When the relative effect size is 1.5, only a few more subjects doomed to failure because it artificially inflates the expected
are required to achieve 95% power instead of 80% power. power of the planned study.
On the other hand, when the relative effect size is 0.75, more Selecting reasonable values for  and A  B is not
than twice as many subjects are required to achieve 95% always easy. Ideally, estimates should be based on previous-
power instead of 80% power. ly published results or data from a pilot study. Even esti-
Two pieces of information must still be specified in order mates based on published results typically require some
to use the formulas given above: unexplained variability () extrapolation to the research question at hand, e.g., a slight-
and effect size under the alternative hypothesis (A  B). ly different dosing scheme or different eligibility criteria. In
Generally speaking, only limited information about variabil- the absence of any relevant external data, values for A 
ity and effect size is available at the time of study design. B can be determined based on minimal clinical relevance.
The choice of values to use for the purpose of sample size For example if the primary outcome measure is amount of
calculation is more often based on an educated guess than blood lost during surgery, the smallest difference in mean
on hard facts. After all, if the research area were so well blood loss (between drug groups) that would be clinically
understood that these quantities were already known, there relevant might by 0.5 units. By “clinically relevant” it is
would be little point in conducting the study. Selecting val- meant whether it is likely to impact clinical practice. One
ues for  that are too small or values for A  B that are could certainly design a study to find a mean blood loss dif-
too big are two of the more common pitfalls in sample size ference of 0.01 units, but even if such a difference were
calculations. Sample size calculations are not analogous to shown to be statistically significant, it is not likely that the
income tax returns: the goal is not to minimize the required result would influence clinical practice. It is important to use
sample size. Figure 5 clearly demonstrates that either lower- the minimal clinically relevant value. A study designed to
ing  or raising A  B (i.e., raising the relative effect size) have 80% power to find a mean blood loss difference of 2.0
leads to a smaller sample size at the same level of power. units would have less than 80% power to find a difference of
The temptation to (intentionally) overestimate the relative 0.5 units. If the true difference is 0.5, and that is clinically
effect size can be great, particularly for the investigator relevant, then the study should be designed for that differ-
faced with insufficient resources. However, this strategy is ence. Determination of clinical significance is clearly not a
78 Mauger and Kauffman

Figure 5 Illustration of the magnitude of the difference between underlying normal dis-
tributions associated with various relative effect sizes. Shaded curve is the reference group
and dashed lines represent groups differing from the reference by the indicated amount.

statistical issue and must be based on expertise in the clini- good idea to do the calculations over a range of potential
cal setting under study. input values in order to determine how sensitive the results
Choosing values for  in the absence of any relevant data are to the assumed values. To quote John Tukey, “An
is somewhat more difficult, and ideally a pilot study would approximate answer to the right question is worth a good
be done at this point. If it is necessary to proceed without deal more than an exact answer to the wrong question.”
pilot data, the relative effect size (A  B)/ can be esti-
mated directly instead of estimating both A  B and  B. Multigroup Mean Comparison
separately. Relative effect size is unitless and does not have
a direct clinical interpretation, but it may be possible to When the goal of the study is to compare the mean out-
determine a reasonable value. Figure 5 illustrates one way of come across more than two groups and the planned statisti-
interpreting relative effect size. The shaded curve represents cal analysis is analysis of variance (ANOVA; see Chapter 82,
one of the treatment groups and the six curve outlines repre- this volume), exact sample size calculations should be done
sent the amount of discrimination between the groups asso- using computer software. However, a reasonable approxima-
ciated with various relative effect sizes. The tabulation tion is to use the two-group formulas above to calculate the
below gives the amount of overlap between each outlined required sample size for each group based on the smallest
curve and the shaded curve. effect size and then multiply by the number of groups. If the
analysis will also include post hoc comparisons of individual
Relative effect size Percent overlap groups, it is standard to adjust the significance level to
0.5 80 account for the multiple testing (see Chapter 82, this vol-
ume). In this case, the adjustment must be taken into consid-
1.0 62
eration at the design phase. For example, if post hoc analyses
1.5 45
will be done at the 0.01 significance level instead of 0.05,
2.0 32
sample size calculations must also be done using   0.01.
2.5 21
3.0 13
C. Two-Group Mean Comparison—
Paired Design
As with minimal clinical relevance, choosing the relative
effect size that is minimally relevant should be done by In some settings, a paired design can be used to compare
someone with expertise in the field. When sample size cal- two treatments. Here, each subject receives both treatments
culations are done without the benefit of external data, it is a and the planned analysis is a paired t test (see Chapter 82,
9 Statistical Considerations 79

this volume). In terms of sample size, the paired design has where 1 and 2 are the are the expected proportions in each
two advantages over the unpaired design. The first is that of the two groups and  ( 1 2
)/2. The quantity ( 1 
fewer total subjects are needed because each subject will be ) corresponds to the expected effect size and the null
2
measured twice, i.e., there will be twice as many outcomes hypothesis being tested in this case is H0: 1  2  0. For
as subjects. If the primary cost of the study is in obtaining example, a study into the effectiveness of a new prophylactic
the measurements (e.g., a very expensive assay must be drug for preventing postoperative infection might utilize a
done), then this benefit is of minor consequence to the over- two-group design. One group would receive the new drug
all cost of the study. If the primary cost is the recruitment of and the other would receive a placebo. The estimated value
subjects or if eligible subjects are scarce, this benefit of the for in the placebo group could be reasonably based on the
paired design can be substantial. The second benefit is due incidence of postoperative infection under the current stan-
to the fact that observations on the same subject are often dard of care, say 10%. If the minimal clinically relevant
positively correlated with each other. The effect of this can effect of the new drug were 5%, then 1  0.10, 2  0.05
be seen explicitly in the sample size formula for the paired (it does not matter which is which), and  0.075. In order
design. The approximate total sample size required is to ensure 80% power to detect this difference at the  
2(1  )2(z1/2 z1)2 0.05 significance level, it would be necessary to enroll
,
(A  B)2 [1.96 2 0.075(10.075) 
where A  B is the effect size,  is the standard deviation, 0.84 0.1(10.1) 0.05(10.05) ]2
and is the correlation between the two measurements. For  71
(0.10.05)2
the standard   0.05 and power  0.8 (  0.2), this for-
mula simplifies to subjects in each group. It is generally the case that larger
sample sizes are required for binary outcomes than for con-

 
2
2(1  )2(1.96 0.84)2  tinuous outcomes. This is because there is less information
 16(1  )
(A  B)2 A  B available (from a quantitative point of view) in a binary out-
come than in a continuous outcome. It is advantageous,
or
therefore, to utilize continuous outcomes whenever possi-

 
2
2(1  )2(1.96 1.28)2  ble. This is not always possible, however, particularly when
 21(1  ) , the primary outcome is presence or absence of some disease
(A  B)2 A  B
or symptom.
for   0.05 and power  0.9 (  0.1). The larger the cor-
relation, the smaller the sample size. In the absence of pilot
data or previously published results on which to base esti- E. Confidence Intervals
mates of , it may be difficult to justify selecting values of
larger than zero. However, it is quite often the case that bio- Not all research questions can be phrased in terms of
medical outcomes are positively correlated and using values hypothesis testing. In some cases the goal of the study is to
for in the neighborhood of 0.25 is not unreasonable. Note describe the population of interest with respect to some
that if  0, the total sample size is exactly the same as that parameter. For example, one might wish to assess the sensi-
required for each group in the unpaired design. If  0 and tivity of a new diagnostic test in a population of patients
the cost of the study is dependent only on the number of known to have the disease of interest. The sensitivity of a
measurements taken (not the number of subjects), then the diagnostic test is the proportion of diseased patients who
paired and unpaired design are equivalent in terms of cost. have a positive test result. A rough approximation to the
exact formula for calculating confidence intervals for pro-
portions, suitable for “back of the envelope” calculations, is
D. Two-Group Comparison of Proportions
p  z1/2 p(1  p)/n ,
If the primary outcome is binary (e.g., yes/no), the popu-
lation proportions will typically be compared using the 2 or where p is observed proportion and 100(1  ) is the
Fisher’s exact test (see Chapter 82, this volume). This is desired confidence level, so that   0.05 corresponds to a
another setting for which the approximate sample size for- 95% confidence interval. The sample size of a proposed
mula is relatively simple and close enough that one can do study can be chosen to yield a confidence interval with spec-
reasonable “back of the envelope” calculations. The approx- ified precision, e.g., p  0.05. The necessary sample size
imate sample size required for each group is can be obtained by setting the right side of the above equa-
tion equal to the desired precision and solving for n:
[z1/2 2 (1  )  z1  1(1  1
) (1 
2 2
) ]2
,
(  )2 n  p(1  p)(z1/2/w)2,
1 2
80 Mauger and Kauffman

where w is desired precision of the interval. For example, IV. Summary


the sample size required to yield a 95% confidence interval
with precision w  0.05 is This chapter was written for a wide readership of surgical
n  p(1  p)(1.96/0.05)2  p(1  p) 1537. investigators. The essential considerations required for the
appropriate design of the proposed studies are available in
Note that the sample size depends on the observed propor- many biostatistical texts. There is a list of references 1–9 at
tion. Obviously, the observed proportion is unknown before the end of this chapter. It is recommended that the serious
the study is completed. It turns out that the most conservative investigator spend time reading the cited narratives, because
(leading to the largest sample size) answer is obtained by the format is designed to lead the reader through the
using p  0.5. In this case, that approach would give a sam- assumptions and logic underlying the application of sample
ple size of 385 (0.5 0.5 1537). If it is safe to assume that size calculations. Crucial to the proposal is the clear state-
the true proportion is different from 0.5, one might reason- ment of a testable hypothesis or desired confidence interval.
ably guess that the observed proportion will also be different The design of the study must reflect the investigator’s atten-
from 0.5. For example, one might expect the diagnostic test tion to the trade-off between study cost (sample size) and
to have sensitivity close to 0.9 because of its similarity to benefit (power or precision). The formulas presented here
another well-established test with that sensitivity. Using p  are good approximations that can be used in relatively sim-
0.9 would give a sample size of 139 (0.9 0.1 1537), less ple design settings. The reader is encouraged to seek consul-
than half the sample size required when using p  0.5. Clear- tation with an experienced applied biostatistician for help
ly, using p  0.5 can be extremely conservative and should with more complex designs. There are many pitfalls in the
not be done if other reasonable estimates of p are available. use of sample size calculations for the inexperienced or
On the other hand, using values for p that are either too close unaware user.
to one or too close to zero would lead to a study design with
inadequate sample size. This example serves to underscore
the fact that sample size calculations must be carefully and References
thoughtfully done if they are to be useful.
Another setting in which the confidence interval 1. Bland, M. (1995). “An Introduction To Medical Statistics,” 2nd Ed.
approach is useful is when the goal of the study is to meas- Oxford Univ. Press, London.
2. Wassertheil-Smoller, S. (1995). “Biostatistics and Epidemiology: A
ure the correlation between two outcomes. A rough approxi- Primer For Health Professionals,” 2nd Ed. Springer-Verlag, New York.
mation to the exact formula for calculating confidence inter- 3. Campbell, M. J., and Machin, D. (1993). “Medical Statistics: A Com-
vals the correlation is monsense Approach,” 2nd Ed. John Wiley & Sons, New York.
4. Altman, D. G. (1991). “Practical Statistics For Medical Research.”
r  z1/2(1  r 2)/n , Chapman & Hall, London.
5. Fisher, L. D., and Van Belle, G. (1993). “Biostatistics: A Methodology
so that the sample size necessary to yield a confidence inter- For The Health Sciences.” John Wiley & Sons, New York.
val with precision w can be approximated by 6. Clarke, G. M. (1994). “Statistics and Experimental Design.” Edward
Arnold, London.
n  (1  r 2)(z1/2/w)2. 7. Pocock, S. J. (1983). “Clinical Trials: A Practical Approach.” John
The sample size is largest when r  0 and decreases as r Wiley & Sons, New York.
8. Woolson, R. F. (1987). “Statistical Methods for the Analysis of Biomed-
increases. As with proportions, sample sizes based on the ical Data.” John Wiley & Sons, New York.
confidence interval approach for correlations must be done 9. Freidman, L. M., Furberg, C. D., and DeMets, D. L. (1998). “Funda-
carefully and with thoughtful consideration. mentals of Clinical Trials,” 3rd Ed. Springer-Verlag, New York.

You might also like