Shih 2009
Shih 2009
To cite this article: Weichung Joe Shih (2009) Two-Stage Sample Size Reassessment Using
Perturbed Unblinding, Statistics in Biopharmaceutical Research, 1:1, 74-80, DOI: 10.1198/
sbr.2009.0007
Article views: 45
Download by: [Nanyang Technological University] Date: 12 June 2016, At: 13:23
Two-Stage Sample Size Reassessment Using
Perturbed Unblinding
1. Introduction
It is a rare occurrence for a common or leading
Sample size reassessment (SSR) is an increasingly
theme to appear in three peer-reviewed statistical journals
popular strategy for designing and conducting clinical
around the same time. From August to October 2006,
trials. In particular, SSR based on updating the variance
many authors contributed to the discussions in Statistics
estimate is a prudent practice accepted by the regulatory
in Medicine (vol. 25, Oct. 15, 2006), Biometrical Journal
authorities to assure adequate power for a study. Since
(vol. 48, August 2006), and Biometrics (vol. 62, Septem-
its development in the early 1990s, however, debate has
ber 2006), regarding adaptive (or flexible) designs, show-
continued over whether a treatment-blinded or unblinded
ing how dear this subject is to clinical biostatisticians.
approach should be used for SSR based on the variance
Among many kinds of adaptive or flexible designs in
estimate. A blind procedure is preferred from the regu-
clinical trials, sample size reassessment (SSR) is of the
latory standpoint, because it better preserves the study
greatest interest to practitioners (Burman and Sonesson
integrity; however, it does not provide the best-unbiased
2006), as well as the most frequently requested matter for
estimate of the variance. On the other hand, the usual un-
regulatory comments at the U.S. Food and Drug Admin-
blinded analysis reveals the treatment effect, which leads
istration (FDA) from the pharmaceutical industry (Wang
to controversy regarding the interpretation of the targeted
2006). Several factors have contributed to SSR’s popular-
effect size as well as concerns of inflating the Type I error
ity. First, sample size (or, in a general sense, information
and possibly biasing the trial. In this article, we devise a
related to study power) calculation is the most basic re-
novel solution to this problem, one that uses perturbed
quirement for planning all studies, as it is critical to the
unblinding to estimate the variance but still keeps the
success of a study and pertains to the budget considera-
treatment effect masked. We then give a bias-corrected
tion. Second, researchers are always confronted with the
final test, which preserves the Type I error rate. We also
issue of uncertainty regarding the effect size and/or vari-
discuss several points of consideration, with a focus on
ability assumptions needed for sample size calculation.
the issues of SSR, that were raised at the recent work-
Regulatory guidelines (CPMP Working Party on Effi-
shop on adaptive designs held jointly by the U.S. Food
cacy of Medical Products 1995; ICH-E9 Expert Working
and Drug Administration (FDA) and the Pharmaceutical
Group 1999; EMEA/CHMP 2006) recognize this issue
Research and Manufacturers of America (PhRMA). In
very well. Third, compared to other mid-course changes
particular, we propose a switch of paradigm from SSR to
in treatment arms, endpoints, superiority/noninferiority,
a two-stage design for clinical trials to alleviate the con-
phase II/III combination, and so on, methods for SSR
cern of possible “change” of behavior due to “change” of
were perhaps developed earliest in the literature (Shih
sample size.
2001). Last but not least, several commercial or free com-
74
Two-Stage Sample Size Reassessment Using Perturbed Unblinding
puter software packages are now available for performing conventional notation, let
a number of SSR procedures (Wassmer and Vandemeule-
n1
2 X
X
broeke 2006). 2
2
Despite the various methods for sample size re- SW = yi j − y i· /2(n 1 − 1),
i=1 j=1
assessment that have been proposed and used in prac-
tice, a basic debate remains: Should the interim effi- and
cacy data be examined with treatment group blinded n1
2 X
X 2
or unblinded? Regulatory authorities certainly favor ST2 = yi j − y ·· /(2n 1 − 1)
blinded SSR (CPMP Working Party on Efficacy of Med- i=1 j=1
ical Products 1995; ICH-E9 Expert Working Group;
EMEA/CHMP 2006), since it does not unveil the treat- be the sample within-group and total variances, respec-
ment effect size, and hence provides more protection for tively, at the interim stage. When only blind data are
the trial’s integrity. But blinding the treatment group ob- used, the index i for treatment group is masked, hence
viously casts some doubt on the information obtained. yi j would be condensed as yk , where k runs through 1
Estimation of the variance is best done with unblinded to 2n 1 , and the within-group sample mean y i· and vari-
ance SW 2 would not be known. Gould and Shih (1992)
data. To resolve this dilemma, we propose a simple
method that uses the ungrouped interim data for vari- proposed two methods that use only blind data to esti-
mate σ 2 midstream. One is a simple formula based on
Statistics in Biopharmaceutical Research 2009.1:74-80.
75
Statistics in Biopharmaceutical Research: Vol. 1, No. 1
Unblinding, or at least partial unblinding (i.e., repre- programmer’s choice (say, a) to everyone’s Y in one of
senting treatment groups by dummy alphabets), is cer- the treatment groups, and another number (say, b 6= a)
tainly an option. The unblinded pooled within-group to everyone’s Y in the other group. These constants a
sample variance, SW 2 , is the minimum variance unbiased and b are protected by at least the same degree of con-
2
estimator of σ based on the first 2n 1 patients. Some au- fidentiality as the randomization schedule’s seed number
thors in this camp have advocated the use of an Indepen- and blocking factor are; no one in the project team, not
dent Data Monitoring Committee (IDMC) to examine the even the statistician at this time, can know them. Since
unblinded or partially unblinded data. Others, however, these constants are sealed, no “back-calculation” is pos-
feel that another separate independent committee should sible to reveal the interim treatment effect. At the final
be appointed to carry out the SSR (Quinlan and Krams data analysis, the programmer is to remove/subtract these
2006; Chuang-Stein et al. 2006), since SSR is for a dif- constants from the individual patients. This modified un-
ferent purpose, and no other data than the primary end- blinding with shift perturbation is termed “perturbed un-
point should be involved. No matter who will be doing blinding.” (I suppose one can even view that a = b = 0,
the unblind SSR, it puts the unblinded group of experts in that is, no perturbation at all, is a very special choice of
an awkward position to just use the observed sigma but the perturbation factors, when they are also kept secret.)
somehow to disregard the observed delta, even though Of course, the calculation of the sample within-group
how to use the observed delta is also highly controver- variance is not affected by such translations of Y. But the
Statistics in Biopharmaceutical Research 2009.1:74-80.
sial (Hung 2006). Therefore, the best approach would be unblinded treatment effect size will be shifted by a factor
to unblind the data for estimating the variance but still to that is unknown to the project team. The (perturbed) un-
keep the treatment effect size masked. In the next section, blinded sample variance, denoted by S12 , is the best (min-
we propose such a method. imum variance) unbiased estimator of σ 2 based on the
first 2n 1 patients. That is,
3. Proposed Method X2 X n1
2
S12 = SW = (yi j − y i· )2 /2(n 1 − 1)
3.1 Unblinded Estimation of σ 2 Without Unmask- i=1 j=1
ing the Treatment Effect
numerically; however, it is calculated with the perturbed
It is well recognized that a successful conduct of any yi j , especially with the shifted sample mean in each treat-
clinical trial requires cross-departmental coordination. ment group. With S12 we will then recalculate the new
In the pharmaceutical industry, the computer program- sample size n ≥ n 1 (per group) so that the final test will
ming group usually handles the randomization schedule have power of 1 − β to detect δ = δ ∗ at the significance
in a discrete way, after the project statistician provides level α:
the blocking factor, the number of sites, and other re- ( )
quired specifications. The programmer chooses a seed 2(t2(n 1 −1),α/2 + t2(n 1 −1),β )2 S12
number in the randomization generator. Standard oper- n = max n 1 , +1 ,
δ ∗2
ating procedures (SOPs) have long been established, for
instance, to direct the linkage between the randomiza- where t j,k is the upper kth percentile of the t-distribution
tion and packaging department to properly ship the study with j degrees of freedom. This new sample size formula
drugs to investigators. They have also automated the pro- allows for a decrease of sample size from the originally
cess for the analysis programs to run when unblinding the planned n 0 . If we do not want to decrease from n 0 , then
data occurs. Firewalls have been established to prevent ( )
2
others from gaining access to the randomization schedule 2 t2(n 1 −1),α/2 + t2(n 1 −1),β S12
and the associated information that generates it, such as n = max n 0 , +1 .
δ ∗2
the seed number, the block size, and so on. This program-
ming function may be handled in-house or externally by a A general expression is
contracted clinical research organization (CRO). The fol- (
lowing novel method of “perturbed unblinding” allows
one to extend this experience to the sample size reassess- n = max n 1 + n min ,
ment. )
During the interim analysis for SSR, the designated ef- 2(t2(n 1 −1),α/2 + t2(n 1 −1),β )2 S12
+1 ,
ficacy endpoint data, Y, will be unblinded, but with the δ ∗2
following modification. Before the unblinding occurs,
the programmer is instructed by the project statistician, which includes n min = 0 for the former case, n min =
according to an established SOP, to add a number of the n 0 − n 1 for the latter case, and other situations such as
76
Two-Stage Sample Size Reassessment Using Perturbed Unblinding
n min = a fraction of n 1 (due to some patients being en- factors, (a1 , b1 ) and (a2 , b2 ), one for each of the sub-
rolled but not yet reaching the endpoint for SSR). Fur- groups. We will then obtain two unbiased estimates of σ 2
thermore, in practice, we may also need to cap the sam- and, since they are based on the same number of patients,
ple size to a resource permissible upper limit, say n max . we simply average them. This extension adds complexity
Then we will have and masking protection, but loses some minor efficiency
( ( of the estimate. The relative efficiency to the previous
estimate using no subgroup can easily be shown to be
n = min max n 1 + n min ,
(n 1 − 2)/(n 1 − 1).
2 ) ) Furthermore, the SOP can also guide some possible
2 t2(n 1 −1),α/2 + t2(n 1 −1),β S12 ways of making up these shifting constants. For example,
+ 1 , n max .
δ ∗2 the project statistician can do a preliminary calculation
from the blind data to obtain the overall range, quintiles,
Of course, this modified unblinding procedure also works and/or other summary statistics of the primary endpoint
with non-normal data after suitable transformation. In the Y. Then the programmer can pick two from this list, use
next subsection we further discuss the choice of these them either directly or with some modification (such as
perturbation constants together with more consideration dividing the range by 2, 3, or 4, for example) as shift-
on the implementation aspects. ing constants a and b. An advantage of using summary
Statistics in Biopharmaceutical Research 2009.1:74-80.
77
Statistics in Biopharmaceutical Research: Vol. 1, No. 1
(X N , S12 , S 2 ) is not complete, hence the existence of a variability (sigma), since the hypotheses are usually ex-
best unbiased estimate of σ 2 is not guaranteed. Further- pressed in term of the treatment effect size (delta). On
more, since the (negative) bias depends on the unknown the other hand, the sense of a phase III trial being confir-
σ 2 , it cannot be accordingly corrected. Miller (2005) matory still has room for debate. For example, phase III
found the lower bound of the bias trials are seldom a repetition to previous phase II trials for
“confirmation.” A different patient population (e.g., mild
f
c=− , to moderate disease condition for phase II and moderate
ν( f − 2) to severe disease condition for phase III), a different ver-
sion of instrument or technology (e.g., improved densit-
where 2 ometer to measure the bone mineral density or sharpened
2 t f,1−α/2 + t f,1−β
ν= , questionnaires to count symptom scores for phase III af-
δ ∗2 ter phase II trials), and so on, make the phase III study
and f = 2(n 1 −1), and proposed the following estimator: hardly an absolute confirmation of what was observed in
2 phase II studies. Hence the notion of the confirmatory
2 S − c if n > n 1 + n min aspect of the phase III trial should not be in the sense
Smc =
S2 if n = n 1 + n min . of repetition (but in the sense of testing the same clini-
cal hypothesis). Treatment effect size and variability are
(Proof that Miller’s bias correction c also applies to the
Statistics in Biopharmaceutical Research 2009.1:74-80.
4. Discussion
B. Changing the Sample Size May Induce Behavior
For any adaptive design to be acceptable to regula- Changes in Study Personnel
tory authorities, not only must the methodology be valid,
but also the process that implements the method needs Regulatory concern: With a sample size increase
to be set up so that no potential bias will be introduced some study investigators may get the message that the
by the interim data analysis. In this section, we discuss new treatment is not working as well as predicted, which
several “points for consideration” that were raised at the may cause some patients to withdraw from the study. If
recent FDA/PhRMA Annual Workshop on adaptive de- it were a decrease of sample size, then some investiga-
signs (Wang 2006). We focus only on the issues pertinent tor may guess that the new treatment works better than
to SSR. From this discussion, we also offer a possible fu- predicted. The sponsor may also be encouraged some-
ture paradigm change to the concept of SSR. how. This fluctuation of originally planned sample size
more or less leaks the treatment effect and may then in-
duce certain behavior changes, which would ultimately
A. Confirmatory Aspect of a Late Phase II or III
bias the study.
Study
Discussion: We recommend that the justification of
Regulatory concern: A foremost concern by the reg-
SSR should be given in the protocol, and that the investi-
ulatory authority is that the late phase trials are sup-
gators are informed that the SSR will be conducted only
posed to confirm hypotheses generated in earlier trials.
to ensure sufficient power in the presence of uncertainty
The need to reassess sample size (and other design speci-
regarding the inter-patient variability; no one will have
fications) “typically changes the emphasis from a confir-
information regarding the treatment effect size. No other
matory trial to an exploratory trial.”
specific information about the SSR shall be provided to
Discussion: For sample size calculation based on the investigators. We also recommend that when the new
a continuous endpoint, this concern is probably more sample size is calculated based on the updated variance
on the hypothesized treatment effect size than on the estimate, a new patient entry cut-off day can be esti-
78
Two-Stage Sample Size Reassessment Using Perturbed Unblinding
mated accordingly with the updated patient enrollment size is planned to detect a MCIE of magnitude of δ with
and dropout rates. This new patient entry cut-off day, but two-sided alpha = 0.05 and 90% power, then a p = 0.05
not the sample size, can be provided to the investigators associated with an observed effect size at completion of
later when the target number of patients is near. The in- the study means that the estimated effect size will be ap-
vestigators will not be informed whether the new cut-off proximately only 60% of δ (Hung 2006), causing a con-
date is due to slower-than-planed enrollment rate or the troversy because a statistically significant result would
SSR. no longer be clinically acceptable. There also could be
concerns over bias resulting from knowledge of interim
Our ultimate solution to this concern is: do not think observed treatment effect.
of change; think of a two-stage design. In reviewing the
original sample size reassessment or internal pilot meth- Discussion: We believe that the original δ ∗ at which
ods one finds that the starting literature was the 1945 the power of the study is targeted should be used con-
Stein’s two-stage sampling method (Shih 2001; Proschan sistently throughout the study. A delta has to be given
2005). The application and extension of Stein’s method for a study regardless of whether the design is sequential
to the current standard practice of providing sample size or a fixed size. The debate as to whether the delta is in a
estimation in a study protocol leads to the term of sample sense of MCIE, an investigator’s expected/wishful effect,
size reassessment. However, in Stein’s idea, there was no or marketable difference can go on without reference to
“original sample size,” only the sample size of the first flexible designs. Therefore, it is better to use the same
Statistics in Biopharmaceutical Research 2009.1:74-80.
stage (n 1 ) and that of the next stage (n 2 ) derived from approach one uses for the fixed sample size design. For
the results of the n 1 samples. Hence, if we adopt a simi- the second stage, this delta should not change. The sam-
lar two-stage design concept, there will be no “originally ple size for the second stage should use the original delta
planned total sample size” to change from. A careful re- and the only updating information will be the variance
view of Sections 2 and 3, where the new method is pro- estimate. This also agrees with Stein’s original method.
posed, shows that we never use the original sample size SSR based on updating the variance estimate has no such
n 0 after it was introduced. This concept of “two-stage de- controversy. If, however, there is much uncertainty about
sign, not a sample size recalculation” was delineated in a treatment effect, then the above proposed SSR can be fol-
commentary paper by Shih (2006). If for budgetary rea- lowed by a group sequential plan with the resized total
sons we need to give a sample size at the beginning of a information, as recommended by Gould and Shih (1998).
study, then we give the expected sample size or a range of
sample sizes. Actually, it is true that in any sequential de- D. Intentional or Unintentional Dissemination of the
sign one cannot give a fixed sample size, but an expected Interim Results
sample size since sample size depends on interim results
and is a random variable. In fact the practice of not giv- Regulatory concern (EMEA/CHMP 2006): “It is al-
ing a fixed but an expected sample size is quite common ways difficult to convincingly demonstrate that no un-
for Phase-I dose finding algorithm-based designs such as blinded interim results have been released. Interim anal-
“3 + 3” or the general “ A + B” designs (Lin and Shih yses, therefore, always introduce the possibility of dam-
2001) as well as for the model-based designs such as aging the integrity of a trial. A balance has to be achieved
CRM (Continual Reassessment Method) in cancer stud- between the needs for assessing accumulating informa-
ies. In single-arm Phase IIA cancer studies, Simon’s pro- tion and the risk of damaging the integrity of the trial.
cedure (Simon 1989) or its variation (Lin and Shih 2004) Routinely breaking the blind should be avoided.”
are also standard two-stage designs that do not provide a
Discussion: This is more an issue for interim anal-
fixed sample size in the protocol, only rules to indicate
yses with early stopping in mind (for efficacy, futility,
how the second stage sample size depends on the pos-
or safety) than it is for SSR. The EMEA draft reflection
sible outcome of the first-stage samples. This paradigm
paper (EMEA/CHMP 2006), where the excerpt comes
switch from SSR to two-stage designs is the future way
from, actually supports what is recommended in this ar-
to resolve this change of sample size issue.
ticle and elsewhere, which is that SSR would be best
done with treatment effect masked, (by total blinding
C. Second-Stage Sample Size Depends on What Delta or the proposed perturbed unblinding), performed sep-
arately from the safety data monitoring process, and con-
Regulatory concern: It is unclear how to postulate a ducted only once during the trial. The discussion item
new delta. If the original delta (treatment difference) (b) also mentioned that the investigators are notified only
is the so-called minimum clinically important effect with updated patient enrollment cut-off date, not the new
(MCIE), then the study should be planned to rule out the sample size. Therefore, with the proposed SSR, there is
MCIE, not simply to rule out a zero effect. If the sample minimal analysis and little chance of dissemination of in-
79
Statistics in Biopharmaceutical Research: Vol. 1, No. 1
formation on the treatment effect at the interim stage. Kieser, M., and Friede (2003), “Simple Procedures for Blinded Sample
Size Adjustment that do not Affect the Type I Error Rate,” Statis-
tics in Medicine, 22, 3571–3581.
Acknowledgments Lin, Y., and Shih, W.J. (2001), “Statistical Properties of the Traditional
Algorithm-Based Designs for Phase-I Cancer Clinical Trials,” Bio-
The work is partially supported by grant no. CA-72720-10. The author statistics, 2, 203–215.
would like to thank Dirk Moore for his helpful review of the initial draft (2004), “Adaptive Two-Stage Designs for Single-Arm Phase
of this article and the Associate Editor and referees for their helpful IIA Cancer Clinical Trials,” Biometrics, 60, 482–490.
comments. A part of this article was presented at the National Health
Miller, F. (2005), “Variance Estimation in Clinical Studies with Interim
Research Institute and Center for Drug Evaluation Joint Symposium on
Sample Size Reestimation,” Biometrics, 61, 355–361.
“Current Advanced Statistical Issues in Clinical Trials-Adaptive De-
signs” at Taipei, November 25, 2006. The author also benefited from Proschan, M. (2005), “Two-Stage Sample Size Re-estimation Based on
discussions with Jim Hung, Sue-Jane Wang, and Gordon Lan. a Nuisance Parameter: A Review,” Journal of Biopharmaceutical
Statistics, 15, 559–574.
[Received March 2007. Revised August 2007.]
Proschan, M.A., and Wittes, J. (2000), “An Improved Double Sampling
Procedure Based on the Variance,” Biometrics, 56, 1183–1187.
References Quinlan, J.A., and Krams, M. (2006), “Implementing Adaptive De-
signs: Logistical and Operational Considerations,” Drug Informa-
Burman, C.-F., and Sonesson, C. (2006), “Are Flexible Designs tion Journal, 40, 437–444.
Sound?” Biometrics, 62, 664–683. Shih, W. J. (2001), “Commentary: Sample Size Re-estimation—
Journey for a Decade,” Statistics in Medicine, 20, 515–518.
Statistics in Biopharmaceutical Research 2009.1:74-80.
80