Bootstrapping Techniques in Excel
Bootstrapping Techniques in Excel
discussions, stats, and author profiles for this publication at: [Link]
CITATIONS READS
0 1,911
1 author:
Hans Pottel
University of Leuven
196 PUBLICATIONS 3,284 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Hans Pottel on 19 June 2015.
Hans Pottel
Subfaculty of Medicine, KU Leuven Kulak, Kortrijk, Belgium
Introduction
Statistics are concerned with methods and techniques on inferences on populations based on
sample data. Making use of information in (small) samples, a statistician takes decisions on
the whole population, using all kind of different tools, methods and techniques. Statistics is
changing as modern computers and software make it possible to look at data graphically and
numerically in ways previously inconceivable. The bootstrap, permutation tests and other
resampling methods are part of this revolution. Although statisticians have embraced
resampling methods for their own use, they have not, in general, included them in their
teaching. Resampling methods can be made accessible to students at virtually every level,
using Microsoft Excel.
It is accepted that spreadsheets are a useful and popular tool for processing and presenting
data. In fact, Microsoft Excel spreadsheets have become somewhat of a standard for data
storage, at least for smaller data sets. This, along with the fact that the program is often being
packaged with new computers, which increases its easy availability, naturally encourages its
use for statistical analysis. However, many statisticians find this unfortunate, since Excel is
clearly not a statistical package. There is no doubt about that, and Excel has never claimed to
be one. But, one should face the facts that due to its easy availability many people, including
professional statisticians, use Excel, even on a daily basis, for quick and easy statistical
calculations.
The aim of this article is show how resampling can be done in Microsoft Excel, using
standard functions, and using some simple macro functions. We’ll make use of Excel Data
Tables 1 to conduct simulations. We illustrate the techniques with many examples.
Resampling methods
To use resampling techniques, the sample data are assumed representative of the population
from which they are taken. This is probably the only requirement for resampling techniques.
Hesterberg (1998) gives a very nice review of simulation and bootstrapping in teaching
statistics.
Where can resampling methods be used?
• Resampling methods allow us to quantify uncertainty by calculating standard errors
and confidence intervals
• Resampling methods let us tackle new inference settings easily (e.g. ratio of means)
• Resampling methods help us understand the concepts of statistical inference
(performing significance tests)
What are the advantages of resampling methods?
• Fewer assumptions: no requirement for normality or large sample sizes
• Greater accuracy
• Generality: the applicability of resampling methods is quite similar for a wide range of
statistics
1
• Promote understanding: they build intuition by providing concrete analogies to
theoretical concepts; e.g. look at how many times a confidence interval covers the true
population mean (if you repeatedly construct 95%CIs based on random samples, about
95% of them will cover the true population mean)
One issue here is that RAND cannot be reseeded by the user. Microsoft gave no clue how this
is done. One has to work in Excel 2003 and 2007 with the fact that the generated RAND
function is always changing and can’t be set back to a fixed set. Every time a worksheet is
updated, a whole new set of values occur in any cell containing “=RAND()” (a so-called
volatile function). This can be circumvented using a simple trick. You can make use of the
formula =IF(T1;RAND();V2) in cell V2 which uses the value of another cell T1, in which you
just set 0 or 1, false or true. The cell has circular reference and will only recalculate the
RAND() function if the value in cell T1 is set to 1 (or True).
Excel also is shipped with an Add-in called the ‘Analysis Toolpak’. This add-in, which is
present but not necessarily active (to make it active, choose Tools >> Add-ins and check the
checkbox next to ‘Analysis Toolpak’ and ‘Analysis Toolpak VBA’). The Analysis Toolpak
has a random number generator (accessible via Tools >> Data Analysis) but delivers also
many additional (statistical) functions. One of these functions is RANDBETWEEN
(Minimum, Maximum) which returns a random integer value between a minimum and
maximum integer value. The function is equivalent to INT[(Maximum – Minimum + 1) *
RAND() + Minimum]. So, there is not really a need to use RANDBETWEEN. If one wants to
resample with replacement from a range of cells in a spreadsheet, one can make use of the
following function =SMALL($A$1:$A$15;INT(COUNT($A$1:$A$15)*RAND())+1). This
function has the form = SMALL (array; k) and returns the k-smallest value in an array. The
array in our example is $A$1:$A$15 and the value of k is the integer of (the number of
elements in the array multiplied by a random value between 0 and 1) plus 1. The number of
elements in our array example is 15, so the integer of 15*RAND() is a value between 0 and
14. To obtain a random value between 1 and 15 we add 1.
2
Procedure for bootstrapping
To use resampling techniques, the sample data are assumed representative of the population
from which they are taken. This sample, the original sample, serves as the starting point for
resampling techniques. Statistical inference is based on the sampling distribution of sample
statistics. The bootstrap is finding the sampling distribution, often called, the bootstrap
distribution. The way to proceed is as follows:
Confidence intervals
Example 1: bootstrap confidence limits
We consider the serum creatinine values of 601 women between 20 and 25 years old, as
obtained by an enzymatic method. The distribution of the values is given in the histogram
below.
Histogram
120
100
Frequency
80
60
40
20
0
1
2
e
or
0,
0,
0,
0,
0,
0,
0,
0,
1,
1,
M
Bin
Mean and median are 0.679 mg/dL and 0.680 mg/dL resp. Standard deviation is 0.1194
mg/dL and standard error is 0.004871 mg/dL. The 95%CI based on the normal distribution is
[0.670; 0.689 mg/dL]. From the graph, we may consider the data as normally distributed.
3
If we construct 100 bootstrap samples and plot the histogram of the mean statistic, we obtain
the graph below.
The mean is 0.679 mg/dL, the standard deviation is 0.00431 mg/dL and the 95% CI is [0.672 ;
0.686 mg/dL]. The standard deviation of the bootstrap distribution is close to the standard
error of the original sample.
Histogram
30
25
20
Frequency
15
10
0 7
66
68
4
6
66
66
67
67
68
68
69
69
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
Bin
Of course, we will obtain better estimates if we increase the number of bootstrap samples. We
present the bootstrap distribution of 500 bootstrap samples in the histogram below. The mean
is 0.679 mg/dL, standard deviation is 0.00464 mg/dL and the 95%CI is [0.671; 0.686 mg/dL].
Histogram
100
90
80
70
Frequency
60
50
40
30
20
10
0
7
66
68
4
6
66
66
67
67
68
68
69
69
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
Bin
4
Remember that the bootstrap idea is to use resample means to estimate how the sample mean
of a sample of size 601 from this population varies because of random sampling. That is, we
are using the resampled observations as if they were real data.
To do this in Excel, we should be able to draw with replacement from a set of data. Excel has
a function SMALL (array; k) or LARGE(array; k), that gives the k-smallest or k-largest
element in an array. For example, if the elements in an array are {3, 8, 14, 2, 15, 17} and
these values are in cells A1:A6, the function SMALL(A1:A6; 3) will return the value 8, as
this value is the third smallest value in the list.
Now consider our sample of n = 601 in cells A2:A602. In cell A1 we have put the variable
name ‘CREA’. We will call the range A2:A602 the ‘data_range’. You can make use of
Excel’s built in feature to name ranges, if you want (Insert >> Name >> Define). Now, we
will make use of the random function RAND() which returns a random value between 0 and
1. We modify this function so that our new function will return a random value between 1 and
601. The function INT(COUNT(data_range)*RAND())+1 will do the trick. The function
COUNT(data_range) returns the value 601, the size of our sample. Multiplying with RAND()
results in a random floating point number between 0 and 601. To convert this to an integer
number between 0 and 600.999, we take the integer part which returns an integer number
between 0 and 600. We correct this by adding one.
Clearly, the same random number generator could be obtained with the LARGE function.
So, if we assume that our data is in cells A2:A602, then we use cells B2:B602 for our
resample. In cell B1, we put ‘Resample’. In the cells B2:B602, we put
=SMALL($A$2:$A$602;INT(COUNT($A$2:$A$602)*RAND())+1). This formula gives us a
resample with replacement of the same size as our original sample.
Taking the average (or median, or whatever statistic) of B2:B602 we obtain the mean of our
resample. The bootstrap procedure tells us to take hundreds of resamples and calculate the
statistic, so that we can obtain the bootstrap distribution of the statistic. As a statistic we use
the mean for this example.
We could do this using macros (VBA), but there is another way to create the bootstrap
sampling distribution, using Excel’s DATA >> TABLE. Our Data Table will be a rectangle
two columns wide with a formula for the resampled statistic in the top right-hand corner of
the rectangle. Put the word ‘Mean’ in cell F2 en in cell G2 enter the formula
‘=AVERAGE($B$2:$B$602)’ to calculate the mean of the resample. Then select F2:G301
and choose Data >> Table.
5
A small dialog window appears, asking for the Row input cell (which you may ignore) and
the column input cell, where you enter the reference to cell F2
Then click ‘OK’ and the data table will fill with averages of 300 resamples. Your result will
look like
6
As you can see, each cell in G3:G301 refers to the reference cell F2 and the formula in G2.
Each time you change a cell in the spreadsheet, the data>>table will recalculate. This may
take several seconds, depending on the size of the resample and the number of bootstrap
resamples. Therefore, it might be good to copy the range G2:301 and use paste special >>
values. Doing this will remove the data>>table reference and stop the recalculation. Clearly,
you can also turn off the automatic recalculation via Tools >> Options >> Calculation >>
Automatic except tables >> OK. By pressing F9 you can recalculate at any time.
Another way to obtain this result, is by using VBA. The macro code is given below:
Sub stats1()
Dim R As Range
Dim Routput As Range
Dim Rresample As Range
Dim i As Integer
'offset for output: where we place the resample from the original data
Set Routput = [Link]("B1")
[Link] = "Resample"
[Link] = False
[Link] = xlCalculationManual
For i = 1 To [Link]
[Link](i, 0).Formula = "=SMALL(" & [Link] & ",INT(COUNT(" & [Link] &
")*RAND())+1)"
Next i
Set Rresample = [Link](1, 0).Resize([Link], 1)
[Link] = xlCalculationAutomatic
[Link] = True
'start bootstrap
[Link](0, 2).Value = "Median"
[Link](0, 3).Value = "Pct2.5"
[Link](0, 4).Value = "Pct97.5"
[Link](0, 5).Value = "Average"
'300 bootstrap resamples: 4 statistics are calculated: median, 2.5% Pct, 97.5% Pct and mean for each resample
For i = 1 To 300
[Link](i, 2).Value = [Link](Rresample)
[Link](i, 3).Formula = [Link](Rresample, 0.025)
[Link](i, 4).Formula = [Link](Rresample, 0.975)
[Link](i, 5).Formula = [Link](Rresample)
Next i
‘summary statistics: mean, 90% confidence intervals and standard deviation are calculated for each statistic
Set Rresample = [Link]([Link](1, 2), [Link](i, 2))
‘for the median
[Link](i + 2, 2).Formula = "=Average(" & [Link] & ")"
[Link](i + 3, 2).FormulaR1C1 = "=NORMINV(0.05,R[-1]C,R[+2]C)"
[Link](i + 4, 2).FormulaR1C1 = "=NORMINV(0.95,R[-2]C,R[+1]C)"
7
[Link](i + 5, 2).Formula = "=Stdev(" & [Link] & ")"
‘for Pct2.5
Set Rresample = [Link]([Link](1, 3), [Link](i, 3))
[Link](i + 2, 3).Formula = "=Average(" & [Link] & ")"
[Link](i + 3, 3).FormulaR1C1 = "=NORMINV(0.05,R[-1]C,R[+2]C)"
[Link](i + 4, 3).FormulaR1C1 = "=NORMINV(0.95,R[-2]C,R[+1]C)"
[Link](i + 5, 3).Formula = "=Stdev(" & [Link] & ")"
‘for Pct97.5
Set Rresample = [Link]([Link](1, 4), [Link](i, 4))
[Link](i + 2, 4).Formula = "=Average(" & [Link] & ")"
[Link](i + 3, 4).FormulaR1C1 = "=NORMINV(0.05,R[-1]C,R[+2]C)"
[Link](i + 4, 4).FormulaR1C1 = "=NORMINV(0.95,R[-2]C,R[+1]C)"
[Link](i + 5, 4).Formula = "=Stdev(" & [Link] & ")"
‘for the mean
Set Rresample = [Link]([Link](1, 5), [Link](i, 5))
[Link](i + 2, 5).Formula = "=Average(" & [Link] & ")"
[Link](i + 3, 5).FormulaR1C1 = "=NORMINV(0.05,R[-1]C,R[+2]C)"
[Link](i + 4, 5).FormulaR1C1 = "=NORMINV(0.95,R[-2]C,R[+1]C)"
[Link](i + 5, 5).Formula = "=Stdev(" & [Link] & ")"
‘numbers are formatted to 3 digits
[Link](i + 2, 2).Resize(4, 4).NumberFormat = "0.000"
End Sub
The macro first generates the resample by placing the formula =SMALL(" & [Link] &
",INT(COUNT(" & [Link] & ")*RAND())+1) in the cells B2:B602. We turn off the
automatic screenupdating and the automatic cell recalculation, to speed up filling the cells
with this formula.
Then we calculate 300 bootstrap resamples and put the statistic median, Pct2.5, Pct97.5 and
mean in the cells D2:D301, E2:E301, F2:F301 and G2:G301 respectively. While the macro is
running, you can follow this process on the screen (for small resamples, this will not be the
case because resampling will be too fast). You see the resampling, followed by the calculated
statistics. Finally, a summary statistics for each bootstrap statistic is calculated. This summary
statistics includes the mean of the bootstrap statistic (which was the mean of the resample),
the 90% confidence interval for the mean and the standard deviation of the bootstrap
distribution, which corresponds to the standard error of the mean of the original sample. We
used the NORMINV(0.05; Mean; Stdev) to calculate the value that corresponds with the 5%
lower tail of the normal distribution with mean ‘Mean’ and standard deviation ‘Stdev’. This
can be seen as the lower limit of the 90% Confidence Interval for the mean.
In this example, the original sampling distribution closely corresponds to the Gaussian bell-
shaped distribution we expect for normally distributed data. In some cases, the distribution
might be skewed. Then you may ask yourself the question whether the mean and standard
deviation are the right statistics? What can you do instead? How can we calculate confidence
intervals and for which statistic? The mean might not be appropriate. The bootstrap provides a
way out of this dilemma. The bootstrap can produce a resampling distribution that can be used
to set confidence limits.
We are interested in the sales prices of residential property in Seattle (example from Tim
Hesterberg). The data available from the county assessor’s office do not distinguish
residential property from commercial property, so a few large commercial sales are present in
the sample and may greatly increase the mean selling price. Therefore, we prefer to use a
measure of center that is more resistant than the mean.
8
Selling prices (in €1000) for real estates (from Tim Hesterberg Table 18.1)
142 232 132,5 200 362 244,95 335 324,5 222 225
175 50 215 260 307 210,95 1370 215,5 179,8 217
197,5 146,5 116,7 449,9 266 265 256 684,5 257 570
149,5 155 244,9 66,407 166 296 148,5 270 252,95 507
705 1850 290 165,95 375 335 987,5 330 149,95 190
The data from the table are presented in a histogram, showing a strongly skewed distribution,
with several outliers, which may be commercial sales.
Histogram
20
18
16
14
Frequency
12
10
8
6
4
2
0
0 500 1000 1500 2000
The bootstrap distribution of the mean based on 1000 resamples is shown in the histogram
below.
Histogram
120
100
80
Frequency
60
40
20
0
0
0
21
24
27
30
33
36
39
42
45
200
9
This distribution is skewed to the right, so the mean value of the resamples cannot be
considered normally distributed. We recognize the fact that, for most statistics, the bootstrap
distributions approximate the shape, spread and bias of the actual sampling distribution.
We have two alternatives: use a confidence interval not based on normality or choose a
measure of center whose distribution is closer to normal. We here show an example of the
bootstrap distribution of a different statistic, one that is more resistant to skewness and
outliers. One such statistic is the median. The median is the mean of 1 or 2 middle
observations. Another statistic is the trimmed mean, e.g. the 25% trimmed mean, a statistic
that calculates the average of the middle 50% of the observations. In Excel you can use the
functions MEDIAN(array) and TRIMMEAN(array; 0.25) for this.
When we use the median statistics, we obtain the following histogram for the bootstrap
distribution:
Histogram
180
160
140
120
Frequency
100
80
60
40
20
0
200
210
220
230
240
250
260
270
280
290
300
310
Bin
This shows that the median does not always work well. The shape of the distribution is not
easy to characterize. Bootstrapping trimmed means works better than bootstrapping medians,
because the bootstrap doesn’t work well for statistics that depend on only 1 or 2 observations.
When we use the trimmed mean (50% of the data is trimmed of, 25% on each side), we
obtain:
10
Histogram
250
200
Frequency
150
100
50
0
0
0
20
22
24
26
28
30
32
34
The bootstrap confidence intervals can be calculated from the mean of the bootstrap statistic
and the bootstrap standard deviation. Here the mean of the 1000 trimmed means was 244.12
and the standard deviation was 16.45. The bootstrap confidence interval can be calculated
from
Statistic ± t * SE
Where t is the critical value of the t(n-1) distribution (here n = 50). This value can be obtained
in Excel, using TINV(0.05; 49) = 2.009575. The 95%CI for the trimmed mean becomes
(211.06; 277.18).
In a two sample problem, we wish to compare two populations, based on separate samples
from each population. When both populations are roughly normal, we can use the two-sample
t-test to compare the population means. The bootstrap can also compare two populations
without the normality condition and without the restriction to comparison of means.
We proceed as follows:
Suppose we have two samples of sizes n and m from the two populations. We draw a
resample of size n with replacement from the first sample and a separate resample of size m
from the second sample. We compute a statistic that compares the two groups (e.g. the
difference between the two sample means). We repeat this resampling process hundreds of
times. We then construct the bootstrap distribution of the statistic and inspect the shape, bias
and bootstrap standard error as we did previously.
We consider the data of nucleus area of cells treated with a 5mM and 10 mM FeCl2 (oxidative
stress). We expect more aggregation with increasing concentration of FeCl2. This results in a
larger total nucleus area. The distribution of the nucleus area for cells treated with 5 mM
FeCl2 is given in the following histogram:
11
Histogram
160
140
120
100
Frequency
80
60
40
20
0
0
0
0
20
40
60
80
10
12
14
16
18
20
Bin
Histogram
70
60
50
Frequency
40
30
20
10
0
0
0
0
20
40
60
80
10
12
14
16
18
20
Bin
Both distributions are right skewed. We want to estimate the difference of populations means
µ1-µ2, but both distributions are quite skewed, so we should be reluctant to use the two-
sample t confidence interval. To compute the bootstrap distribution for the difference in
sample means, we resample separately from the two samples. Each of our 500 resamples
12
consists of two resamples, one of size n = and one of size m = . For each combined resample,
we calculate the difference in means. The 500 differences form the bootstrap distribution,
which is shown in the histogram below.
Histogram
120
100
80
Frequency
60
40
20
0
3
9
10
11
12
e
or
M
Bin
The bootstrap Mean = 7.93; stdev = 2.21. The observed difference is 7.92.
The bootstrap normal probability plot is shown in the figure below. This shows that the data
as approximately normally distributed and consequently the t confidence limits may be
calculated. In our example the 95%CI of the mean is (3.59; 12.27).
15,0
12,2
Observed Value
9,4
6,6
3,8
1,0
1,0 3,8 6,6 9,4 12,2 15,0
Expected Value
13
The bootstrap distribution is not always close to the bell-shaped normal curve. Therefore, it
may be recommended not to use the bootstrap t confidence interval, because this method is
based on the normality assumption.
If the bootstrap distribution is approximately normal and the bias is small, we can use the
bootstrap t confidence interval, statistic ± t*SE. In other cases, the 95% bootstrap percentile
confidence interval can be calculated. In our example this becomes (3.25; 12.09).
We use the data of serum creatinine measured in the subgroup of 90-95 year old men (n = 48),
with two different methods: an enzymatic method and a Jaffé method. Both methods differ
from each other, but still there should be a high correlation between the results. The
scatterplot of the data is shown below.
2
Serum creatinine (mg/dL) (Jaffé method)
y = 0,9069x + 0,1744
R2 = 0,9025
1,5
0,5
0,5 1 1,5 2
Serum creatinine (m g/dL) (Enzym atic m ethod)
We can use the same bootstrap procedure to find a confidence interval for the correlation
coefficient, which is R = 0.95. There is of course one point that needs special attention:
because each observation consists of the serum creatinine value obtained with both methods,
we should be careful not to lose the tie between both methods during resampling. Therefore,
we give each patient an identification number, called PatID, and we resample the patients.
14
Resampling gives (e.g.):
We use Data >> Tables with the built-in Excel function CORREL to calculate the correlation
coefficient of 500 resamples:
15
To obtain this list, first put ‘Correlation’ in cell I3 and the function
=CORREL(F3:F50;G3:G50) in cell J3.
Now select I3:J502 and choose Data >> Tables and set the Column Input Cell to I3. Cli
It is recommended to break the link with the data-table Excel formula after the formula
generated the correlation coefficient of the resampled data, because every time you change a
cell in the spreadsheet, the SMALL function will resample (because RAND() is a so-called
volatile function) and data-tables will recalculate all 500 correlation coefficients of the
resamples. It is my experience that Excel is not always bulletproof for these kind of
continuous recalculations, resulting in unexpected errors and Excel shutting down
unexpectedly.
The bootstrap distribution of the correlation coefficient is given in the histogram below. It is
not unexpected that this distribution is deviating from the normal bell-shaped curve, as the
correlation coefficient can never be larger than 1.
16
180
160
140
120
frequency
100
80
60
40
20
0
0,9 0,91 0,92 0,93 0,94 0,95 0,96 0,97 0,98 0,99 1
correlation coefficient
The normal probability plot shows deviations from the identity line that confirm the non-
normality of the data. Therefore, it is recommended not to use the t confidence intervals, but
to use the bootstrap percentile interval.
1,00
0,99
0,98
0,97
Observed Value
0,96
0,95
0,94
0,93
0,92
0,91
0,90
0,90 0,91 0,92 0,93 0,94 0,95 0,96 0,97 0,98 0,99 1,00
Expected Value
The 95% t confidence interval would be (0.922; 0.975). The bootstrap percentile interval is
(Pct2.5 – Pct97.5): (0.918 – 0.972) which is only slightly different as the t confidence interval.
17
Example 5: bootstrapping regression
Regression models may be bootstrapped in exactly the same way as shown in example 4. The
original data consists of x, y pairs and the statistic computed from bootstrap replications
consists of paired estimates of slope and intercept. Often, the main interest is in estimates of
the slope, but we may also want to set confidence limits on an estimate of some value of x
computed from the estimates of slope and intercept. To set confidence limits on some
regression estimate by bootstrapping, one simply needs to follow the procedure presented
above, with the “statistic” being the estimate of interest in the study at hand. A problem in this
approach is often the sample size. With smallish samples, bootstrapping pairs may give some
strange and variable results. We will thus need to consider bootstrapping the deviations. The
procedure is simple. One fits a regression model to the original data, calculates residuals
about the fitted line, and bootstraps the residuals. Consider the result of fitting a simple linear
regression to n original pairs of x, y observations. The outcome is a fitted regression line: y =
a + bx, where a and b represent the estimates of intercept and slope. The residuals about the
regression line are then: ei = yi – a – bxi (i= 1, 2, 3, …, n). We now bootstrap the residuals,
taking repeated random samples with replacement of n observations from the residuals, add
these residuals to the fitted regression line and calculate a new set of n values of yi. Combined
with the original set of x-values (unchanged throughout) these new pairs constitute the
bootstrap samples. We then calculate the bootstrap replication by fitting a new regression line
to the bootstrap sample. The only tricky part is to remember that the new values of yi are
computed from the ith value of xi, so that the same residual (ei) may be associated with several
values of xi, depending on the random selection. That is, the new set of yi-values is computed
from: yi = a + bxi + ei (i = 1, 2, 3, …, n), with a and b coming from the regression line fitted to
the original data and the values of ei coming from a random sample with replacement of the n
data points generated by the equation ei = yi – a – bxi (i= 1, 2, 3, …, n).
First we calculate the regression line from the original data.
1,9
y = 0,9069x + 0,1744
R2 = 0,9025
1,7
1,5
CREA (Enz)
1,3
1,1
0,9
0,7
0,5
0,5 0,7 0,9 1,1 1,3 1,5 1,7 1,9
CREA (Jaffé)
The 95% CI of the slope obtained via t-statistics is: [0.8184; 0.9953].
18
This gives us a slope of 0.9069 and an intercept of 0.1744. With these values we calculate the
residuals ei = yi – 0.1744 – 0.9069 xi. Then we resample these residuals and add the resampled
residual to the regression equation to obtain the new yi = 0.1744 + 0.9069 xi + ei*, where the *
denotes that the residual is resampled.
We then use Excel’s Data >> Table option to calculate the bootstrap distribution of slopes of
the resamples. This gives us the following graph:
Histogram
200
180
160
140
Frequency
120
100
80
60
40
20
0
0,75 0,79 0,83 0,87 0,91 0,95 0,99 1,03 More
Bin
If we would have used the resampling of the original pairs of data (instead of calculating the
residuals and resampling these residuals), we obtained:
mean 0,9076
stdev 0,0389
t-based 95%CI 0,8295
19
0,9857
bootstrap 95% CI 0,8341
0,9851
We used the Fisher Iris dataset to demonstrate how we can calculate bootstrap confidence
intervals for logistic regression parameters. In this dataset we only used the data of two
species (Versicolor and Virginica), coded as 1 and 0, and the sepal length (x).
OBSERVED bo 12,5708
b1 -2,01293
y (1-y)
ID x Species p(x) p(x) (1-p(x)) product ln(product)
1 7,0 1 0,1795 0,179504 1 0,179504 -1,71756
2 6,4 1 0,42264 0,422637 1 0,422637 -0,86124
3 6,9 1 0,21108 0,211081 1 0,211081 -1,55551
4 5,5 1 0,81753 0,817527 1 0,817527 -0,20147
5 6,5 1 0,37443 0,374432 1 0,374432 -0,98234
Logistic regression is different from regular regression because the dependent variable can be
binary (versicolor versus virginica, coded as 1 or 0). Based on the information of sepal lengt,
we want to predict the probability that the species is versicolor or virginica. The model we
want should fit the data and predicts the probability that the species is versicolor, based on the
information of the sepal length. This example only has one factor or variable x and the
function that gives the probability for each value of x is given by:
The issue is to determine the parameters boand b1. Unlike regular regression problems, we
cannot use the method of least squares to estimate these parameters. The method of
estimation, which is usually applied here, is called maximum likelihood. In order to apply this
technique, we must first construct a likelihood function. We estimate the parameters in our
regression equation by choosing them to maximize the likelihood function we construct. The
likelihood function L(bo,b1) is defined as:
n
L(b0 , b1 ) = ∏ p( xi ) yi [1 − p ( xi )]
(1− yi )
i =1
20
n
ln L(b0 , b1 ) = ∑ [ yi p( xi ) + (1 − yi )(1 − p ( xi ))]
i =1
The function L – or equivalently ln L - should be maximized. This can be done using Excel’s
Solver Add-in. This add-in comes with the Excel software but should be installed separately.
When installed correctly, one should find the menu item Tools >> Solver.
To maximize the function ln L via Excel’s Solver, we first set up the spreadsheet as follows:
a) e.g. in column F we set an ID for each data row. The data are in columns G (the sepal
length) and H (the coded value for the species).
b) In column I we calculate p(x) based on start values for b0 and b1, using the function
p(x) = exp(bo+b1x) / [1 + exp(bo+b1x)], and where x is the sepal length value
c) In column J, we calculate p(x)y where y is the coded value of the species; in column K
we then calculate (1- p(x))(1-y) and in column L we make the product of the cells in
columns J and K. Finally in column M we calculate the logarithm of the cells in
column L. Of course, you can also simply calculate y p(x) + (1-y) (1-p(x)) and make
the sum. This last method is sometimes to be preferred to avoid numerical problems in
Excel.
d) We make the sum of all cells in column M thus obtaining Σ ln L. This value should be
maximized by changing the cells containing b0 and b1. This can be done using Excel’s
Solver.
Using Tools >> Solver, a dialog window appears. Select the cell in which the Σ ln L result is
present (here cell $M$104). This value should be maximized (set ‘Equal To:’ to Max). In the
textbox ‘By Changing Cells:’ you should refer to the cells containing b0 and b1. Clicking
‘Solve’ invokes an interative non-linear procedure that maximizes the value in cell $M$104.
The logistic regression parameters for our example are b0 = 12.57 and b1 = - 2.01.
To obtain the confidence intervals for these parameters via bootstrapping, we proceed as
follows:
We made (e.g.) 100 resamples from the originally observed data by making use of an Excel
macro, called solvertest. In this macro we first set the cells R1 and R2 equal to 1, which are
the start values of the logistic regression parameters for our resample. We then generate
random IDs from the original sample IDs (which were values from 1 to 100). These random
21
IDs were obtained using the VBA random function ‘Rnd()’ and put in cells O4 till O103. In
column P and Q we used the VLOOKUP function to find the corresponding sepal length (x)
and coded value for species (y) from the originally observed data (in cells F4:H103).
Automatically, the values of p(x) and Σ ln L (in cell $V$104) were then calculated on the
spreadsheet for the resample.
Sub solvertest()
Dim i As Integer
For i = 1 To 100
[Link]("R1").Value = 1
[Link]("R2").Value = 1
'genereer random patid
For j = 1 To 100
[Link]("O4").Offset(j - 1, 0).Value = Int(100 * Rnd()) + 1
Next j
SolverReset
SolverOk SetCell:="$V$104", MaxMinVal:=1, ValueOf:="0", ByChange:="$R$1:$R$2"
SolverSolve UserFinish:=True
For this resample, we calculated the logistic regression parameters b0 and b1, using Excel’s
Solver. To use Excel’s Solver in a VBA macro we first need to set a reference to it. You can
do this from the Visual Basic Editor by choosing Tools >> References which invokes a dialog
window in which you can check the ‘SOLVER’ reference in the available references list. This
reference only needs to be set once.
In the VBA code the following statements are used:
SolverReset, which resets the memory
SolverOK, which is equivalent to clicking Solver on the Tools menu, and then specifying the
options that are in the Solver Parameters dialog box.
SolverSolve, which is equivalent to clicking Solve in the Solver Parameters dialog box.
By setting UserFinish to True the Solver’s dialog results windows will not appear.
When Excel’s Solver has finished its iterative process, the results are printed in column X and
the whole procedure is repeated, that is, a new resample is taken and new fit parameters are
obtained.
22
The histogram of the values for b0 is shown below:
Histogram
30
25
Frequency
20
15
10
0
5 9 13 17 21 25
Bin
Bootstrapping 95%CI for b0 is [7.56; 21.11] and for b1 we have [-3.40; -1.23].
23
Example 6: confidence intervals for Cohen’s Kappa
Kappa provides a measure of the degree to which two raters concur in their respective sortings
of N items into k mutually exclusive categories. A 'rater' in this context can be an individual
human being, a set of individuals who sort the N items collectively, or some non-human
agency, such as a computer program or diagnostic test, that performs a sorting on the basis of
specified criteria.
The original and simplest version of kappa is the unweighted kappa coefficient introduced by
J. Cohen in 1960. To illustrate, suppose that our raters are two clinical tests, A and B,
independently employed to sort each of N=100 subjects into one or the other of k=3
diagnostic categories. The table below shows a cross-tabulation of the sortings actually
observed.
OBSERVED Method A
1 2 3 Total
1 44 5 1 50
Method
2 7 20 3 30
B
3 9 5 6 20
Total 60 30 10 100
The table below shows the cell frequencies that would have been expected by mere chance,
given the observed marginal totals.
2 18 9 3 30
B
3 12 6 2 20
Total 60 30 10 100
observed
observed concordant 70
chance expected concordant 41
excess 29
chance expected non concordant 59
kappa 0,492
We now resample the paired observations using the method described earlier (actually we
resample the SampleID and use VLOOKUP to assign the results of methods A and B to the
corresponding SampleID.
24
An example of the resampled frequency table is:
RESAMPLE Method A
1 2 3 Total
1 45 6 3 54
Method
2 9 19 3 31
B
3 8 3 4 15
Total 62 28 10 100
To obtain this table in Excel we make use of array functions, which – contrary to pivot tables
– change whenever the resampling is done.
As an example, consider the cell L3, where the number of concordant value 1 is given. This
number is obtained using the array function =SUM((B2:B101=1)*(C2:C101=1)) which
should be entered using CTRL + SHIFT + ENTER. If you do it properly, {} brackets appear
around the formula. Array formulas are a powerful tool in Excel. An array formula is a
formula that works with an array, or series, of data values rather than a single data value.
In our example, the first array is a series of TRUE or FALSE values which are the results of
comparing B2:B101 to the value 1. The second array is also a series of TRUE or FALSE
values, the result of comparing C2:C101 to 1. These two arrays are multiplied together. When
you multiply two arrays, the result is itself an array, each element of which is the product of
the corresponding elements of the two arrays being multiplied. The SUM function simply
adds up the elements of the array and return a result of 44, the number of 1’s for both method
A and B. By filling the frequency table for the resampled data, we can calculate Cohen’s
kappa from this frequency table, in the same way we calculated it for the originally observed
25
data. By using Data >> Tables we then obtain the bootstrap distribution of kappa of our 1000
resamples.
Histogram
140
120
100
Frequency
80
60
40
20
0
4
6
28
32
36
44
48
52
56
64
68
e
or
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
M
Bin
T-based bootstrap
95%CI 95%CI
mean 0,4894 0,3518 0,3487
stdev 0,0693 0,6269 0,6211
Note that kappa = 0.4915 for the originally observed data and the 95% CI obtained for Kappa
via the asymptotic error (ASE) is [0.3475; 0.6355].
26
Significance testing using permutation tests
In some cases, we want to determine whether an observed effect, such as the difference
between two means, could reasonably be ascribed to the randomness introduced in selecting a
sample. If not, we have evidence that the effect observed in the sample, reflects an effect that
is present in the population. We will proceed as follows:
• We start choosing a statistic that measures the effect we are looking for
• We construct the sampling distribution that this statistic would have if the effect were
not present in the population
• We locate the observed statistic on this distribution. A value in the tail of the
distribution would rarely occur by chance, and so is evidence that something other
than chance is operating.
Resampling for significance tests requires that we resample in a manner consistent with the
null hypothesis. The statement that the effect we seek is not present in the population is the
null hypothesis. The probability, calculated taking the null hypothesis to be true, that we
would observe a statistic value as extreme or more extreme than the one we did observe is the
p-value. Because p-values are calculated assuming that the null hypothesis is true, we cannot
resample from the observed sample as we did earlier. Here, we must resample to create a
distribution centered at the parameter value stated by the null hypothesis.
The table below shows the results from a small experiment where 7 mice out of 16 were
randomly selected and treated with a new drug, while the other 9 mice served as the control
group. The treatment was intended to prolong survival after surgery, expressed in days. The
question that now arises is: does the new drug prolong survival?
Treated Control
94 52
38 10
23 40
197 104
99 50
16 27
141 146
31
46
Treated Control
Mean 86,86 56,22
Stdev 66,77 42,42
SE 25,24 14,14
27
• Choose 7 out of 16 mice at random to be the treatment group; the other 9 mice are the
control group. Choose without replacement! This is called a permutation resample.
Calculate the mean in each group. The difference between these means is our statistic.
• Repeat this resampling from the 16 mice hundreds of times. The distribution of the
statistic (difference in means) forms the sampling distribution under the condition that
the null hypothesis is true. It is called a permutation distribution.
To do this in Excel, we need a VBA array function. From [Link] we used the
VBA function UniqueRandomLongs. This array function returns n = Number unique values
between a Minimum and a Maximum value. We arrange the data as follows:
In column H we introduced the ‘MiceID’, that is, we numbered the mice results from 1 to 16,
where the first 7 entries are the data from the treated mice and the next 9 entries are the data
of the control mice. Then we used the function from Chip Pearson to randomize the numbers,
each number appearing exactly once. The resample1 is taken from the data in column J, by
using the VLOOKUP function.
The VLOOKUP function looks up the value in H2 (which is the MiceID = 1) in the random
sequence of miceID of column I and returns the value next to this miceID in column J where
the data is placed. For Resample1, we use the MiceID from 1 to 7, for Resample2 we use the
MiceID from 8 to 16. Then we calculate the statistic as the difference between the means of
both resamples.
28
Next, we have to repeat that hundreds of times. We can do that using the data >> tables option
in Excel and referring to cell M12 (the calculated difference between means of the resamples)
for the Column input cell, or we could use a small VBA macro, like this:
Sub MakeBootstrapStat()
Dim i As Integer
For i = 1 To 1000
[Link]("O1").Offset(i, 0).Value = [Link]("M12").Value
Next i
End Sub
The macro code above simply fills column O with the value of cell M12, which contains the
calculated difference between the means of the resamples. By entering a value in a cell in
column O, the recalculation is triggered each time and the resamples are refreshed because the
UniqueRandomLongs function has a dummy variable to make it volatile.
The permutation distribution of our difference in means is shown in the histogram below.
Histogram
160
140
120
100
Frequency
80
60
40
20
0
-70 -50 -30 -10 10 30 50 70 More
Difference
The p-value for the one-sided test (mean of treatment group > mean of control group) is based
on 1000 permutation resamples. The observed difference was 86.86 – 56.22 = 30.64. The p-
value for the one-sided test is the probability that the difference in means is 30.64 or greater,
calculated taking the null hypothesis to be true. The histogram above shows how the statistic
would vary if the null hypothesis were true. The proportion of observations greater than 30.64
estimates the p-value. From the resampling results we can find that 147 of the 1000 results
gave a value of 30.64 or larger. The proportion of samples that exceed the observed value of
30.64 is thus 147/1000 = 0.147. In fact, a small refinement can be made. It can be shown that
by adding one sample result above the observed statistic, improves the estimated p-value. The
permutation test estimate of the p-value is then (147 + 1) / (1000 + 1) = 0.148.
29
Using the two-sample one sided t-test to compare the means of the treated and control group,
we obtain a p-value of 0.140, which is very similar to the p-value of the permutation test.
Permutation tests have these advantages over t tests:
• The t test gives accurate p-values if the sampling distribution of the difference in
means is at least roughly normal. The permutation test gives accurate p-values even
when the sampling distribution is not close to normal.
• We can directly check the normality of the sampling distribution by looking at the
permutation distribution.
If the two p-values differ considerably, it usually indicates that the conditions for the two
sample t-test don’t hold for these data. Permutation tests give more accurate p-values than t-
tests, especially when the sampling distribution is skewed.
An alternative way to reshuffle the original data from both samples, is to make use of an extra
column (column I), with random numbers. Then these random numbers are ranked from small
to large or vice versa in column J, giving a list of random integer numbers. Theoretically tied
ranks might be observed, however, in practice, this will be very rare. We then proceed as
described earlier. The advantage is that no macros are needed.
As an example, consider the data of a new drug to reduce blood pressure. The data for 10
patients is shown in the table below, before and after treatment with the new drug.
Mean -29,2
The mean difference is -29.2 mm Hg. The t-test for matched pairs gives a p-value of
0.00002088.
If we want to perform a permutation test, we have to keep in mind that the key step in the
general procedure for permutation tests is to form permutation resamples in a way that is
consistent with the study design and with the null hypothesis. The null hypothesis says that
the drug has no effect, or the “before” and “after” have no meaning. Therefore, we should
30
resample by randomly assigning “before” or “after” to a patient, but we should not mix scores
from different people, because that isn’t consistent with the pairing in the study design.
We can do this as follows in Excel:
Under the assumption of the null hypothesis, we randomly assign the blood pressure of one
patient to the “before” or “after” situation, by using the formula in cell E3: =IF(RAND()>0.5;
B2; A2) which assigns the value of the observed “After” situation to cell E3 if a random value
greater than 0.5 is returned by the RAND() function, otherwise the observed “Before”
situation is entered in cell E3.
In cell F3 we then have to set the ‘other’ value, the one that is not assigned to cell E3. We use
the formula =IF(E3=A2;B2;A2).
31
We repeat this resampling and the calculation of the mean of differences hundreds of times.
We can do this by using data >> tables and referring to cell G14, where we calculated the
mean of the differences of the resample. The permutation distribution of 1000 resamples
looks like:
Histogram
200
180
160
140
Frequency
120
100
80
60
40
20
0
-40 -30 -20 -10 0 10 20 30 40
Paired difference
As none of the resample mean pair differences is below the observed mean difference of -29.2
mm Hg, the permutation test p-value equals 0.
Permutation tests can also be used to test the significance of a relationship between two
variables. For example, we looked at the relationship between two methods to determine the
serum creatinine concentration in 90-95 year old men (see example 4). The null hypothesis
would be that there is no relationship. In that case, the value of serum creatinine obtained for
the same patient by method 1 would have nothing to do with the value obtained by method 2.
We thus can resample in a way consistent with the null hypothesis by permuting the observed
values of method 2 among the patients at random.
As a test statistic we take the correlation. For every resample, we calculate the correlation
between the serum creatinine obtained by method 1 (in its original order) and by method 2 (in
a randomly reshuffled order). The p-value is the proportion of the resamples with correlation
larger than the original observed correlation.
To do this in Excel, we reshuffle the PatID numbers using the array function
UniqueRandomLongs we used previously in Example 4. In column E we thus generate a new
list of PatIDs but in a random order.
32
In cell G2 (and the cells G3:G48) we use the VLOOKUP function to get the value of method
2 corresponding to PATID = 10.
For each resample, we calculate the correlation, using CORREL between the data of method 1
(in its original order) and the data of method 2 (in its reshuffled order). The permutation
distribution for 100 resamples is obtained using Data >> Tables and referring to the cell
where CORREL is used. This gives us the following histogram:
Histogram
30
25
20
Frequency
15
10
0
-0,4 -0,3 -0,2 -0,1 0 0,1 0,2 0,3 0,4 More
Bin
The observed correlation coefficient was 0.95 which is far from any observed correlation
under the null hypothesis. Therefore, the permutation test p-value is 0, meaning that the null
hypothesis of no correlation is rejected.
In this example we use the data of a study of Post Traumatic Stress Disorder in rape survivors.
This study was carried out by Foa, Rothbaum, Riggs, and Murdock (1991), as part of a long
series of studies that Foa has conducted on this topic. Each of the participants completed a
symptom inventory at the start of treatment, and again at the end. We will use the data from
the end of treatment. There were four conditions. Group SIT (Stress Innoculation Therapy)
learned techniques for dealing with stress. Group PE (Prolonged Exposure) reviewed the rape
incident in their heads over and over again, until it lost some of its negative valence. Group
SC (Supportive Counseling) was a group that just received standard counseling, and Group
(WL) was a waiting list control group.
The data for this example are shown below:
33
Observed data
SIT PE SC WL
3 18 24 12
13 6 14 30
13 21 21 27
8 34 5 20
11 26 17 17
9 11 17 23
12 2 23 13
7 5 19 28
16 5 7 12
15 26 27 13
18 25
12
8
10
SIT PE SC WL
Mean 11,07143 15,4 18,09091 19,5
Var 15,60989 123,6 50,89091 50,5
A standard one-way analysis of variance on these data would produce F = 3.046, p = .039,
which would lead us to reject the null hypothesis. However, with very little data in each cell,
we don't really have a good way to convince ourselves that normality is a reasonable
assumption. At best, all we can say is that the data are not so extreme as to lead us to believe
that they are not normally distributed. However, we can use the permutation procedure to
avoid having to make that assumption. I should also point out that there is a problem with
homoscedasticity (compare the variances of Groups SIT and PE.)
Here is the output we obtain via Tools >> Data Analysis >> Anova: Single Factor, where we
used the Analysis Toolpak Add-in in Excel.
SUMMARY
Groups Count Sum Average Variance
SIT 14 155 11,07143 15,60989
PE 10 154 15,4 123,6
SC 11 199 18,09091 50,89091
WL 10 195 19,5 50,5
ANOVA
Source of Variation SS df MS F P-value
Between Groups 507,8401 3 169,28 3,045757 0,039363
Within Groups 2278,738 41 55,57897
Total 2786,578 44
34
If we are willing to assume that the data we have are a reasonable representation of the
populations from which they were drawn, then we can use those data to reproduce the
population under the null hypothesis of no differences between groups and then draw
resamples from that population. Notice, however, that we have clearly made an assumption.
We have assumed that the sample data reflect the population. That is just as much of an
assumption as the assumption of normality. This is an important point to keep in mind,
especially with small samples. However, using single factor anova assumes 1) representative
data, 2) normality and 3) equal variances. The last two assumptions are not needed for a
permutation test.
The advantage of bootstrapping and permutation tests for educating statistics is that to do
simulations, one really has to know how to calculate the statistics.
We proceed as follows:
1) we first list all the observed data in one column e.g. from I4:I48 and we give each
observation a unique ID (e.g. in H4:H48)
2) in column J4:J48 we generate 41 random numbers with =RAND()
3) in column K4:K48 we calculate the rank of each of the random numbers
4) in the columns M till P we present the resample, by using the VLOOKUP function to
lookup the data that corresponds to the randomly assigned rank. In other words, the
values in M till N are the reshuffled original data (that is, we permute the data and
reassign them randomly to each category, assuming the null hypothesis is true (all
categories are equal))
35
5) then we calculate the F-statistics: by permuting the data, the overall mean stays equal
to 15.6222 but the group means will change. Therefore, the SStotal will remain the
same but we have to recalculate SSwithin groups for each resample. The F-statistics is
MSbg/MSwg.
6) We calculate 1000 or more F-values for the resamples. A histogram of these values are
shown below. The observed F-value was 3.046. The permutation test p-value is simply
the number of times the resample F-value is greater than the observed F-value, divided
by the number of resamples. We found a p-value of 0.038, which is very close to the
value obtained from single factor anova (p = 0.03936).
350
300
250
Frequency
200
150
100
50
0
0,5 1 1,5 2 2,5 3 3,5 4 4,5 5 5,5 6 6,5 More
Bin
In a first example we consider the relationship between the treatment of a specific disease and
the health status of the patient after a certain period of time, in terms of, cured or not. We are
quite convinced of our new drug and the majority of patients is in the treated group, so the
patient distribution among both groups is quite unbalanced. The data are:
We claim that our new treatment increases the chance of being cured after a certain period of
time. The null hypothesis is: Ho: p1 = p2 and the alternative hypothesis is Ha: p1 > p2. The p-
value for the comparison of independent proportions is p = 0.0075 (one-sided chi-square test;
χ2 = 5.911). Under the null hypothesis, all 170 patients are equally likely to be cured. That is,
curing occurs for reasons that have nothing to do with whether the patient has taken the new
drug or not. We should resample in a way consistent with the null hypothesis. This can be
done by reshuffling the data and assigning randomly 142 to the first group and 28 to the other
group.
36
In Excel, we proceed as follows:
37
Using VLOOKUP, we find the cure status in the original data for the reshuffled patid.
We then take the first 28 patients for the control group and the last 142 patients for the
treatment group. We count the number of cured patients in each group and calculate the
proportion. To avoid rounding errors, it is easier to calculate the number of times we find a
value of 15 or less in the control group. Such a value will always result in a difference greater
than or equal to 0.225. In other words, instead of calculating the difference in proportions
between the cured patients in the treated and control group, we just calculate the number of
cured patients in the control group, for each resample. Then we calculate the number of times
this number is smaller than or equal to 15. Each time this happens, the difference in
proportions will be at least 0.225. This happens 18 times in 1000 resamples, resulting in a p-
value of 0.018.
Another (although much more complex) way to arrive at the same result is by making use of
the built-in Excel function ‘binomdist’. Under the null hypothesis, we expect 72.4%
(123/170) of the patients to be cured in each group. We can calculate the cumulative binomial
distribution, once for 142 patients and once for 28 patients.
38
Using =BINOMDIST(n; 142; 0.724; TRUE) gives the cumulative probability distribution (0
or more persons are cured) in the treatment group, given that the probability is 72.4% to be
cured. The value of n goes from 0 to 142. Analogously, =BINOMDIST(n; 28; 0.724; TRUE)
calculates the cumulative distribution in the control group.
We then generate random numbers (see column G) and assign (via VLOOKUP) the patient ID
to corresponding to the cumulative value of the binomial distribution closest to that random
number. E.g. the random number 0.742241815 lies between 0.6940 = Binomdist(105; 142;
0.724; TRUE) and 0.7574 = Binomdist(106; 142; 0.724; TRUE). So, we assign the value 105
to it and calculate 105/142 = 0.739437 as the first proportion (prop1). We do that a 1000
times. We proceed analogously in the control group and find 1000 values for prop2. We then
take the difference and calculate how many times this difference is larger than or equal to
0.225. We obtain 21, giving a p-value of 0.021. Note that this value is not as stable as the
value we found using the previous method.
We consider the data presented in the article by Regan, Hellmann and Stone (“Treatment of
Wegener’s granulomatosis”, 2001, Rheumatic Diseases Clinics of North America, 27(4), 863-
886). There are two treatment groups, one with 17 patients and the other with 19 patients.
Patients are being treated for Wegener’s granulomatosis. The data consists of the number of
patients in remission (Yes) and not in remission (No), for each treatment. The data are
reproduced in the table below:
The proportion in remission for treatment 1 is 0.353 or 35.3%, while the proportion for
treatment 2 is 0.737 or 73.7%. The question we want to answer is how likely it is to get a
difference in proportions as large or larger than the observed difference of 0.384, if there is
actually no difference in the population remission proportions?
If the population remission for each treatment are not different, then we would expect to see
20/36 = 0.556 or 55.6% of the patients in each treatment group in remission.
39
We disregard treatment group (under the assumption of the null hypothesis both treatment
groups are equal) and reshuffle the patient IDs. The first 17 are now assigned to treatment
group 1 and the next 19 to treatment group 2. We calculate the difference in proportions for
the resampled data. We do this 1000 times or more and the number of times we find a
difference as larger or larger than 0.384 can be used to calculate a p-value, which is here equal
to 21/1000 = 0.021.
1
Derek Christie, Resampling with Excel, Teaching Statistics, Volume 26, Number 1, Spring 2004
2
Wichman, B.A. and I.D. Hill, Algorithm AS 183: An Efficient and Portable Pseudo-Random Number
Generator, Applied Statistics, 31, 188-190, 1982.
40









