0% found this document useful (0 votes)

442 views41 pages

Bootstrapping Techniques in Excel

This document discusses resampling methods such as bootstrapping, permutation tests, and simulation that can be performed in Microsoft Excel. It provides an overview of where resampling methods can be used, their advantages, and how to generate random numbers and perform bootstrapping procedures in Excel using built-in functions and some simple macros. Examples are given to illustrate how to obtain bootstrap confidence intervals.

Uploaded by

WordsWorthReports

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

442 views41 pages

Bootstrapping Techniques in Excel

Uploaded by

WordsWorthReports

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction
Random Number Generation
Procedure for Bootstrapping
Examples of Usage

See

discussions, stats, and author profiles for this publication at: [Link]

Resampling methods in Excel

Article · June 2015

CITATIONS READS

0 1,911

1 author:

Hans Pottel
University of Leuven
196 PUBLICATIONS 3,284 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Bootstrapping in Excel View project

Estimating GFR equations View project

All content following this page was uploaded by Hans Pottel on 19 June 2015.

The user has requested enhancement of the downloaded file.

Bootstrapping, permutation tests and simulation in Excel

Hans Pottel
Subfaculty of Medicine, KU Leuven Kulak, Kortrijk, Belgium
Introduction
Statistics are concerned with methods and techniques on inferences on populations based on
sample data. Making use of information in (small) samples, a statistician takes decisions on
the whole population, using all kind of different tools, methods and techniques. Statistics is
changing as modern computers and software make it possible to look at data graphically and
numerically in ways previously inconceivable. The bootstrap, permutation tests and other
resampling methods are part of this revolution. Although statisticians have embraced
resampling methods for their own use, they have not, in general, included them in their
teaching. Resampling methods can be made accessible to students at virtually every level,
using Microsoft Excel.
It is accepted that spreadsheets are a useful and popular tool for processing and presenting
data. In fact, Microsoft Excel spreadsheets have become somewhat of a standard for data
storage, at least for smaller data sets. This, along with the fact that the program is often being
packaged with new computers, which increases its easy availability, naturally encourages its
use for statistical analysis. However, many statisticians find this unfortunate, since Excel is
clearly not a statistical package. There is no doubt about that, and Excel has never claimed to
be one. But, one should face the facts that due to its easy availability many people, including
professional statisticians, use Excel, even on a daily basis, for quick and easy statistical
calculations.
The aim of this article is show how resampling can be done in Microsoft Excel, using
standard functions, and using some simple macro functions. We’ll make use of Excel Data
Tables 1 to conduct simulations. We illustrate the techniques with many examples.

Resampling methods
To use resampling techniques, the sample data are assumed representative of the population
from which they are taken. This is probably the only requirement for resampling techniques.

Hesterberg (1998) gives a very nice review of simulation and bootstrapping in teaching
statistics.
Where can resampling methods be used?
• Resampling methods allow us to quantify uncertainty by calculating standard errors
and confidence intervals
• Resampling methods let us tackle new inference settings easily (e.g. ratio of means)
• Resampling methods help us understand the concepts of statistical inference
(performing significance tests)
What are the advantages of resampling methods?
• Fewer assumptions: no requirement for normality or large sample sizes
• Greater accuracy
• Generality: the applicability of resampling methods is quite similar for a wide range of
statistics

1
• Promote understanding: they build intuition by providing concrete analogies to
theoretical concepts; e.g. look at how many times a confidence interval covers the true
population mean (if you repeatedly construct 95%CIs based on random samples, about
95% of them will cover the true population mean)

Random number generation

Microsoft completely changed RAND for Excel 2003 (and it remained unchanged for Excel
2007). The RAND function in earlier versions of Excel used a pseudo-random number
generation algorithm whose performance on standard tests of randomness was not sufficient.
Although this is likely to affect only those users who have to make a large number of calls to
RAND, such as a million or more, and not to be a concern for almost every user, the pseudo-
random number generation algorithm that is described here was first implemented for Excel
2003. The algorithm that is implemented in Excel 2003 was developed by B.A. Wichman and
I.D. Hill 2. It has been shown to pass the DIEHARD tests and additional tests developed by the
National Institute of Standards and Technology (NIST, formerly National Bureau of
Standards).
Because RAND produces pseudo-random numbers, if a long sequence of them is produced,
eventually the sequence will repeat itself. Combining random numbers as in the Wichman-
Hill procedure guarantees that more than 1013 numbers will be generated before the repetition
begins. Several of the Diehard tests produced unsatisfactory results with earlier versions of
RAND because the cycle before numbers started repeating was unacceptably short.

One issue here is that RAND cannot be reseeded by the user. Microsoft gave no clue how this
is done. One has to work in Excel 2003 and 2007 with the fact that the generated RAND
function is always changing and can’t be set back to a fixed set. Every time a worksheet is
updated, a whole new set of values occur in any cell containing “=RAND()” (a so-called
volatile function). This can be circumvented using a simple trick. You can make use of the
formula =IF(T1;RAND();V2) in cell V2 which uses the value of another cell T1, in which you
just set 0 or 1, false or true. The cell has circular reference and will only recalculate the
RAND() function if the value in cell T1 is set to 1 (or True).

Excel also is shipped with an Add-in called the ‘Analysis Toolpak’. This add-in, which is
present but not necessarily active (to make it active, choose Tools >> Add-ins and check the
checkbox next to ‘Analysis Toolpak’ and ‘Analysis Toolpak VBA’). The Analysis Toolpak
has a random number generator (accessible via Tools >> Data Analysis) but delivers also
many additional (statistical) functions. One of these functions is RANDBETWEEN
(Minimum, Maximum) which returns a random integer value between a minimum and
maximum integer value. The function is equivalent to INT[(Maximum – Minimum + 1) *
RAND() + Minimum]. So, there is not really a need to use RANDBETWEEN. If one wants to
resample with replacement from a range of cells in a spreadsheet, one can make use of the
following function =SMALL($A$1:$A$15;INT(COUNT($A$1:$A$15)*RAND())+1). This
function has the form = SMALL (array; k) and returns the k-smallest value in an array. The
array in our example is $A$1:$A$15 and the value of k is the integer of (the number of
elements in the array multiplied by a random value between 0 and 1) plus 1. The number of
elements in our array example is 15, so the integer of 15*RAND() is a value between 0 and
14. To obtain a random value between 1 and 15 we add 1.

2
Procedure for bootstrapping
To use resampling techniques, the sample data are assumed representative of the population
from which they are taken. This sample, the original sample, serves as the starting point for
resampling techniques. Statistical inference is based on the sampling distribution of sample
statistics. The bootstrap is finding the sampling distribution, often called, the bootstrap
distribution. The way to proceed is as follows:

• Create hundreds or thousands of new samples, called bootstrap samples, or resamples.

You do this by sampling with replacement from the original sample. Each resample
has the same size as the original sample. Sampling with replacement means that after
we randomly draw from the original sample, we put it back before drawing the next
observation.
• Calculate the statistic for each resample. The distribution of these resample statistics is
called a bootstrap distribution.
• The thus obtained bootstrap distribution can be used to obtain information about the
shape, center, and spread of the sampling distribution of the statistic. E.g. the bootstrap
standard error of a statistic is simply the standard deviation of the bootstrap
distribution of that statistic.

Confidence intervals
Example 1: bootstrap confidence limits

We consider the serum creatinine values of 601 women between 20 and 25 years old, as
obtained by an enzymatic method. The distribution of the values is given in the histogram
below.

Histogram

120
100
Frequency

80
60
40

0
1
2

e
or
0,

1,
M

Bin

Mean and median are 0.679 mg/dL and 0.680 mg/dL resp. Standard deviation is 0.1194
mg/dL and standard error is 0.004871 mg/dL. The 95%CI based on the normal distribution is
[0.670; 0.689 mg/dL]. From the graph, we may consider the data as normally distributed.

3
If we construct 100 bootstrap samples and plot the histogram of the mean statistic, we obtain
the graph below.

The mean is 0.679 mg/dL, the standard deviation is 0.00431 mg/dL and the 95% CI is [0.672 ;
0.686 mg/dL]. The standard deviation of the bootstrap distribution is close to the standard
error of the original sample.

Histogram

20
Frequency

0 7
66

68
4

6
66

0,
0,

Bin

Of course, we will obtain better estimates if we increase the number of bootstrap samples. We
present the bootstrap distribution of 500 bootstrap samples in the histogram below. The mean
is 0.679 mg/dL, standard deviation is 0.00464 mg/dL and the 95%CI is [0.671; 0.686 mg/dL].

Histogram

100
90
80
70
Frequency

60
50
40
30
20
10
0
7
66

68
4

6
66

0,
0,

Bin

4
Remember that the bootstrap idea is to use resample means to estimate how the sample mean
of a sample of size 601 from this population varies because of random sampling. That is, we
are using the resampled observations as if they were real data.

Now, how can we do this in MS Excel?

To do this in Excel, we should be able to draw with replacement from a set of data. Excel has
a function SMALL (array; k) or LARGE(array; k), that gives the k-smallest or k-largest
element in an array. For example, if the elements in an array are {3, 8, 14, 2, 15, 17} and
these values are in cells A1:A6, the function SMALL(A1:A6; 3) will return the value 8, as
this value is the third smallest value in the list.

Now consider our sample of n = 601 in cells A2:A602. In cell A1 we have put the variable
name ‘CREA’. We will call the range A2:A602 the ‘data_range’. You can make use of
Excel’s built in feature to name ranges, if you want (Insert >> Name >> Define). Now, we
will make use of the random function RAND() which returns a random value between 0 and
1. We modify this function so that our new function will return a random value between 1 and
601. The function INT(COUNT(data_range)*RAND())+1 will do the trick. The function
COUNT(data_range) returns the value 601, the size of our sample. Multiplying with RAND()
results in a random floating point number between 0 and 601. To convert this to an integer
number between 0 and 600.999, we take the integer part which returns an integer number
between 0 and 600. We correct this by adding one.

Clearly, the same random number generator could be obtained with the LARGE function.

So, if we assume that our data is in cells A2:A602, then we use cells B2:B602 for our
resample. In cell B1, we put ‘Resample’. In the cells B2:B602, we put
=SMALL($A$2:$A$602;INT(COUNT($A$2:$A$602)*RAND())+1). This formula gives us a
resample with replacement of the same size as our original sample.

Taking the average (or median, or whatever statistic) of B2:B602 we obtain the mean of our
resample. The bootstrap procedure tells us to take hundreds of resamples and calculate the
statistic, so that we can obtain the bootstrap distribution of the statistic. As a statistic we use
the mean for this example.

We could do this using macros (VBA), but there is another way to create the bootstrap
sampling distribution, using Excel’s DATA >> TABLE. Our Data Table will be a rectangle
two columns wide with a formula for the resampled statistic in the top right-hand corner of
the rectangle. Put the word ‘Mean’ in cell F2 en in cell G2 enter the formula
‘=AVERAGE($B$2:$B$602)’ to calculate the mean of the resample. Then select F2:G301
and choose Data >> Table.

5
A small dialog window appears, asking for the Row input cell (which you may ignore) and
the column input cell, where you enter the reference to cell F2

Then click ‘OK’ and the data table will fill with averages of 300 resamples. Your result will
look like

6
As you can see, each cell in G3:G301 refers to the reference cell F2 and the formula in G2.
Each time you change a cell in the spreadsheet, the data>>table will recalculate. This may
take several seconds, depending on the size of the resample and the number of bootstrap
resamples. Therefore, it might be good to copy the range G2:301 and use paste special >>
values. Doing this will remove the data>>table reference and stop the recalculation. Clearly,
you can also turn off the automatic recalculation via Tools >> Options >> Calculation >>
Automatic except tables >> OK. By pressing F9 you can recalculate at any time.

Another way to obtain this result, is by using VBA. The macro code is given below:

Sub stats1()
Dim R As Range
Dim Routput As Range
Dim Rresample As Range
Dim i As Integer

'range with original data

Set R = [Link](Range("A2"), Range("A1").End(xlDown))

'offset for output: where we place the resample from the original data
Set Routput = [Link]("B1")
[Link] = "Resample"

'fill the resample with the SMALL function;

'here we could have filled the resample directly with the data instead of with the formula

[Link] = False
[Link] = xlCalculationManual

For i = 1 To [Link]
[Link](i, 0).Formula = "=SMALL(" & [Link] & ",INT(COUNT(" & [Link] &
")*RAND())+1)"
Next i
Set Rresample = [Link](1, 0).Resize([Link], 1)

[Link] = xlCalculationAutomatic
[Link] = True

'start bootstrap
[Link](0, 2).Value = "Median"
[Link](0, 3).Value = "Pct2.5"
[Link](0, 4).Value = "Pct97.5"
[Link](0, 5).Value = "Average"

'300 bootstrap resamples: 4 statistics are calculated: median, 2.5% Pct, 97.5% Pct and mean for each resample
For i = 1 To 300
[Link](i, 2).Value = [Link](Rresample)
[Link](i, 3).Formula = [Link](Rresample, 0.025)
[Link](i, 4).Formula = [Link](Rresample, 0.975)
[Link](i, 5).Formula = [Link](Rresample)
Next i

‘summary statistics: mean, 90% confidence intervals and standard deviation are calculated for each statistic
Set Rresample = [Link]([Link](1, 2), [Link](i, 2))
‘for the median
[Link](i + 2, 2).Formula = "=Average(" & [Link] & ")"
[Link](i + 3, 2).FormulaR1C1 = "=NORMINV(0.05,R[-1]C,R[+2]C)"
[Link](i + 4, 2).FormulaR1C1 = "=NORMINV(0.95,R[-2]C,R[+1]C)"

7
[Link](i + 5, 2).Formula = "=Stdev(" & [Link] & ")"
‘for Pct2.5
Set Rresample = [Link]([Link](1, 3), [Link](i, 3))
[Link](i + 2, 3).Formula = "=Average(" & [Link] & ")"
[Link](i + 3, 3).FormulaR1C1 = "=NORMINV(0.05,R[-1]C,R[+2]C)"
[Link](i + 4, 3).FormulaR1C1 = "=NORMINV(0.95,R[-2]C,R[+1]C)"
[Link](i + 5, 3).Formula = "=Stdev(" & [Link] & ")"
‘for Pct97.5
Set Rresample = [Link]([Link](1, 4), [Link](i, 4))
[Link](i + 2, 4).Formula = "=Average(" & [Link] & ")"
[Link](i + 3, 4).FormulaR1C1 = "=NORMINV(0.05,R[-1]C,R[+2]C)"
[Link](i + 4, 4).FormulaR1C1 = "=NORMINV(0.95,R[-2]C,R[+1]C)"
[Link](i + 5, 4).Formula = "=Stdev(" & [Link] & ")"
‘for the mean
Set Rresample = [Link]([Link](1, 5), [Link](i, 5))
[Link](i + 2, 5).Formula = "=Average(" & [Link] & ")"
[Link](i + 3, 5).FormulaR1C1 = "=NORMINV(0.05,R[-1]C,R[+2]C)"
[Link](i + 4, 5).FormulaR1C1 = "=NORMINV(0.95,R[-2]C,R[+1]C)"
[Link](i + 5, 5).Formula = "=Stdev(" & [Link] & ")"
‘numbers are formatted to 3 digits
[Link](i + 2, 2).Resize(4, 4).NumberFormat = "0.000"

End Sub
The macro first generates the resample by placing the formula =SMALL(" & [Link] &
",INT(COUNT(" & [Link] & ")*RAND())+1) in the cells B2:B602. We turn off the
automatic screenupdating and the automatic cell recalculation, to speed up filling the cells
with this formula.
Then we calculate 300 bootstrap resamples and put the statistic median, Pct2.5, Pct97.5 and
mean in the cells D2:D301, E2:E301, F2:F301 and G2:G301 respectively. While the macro is
running, you can follow this process on the screen (for small resamples, this will not be the
case because resampling will be too fast). You see the resampling, followed by the calculated
statistics. Finally, a summary statistics for each bootstrap statistic is calculated. This summary
statistics includes the mean of the bootstrap statistic (which was the mean of the resample),
the 90% confidence interval for the mean and the standard deviation of the bootstrap
distribution, which corresponds to the standard error of the mean of the original sample. We
used the NORMINV(0.05; Mean; Stdev) to calculate the value that corresponds with the 5%
lower tail of the normal distribution with mean ‘Mean’ and standard deviation ‘Stdev’. This
can be seen as the lower limit of the 90% Confidence Interval for the mean.

In this example, the original sampling distribution closely corresponds to the Gaussian bell-
shaped distribution we expect for normally distributed data. In some cases, the distribution
might be skewed. Then you may ask yourself the question whether the mean and standard
deviation are the right statistics? What can you do instead? How can we calculate confidence
intervals and for which statistic? The mean might not be appropriate. The bootstrap provides a
way out of this dilemma. The bootstrap can produce a resampling distribution that can be used
to set confidence limits.

Example 2: confidence limits of the median or the trimmed mean

We are interested in the sales prices of residential property in Seattle (example from Tim
Hesterberg). The data available from the county assessor’s office do not distinguish
residential property from commercial property, so a few large commercial sales are present in
the sample and may greatly increase the mean selling price. Therefore, we prefer to use a
measure of center that is more resistant than the mean.

8
Selling prices (in €1000) for real estates (from Tim Hesterberg Table 18.1)
142 232 132,5 200 362 244,95 335 324,5 222 225
175 50 215 260 307 210,95 1370 215,5 179,8 217
197,5 146,5 116,7 449,9 266 265 256 684,5 257 570
149,5 155 244,9 66,407 166 296 148,5 270 252,95 507
705 1850 290 165,95 375 335 987,5 330 149,95 190

The data from the table are presented in a histogram, showing a strongly skewed distribution,
with several outliers, which may be commercial sales.

Histogram

20
18
16
14
Frequency

12
10
8
6
4
2
0
0 500 1000 1500 2000

Selling price (in €1000)

The bootstrap distribution of the mean based on 1000 resamples is shown in the histogram
below.

Histogram

120

100

80
Frequency

0
0

0
21

200

9
This distribution is skewed to the right, so the mean value of the resamples cannot be
considered normally distributed. We recognize the fact that, for most statistics, the bootstrap
distributions approximate the shape, spread and bias of the actual sampling distribution.

We have two alternatives: use a confidence interval not based on normality or choose a
measure of center whose distribution is closer to normal. We here show an example of the
bootstrap distribution of a different statistic, one that is more resistant to skewness and
outliers. One such statistic is the median. The median is the mean of 1 or 2 middle
observations. Another statistic is the trimmed mean, e.g. the 25% trimmed mean, a statistic
that calculates the average of the middle 50% of the observations. In Excel you can use the
functions MEDIAN(array) and TRIMMEAN(array; 0.25) for this.

When we use the median statistics, we obtain the following histogram for the bootstrap
distribution:

Histogram

180
160
140
120
Frequency

100
80
60
40
20
0
200

210
220
230

240

250
260

270

280
290

300

310

Bin

This shows that the median does not always work well. The shape of the distribution is not
easy to characterize. Bootstrapping trimmed means works better than bootstrapping medians,
because the bootstrap doesn’t work well for statistics that depend on only 1 or 2 observations.
When we use the trimmed mean (50% of the data is trimmed of, 25% on each side), we
obtain:

10
Histogram

250

200
Frequency

150

100

0
0

0
20

34
The bootstrap confidence intervals can be calculated from the mean of the bootstrap statistic
and the bootstrap standard deviation. Here the mean of the 1000 trimmed means was 244.12
and the standard deviation was 16.45. The bootstrap confidence interval can be calculated
from

Statistic ± t * SE

Where t is the critical value of the t(n-1) distribution (here n = 50). This value can be obtained
in Excel, using TINV(0.05; 49) = 2.009575. The 95%CI for the trimmed mean becomes
(211.06; 277.18).

Example 3: bootstrapping to compare two groups

In a two sample problem, we wish to compare two populations, based on separate samples
from each population. When both populations are roughly normal, we can use the two-sample
t-test to compare the population means. The bootstrap can also compare two populations
without the normality condition and without the restriction to comparison of means.

We proceed as follows:

Suppose we have two samples of sizes n and m from the two populations. We draw a
resample of size n with replacement from the first sample and a separate resample of size m
from the second sample. We compute a statistic that compares the two groups (e.g. the
difference between the two sample means). We repeat this resampling process hundreds of
times. We then construct the bootstrap distribution of the statistic and inspect the shape, bias
and bootstrap standard error as we did previously.

We consider the data of nucleus area of cells treated with a 5mM and 10 mM FeCl2 (oxidative
stress). We expect more aggregation with increasing concentration of FeCl2. This results in a
larger total nucleus area. The distribution of the nucleus area for cells treated with 5 mM
FeCl2 is given in the following histogram:

11
Histogram

160

140

120

100
Frequency

0
0

0
0
20

80
10

Bin

For cells treated with 10 mM FeCl2, we have:

Histogram

50
Frequency

0
0

0
0
20

80
10

Bin

Both distributions are right skewed. We want to estimate the difference of populations means
µ1-µ2, but both distributions are quite skewed, so we should be reluctant to use the two-
sample t confidence interval. To compute the bootstrap distribution for the difference in
sample means, we resample separately from the two samples. Each of our 500 resamples

12
consists of two resamples, one of size n = and one of size m = . For each combined resample,
we calculate the difference in means. The 500 differences form the bootstrap distribution,
which is shown in the histogram below.

Histogram

120

100

80
Frequency

0
3

9
10

e
or
M

Bin

The bootstrap Mean = 7.93; stdev = 2.21. The observed difference is 7.92.

The bootstrap normal probability plot is shown in the figure below. This shows that the data
as approximately normally distributed and consequently the t confidence limits may be
calculated. In our example the 95%CI of the mean is (3.59; 12.27).

Normal Probability Plot (QQ plot)

15,0

12,2
Observed Value

9,4

6,6

3,8

1,0
1,0 3,8 6,6 9,4 12,2 15,0
Expected Value

13
The bootstrap distribution is not always close to the bell-shaped normal curve. Therefore, it
may be recommended not to use the bootstrap t confidence interval, because this method is
based on the normality assumption.

If the bootstrap distribution is approximately normal and the bias is small, we can use the
bootstrap t confidence interval, statistic ± t*SE. In other cases, the 95% bootstrap percentile
confidence interval can be calculated. In our example this becomes (3.25; 12.09).

Example 4: confidence intervals for the correlation coefficient

We use the data of serum creatinine measured in the subgroup of 90-95 year old men (n = 48),
with two different methods: an enzymatic method and a Jaffé method. Both methods differ
from each other, but still there should be a high correlation between the results. The
scatterplot of the data is shown below.

2
Serum creatinine (mg/dL) (Jaffé method)

y = 0,9069x + 0,1744
R2 = 0,9025

1,5

0,5
0,5 1 1,5 2
Serum creatinine (m g/dL) (Enzym atic m ethod)

We can use the same bootstrap procedure to find a confidence interval for the correlation
coefficient, which is R = 0.95. There is of course one point that needs special attention:
because each observation consists of the serum creatinine value obtained with both methods,
we should be careful not to lose the tie between both methods during resampling. Therefore,
we give each patient an identification number, called PatID, and we resample the patients.

The original data looks like:

14
Resampling gives (e.g.):

We use the SMALL function to do the resampling of PatID:

We use Data >> Tables with the built-in Excel function CORREL to calculate the correlation
coefficient of 500 resamples:

15
To obtain this list, first put ‘Correlation’ in cell I3 and the function
=CORREL(F3:F50;G3:G50) in cell J3.

Now select I3:J502 and choose Data >> Tables and set the Column Input Cell to I3. Cli

It is recommended to break the link with the data-table Excel formula after the formula
generated the correlation coefficient of the resampled data, because every time you change a
cell in the spreadsheet, the SMALL function will resample (because RAND() is a so-called
volatile function) and data-tables will recalculate all 500 correlation coefficients of the
resamples. It is my experience that Excel is not always bulletproof for these kind of
continuous recalculations, resulting in unexpected errors and Excel shutting down
unexpectedly.

The bootstrap distribution of the correlation coefficient is given in the histogram below. It is
not unexpected that this distribution is deviating from the normal bell-shaped curve, as the
correlation coefficient can never be larger than 1.

16
180

160

140

120
frequency

100

0
0,9 0,91 0,92 0,93 0,94 0,95 0,96 0,97 0,98 0,99 1
correlation coefficient

The normal probability plot shows deviations from the identity line that confirm the non-
normality of the data. Therefore, it is recommended not to use the t confidence intervals, but
to use the bootstrap percentile interval.

Norm al Probability Plot (QQ plot)

1,00

0,99

0,98

0,97
Observed Value

0,96

0,95

0,94

0,93

0,92

0,91

0,90
0,90 0,91 0,92 0,93 0,94 0,95 0,96 0,97 0,98 0,99 1,00
Expected Value

The 95% t confidence interval would be (0.922; 0.975). The bootstrap percentile interval is
(Pct2.5 – Pct97.5): (0.918 – 0.972) which is only slightly different as the t confidence interval.

17
Example 5: bootstrapping regression

Regression models may be bootstrapped in exactly the same way as shown in example 4. The
original data consists of x, y pairs and the statistic computed from bootstrap replications
consists of paired estimates of slope and intercept. Often, the main interest is in estimates of
the slope, but we may also want to set confidence limits on an estimate of some value of x
computed from the estimates of slope and intercept. To set confidence limits on some
regression estimate by bootstrapping, one simply needs to follow the procedure presented
above, with the “statistic” being the estimate of interest in the study at hand. A problem in this
approach is often the sample size. With smallish samples, bootstrapping pairs may give some
strange and variable results. We will thus need to consider bootstrapping the deviations. The
procedure is simple. One fits a regression model to the original data, calculates residuals
about the fitted line, and bootstraps the residuals. Consider the result of fitting a simple linear
regression to n original pairs of x, y observations. The outcome is a fitted regression line: y =
a + bx, where a and b represent the estimates of intercept and slope. The residuals about the
regression line are then: ei = yi – a – bxi (i= 1, 2, 3, …, n). We now bootstrap the residuals,
taking repeated random samples with replacement of n observations from the residuals, add
these residuals to the fitted regression line and calculate a new set of n values of yi. Combined
with the original set of x-values (unchanged throughout) these new pairs constitute the
bootstrap samples. We then calculate the bootstrap replication by fitting a new regression line
to the bootstrap sample. The only tricky part is to remember that the new values of yi are
computed from the ith value of xi, so that the same residual (ei) may be associated with several
values of xi, depending on the random selection. That is, the new set of yi-values is computed
from: yi = a + bxi + ei (i = 1, 2, 3, …, n), with a and b coming from the regression line fitted to
the original data and the values of ei coming from a random sample with replacement of the n
data points generated by the equation ei = yi – a – bxi (i= 1, 2, 3, …, n).
First we calculate the regression line from the original data.

1,9
y = 0,9069x + 0,1744
R2 = 0,9025
1,7

1,5
CREA (Enz)

1,3

1,1

0,9

0,7

0,5
0,5 0,7 0,9 1,1 1,3 1,5 1,7 1,9
CREA (Jaffé)

The 95% CI of the slope obtained via t-statistics is: [0.8184; 0.9953].

18
This gives us a slope of 0.9069 and an intercept of 0.1744. With these values we calculate the
residuals ei = yi – 0.1744 – 0.9069 xi. Then we resample these residuals and add the resampled
residual to the regression equation to obtain the new yi = 0.1744 + 0.9069 xi + ei*, where the *
denotes that the residual is resampled.

We then use Excel’s Data >> Table option to calculate the bootstrap distribution of slopes of
the resamples. This gives us the following graph:

Histogram

200
180
160
140
Frequency

120
100
80
60
40
20
0
0,75 0,79 0,83 0,87 0,91 0,95 0,99 1,03 More

Bin

The following results are obtained:

Count of resamples 1000

Mean 0,9060
Stdev 0,0431
t-base 95%CI 0,8194
0,9927
bootstrap 95% CI 0,8203
0,9901

If we would have used the resampling of the original pairs of data (instead of calculating the
residuals and resampling these residuals), we obtained:

mean 0,9076
stdev 0,0389
t-based 95%CI 0,8295

19
0,9857
bootstrap 95% CI 0,8341
0,9851

Example: confidence intervals for logistic regression parameters

We used the Fisher Iris dataset to demonstrate how we can calculate bootstrap confidence
intervals for logistic regression parameters. In this dataset we only used the data of two
species (Versicolor and Virginica), coded as 1 and 0, and the sepal length (x).

OBSERVED bo 12,5708
b1 -2,01293
y (1-y)
ID x Species p(x) p(x) (1-p(x)) product ln(product)
1 7,0 1 0,1795 0,179504 1 0,179504 -1,71756
2 6,4 1 0,42264 0,422637 1 0,422637 -0,86124
3 6,9 1 0,21108 0,211081 1 0,211081 -1,55551
4 5,5 1 0,81753 0,817527 1 0,817527 -0,20147
5 6,5 1 0,37443 0,374432 1 0,374432 -0,98234

Logistic regression is different from regular regression because the dependent variable can be
binary (versicolor versus virginica, coded as 1 or 0). Based on the information of sepal lengt,
we want to predict the probability that the species is versicolor or virginica. The model we
want should fit the data and predicts the probability that the species is versicolor, based on the
information of the sepal length. This example only has one factor or variable x and the
function that gives the probability for each value of x is given by:

p(x) = exp(bo+b1x) / [1 + exp(bo+b1x)]

The issue is to determine the parameters boand b1. Unlike regular regression problems, we
cannot use the method of least squares to estimate these parameters. The method of
estimation, which is usually applied here, is called maximum likelihood. In order to apply this
technique, we must first construct a likelihood function. We estimate the parameters in our
regression equation by choosing them to maximize the likelihood function we construct. The
likelihood function L(bo,b1) is defined as:

n
L(b0 , b1 ) = ∏ p( xi ) yi [1 − p ( xi )]
(1− yi )

i =1

By taking the logarithm, this function becomes:

20
n
ln L(b0 , b1 ) = ∑ [ yi p( xi ) + (1 − yi )(1 − p ( xi ))]
i =1

The function L – or equivalently ln L - should be maximized. This can be done using Excel’s
Solver Add-in. This add-in comes with the Excel software but should be installed separately.
When installed correctly, one should find the menu item Tools >> Solver.

To maximize the function ln L via Excel’s Solver, we first set up the spreadsheet as follows:
a) e.g. in column F we set an ID for each data row. The data are in columns G (the sepal
length) and H (the coded value for the species).
b) In column I we calculate p(x) based on start values for b0 and b1, using the function
p(x) = exp(bo+b1x) / [1 + exp(bo+b1x)], and where x is the sepal length value
c) In column J, we calculate p(x)y where y is the coded value of the species; in column K
we then calculate (1- p(x))(1-y) and in column L we make the product of the cells in
columns J and K. Finally in column M we calculate the logarithm of the cells in
column L. Of course, you can also simply calculate y p(x) + (1-y) (1-p(x)) and make
the sum. This last method is sometimes to be preferred to avoid numerical problems in
Excel.
d) We make the sum of all cells in column M thus obtaining Σ ln L. This value should be
maximized by changing the cells containing b0 and b1. This can be done using Excel’s
Solver.

Using Tools >> Solver, a dialog window appears. Select the cell in which the Σ ln L result is
present (here cell $M$104). This value should be maximized (set ‘Equal To:’ to Max). In the
textbox ‘By Changing Cells:’ you should refer to the cells containing b0 and b1. Clicking
‘Solve’ invokes an interative non-linear procedure that maximizes the value in cell $M$104.

The logistic regression parameters for our example are b0 = 12.57 and b1 = - 2.01.
To obtain the confidence intervals for these parameters via bootstrapping, we proceed as
follows:
We made (e.g.) 100 resamples from the originally observed data by making use of an Excel
macro, called solvertest. In this macro we first set the cells R1 and R2 equal to 1, which are
the start values of the logistic regression parameters for our resample. We then generate
random IDs from the original sample IDs (which were values from 1 to 100). These random

21
IDs were obtained using the VBA random function ‘Rnd()’ and put in cells O4 till O103. In
column P and Q we used the VLOOKUP function to find the corresponding sepal length (x)
and coded value for species (y) from the originally observed data (in cells F4:H103).
Automatically, the values of p(x) and Σ ln L (in cell $V$104) were then calculated on the
spreadsheet for the resample.

Sub solvertest()
Dim i As Integer
For i = 1 To 100
[Link]("R1").Value = 1
[Link]("R2").Value = 1
'genereer random patid
For j = 1 To 100
[Link]("O4").Offset(j - 1, 0).Value = Int(100 * Rnd()) + 1
Next j

SolverReset
SolverOk SetCell:="$V$104", MaxMinVal:=1, ValueOf:="0", ByChange:="$R$1:$R$2"
SolverSolve UserFinish:=True

[Link]("X1").Offset(i, 0).Value = [Link]("R1").Value

[Link]("X1").Offset(i, 1).Value = [Link]("R2").Value
Next i
End Sub

For this resample, we calculated the logistic regression parameters b0 and b1, using Excel’s
Solver. To use Excel’s Solver in a VBA macro we first need to set a reference to it. You can
do this from the Visual Basic Editor by choosing Tools >> References which invokes a dialog
window in which you can check the ‘SOLVER’ reference in the available references list. This
reference only needs to be set once.
In the VBA code the following statements are used:
SolverReset, which resets the memory
SolverOK, which is equivalent to clicking Solver on the Tools menu, and then specifying the
options that are in the Solver Parameters dialog box.
SolverSolve, which is equivalent to clicking Solve in the Solver Parameters dialog box.
By setting UserFinish to True the Solver’s dialog results windows will not appear.

When Excel’s Solver has finished its iterative process, the results are printed in column X and
the whole procedure is repeated, that is, a new resample is taken and new fit parameters are
obtained.

22
The histogram of the values for b0 is shown below:

Histogram

25
Frequency

0
5 9 13 17 21 25
Bin

Bootstrapping 95%CI for b0 is [7.56; 21.11] and for b1 we have [-3.40; -1.23].

23
Example 6: confidence intervals for Cohen’s Kappa

Kappa provides a measure of the degree to which two raters concur in their respective sortings
of N items into k mutually exclusive categories. A 'rater' in this context can be an individual
human being, a set of individuals who sort the N items collectively, or some non-human
agency, such as a computer program or diagnostic test, that performs a sorting on the basis of
specified criteria.
The original and simplest version of kappa is the unweighted kappa coefficient introduced by
J. Cohen in 1960. To illustrate, suppose that our raters are two clinical tests, A and B,
independently employed to sort each of N=100 subjects into one or the other of k=3
diagnostic categories. The table below shows a cross-tabulation of the sortings actually
observed.

OBSERVED Method A
1 2 3 Total
1 44 5 1 50
Method

2 7 20 3 30
B

3 9 5 6 20
Total 60 30 10 100

The table below shows the cell frequencies that would have been expected by mere chance,
given the observed marginal totals.

CHANCE EXPECTED Method A

1 2 3 Total
1 30 15 5 50
Method

2 18 9 3 30
B

3 12 6 2 20
Total 60 30 10 100

In this example the observed number of concordant items is 70 = 44 + 20 + 6, the chance

expected number is 41 = 30 + 9 + 2, and the excess of observed over expected is 70-41=29.
Similarly, the chance expected number of non- concordant items is 100-41=59. Cohen's kappa
is simply the ratio of the former to the latter: 29/59=.4915. Essentially, what it says is: Of all
the items that we would have expected to be non-concordant if nothing more than chance
coincidence were operating in the situation, 49.15% of them are in fact concordant.

observed
observed concordant 70
chance expected concordant 41
excess 29
chance expected non concordant 59
kappa 0,492

We now resample the paired observations using the method described earlier (actually we
resample the SampleID and use VLOOKUP to assign the results of methods A and B to the
corresponding SampleID.

24
An example of the resampled frequency table is:

RESAMPLE Method A
1 2 3 Total
1 45 6 3 54
Method

2 9 19 3 31
B

3 8 3 4 15
Total 62 28 10 100

To obtain this table in Excel we make use of array functions, which – contrary to pivot tables
– change whenever the resampling is done.

As an example, consider the cell L3, where the number of concordant value 1 is given. This
number is obtained using the array function =SUM((B2:B101=1)*(C2:C101=1)) which
should be entered using CTRL + SHIFT + ENTER. If you do it properly, {} brackets appear
around the formula. Array formulas are a powerful tool in Excel. An array formula is a
formula that works with an array, or series, of data values rather than a single data value.
In our example, the first array is a series of TRUE or FALSE values which are the results of
comparing B2:B101 to the value 1. The second array is also a series of TRUE or FALSE
values, the result of comparing C2:C101 to 1. These two arrays are multiplied together. When
you multiply two arrays, the result is itself an array, each element of which is the product of
the corresponding elements of the two arrays being multiplied. The SUM function simply
adds up the elements of the array and return a result of 44, the number of 1’s for both method
A and B. By filling the frequency table for the resampled data, we can calculate Cohen’s
kappa from this frequency table, in the same way we calculated it for the originally observed

25
data. By using Data >> Tables we then obtain the bootstrap distribution of kappa of our 1000
resamples.

Histogram

140
120

100
Frequency

80
60
40

20
0
4

6
28

e
or
0,

0,
0,

0,
M

Bin

The mean and 95% confidence intervals are then:

T-based bootstrap
95%CI 95%CI
mean 0,4894 0,3518 0,3487
stdev 0,0693 0,6269 0,6211

Note that kappa = 0.4915 for the originally observed data and the 95% CI obtained for Kappa
via the asymptotic error (ASE) is [0.3475; 0.6355].

26
Significance testing using permutation tests
In some cases, we want to determine whether an observed effect, such as the difference
between two means, could reasonably be ascribed to the randomness introduced in selecting a
sample. If not, we have evidence that the effect observed in the sample, reflects an effect that
is present in the population. We will proceed as follows:
• We start choosing a statistic that measures the effect we are looking for
• We construct the sampling distribution that this statistic would have if the effect were
not present in the population
• We locate the observed statistic on this distribution. A value in the tail of the
distribution would rarely occur by chance, and so is evidence that something other
than chance is operating.

Example 1: comparing two means

Resampling for significance tests requires that we resample in a manner consistent with the
null hypothesis. The statement that the effect we seek is not present in the population is the
null hypothesis. The probability, calculated taking the null hypothesis to be true, that we
would observe a statistic value as extreme or more extreme than the one we did observe is the
p-value. Because p-values are calculated assuming that the null hypothesis is true, we cannot
resample from the observed sample as we did earlier. Here, we must resample to create a
distribution centered at the parameter value stated by the null hypothesis.

The table below shows the results from a small experiment where 7 mice out of 16 were
randomly selected and treated with a new drug, while the other 9 mice served as the control
group. The treatment was intended to prolong survival after surgery, expressed in days. The
question that now arises is: does the new drug prolong survival?

The data are presented here.

Treated Control
94 52
38 10
23 40
197 104
99 50
16 27
141 146
31
46

The observed summary statistics are:

Treated Control
Mean 86,86 56,22
Stdev 66,77 42,42
SE 25,24 14,14

The permutation test procedure goes as follows:

27
• Choose 7 out of 16 mice at random to be the treatment group; the other 9 mice are the
control group. Choose without replacement! This is called a permutation resample.
Calculate the mean in each group. The difference between these means is our statistic.
• Repeat this resampling from the 16 mice hundreds of times. The distribution of the
statistic (difference in means) forms the sampling distribution under the condition that
the null hypothesis is true. It is called a permutation distribution.

To do this in Excel, we need a VBA array function. From [Link] we used the
VBA function UniqueRandomLongs. This array function returns n = Number unique values
between a Minimum and a Maximum value. We arrange the data as follows:

In column H we introduced the ‘MiceID’, that is, we numbered the mice results from 1 to 16,
where the first 7 entries are the data from the treated mice and the next 9 entries are the data
of the control mice. Then we used the function from Chip Pearson to randomize the numbers,
each number appearing exactly once. The resample1 is taken from the data in column J, by
using the VLOOKUP function.

The VLOOKUP function looks up the value in H2 (which is the MiceID = 1) in the random
sequence of miceID of column I and returns the value next to this miceID in column J where
the data is placed. For Resample1, we use the MiceID from 1 to 7, for Resample2 we use the
MiceID from 8 to 16. Then we calculate the statistic as the difference between the means of
both resamples.

28
Next, we have to repeat that hundreds of times. We can do that using the data >> tables option
in Excel and referring to cell M12 (the calculated difference between means of the resamples)
for the Column input cell, or we could use a small VBA macro, like this:

Sub MakeBootstrapStat()
Dim i As Integer
For i = 1 To 1000
[Link]("O1").Offset(i, 0).Value = [Link]("M12").Value
Next i
End Sub

The macro code above simply fills column O with the value of cell M12, which contains the
calculated difference between the means of the resamples. By entering a value in a cell in
column O, the recalculation is triggered each time and the resamples are refreshed because the
UniqueRandomLongs function has a dummy variable to make it volatile.
The permutation distribution of our difference in means is shown in the histogram below.

Histogram

160

140

120

100
Frequency

0
-70 -50 -30 -10 10 30 50 70 More

Difference

The p-value for the one-sided test (mean of treatment group > mean of control group) is based
on 1000 permutation resamples. The observed difference was 86.86 – 56.22 = 30.64. The p-
value for the one-sided test is the probability that the difference in means is 30.64 or greater,
calculated taking the null hypothesis to be true. The histogram above shows how the statistic
would vary if the null hypothesis were true. The proportion of observations greater than 30.64
estimates the p-value. From the resampling results we can find that 147 of the 1000 results
gave a value of 30.64 or larger. The proportion of samples that exceed the observed value of
30.64 is thus 147/1000 = 0.147. In fact, a small refinement can be made. It can be shown that
by adding one sample result above the observed statistic, improves the estimated p-value. The
permutation test estimate of the p-value is then (147 + 1) / (1000 + 1) = 0.148.

29
Using the two-sample one sided t-test to compare the means of the treated and control group,
we obtain a p-value of 0.140, which is very similar to the p-value of the permutation test.
Permutation tests have these advantages over t tests:
• The t test gives accurate p-values if the sampling distribution of the difference in
means is at least roughly normal. The permutation test gives accurate p-values even
when the sampling distribution is not close to normal.
• We can directly check the normality of the sampling distribution by looking at the
permutation distribution.

If the two p-values differ considerably, it usually indicates that the conditions for the two
sample t-test don’t hold for these data. Permutation tests give more accurate p-values than t-
tests, especially when the sampling distribution is skewed.

An alternative way to reshuffle the original data from both samples, is to make use of an extra
column (column I), with random numbers. Then these random numbers are ranked from small
to large or vice versa in column J, giving a list of random integer numbers. Theoretically tied
ranks might be observed, however, in practice, this will be very rare. We then proceed as
described earlier. The advantage is that no macros are needed.

Example 2: permutation test for matched pairs

As an example, consider the data of a new drug to reduce blood pressure. The data for 10
patients is shown in the table below, before and after treatment with the new drug.

Before After Diff

200 160 -40
170 120 -50
150 130 -20
160 140 -20
180 145 -35
175 135 -40
155 145 -10
158 123 -35
187 167 -20
165 143 -22

Mean -29,2

The mean difference is -29.2 mm Hg. The t-test for matched pairs gives a p-value of
0.00002088.
If we want to perform a permutation test, we have to keep in mind that the key step in the
general procedure for permutation tests is to form permutation resamples in a way that is
consistent with the study design and with the null hypothesis. The null hypothesis says that
the drug has no effect, or the “before” and “after” have no meaning. Therefore, we should

30
resample by randomly assigning “before” or “after” to a patient, but we should not mix scores
from different people, because that isn’t consistent with the pairing in the study design.
We can do this as follows in Excel:
Under the assumption of the null hypothesis, we randomly assign the blood pressure of one
patient to the “before” or “after” situation, by using the formula in cell E3: =IF(RAND()>0.5;
B2; A2) which assigns the value of the observed “After” situation to cell E3 if a random value
greater than 0.5 is returned by the RAND() function, otherwise the observed “Before”
situation is entered in cell E3.

In cell F3 we then have to set the ‘other’ value, the one that is not assigned to cell E3. We use
the formula =IF(E3=A2;B2;A2).

A resample may look like this:

31
We repeat this resampling and the calculation of the mean of differences hundreds of times.
We can do this by using data >> tables and referring to cell G14, where we calculated the
mean of the differences of the resample. The permutation distribution of 1000 resamples
looks like:

Histogram

200
180
160
140
Frequency

120
100
80
60
40
20
0
-40 -30 -20 -10 0 10 20 30 40

Paired difference

As none of the resample mean pair differences is below the observed mean difference of -29.2
mm Hg, the permutation test p-value equals 0.

Example 3: permutation test for the significance of a relationship

Permutation tests can also be used to test the significance of a relationship between two
variables. For example, we looked at the relationship between two methods to determine the
serum creatinine concentration in 90-95 year old men (see example 4). The null hypothesis
would be that there is no relationship. In that case, the value of serum creatinine obtained for
the same patient by method 1 would have nothing to do with the value obtained by method 2.
We thus can resample in a way consistent with the null hypothesis by permuting the observed
values of method 2 among the patients at random.
As a test statistic we take the correlation. For every resample, we calculate the correlation
between the serum creatinine obtained by method 1 (in its original order) and by method 2 (in
a randomly reshuffled order). The p-value is the proportion of the resamples with correlation
larger than the original observed correlation.
To do this in Excel, we reshuffle the PatID numbers using the array function
UniqueRandomLongs we used previously in Example 4. In column E we thus generate a new
list of PatIDs but in a random order.

32
In cell G2 (and the cells G3:G48) we use the VLOOKUP function to get the value of method
2 corresponding to PATID = 10.
For each resample, we calculate the correlation, using CORREL between the data of method 1
(in its original order) and the data of method 2 (in its reshuffled order). The permutation
distribution for 100 resamples is obtained using Data >> Tables and referring to the cell
where CORREL is used. This gives us the following histogram:

Histogram

20
Frequency

0
-0,4 -0,3 -0,2 -0,1 0 0,1 0,2 0,3 0,4 More

Bin

The observed correlation coefficient was 0.95 which is far from any observed correlation
under the null hypothesis. Therefore, the permutation test p-value is 0, meaning that the null
hypothesis of no correlation is rejected.

Example 4: comparing more than two groups (single factor anova)

In this example we use the data of a study of Post Traumatic Stress Disorder in rape survivors.
This study was carried out by Foa, Rothbaum, Riggs, and Murdock (1991), as part of a long
series of studies that Foa has conducted on this topic. Each of the participants completed a
symptom inventory at the start of treatment, and again at the end. We will use the data from
the end of treatment. There were four conditions. Group SIT (Stress Innoculation Therapy)
learned techniques for dealing with stress. Group PE (Prolonged Exposure) reviewed the rape
incident in their heads over and over again, until it lost some of its negative valence. Group
SC (Supportive Counseling) was a group that just received standard counseling, and Group
(WL) was a waiting list control group.
The data for this example are shown below:

33
Observed data

SIT PE SC WL
3 18 24 12
13 6 14 30
13 21 21 27
8 34 5 20
11 26 17 17
9 11 17 23
12 2 23 13
7 5 19 28
16 5 7 12
15 26 27 13
18 25
12
8
10

SIT PE SC WL
Mean 11,07143 15,4 18,09091 19,5
Var 15,60989 123,6 50,89091 50,5

Overall mean 15,62222

A standard one-way analysis of variance on these data would produce F = 3.046, p = .039,
which would lead us to reject the null hypothesis. However, with very little data in each cell,
we don't really have a good way to convince ourselves that normality is a reasonable
assumption. At best, all we can say is that the data are not so extreme as to lead us to believe
that they are not normally distributed. However, we can use the permutation procedure to
avoid having to make that assumption. I should also point out that there is a problem with
homoscedasticity (compare the variances of Groups SIT and PE.)

Here is the output we obtain via Tools >> Data Analysis >> Anova: Single Factor, where we
used the Analysis Toolpak Add-in in Excel.

Anova: Single Factor

SUMMARY
Groups Count Sum Average Variance
SIT 14 155 11,07143 15,60989
PE 10 154 15,4 123,6
SC 11 199 18,09091 50,89091
WL 10 195 19,5 50,5

ANOVA
Source of Variation SS df MS F P-value
Between Groups 507,8401 3 169,28 3,045757 0,039363
Within Groups 2278,738 41 55,57897

Total 2786,578 44

34
If we are willing to assume that the data we have are a reasonable representation of the
populations from which they were drawn, then we can use those data to reproduce the
population under the null hypothesis of no differences between groups and then draw
resamples from that population. Notice, however, that we have clearly made an assumption.
We have assumed that the sample data reflect the population. That is just as much of an
assumption as the assumption of normality. This is an important point to keep in mind,
especially with small samples. However, using single factor anova assumes 1) representative
data, 2) normality and 3) equal variances. The last two assumptions are not needed for a
permutation test.
The advantage of bootstrapping and permutation tests for educating statistics is that to do
simulations, one really has to know how to calculate the statistics.

We proceed as follows:
1) we first list all the observed data in one column e.g. from I4:I48 and we give each
observation a unique ID (e.g. in H4:H48)
2) in column J4:J48 we generate 41 random numbers with =RAND()
3) in column K4:K48 we calculate the rank of each of the random numbers
4) in the columns M till P we present the resample, by using the VLOOKUP function to
lookup the data that corresponds to the randomly assigned rank. In other words, the
values in M till N are the reshuffled original data (that is, we permute the data and
reassign them randomly to each category, assuming the null hypothesis is true (all
categories are equal))

35
5) then we calculate the F-statistics: by permuting the data, the overall mean stays equal
to 15.6222 but the group means will change. Therefore, the SStotal will remain the
same but we have to recalculate SSwithin groups for each resample. The F-statistics is
MSbg/MSwg.
6) We calculate 1000 or more F-values for the resamples. A histogram of these values are
shown below. The observed F-value was 3.046. The permutation test p-value is simply
the number of times the resample F-value is greater than the observed F-value, divided
by the number of resamples. We found a p-value of 0.038, which is very close to the
value obtained from single factor anova (p = 0.03936).

F-values for the resamples

350
300
250
Frequency

200
150
100
50
0
0,5 1 1,5 2 2,5 3 3,5 4 4,5 5 5,5 6 6,5 More

Bin

Example 4: comparing two proportions

In a first example we consider the relationship between the treatment of a specific disease and
the health status of the patient after a certain period of time, in terms of, cured or not. We are
quite convinced of our new drug and the majority of patients is in the treated group, so the
patient distribution among both groups is quite unbalanced. The data are:

Patients Cured Prop

Treated 142 108 0,761
Control 28 15 0,536
170 123 0,724
Diff 0,225

We claim that our new treatment increases the chance of being cured after a certain period of
time. The null hypothesis is: Ho: p1 = p2 and the alternative hypothesis is Ha: p1 > p2. The p-
value for the comparison of independent proportions is p = 0.0075 (one-sided chi-square test;
χ2 = 5.911). Under the null hypothesis, all 170 patients are equally likely to be cured. That is,
curing occurs for reasons that have nothing to do with whether the patient has taken the new
drug or not. We should resample in a way consistent with the null hypothesis. This can be
done by reshuffling the data and assigning randomly 142 to the first group and 28 to the other
group.

36
In Excel, we proceed as follows:

We set up a database with 3 columns, one with a patient

identification number (PatID), one for the treatment
status (treated Yes/No) and one for the health status
(cured Yes/No).
The PatID has a value from 1 to 170. In the treated
column we have 142 patients with ‘Yes’ (treated group)
and 28 patients with ‘No’ (control group). In the group
of control patients we have 15 patients with ‘Yes’
(cured) and 13 patients with ‘No’ (not cured). In the
treated group, we have 108 patients with ‘Yes’ for cured
and 34 patients with ‘No’ for cured.
Now, we forget about the treatment, and reshuffle the
data at random, take out 28 and assign them to the
control group and 142 to the treatment group. We count
the number of times we get a ‘Yes’ in each group.

Next to the data, we make a resample by using the

RAND() function to generate 170 random floating point
numbers. In the column next to these random values, we
put the rank order of the random value next to it, in the
list of 170 random values. We thus obtain a list of
unique reshuffled patients identifications.

37
Using VLOOKUP, we find the cure status in the original data for the reshuffled patid.
We then take the first 28 patients for the control group and the last 142 patients for the
treatment group. We count the number of cured patients in each group and calculate the
proportion. To avoid rounding errors, it is easier to calculate the number of times we find a
value of 15 or less in the control group. Such a value will always result in a difference greater
than or equal to 0.225. In other words, instead of calculating the difference in proportions
between the cured patients in the treated and control group, we just calculate the number of
cured patients in the control group, for each resample. Then we calculate the number of times
this number is smaller than or equal to 15. Each time this happens, the difference in
proportions will be at least 0.225. This happens 18 times in 1000 resamples, resulting in a p-
value of 0.018.
Another (although much more complex) way to arrive at the same result is by making use of
the built-in Excel function ‘binomdist’. Under the null hypothesis, we expect 72.4%
(123/170) of the patients to be cured in each group. We can calculate the cumulative binomial
distribution, once for 142 patients and once for 28 patients.

38
Using =BINOMDIST(n; 142; 0.724; TRUE) gives the cumulative probability distribution (0
or more persons are cured) in the treatment group, given that the probability is 72.4% to be
cured. The value of n goes from 0 to 142. Analogously, =BINOMDIST(n; 28; 0.724; TRUE)
calculates the cumulative distribution in the control group.

We then generate random numbers (see column G) and assign (via VLOOKUP) the patient ID
to corresponding to the cumulative value of the binomial distribution closest to that random
number. E.g. the random number 0.742241815 lies between 0.6940 = Binomdist(105; 142;
0.724; TRUE) and 0.7574 = Binomdist(106; 142; 0.724; TRUE). So, we assign the value 105
to it and calculate 105/142 = 0.739437 as the first proportion (prop1). We do that a 1000
times. We proceed analogously in the control group and find 1000 values for prop2. We then
take the difference and calculate how many times this difference is larger than or equal to
0.225. We obtain 21, giving a p-value of 0.021. Note that this value is not as stable as the
value we found using the previous method.

Example 5: comparing two proportions

We consider the data presented in the article by Regan, Hellmann and Stone (“Treatment of
Wegener’s granulomatosis”, 2001, Rheumatic Diseases Clinics of North America, 27(4), 863-
886). There are two treatment groups, one with 17 patients and the other with 19 patients.
Patients are being treated for Wegener’s granulomatosis. The data consists of the number of
patients in remission (Yes) and not in remission (No), for each treatment. The data are
reproduced in the table below:

Treatm1 Treatm2 Total

Yes 6 14 20
No 11 5 16
Total 17 19 36

The proportion in remission for treatment 1 is 0.353 or 35.3%, while the proportion for
treatment 2 is 0.737 or 73.7%. The question we want to answer is how likely it is to get a
difference in proportions as large or larger than the observed difference of 0.384, if there is
actually no difference in the population remission proportions?
If the population remission for each treatment are not different, then we would expect to see
20/36 = 0.556 or 55.6% of the patients in each treatment group in remission.

39
We disregard treatment group (under the assumption of the null hypothesis both treatment
groups are equal) and reshuffle the patient IDs. The first 17 are now assigned to treatment
group 1 and the next 19 to treatment group 2. We calculate the difference in proportions for
the resampled data. We do this 1000 times or more and the number of times we find a
difference as larger or larger than 0.384 can be used to calculate a p-value, which is here equal
to 21/1000 = 0.021.

1
Derek Christie, Resampling with Excel, Teaching Statistics, Volume 26, Number 1, Spring 2004
2
Wichman, B.A. and I.D. Hill, Algorithm AS 183: An Efficient and Portable Pseudo-Random Number
Generator, Applied Statistics, 31, 188-190, 1982.

View publication stats

1
Bootstrapping, permutation tests and simulation in Excel

Hans Pottel
Subfaculty of Medicine, KU Leuven Kulak, Kortri

2
• Promote understanding: they build intuition by providing concrete analogies to
theoretical concepts; e.g. look at how

3
Procedure for bootstrapping
To use resampling techniques, the sample data are assumed representative of the population

4
If we construct 100 bootstrap samples and plot the histogram of the mean statistic, we obtain
the graph below.
The mea

5
Remember that the bootstrap idea is to use resample means to estimate how the sample mean
of a sample of size 601 from

6

A small dialog window appears, asking for the Row input cell (which you may ignore) and
the column input cell, where

7
As you can see, each cell in G3:G301 refers to the reference cell F2 and the formula in G2.
Each time you change a cell

8
Routput.Offset(i + 5, 2).Formula = "=Stdev(" & Rresample.Address & ")"
‘for Pct2.5
Set Rresample = ActiveSheet.Range(R

9
Selling prices (in €1000) for real estates (from Tim Hesterberg Table 18.1)

142
232
132,5
200
362
244,95
33

Install Excel Data Analysis Toolpack
No ratings yet
Install Excel Data Analysis Toolpack
3 pages
Excel Functions for Math & Stats
No ratings yet
Excel Functions for Math & Stats
4 pages
Excel Data Management Techniques
No ratings yet
Excel Data Management Techniques
22 pages
Excel and R Research Methodology Guide
No ratings yet
Excel and R Research Methodology Guide
115 pages
Excel Data Analysis Techniques Guide
No ratings yet
Excel Data Analysis Techniques Guide
26 pages
Excel & PowerBI Practical Guide
No ratings yet
Excel & PowerBI Practical Guide
91 pages
Chapter 12 - Simulation and Risk Analysis
No ratings yet
Chapter 12 - Simulation and Risk Analysis
64 pages
Business Statistics Using Excel PDF
93% (15)
Business Statistics Using Excel PDF
505 pages
Excel Shortcuts and Functions Guide
No ratings yet
Excel Shortcuts and Functions Guide
13 pages
Using Excel For Statistics
0% (1)
Using Excel For Statistics
24 pages
Excel Statistical Calculations Guide
No ratings yet
Excel Statistical Calculations Guide
7 pages
Excel Functions and Their Usage Guide
No ratings yet
Excel Functions and Their Usage Guide
20 pages
Excel Functions for Data Analytics
No ratings yet
Excel Functions for Data Analytics
26 pages
Statistical Analysis with Excel and Pandas
No ratings yet
Statistical Analysis with Excel and Pandas
41 pages
MS Excel Features and Operations Guide
No ratings yet
MS Excel Features and Operations Guide
62 pages
Excel Formulas, Functions, and Analysis
No ratings yet
Excel Formulas, Functions, and Analysis
43 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
8 pages
DSF Quiz and Employee Data Analysis
No ratings yet
DSF Quiz and Employee Data Analysis
30 pages
00.20 ExcelUserManual
100% (2)
00.20 ExcelUserManual
198 pages
Excel Practical Guide by Chinmay Maheshwari
No ratings yet
Excel Practical Guide by Chinmay Maheshwari
30 pages
ExcelFunctionImprovements 10-05-09
No ratings yet
ExcelFunctionImprovements 10-05-09
41 pages
Excel Functions for Research Methodology
100% (1)
Excel Functions for Research Methodology
86 pages
Excel Statistical Functions Guide
No ratings yet
Excel Statistical Functions Guide
11 pages
Data Management and Statistics Overview
No ratings yet
Data Management and Statistics Overview
8 pages
Excel Data Analysis Techniques Guide
No ratings yet
Excel Data Analysis Techniques Guide
5 pages
Excel Guide: Central Tendency & Charts
No ratings yet
Excel Guide: Central Tendency & Charts
8 pages
Excel Functions and Syntax Guide
No ratings yet
Excel Functions and Syntax Guide
4 pages
Essential Excel Features and Functions
No ratings yet
Essential Excel Features and Functions
61 pages
Excel Functions by Category Guide
No ratings yet
Excel Functions by Category Guide
42 pages
Excel Statistical Analysis Project Guide
No ratings yet
Excel Statistical Analysis Project Guide
3 pages
BA - 1.1 - Descriptive Analysis - 1
No ratings yet
BA - 1.1 - Descriptive Analysis - 1
13 pages
Excel Statistical Functions Guide
No ratings yet
Excel Statistical Functions Guide
17 pages
Excel Lookup and Data Summary Examples
No ratings yet
Excel Lookup and Data Summary Examples
5 pages
Business Statistics with Excel Tools
No ratings yet
Business Statistics with Excel Tools
12 pages
Excel Features and Data Analysis Guide
No ratings yet
Excel Features and Data Analysis Guide
95 pages
Using Analysis ToolPak in Excel
No ratings yet
Using Analysis ToolPak in Excel
27 pages
Excel Cheat Sheet for Beginners
No ratings yet
Excel Cheat Sheet for Beginners
7 pages
Excel Functions for Data Analysis
No ratings yet
Excel Functions for Data Analysis
22 pages
20210825200939D5341 - Week 9 CH 12
No ratings yet
20210825200939D5341 - Week 9 CH 12
42 pages
Real Statistics Examples Basics
No ratings yet
Real Statistics Examples Basics
216 pages
Monte Carlo Simulation in Excel Guide
100% (1)
Monte Carlo Simulation in Excel Guide
10 pages
MS Excel Training for ICAI Ghaziabad
No ratings yet
MS Excel Training for ICAI Ghaziabad
49 pages
Numerical Calculus With Excel
No ratings yet
Numerical Calculus With Excel
16 pages
MS Excel Features and Functions Guide
No ratings yet
MS Excel Features and Functions Guide
95 pages
Excel Statistical Functions Guide
No ratings yet
Excel Statistical Functions Guide
31 pages
Simulation Techniques in VBA
No ratings yet
Simulation Techniques in VBA
20 pages
Excel Basics for Data Science Beginners
No ratings yet
Excel Basics for Data Science Beginners
14 pages
Excel Functions and Data Analysis Guide
No ratings yet
Excel Functions and Data Analysis Guide
3 pages
Monte Carlo Simulation in Excel
No ratings yet
Monte Carlo Simulation in Excel
28 pages
A Note On Excel Random Function
No ratings yet
A Note On Excel Random Function
2 pages
RANDBETWEEN and Probability Distributions
No ratings yet
RANDBETWEEN and Probability Distributions
4 pages
Power BI Desktop Refresh Shortcut Guide
No ratings yet
Power BI Desktop Refresh Shortcut Guide
78 pages
BBA Business Research Methodology Lab Guide
No ratings yet
BBA Business Research Methodology Lab Guide
48 pages
Business Statistics Assignment Guide
No ratings yet
Business Statistics Assignment Guide
10 pages
Chapter 4 - Descriptive Statistical Measures
No ratings yet
Chapter 4 - Descriptive Statistical Measures
71 pages
Employee Perceptions of Banking Recruitment
No ratings yet
Employee Perceptions of Banking Recruitment
7 pages
A Level Psychology Research Methods Answers
No ratings yet
A Level Psychology Research Methods Answers
19 pages
Understanding Probability and Statistics
No ratings yet
Understanding Probability and Statistics
4 pages
Measures of Relative Dispersion Explained
No ratings yet
Measures of Relative Dispersion Explained
4 pages
AQA A-Level Psychology Checklist Guide
No ratings yet
AQA A-Level Psychology Checklist Guide
15 pages
Which Measure of Central Tendency To Use
No ratings yet
Which Measure of Central Tendency To Use
8 pages
Sweet and Sour Taste Classification Using EEG
No ratings yet
Sweet and Sour Taste Classification Using EEG
5 pages
Biodiversity Awareness in Kerala Schools
No ratings yet
Biodiversity Awareness in Kerala Schools
5 pages
Univariate Data Analysis in Research
No ratings yet
Univariate Data Analysis in Research
11 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
14 pages
MealKit4U Income Statement Analysis
No ratings yet
MealKit4U Income Statement Analysis
15 pages
Data Screening in SPSS Essentials
No ratings yet
Data Screening in SPSS Essentials
29 pages
Table D.1 Example: Gujarati: Basic Econometrics, Fourth Edition Back Matter Appendix D: Statistical Tables
No ratings yet
Table D.1 Example: Gujarati: Basic Econometrics, Fourth Edition Back Matter Appendix D: Statistical Tables
1 page
Tax Compliance in Nigeria: Trust and Justice
No ratings yet
Tax Compliance in Nigeria: Trust and Justice
15 pages
Aceh Coastal Deposits Grain Size Analysis
No ratings yet
Aceh Coastal Deposits Grain Size Analysis
8 pages
Analyzing Measures of Dispersion in Statistics
No ratings yet
Analyzing Measures of Dispersion in Statistics
24 pages
Guide des Statistiques Descriptives
No ratings yet
Guide des Statistiques Descriptives
26 pages
Essentials of Statistics For The Social and Behavioral Sciences Essentials of Behavioral Science 1st Edition Barry H. Cohen Ebook Digital Reading Kit
100% (3)
Essentials of Statistics For The Social and Behavioral Sciences Essentials of Behavioral Science 1st Edition Barry H. Cohen Ebook Digital Reading Kit
44 pages
Understanding Negatively Skewed Tests
No ratings yet
Understanding Negatively Skewed Tests
16 pages
Introduction To Business Statistics Textbook Exam Questions
No ratings yet
Introduction To Business Statistics Textbook Exam Questions
15 pages
Understanding Raw Data and Its Organization
100% (1)
Understanding Raw Data and Its Organization
5 pages
Data Collection and Frequency Distributions
No ratings yet
Data Collection and Frequency Distributions
71 pages
Data Representation Techniques Explained
No ratings yet
Data Representation Techniques Explained
49 pages
EViews Econometrics User Guide
No ratings yet
EViews Econometrics User Guide
86 pages
Student Marks Data Analysis Techniques
No ratings yet
Student Marks Data Analysis Techniques
5 pages
Environmental Literacy in Middle Schools
No ratings yet
Environmental Literacy in Middle Schools
31 pages
Pearson's Coefficient of Skewness Examples
50% (2)
Pearson's Coefficient of Skewness Examples
6 pages
HR Policies Impact on Employee Performance
No ratings yet
HR Policies Impact on Employee Performance
7 pages

Bootstrapping Techniques in Excel

Uploaded by

Bootstrapping Techniques in Excel

Uploaded by

See

Resampling methods in Excel

Article · June 2015

Bootstrapping in Excel View project

Estimating GFR equations View project

The user has requested enhancement of the downloaded file.

Random number generation

• Create hundreds or thousands of new samples, called bootstrap samples, or resamples.

Now, how can we do this in MS Excel?

'range with original data

'fill the resample with the SMALL function;

Example 2: confidence limits of the median or the trimmed mean

Selling price (in €1000)

Example 3: bootstrapping to compare two groups

For cells treated with 10 mM FeCl2, we have:

Normal Probability Plot (QQ plot)

Example 4: confidence intervals for the correlation coefficient

The original data looks like:

We use the SMALL function to do the resampling of PatID:

Norm al Probability Plot (QQ plot)

The following results are obtained:

Count of resamples 1000

Example: confidence intervals for logistic regression parameters

p(x) = exp(bo+b1x) / [1 + exp(bo+b1x)]

By taking the logarithm, this function becomes:

[Link]("X1").Offset(i, 0).Value = [Link]("R1").Value

CHANCE EXPECTED Method A

In this example the observed number of concordant items is 70 = 44 + 20 + 6, the chance

The mean and 95% confidence intervals are then:

Example 1: comparing two means

The data are presented here.

The observed summary statistics are:

The permutation test procedure goes as follows:

Example 2: permutation test for matched pairs

Before After Diff

A resample may look like this:

Example 3: permutation test for the significance of a relationship

Example 4: comparing more than two groups (single factor anova)

Overall mean 15,62222

Anova: Single Factor

F-values for the resamples

Example 4: comparing two proportions

Patients Cured Prop

We set up a database with 3 columns, one with a patient

Next to the data, we make a resample by using the

Example 5: comparing two proportions

Treatm1 Treatm2 Total

View publication stats

You might also like