CE Update
Received 12.26.05 | Revisions Received 1.5.06 | Accepted 1.6.06
Downloaded from [Link] by Universidade Federal de Minas Gerais user on 25 December 2021
Statistical Methods for Establishing and Validating
Reference Intervals
Roger L. Bertholf, PhD
(University of Florida Health Science Center/Jacksonville, Jacksonville, FL)
DOI: 10.1309/CBMHPRFNLU1XA4XV
Abstract interpretation of the test. Furthermore, the are fundamental to providing quality laboratory
Reference intervals are an essential part of Clinical Laboratory Improvement Act of 1988 services. In this review, we will consider the
laboratory medicine, and accreditation (CLIA ’88) requires laboratories to verify that statistical methods that can be applied to
standards require that every laboratory result is the reference interval accompanying a establish and validate reference intervals.
accompanied by an appropriate reference laboratory result is appropriate for the patient
interval to provide guidance in the population the laboratory serves. These 2 tasks
After reading this article, the reader should understand the statistical Generalist exam 90601 questions and corresponding answer form are
components behind establishing reference intervals. located after the CE Update section on p. 311.
Reference intervals delimit the expected results of various may not faithfully represent the patients in specific geographical
laboratory tests in healthy individuals, and provide some guid- and demographic areas. For this reason, laboratory practice stan-
ance in the interpretation of patient results. It can be difficult, dards included in the 1988 revision of the Clinical Laboratory
however, to define the appropriate reference interval for any par- Improvement Act (CLIA), originally passed by Congress in
ticular laboratory test. Many factors, including age, gender, race, 1967, required clinical laboratories to verify that reference inter-
posture during specimen collection, geographical location, diur- vals were appropriate for their specific patient populations.3
nal variations, and even seasonal changes may influence the re- Statistical methods can be applied to the task of establish-
sults of laboratory tests.1 These factors are partially responsible ing reference intervals, as well as their validation in individual
for the intra- and inter-individual variations observed in the re- laboratories.
sults of laboratory tests, and reference intervals should reflect
these variations. Reference populations that are selectively en-
riched with individuals predisposed to higher or lower values will Establishing Reference Intervals
result in a significantly biased reference range. A reference popu- Reference intervals customarily represent the central 95% of
lation should be composed of healthy individuals who are demo- values obtained from the reference population. Consequently,
graphically matched to the patient population the laboratory 2.5% of “normal” individuals will exceed the reference range,
serves, but this type of ideal reference population may not be and 2.5% will be below it. It is tempting to assume that normal
accessible. For example, pediatric reference intervals are difficult values for clinical laboratory measurements conform to a Gauss-
to establish because of ethical concerns over performing unnec- ian distribution, in which the central 95% of the area under the
essary venipuncture on children, who cannot legally provide probability distribution curve corresponds to the population
consent. Similarly, it may be difficult to recruit healthy elderly mean (µ) ± 1.96 standard deviations (usually rounded to 2 SD,
subjects to donate specimens for a reference interval study due to or 2σ). However, this approach is often misguided, since the
the high incidence of chronic disease in this group. For these concentrations of various biochemicals in the body rarely follow
and other reasons, ideal reference populations are often unavail- a Gaussian distribution, due to physiological factors that influ-
able, and some compromises may be necessary in the selection ence the concentration in a unidirectional manner; intra-individ-
of a suitable population on which to base reference intervals. ual variations are not strictly random. Statistical approaches that
Manufacturers of in vitro diagnostic reagents are required to are based on a predictable distribution of data, such as the
establish reference intervals as part of the application for Gaussian (or “Normal”) distribution, are called “parametric,”
approval by the Food and Drug Administration,2 and this re- since they make certain assumptions about the data derived
quirement is ordinarily met by collecting data from several sites from the population. Non-parametric methods make no
at which the product is tested prior to market. The population assumptions about how the data are distributed, and provide
used for these studies is usually larger and more diverse than the ways to analyze and compare data sets that have unknown or
reference population available to any particular laboratory, but unpredictable distributions.
306 LABMEDICINE 䊏 Volume 37 Number 5 䊏 May 2006 [Link]
CE Update
The Gaussian Distribution Non-Parametric Reference Ranges
Probability distributions are powerful statistical tools that If a distribution is not Gaussian, the central 95% of the
allow predictions to be made about population data. A statis- data can be determined by ordering the array from the lowest to
tical distribution is a mathematical probability function that the highest values, and eliminating the highest 2.5% and lowest
describes the relationship between the value of a particular 2.5% of values; the remaining highest and lowest values delimit
measurement and the probability that randomly selected data the reference interval. Non-parametric methods do not make
in the population will have that value. The familiar Gaussian any assumptions about the distribution of values in the data set,
Downloaded from [Link] by Universidade Federal de Minas Gerais user on 25 December 2021
distribution is simply a mathematical probability function such as whether it is symmetric about the mean and whether the
that expresses the relationship between the mean (µ) and stan- distribution is skewed toward higher or lower values. Although
dard deviation (σ) of a set of data, and the probability that a non-parametric determination of reference intervals is a simple
randomly selected data point will have a particular value, x. and straightforward procedure that does not rely on any assump-
Gaussian distributions are characteristic of data that are influ- tions about the distribution, the method has some limitations.
enced by multiple, random, independent errors in measure- One limitation is that non-parametric methods ignore any
ment. The first property that is noticeable when P(x) errors associated with individual measurements. Using the exam-
(probability) is plotted against (x-µ)/σ (Figure 1), is the sym- ple of plasma glucose measured in 40 healthy volunteers, if the
metry of the resulting bell-shaped curve; in a Gaussian distri- data are arranged from lowest to highest glucose concentrations,
bution, P(x-µ) = P-(x-µ). This property follows directly from the non-parametric reference interval would be defined by the
the way in which the Gaussian probability function is derived, 2nd and 39th values in the ordered array, since 2.5% of 40 = 1.
requiring that factors influencing individual measurements are
random and independent. In a Gaussian distribution, the cen-
tral 95% of data are bounded by the approximate limits µ ±
2σ, where µ is the population mean and σ is the population
standard deviation. Therefore, if a reference range study for
plasma glucose concentrations in healthy, non-diabetic indi-
viduals generated a mean of 91 mg/dL and a standard devia-
tion of 8 mg/L, and a Gaussian distribution of the data was
assumed, then the reference interval would be 91 ± 2(8), or
75 – 107 mg/dL.
Is it a valid assumption that plasma glucose in healthy
individuals would be distributed in a Gaussian fashion? Proba-
bly not, because the factors that influence plasma glucose con-
centration are neither strictly random, nor entirely
independent. Age and obesity, for example, are factors associ-
ated with impaired glucose tolerance even in non-diabetic in-
dividuals, and these factors do not have a random influence
on plasma glucose (nor are they completely independent,
since older patients are more likely to be overweight). Both of
these factors increase glucose levels, so any distribution of Figure 1_The Gaussian probability distribution curve (also called the
plasma glucose concentrations in non-diabetic individuals “normal” or “bell-shaped” curve). The population mean (µ) is at the
would almost certainly be skewed toward higher values, rather center of the symmetrical distribution, and 67% of the area under the
curve falls within one standard deviation (σ) on either side of the
than symmetrically distributed around the mean. Glucose is
mean. Approximately 95% of the area under the curve is between
not an atypical example. The concentrations of most clinically µ - 2σ and µ + 2σ. Gaussian probability distributions predict the vari-
relevant analytes in healthy individuals have distributions that ability in measurements that are affected by multiple, independent,
are skewed toward higher or lower values, owing to physiolog- random errors. Laboratory quality control is a good example of a
ical factors that have a strictly unidirectional influence on process to which a Gaussian distribution should apply. Inter-individual
their concentration. variations in clinical analytes do not ordinarily follow this distribution.
Log Transformation
In cases where the distribution of normal values is heavily
skewed toward higher results, a plot of the log concentration
vs. frequency may produce a curve that is more symmetrical
and similar to a Gaussian distribution (Figure 2). If the result-
ing log-transformed distribution appears Gaussian, then some
of the useful properties of Gaussian distributions, such as the
± 2σ = 95% rule, may be applied. In 1972, Harris and
DeMets4 proposed log transformation as a means for generat-
ing a symmetrical distribution of reference values. It is impor-
tant to remember, however, that whether or not data conform
to a Gaussian distribution is determined by the randomness Figure 2_Log-transformation of data skewed toward higher values. In
and independent nature of the influences that cause variation some cases, skewed data can be made more symmetrical by log
in the data points, and mathematical transformation of the transformation, usually for the purpose of applying Gaussian statistics
data does not change those fundamental influences. to the data.
[Link] May 2006 䊏 Volume 37 Number 5 䊏 LABMEDICINE 307
CE Update
But there is some variance associated with each of those data Non-Parametric Methods for Comparing Data
points. The 2nd and 39th values in the ordered array will only Just as there are both parametric and non-parametric sta-
be approximations, within the limits of the precision of the tistical methods for determining the reference interval, both
assay, of the 2.5th and 97.5th percentiles; the non-parametric approaches exist for comparing data sets, as well. Non-para-
statistical method does not account for those variations. In con- metric methods can be applied to determine whether 2 data
trast, the Gaussian distribution takes into account all random sets have essentially the same, or significantly different proper-
influences in determining the upper and lower limits of the cen- ties. In the case of validating a reference interval, the 2 data
tral 95% of values. sets may be the manufacturer’s data, used to determine the
Downloaded from [Link] by Universidade Federal de Minas Gerais user on 25 December 2021
The variability associated with individual data points can be suggested reference interval, and a sample of healthy individu-
minimized, to a degree, if the dataset is very large. Therefore, in als recruited locally by the laboratory.
order to produce reliable limits for the 2.5th and 97.5th
percentiles, non-parametric distributions require fairly large ref- The Mann-Whitney Test
erence populations. The Clinical and Laboratory Standards Insti- An example of a non-parametric statistical method to
tute (CLSI; formerly the National Committee on Clinical compare data sets is the Mann-Whitney test. In this method,
Laboratory Standards, NCCLS) recommends that reference in- the 2 data sets to be compared—x1, x2 . . . xN and y1, y2 . . .
tervals be determined by a non-parametric method, with data yN—are ordered, together, from the lowest to highest values.
from at least 120 appropriately selected subjects. The 3 highest The array might look something like:
and 3 lowest values are eliminated, and the 4th and 117th num-
bers in the ordered array define the reference interval. x1, y1, x2, x3, y2, x4, y3, y4, y5, x5 . . . etc.
For the Mann-Whitney test, the total number of y values
Validation of Reference Ranges that follow each x value are summed, and likewise for the x
Part of the data supplied to the Food and Drug Administra- values that follow each y. If these sums, Ux and Uy, are similar,
tion (FDA) in a 510(k) application for approval of an in vitro then the 2 samples appear to be equivalent. Large differences
diagnostic method is a reference interval determined with the between Ux and Uy indicate that the 2 data sets are not equiv-
proposed method. These reference interval studies may be con- alent. The Mann-Whitney test is also called the U-test, Rank
ducted in the hospital laboratories where the reagents are evalu- Sum test, or Wilcoxen’s test.
ated, and may use patient specimens or healthy volunteers.
Manufacturer-determined reference intervals are typically based The Run Test
on a large number of specimens (often a thousand or more), and Another non-parametric approach to comparing data sets
the proposed normal range is included in the product literature. is the Run test. As with the Mann-Whitney test, data from
Current CLIA guidelines require that laboratories using a manu- both arrays are ordered from lowest to highest, and the num-
facturer’s reference interval—or, for that matter, any reference bers of “runs,” or sequential data elements from one or the
interval that is transferred from an external source—verify that it other array, are counted. Two data sets selected randomly from
is appropriate for the population served by the laboratory. Labo- a common population will produce few runs, whereas a signif-
ratories must determine whether a reference interval based on icant bias between the 2 data sets will be reflected in the mag-
data from the manufacturer’s “healthy” population is the same as nitude and inequality when the 2 run sums are compared.
the reference interval for the population that the laboratory It may be helpful to think about the Mann-Whitney and
serves. Although it is possible to meet this requirement without Run tests as statistical methods not so much for determining
gathering reference data from a local population, validation of a whether 2 data sets have the same mean and standard devia-
reference interval ordinarily involves collection of specimens tion, but rather a reflection of the degree to which the 2 data
from healthy volunteers, and comparison of the results to the sets have the same distribution of values, which for non-para-
proposed reference interval. Alternatively, a laboratory may es- metric distributions is the more important question.
tablish its own reference interval by collecting 120 specimens, as
recommended by CLSI, but this may be an impractical alterna- The Monte Carlo Method
tive for many laboratories. Monte Carlo simulations make use of random selection
If the concentrations of various analytes in healthy indi- to generate a representative statistical distribution that can be
viduals followed Gaussian probability distributions, then the applied to solve a quantitative statistical problem. Although
reference intervals could be compared by several parametric random sampling, as a method to generate statistical distribu-
statistical methods. The Student’s t test, for example, tions, had been used by mathematicians since the 19th Cen-
estimates the degree to which a small sample selected from a tury, credit for refining (and naming) this technique is usually
population predicts the properties of the entire population given to Stanislaw Ulam, a Polish born mathematician who
(specifically, the µ and σ). With regard to reference intervals, worked for John von Neumann on the Manhattan Project
the question is whether the statistical characteristics (µ and during World War II, and his collaborator Nicholas Metropo-
σ) of a small sample of healthy individuals selected locally lis, who published their description of Monte Carlo simula-
match the population statistics on which the manufacturer’s tions in 1949.5 The Monte Carlo method is an elegant
(or other laboratory’s) reference interval is based. Parametric approach to validating reference intervals, and an application
methods provide ways to make those comparisons, based on to this problem was described by Holmes and colleagues in
the variability in the mean and standard deviation that is pre- 1994.6
dicted when a subset is randomly selected from population In the Monte Carlo approach, a limited normal range
data. But parametric methods assume that the data have a study is performed, perhaps involving 20 healthy volunteers
predictable distribution, and as mentioned before, this is not selected from the local population served by the laboratory.
usually the case for laboratory tests. The mean and standard deviation is calculated based on the
308 LABMEDICINE 䊏 Volume 37 Number 5 䊏 May 2006 [Link]
CE Update
in-house study. Then, using the larger data set on which the points will fall outside of the central 95% limits is more than
manufacturer’s reference interval is based, 20 individual data 90%. The probability of randomly selecting 3 or more values
points are randomly selected and the mean and standard devia- outside of the central 95% of the array on 2 consecutive trials
tion of this random sample is calculated. This procedure is of 20 is only about 1%, so failure of the second trial would
repeated many times using computer algorithms for randomly lead one to conclude that the populations are sufficiently dif-
selecting data points and calculating the mean and standard ferent to warrant a local reference interval.
deviation based on those data subsets. When a sufficient num- Extended validation. Sixty reference specimens are
ber of samples have been selected from the parent (or popula- obtained from healthy volunteers within the laboratory’s
Downloaded from [Link] by Universidade Federal de Minas Gerais user on 25 December 2021
tion) data set, then the variance associated with the mean and catchment area, and the reference interval for the local popu-
standard deviation of a randomly selected 20 data point subset lation is calculated. If the reference interval is calculated para-
can be calculated. If the results for the local sample are truly metrically with the assumption that the population data have
representative of the population on which the manufacturer’s a Gaussian distribution (95% limits = µ ± 2σ), then a sample
reference interval is based, then the mean and standard devia- of 60 data points randomly selected from the population
tion of the in-house study sample will fall within limits pre- should produce essentially the same reference interval. This is
dicted by the Monte Carlo simulation. In other words, the because, as a general rule, samples of greater than 30 data
statistical properties—mean and standard deviation—of the points randomly selected from a Gaussian population will
local population will appear equivalent to a randomly selected have statistical properties that are representative of the entire
subset of the larger population on which the manufacturer population (this is predicted by the Student’s t distribution).
based its reference interval. The power of this method is that In other words, the mean and standard deviation of a subset
it is entirely non-parametric, but requires only a small set of of 30 or more data points are very close to the mean and stan-
in-house data. dard deviation of the population. The Student’s t distribution
takes into account deviations from Gaussian behavior when
the number of sample data is fewer than 30.
CLSI-Recommended Methods for Validation of Non-parametric statistical methods, such as those
Reference Intervals described above, also can be used to compare the locally gen-
Guidelines are available for the validation of reference erated reference interval with the manufacturer’s proposed in-
intervals from the CLSI document C28-A, which describes 3 terval. In either the limited or extended validation methods,
methods for meeting the CLIA-specified requirement.7 Some outliers may be removed from the dataset by application of
of these recommendations have a basis in statistical theory, the “Reed rule”: If the difference between the extreme value
whereas others do not. Berry and Westgard reviewed exten- and the next closest value in the array is D, and the range be-
sively the CLSI recommendations for reference interval valida- tween the lowest and highest values in the entire array is R,
tion on the Westgard QC Web site.8 Reed’s rule is violated when the ratio D/R exceeds one-third,
Inspection method. The demographic and geographic fac- and data points that violate this criteria can be eliminated.
tors associated with the reference population are examined to
determine whether they are consistent with the population
served by the laboratory. If there are no credible reasons to Summary
suspect that the population served by the laboratory differs Statistical analysis is helpful for characterizing, and in
from the reference population in any manner that would af- some instances predicting, the behavior of data sets. Establish-
fect the predicted results of a particular test, then use of the ing and validating reference intervals are tasks to which statis-
reference range may be justified. The CLIA guidelines allow tical analysis can be applied, since the fundamental purpose of
the medical director of a laboratory to make that assessment. a reference interval is to predict the results of laboratory tests
The inspection method is not a statistical approach, and in healthy patients. The “central 95% of healthy individuals”
transference of a reference interval from one laboratory to an- that customarily defines reference intervals is a compromise
other should not be done without a firm basis on which to between the sensitivity (ability to detect disease) and
conclude that the reference populations are similar. This specificity (ability to rule out disease) of a laboratory test.
method should only be used when reference data from local Adopting this definition of reference intervals ensures at least
volunteers are unavailable. This may be the case, for example, 5% of results will be falsely positive, but allows for some over-
with age-specific reference ranges for pediatric populations. lap between the distributions of positive and negative (“nor-
Limited validation. In a limited validation study, approxi- mal”) results in order to improve the clinical sensitivity of the
mately 20 reference samples are collected from healthy volun- test. Because the limits of the reference interval ultimately de-
teers selected from the population served by the laboratory. If fine the sensitivity and specificity of a laboratory test, it is very
no more than 2 measurements fall outside the reference inter- important to apply the appropriate statistical method when
val, the range is validated. If 3 or more reference specimens determining these limits.
are outside of the reference range, 20 additional reference The distributions of most clinically relevant analytes in
samples can be obtained, and if 3 or more of the second refer- blood, urine, or other body fluids, do not have mathemati-
ence sample are out of the reference interval, the laboratory cally predictable properties. As a result, parametric statistical
should consider establishing its own reference range. methods, which are based on mathematical probability func-
The limited validation is based on the statistical predic- tions that assume a predictable distribution of data, are not
tion that 19 of 20 randomly-selected data points should fall ordinarily applicable to the determination of reference inter-
within the central 95% of values in a population. This predic- vals. Non-parametric statistical methods, which are applicable
tion is regardless of whether the reference interval was to any distribution of data, are preferable for determining ref-
obtained by parametric or non-parametric methods. The erence intervals, but have limitations of their own, including
probability that fewer than 3 out of 20 randomly selected data the large number of data points necessary for generating a
[Link] May 2006 䊏 Volume 37 Number 5 䊏 LABMEDICINE 309
CE Update
valid range. CLSI recommends that non-parametric reference 2. FDA 510(k) requirements. Available at: [Link]
intervals are based on 120 specimens from healthy volunteers [Link]. Accessed on March 10, 2006.
representing a broad demographic profile. 3. CLIA ‘88. Available at: [Link] Accessed on March
Many laboratories use manufacturer-specified reference 10, 2006.
intervals, since these are based on large data sets. However, 4. Harris EK, DeMets DL. Estimation of normal ranges and cumulative
proportions by transforming observed distributions to gaussian form. Clin
CLIA requires clinical laboratories to verify that their refer- Chem. 1972;18:605-612.
ence ranges are appropriate for the patient population they 5. Metropolis N, Ulam S. The Monte Carlo method. J Am Stat Assoc.
serve. Non-parametric statistical methods exist for comparing 1949;44:335-341.
Downloaded from [Link] by Universidade Federal de Minas Gerais user on 25 December 2021
data sets, and these can be applied to validation of reference 6. Holmes EW, Kahn SE, Molnar PA, et al. Verification of reference ranges by
intervals when reference data are obtained from the local using a Monte Carlo sampling technique. Clin Chem. 1994;40:2216-2222.
healthy population. Monte Carlo simulation is another non- 7. CLSI Document C28-A: How to define, determine, and utilize reference
parametric method for validating reference intervals when a intervals in the clinical laboratory; Approved guideline. 1995.
small sampling is obtained locally. The CLSI provides some 8. Barry PL, Westgard JO. Method validation: Reference interval transference.
guidance on validating reference intervals, including simple Available at: [Link] Accessed on March 10,
2006.
inspection, limited validation, and extended validation. LM
1. Fraser CG. Inherent biological variation and reference values. Clin Chem Lab
Med. 2004;42:758-764.
310 LABMEDICINE 䊏 Volume 37 Number 5 䊏 May 2006 [Link]