Assessment of Bias with Emphasis on Method Comparison
Roger Johnson
Department of Chemical Pathology, LabPlus, Auckland City Hospital, Auckland, New Zealand.
For correspondence: Dr Roger Johnson e-mail: [email protected]
Summary
• Definition of bias - distinct from accuracy, bias is an average deviation from a true value.
• Method comparison - a set of specimens is assayed by both an existing method and the new candidate method, and the
results compared. The following list describes the testing procedures and data handling required in a method comparison
study for the assessment of bias:
− Test material
− Number and disposition of specimens
− Summary of findings
− The problem with correlation, and the difference plot
− Statistics of difference
− Log transformation of the difference plot
− Statistics of difference with logs
− Linear regression
− Deming and Passing-Bablok models
− The value of r in linear regression
− Choice of statistics
− Examples of suitable computer programs
• Acceptable bias criteria are discussed.
• Linearity and recovery - failing either of these criteria should serve as a warning that method comparison data may conceal
an unrecognised bias.
• Finally, consideration of all steps in the assessment of bias is required to determine acceptability or not of the method
comparison.
Definition of Bias
“Bias is used to express numerically the degree of trueness”, trueness being “the closeness of agreement between the average
value obtained from a large series of measurements and the true value”.1
“Bias” and “inaccuracy” are often used synonymously. However, contemporary usage by ISO and CLSI1 makes a distinction
between these terms: inaccuracy relates to how closely a single measurement agrees with the true value whereas bias
relates to how an average of a series of measurements agrees with the true value. In the first case, imprecision contributes
to the lack of agreement whereas in the second, imprecision is minimised (ideally removed entirely) by use of an average.
Introduction
AACB members may be familiar with the paper on method Interference is considered elsewhere in this issue (see
evaluation by Nick Balazs and Des Geary and published in Interference Testing in this issue). Linearity and recovery
1981,2 fittingly as a technical report from the AACB “Scientific will be covered briefly but first method comparison will be
and Technical Committee”. Their recommendations, used as discussed in more detail.
a basis for method evaluation in my own laboratory for more
than 20 years, addressed “inaccuracy” (what we should now Method Comparison
call bias), making global assessments by method comparison, Test Material
and separately assessing interference, linearity and recovery. The cornerstone of many method evaluations is a method
Clin Biochem Rev Vol 29 Suppl (i) August 2008 I S37
Johnson R
comparison in which a set of specimens is assayed by both
an existing method and the new candidate method, and the
results compared. For reasons of suitability and convenience,
the specimens used are often excess patient specimens. In this
case, they will have no known value other than that found in
the existing assay which itself may have shortcomings. For this
reason, it is informative to include specimens of known value
which may be external quality assurance specimens, possibly
from an RCPA QAP scheme or from reference sources such
as the Centers for Disease Control and Prevention (CDC),
or National Institute of Standards and Technology (NIST).
Disadvantages of these sorts of specimen are that the matrix
may be inappropriate and that costs may be significant for
some materials.
Number and Disposition of Specimens
The number of specimens need not be large (for example,
CLSI suggest 20,3 although a more thorough investigation
requires 40 or more4). The more critical aspects are that they
span the range of interest and are determined with greater
certainty than might be done routinely, by using multiple
determinations, at least duplicates. Comparing multiple small
batches with the two procedures run at the same time over
several days is preferred to single larger runs,3,4 as between-
day variations can be accommodated.
Summary of Findings
The data should be displayed on an x-y plot, with the results
from the existing method plotted on the x axis (conventionally
Figure 1. Method comparison. A. Conventional x-y plot with
the independent variable) and those from the candidate perfect correlation (r = 1.0); the lack of agreement between
method plotted as y (Figure 1A). Inspection by eye may the two sets of data may go unremarked unless the line of
reveal aberrant points or nonlinear behaviour that may warrant agreement is also shown. B. The difference plot highlights
further investigation. But beyond that, opinions differ as to the lack of agreement immediately and encourages statistical
how the data should be analysed. analysis of the difference.
The Problem with Correlation, and the Difference Plot Statistics of Difference
Least squares linear regression analysis (for example in If the data as displayed in a difference plot shows even scatter
Microsoft® Excel) calculates among other things a correlation at different concentrations, the difference plot is amenable to
coefficient, r, which has everything to do with scatter about statistical analysis in which the bias between the two sets of
the line (commonly representing imprecision) and nothing results can be described by a mean and SEM.5 If the 95 %
to do with agreement. Undue focus on this statistic alone confidence interval for the mean difference (mean ±2 SEM)
as a measure of a candidate assay’s worth has been rightly includes zero, a statistician would say that there is no evidence
criticised, and caused Annals of Clinical Biochemistry to ban its of bias.
presentation.5 Annals’ editors favoured the difference plot6 in
which differences between the comparison estimates are plotted Log Transformation of the Difference Plot
against the mean of their values. The distinction in approach is The example shown in Figure 1 is of constant or systematic
illustrated in Figure 1: the r value shows perfect correlation in bias and is unusual in clinical chemistry. More common is the
the presence of perfect disagreement. The disagreement can be situation in Figure 2A. Here there is a progressive deviation
hard to see in x-y plots (e.g. Part A) but the difference plot (Part with concentration, a proportional bias, in which the difference
B) emphasises the lack of agreement as a consistent difference. plot (Figure 2B) is unhelpful until the data are transformed.
The difference plot allows a more sensitive visual review of the One possible transformation is by plotting proportional rather
data than is possible with an x-y plot. than absolute differences.7 This plot has a familiar feel but the
S38 I Clin Biochem Rev Vol 29 Suppl (i) August 2008
Assessment of Bias
derived data are inherently non-Gaussian. The transformation
preferred by statisticians is to convert the experimental data to
logarithms and then proceed exactly as before5 (Figure 2C).
Statistics of Difference with Logs
The log data have to be transformed back to be intelligible.
The mean difference of about 0.08 in Figure 2C shows that
the slope of the line is 100.08, i.e. 1.2. If calculation of the mean
±2 SEM of the log data includes zero, an argument is provided
for accepting the slope as 100, i.e. 1.0, and therefore without
proportional bias. However, to be valid such calculations
require a Gaussian distribution of data.8
Linear Regression
When the comparison data contain elements of both systematic
and proportional bias, the difference plot whether direct or
transformed can be difficult to interpret, and some form of
linear regression may give a clearer result. But which model
to use? Least squares linear regression (as done in Microsoft®
Excel) considers error only in the “y” direction and minimises
this component in constructing the line. Invariably “x” is
also subject to error, a fact that becomes clear if x and y are
reversed: the regression line in this case is not equivalent to
the first because the errors minimised are not the same.5
Deming and Passing-Bablok Models
Two models often used to overcome this difficulty are those
of Deming and of Passing and Bablok. The first takes into
account variability in both x and y whereas the second is a
non-parametric approach in which the median slope of all
possible lines between individual data points is found. Both
approaches have their champions.
The Value of r in Linear Regression
Yet another argument proposes that the regular least squares
approach is valid provided that the line is sufficiently well-
defined.9 Definition in this case is judged by r being high
(>0.975 for values spanning one decade; >0.99 for values
spanning three decades because r is affected by the range of
values). In fact, the authors suggest that if r is not sufficiently
high (i.e. below the cut-off values mentioned), either more
data need to be collected or existing data need more careful
scrutiny.
Figure 2. Method comparison. A. Conventional x-y plot
Choice of Statistics with perfect correlation (r = 1.0); as in Figure 1, the lack of
Given this array of techniques and our generally amateurish agreement between the two sets of data may go unremarked
knowledge of their validity, what is the safest approach? unless the line of agreement is also shown. B. The difference
Westgard has recommended that considering the ease with plot shows a lack of agreement that changes as a proportion of
which data can be manipulated by computer, many different the mean. C. The data in B have been replotted after first taking
techniques should be applied, and: “When in doubt about the logs of the data sets and then recalculating the differences and
validity of the statistical technique, see whether the choice of means. The constant difference shown (about 0.08) means
statistics changes the outcome or decision on acceptability”.10 that y differs from x by 100.08, i.e. 1.2-fold.
Clin Biochem Rev Vol 29 Suppl (i) August 2008 I S39
Johnson R
Examples of Suitable Computer Programs should exceed the expected or desired upper limit of the assay
In our laboratory, we use a version of method comparison to test whether as an ideal linearity extends beyond that point;
software now marketed as MultiQC11 which allows easy the low concentration does not have to be zero although a
transition between difference plot, linear regression, Deming clearer result can be expected the closer it is to zero. Analysis
and Passing-Bablok models, so that Westgard’s advice can of these specimens should be at least in duplicate to lessen
easily be followed. I have seen similar results from Analyse- variation.
it.12 Both websites have instructive animations that offer
advice in use of the respective programs, and both allow for An x-y plot of concentration against proportion of high
a free trial of the software. A Google™ search reveals other specimen is then drawn and inspected by eye when non-
companies that may offer similar programs. linearity may be evident (Figure 3A). An upper limit should be
no higher than the highest concentration (or activity) that falls
Acceptable Bias on the apparently linear segment. In fact it may be prudent to
The quotation from Westgard10 (above) raises the question select an even lower concentration to allow for sub-optimal
of what is acceptable bias. Clearly if no analytical goal is performance in routine use. Choosing a lower limit follows
decided before a comparison is done, the exercise is purely similar reasoning. (See article by Armbruster and Pry in this
descriptive. So what is an appropriate goal? Biological issue.)
variation offers a realistic approach based on population data.
The underlying consideration is that bias causes more than the
expected 5% of a reference population’s results to fall outside
a pre-determined (95%) reference interval. By limiting bias
to no more than a quarter of the reference group’s biological
variation, the proportion outside the reference interval is
restricted to no more than 5.8% (a relative increase of 16%
over the expected 5%), and is judged a “desirable” standard
of performance.13
The limits on bias provided on Westgard’s website14 are
for desirable performance; “optimum” and “minimum”
performance standards are also recognised, respectively 50%
and 150% of desirable.13 This means that for a desirable bias
of 4%, optimally it should be 2% and at worst no more than
6%.
If a new method is being introduced and the bias compared to
the old method exceeds an acceptable limit, then the reference
interval should be reviewed and clinicians notified that the
results may be different to those previously issued.
For particular cut-points (e.g. as with plasma glucose
concentration in defining diabetes), deviation at these points
is of more concern than an average deviation over the full
range of the assay.
Linearity Figure 3. Assessment of linearity. A. Conventional x-y plot
Whatever the shape of the calibration line, the expectation for assessing linearity by eye. The curved response might be
is that a concentration (or activity) of analyte should be made “linear” by restricting the measuring range to no more
matched by its assay result. Any limitation on the linearity than 150 units. B. Residual plot, residuals being the individual
of this relationship can be assessed by selecting a specimen differences between the experimental results (y values) and
with a high concentration of analyte and mixing it in linearly the results predicted from linear regression reveals the non-
related proportions with one containing a low concentration: linear behaviour as a continuum. Note that residuals are
suitable mixtures contain 0, 10, 20 …. up to 100 % of the high negative at low proportions of “HI” because the regression
specimen, giving 11 specimens to test. The high concentration line has a positive intercept on the y axis.
S40 I Clin Biochem Rev Vol 29 Suppl (i) August 2008
Assessment of Bias
An alternative means of data analysis is by residual plot (in
effect a difference plot): residuals are differences between Summary of steps in assessment of bias
the actual values found and those predicted for them from a 1. Criteria of acceptable performance established
least squares regression line. Curvature is suggested by the 2. Comparison of test method with reference method
shape of this plot, or in less clear cases the sign sequence of using patient material ± reference material
the residuals,9 a greater sensitivity compared with the direct 3. x-y Plot of data with examination by eye
plot being explained by the finer scale on the ordinate axis 4. Consideration of difference plot and statistics of
(Figure 3B). difference
5. Consideration of regression analysis and statistics of
Whether this performance is useful or can be made more regression
satisfactory by using a restricted range should be considered 6. Test of interference
in relation to acceptable bias (above). 7. Test of linearity
8. Test of recovery
Recovery 9. Judgement of acceptability
Measurement of recovery involves the assay of exogenous
analyte in the specimen matrix. In essence, a base material
(e.g. serum) is assayed before and after addition of a known Competing Interests: None declared.
amount of analyte (often called spiking). The difference in
concentration between these measurements should equate to References
the known amount added. 1. Tate J, Panteghini M. Standardisation – The theory and
the practice. Clin Biochem Rev 2007;28:127-30.
This test is useful in deciding whether calibrators need to be 2. Balazs ND, Geary TD. Guidelines for the selection and
made in a matrix more closely resembling the specimens to evaluation of analytical methods. Clin Biochem Rev
be analysed, or whether some interference that needs further 1980;1:51-7.
investigation is present. It is not straightforward to do, 3. Clinical and Laboratory Standards Institute. User
however. It requires: analyte in a suitably concentrated form verification of performance for precision and trueness;
so that addition causes minimal disruption of the base material approved guideline - second edition. CLSI document
(≤10% by volume and with consideration of the solvent used); EP15-A2. Wayne, PA, USA: CLSI; 2005.
measurements with low imprecision (duplicate determinations 4. Clinical and Laboratory Standards Institute. Method
at least); and addition sufficient to make a measurable comparison and bias estimation using patient samples;
difference without exceeding the assay range. The last of these approved guideline - second edition. CLSI document
requirements may be difficult to achieve when using random EP9-A2. Wayne, PA, USA: CLSI; 2002.
specimens, possibly with unknown content of analyte. 5. Hollis S. Analysis of method comparison studies. Ann
Clin Biochem 1996;33:1-4.
Experience suggests that the arithmetic associated with this 6. Bland JM, Altman DG. Statistical methods for
sort of experiment can be demanding unless the base material assessing agreement between two methods of clinical
is subject to a blank (solvent) addition to account for the measurement. Lancet 1986;1:307-10.
dilution that occurs with spiking. Subtraction of the value in 7. Pollock MA, Jefferson SG, Kane JW, Lomax K,
the adjusted base from that in the spiked material gives the MacKinnon G, Winnard CB. Method comparison - a
added concentration directly and hence the amount of the spike different approach. Ann Clin Biochem 1992;29:556-60.
recovered. Recovery is then [Final (Spike) Concentration - 8. Twomey PJ. How to use difference plots in quantitative
Initial (Base) Concentration]/Added Concentration. method comparison studies. Ann Clin Biochem
2006;43:124-9.
Acceptability of recovery can be judged against Logan’s 9. Stockl D, Dewitte K, Thienpont LM. Validity of linear
criteria,15 although nowadays we should take note of regression in method comparison studies: is it limited
acceptable bias too. Exceeding either of these criteria should by the statistical model or the quality of the analytical
serve as a warning that method comparison data may conceal input data? Clin Chem 1998;44:2340-6.
an unrecognised bias. 10. Westgard JO. Points of care in using statistics in method
comparison studies. Clin Chem 1998;44:2240-2.
11. MultiQC, Medical Laboratory Quality Control Software.
www.multiqc.com (Accessed 27 December 2007).
12. Analyse-it® Statistical analysis add-in software for
Clin Biochem Rev Vol 29 Suppl (i) August 2008 I S41
Johnson R
Microsoft Excel. www.analyse-it.com (Accessed 27 M, Hernández A, et al. Biological variation database,
December 2007). and quality specifications for imprecision, bias and
13. Fraser CG. Biological Variation: From Principles to total error. The 2006 update. http://www.westgard.com/
Practice. Washington DC, USA: AACC Press; guest32.htm (Accessed 27 December 2007).
2001.p. 52-5. 15. Logan JE. Evaluation of commercial kits. CRC Crit Rev
14. Ricós C, García-Lario J-V, Alvarez V, Cava F, Domenech Clin Lab Sci 1972;3:271-89.
Appendix: An example of data handling and interpretation in method comparison studies.
Please see (http://www.aacb.asn.au/web/Resources/Tools/).
The data are presented in a Microsoft Excel spreadsheet format under several tabs best viewed in the order given:
Data – representative comparison data and scatter plot
Difference – calculation and presentation of the difference plot
log Difference – log transformation of the data and calculation of a log difference plot
Statistics of log Difference – calculation of error and limits of error
Regression – least squares presentation with limits on slope and intercept
Summary – comparison with acceptable bias of statistical data calculated here, and for Deming and Passing-Bablok models
S42 I Clin Biochem Rev Vol 29 Suppl (i) August 2008