Three Measures of Precision in Floating Point Arithmetic
Nick Higham April 13, 1991
This note is about three quantities that relate to the precision of oating point arithmetic. For t-digit, rounded base b arithmetic the quantities are 1. machine epsilon m , dened as the distance from 1.0 to the smallest oating point number bigger than 1.0 (and given by m = b(1t) , which is the spacing of the oating point numbers between 1.0 and b), and 2. = smallest oating point number x such that (1 + x ) > 1,
1 3. unit roundoff u = 2 b(1t) (which is a bound for the relative error in rounding a real number to oating point form).
The terminology I have used is not an accepted standard; for example, the name machine epsilon is sometimes given to the quantity in (2). My denition of unit roundoff is as in Golub and Van Loans book Matrix Computations [1] and is widely used. I chose the notation eps in (1) because it conforms with M ATLAB, in which the permanent variable eps is the machine epsilon. [Ed. note: Well, not quite. See my comments below. Cleve] The purpose of this note is to point out that it is not necessarily the case that = m , or that = u, as is sometimes claimed in the literature, and that, moreover, the precise value of is difcult to predict. It is helpful to consider binary arithmetic with t = 3. Using binary notation we have 1 + u = 1.00 + .001 = 1.001, which is exactly half way between the adjacent oating point numbers 1.00 and 1.01. Thus (1 + u) = 1.01 if we round away from zero when there is a tie, while (1 + u) = 1.00 if we round to an even last digit on a tie. It follows that u with round away from zero (and it is easy to see that = u), whereas > u for round to even. I believe that round away from zero used to be the more common choice in computer arithmetic, and this may explain why some authors dene or characterize u as in (2). However, the widely used IEEE standard 754 binary arithmetic uses round to even. 1
Nick Higham
Three Measures of Precision
So far, then it is clear that the way in which ties are resolved in rounding affects the value of . Let us now try to determine the value of with round to even. A little thought may lead one to suspect that u(1 + m ). For in the b = 2, t = 3 case we have x = u (1 + m ) = .001 (1 + .01) = .00101
f l (1 + x ) = f l (1.00101) = 1.01,
assuming perfect rounding. I reasoned this way, and decided to check this putative value of in 386-M ATLAB on my PC. M ATLAB uses IEEE standard 754 binary arithmetic, which has t = 53 (taking into account the implicit leading bit of 1). Here is what I found:
>> format compact; format hex >> x = 2^(-53)*(1+2^(-52)); y = [1+x 1 x] y = 3ff0000000000000 3ff0000000000000 3ca0000000000001 >> x = 2^(-53)*(1+2^(-11)); y = [1+x 1 x] y = 3ff0000000000000 3ff0000000000000 3ca0020000000000 >> x = 2^(-53)*(1+2^(-10)); y = [1+x 1 x] y = 3ff0000000000001 3ff0000000000000 3ca0040000000000
Thus the guess is wrong, and it appears that = u(1 + 242 ment! What is the explanation?
m)
in this environ-
The answer is that we are seeing the effect of double-rounding, a phenomenon that I learned about from an article by Cleve Moler [2]. The Intel oating-point chips used on PCs implement internally the optional extended precision arithmetic described in the IEEE standard, with 64 bits in the mantissa [3]. What appears to be happening in the example above is that 1 + x is rst rounded to 64 bits; if x = u (1 + 2i ) and i > 10 then the least signicant bit is lost in this rounding. The extended precision number is now rounded to 53 bit precision; but when i > 10 there is a rounding tie (since we have lost the original least signicant bit) which is resolved to 1.0, which has an even last bit. The interesting fact, then, is that the value of can vary even between machines that implement IEEE standard arithmetic. Finally, Id like to stress an important point that I learned from the work of Vel Kahan: the relative error in addition and subtraction is not necessarily bounded by u. Indeed on machines such as Crays that lack a guard digit this relative error can be as large as 1. For example, if b = 2 and t = 3, then subtracting from 1.0 the next smaller oating number we have
Exactly: 1.00 -0.111 ----0.001
D EREK OC ONNOR , J ULY 29, 2011
Computed, without a guard digit: 1.00 -0.11 The least significant bit is dropped. ----0.01
Nick Higham
Three Measures of Precision
The computed answer is too big by a factor 2 and so has relative error 1! According to Vel Kahan, the example I have given mimics what happens on a Cray X-MP or Y-MP, but the Cray 2 behaves differently and produces the answer zero. Although the relative error in addition/subtraction is not bounded by the unit roundoff u for machines without a guard digit, it is nevertheless true that f l ( a + b ) = a (1 + e ) + b (1 + f ), where e and f are bounded in magnitude by u. [1] G. H. Golub and C. F. Van Loan, Matrix Computations, Second Edition, Johns Hopkins Press, Baltimore, 1989. [2] C. B. Moler, Technical note: Double-rounding and implications for numeric computations, The MathWorks Newsletter, Vol 4, No. 1 (1990), p. 6. [3] R. Startz, 8087/80287/80387 for the IBM PC & Compatibles, Third Edition, Brady, New York, 1988.
D EREK OC ONNOR , J ULY 29, 2011
Nick Higham Editors addendum: [Cleve Moler]
Three Measures of Precision
I agree with everything Nick has to say, and have a few more comments. M ATLAB on a PC has IEEE oating point with extended precision implemented in an Intel chip. The C compiler generates code with double rounding. M ATLAB on a Sun Sparc also has IEEE oating point with extended precision, but it is implemented in a Sparc chip. The C compiler generates code which avoids double rounding. On both the PC and the Sparc
m
= 252 = 3cb0000000000000 = 2.220446049250313e-16
However, on the PC = 253 (1 + 210 ) = 3ca0040000000000 = 1.111307226797642e 16 While on the Sparc = 253 (1 + 252 ) = 3ca0000000000001 = 1.110223024625157e 16 Note that is not 2 raised to a negative integer power. M ATLAB on a VAX usually uses D oating point (there is also a G version under VMS). Compared to IEEE oating point, the D format has 3 more bits in the fraction and 3 less bits in the exponent. So m should be 255 , but M ATLAB says m is 256 . It is actually using the 1 + x > 1 trick to compute what were now calling . There is no extended precision or double rounding and ties between two oating point values are chopped, so we can nd by just trying powers of 2. On the VAX with D oat
m
= 255 = 2.775557561562891e 17
= 256 = 1.387778780781446e 17 The denition of m as the distance from 1.0 to the next oating point number is a purely geometric quantity depending only on the structure of the oating point numbers. The point Nick is making is that the more common denition of what we here call involves a comparison between 1.0 + x and 1.0 and subtle rounding properties of oating point addition. I now much prefer the simple geometric denition, even though Ive been as responsible as anybody for the popularity of the denition involving addition. Cleve
A This is a L TEXed version of the original post to NA Digest, 13 April, 1991, Derek OConnor. www.derekroconnor.net,
D EREK OC ONNOR , J ULY 29, 2011