0% found this document useful (0 votes)
95 views31 pages

Floating Points

IEEE 754 is the standard that defines floating point number representations. It specifies single and double precision formats. Single precision uses 32 bits with 1 sign bit, 8 exponent bits, and 23 mantissa bits. Double precision uses 64 bits with 1 sign bit, 11 exponent bits, and 52 mantissa bits. Floating point numbers represent values as (-1)^s * m * 2^e, where s is the sign, m is the normalized mantissa, and e is the biased exponent. Special values like infinities and NaNs are also supported. While floating point arithmetic is approximate, it is widely used in applications like scientific computing.

Uploaded by

Abdalrhman juber
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views31 pages

Floating Points

IEEE 754 is the standard that defines floating point number representations. It specifies single and double precision formats. Single precision uses 32 bits with 1 sign bit, 8 exponent bits, and 23 mantissa bits. Double precision uses 64 bits with 1 sign bit, 11 exponent bits, and 52 mantissa bits. Floating point numbers represent values as (-1)^s * m * 2^e, where s is the sign, m is the normalized mantissa, and e is the biased exponent. Special values like infinities and NaNs are also supported. While floating point arithmetic is approximate, it is widely used in applications like scientific computing.

Uploaded by

Abdalrhman juber
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Floating Point

• An IEEE floating point representation consists of


– A Sign Bit (no surprise)
– An Exponent (“times 2 to the what?”)
– Mantissa (“Significand”), which is assumed to be 1.xxxxx (thus, one
bit of the mantissa is implied as 1)
– This is called a normalized representation
• So a mantissa = 0 really is interpreted to be 1.0, and a
mantissa of all 1111 is interpreted to be 1.1111
• Special cases are used to represent denormalized
mantissas (true mantissa = 0), NaN, etc., as will be
discussed.
Floating Point Standard
• Defined by IEEE Std 754-1985
• Developed in response to divergence of
representations
– Portability issues for scientific code
• Now almost universally adopted
• Two representations
– Single precision (32-bit)
– Double precision (64-bit)
IEEE Floating-Point Format
single: 8 bits single: 23 bits
double: 11 bits double: 52 bits
S Exponent Fraction



x
( 
S
1) 
(1 
(Expone
Bias)
Fractio
2

• S: sign bit (0  non-negative, 1  negative)


• Normalize significand: 1.0 ≤ |significand| < 2.0
– Always has a leading pre-binary-point 1 bit, so no need to represent it
explicitly (hidden bit)
– Significand is Fraction with the “1.” restored
• Exponent: excess representation: actual exponent
+ Bias
– Ensures exponent is unsigned
– Single: Bias = 127; Double: Bias = 1203
Single-Precision Range
• Exponents 00000000 and 11111111 reserved
• Smallest value
– Exponent: 00000001
 actual exponent = 1 – 127 = –126
– Fraction: 000…00  significand = 1.0
– ±1.0 × 2–126 ≈ ±1.2 × 10–38
• Largest value
– exponent: 11111110
 actual exponent = 254 – 127 = +127
– Fraction: 111…11  significand ≈ 2.0
– ±2.0 × 2+127 ≈ ±3.4 × 10+38
Double-Precision Range
• Exponents 0000…00 and 1111…11 reserved
• Smallest value
– Exponent: 00000000001
 actual exponent = 1 – 1023 = –1022
– Fraction: 000…00  significand = 1.0
– ±1.0 × 2–1022 ≈ ±2.2 × 10–308
• Largest value
– Exponent: 11111111110
 actual exponent = 2046 – 1023 = +1023
– Fraction: 111…11  significand ≈ 2.0
– ±2.0 × 2+1023 ≈ ±1.8 × 10+308
Representation of Floating Point
Numbers
• IEEE 754 single precision

31 30 23 22 0

Sign Biased exponent Normalized Mantissa (implicit 24th bit = 1)

Exponent Mantissa Object Represented


0 0 0

(-1)s  F  2E-127
0 non-zero denormalized
1-254 anything FP number
255 0 pm infinity
255 non-zero NaN
Why biased exponent?
• For faster comparisons (for sorting, etc.), allow integer
comparisons of floating point numbers:

• Unbiased exponent:

1/2 0 1111 1111 000 0000 0000 0000 0000 0000


2 0 0000 0001 000 0000 0000 0000 0000 0000

• Biased exponent:

1/2 0 0111 1110 000 0000 0000 0000 0000 0000


2 0 1000 0000 000 0000 0000 0000 0000 0000
Basic Technique

• Represent the decimal in the form +/- 1.xxxb x 2y


• And “fill in the fields”
– Remember biased exponent and implicit “1.” mantissa!
• Examples:
– 0.0: 0 00000000 00000000000000000000000
– 1.0 (1.0 x 2^0): 0 01111111 00000000000000000000000
– 0.5 (0.1 binary = 1.0 x 2^-1): 0 01111110 00000000000000000000000
– 0.75 (0.11 binary = 1.1 x 2^-1): 0 01111110 10000000000000000000000
– 3.0 (11 binary = 1.1*2^1): 0 10000000 10000000000000000000000
– -0.375 (-0.011 binary = -1.1*2^-2): 1 01111101 10000000000000000000000
– 1 10000011 01000000000000000000000 = - 1.01 * 2^4 = -20.0
Basic Technique

• One can compute the mantissa just similar to the way one would
convert decimal whole numbers to binary.
• Take the decimal and repeatedly multiply the fractional
component by 2. The whole number portion is the next binary
bit.
• For whole numbers, append the binary whole number to the
mantissa and shift the exponent until the mantissa is in
normalized form.
Floating-Point Example
• Represent –0.75
– –0.75 = (–1)1 × 1.12 × 2–1
–S=1
– Fraction = 1000…002
– Exponent = –1 + Bias
• Single: –1 + 127 = 126 = 011111102
• Double: –1 + 1023 = 1022 = 011111111102
• Single: 1011111101000…00
• Double: 1011111111101000…00
Floating-Point Example
• What number is represented by the single-
precision float
11000000101000…00
–S=1
– Fraction = 01000…002
– Fxponent = 100000012 = 129
• x = (–1)1 × (1 + 012) × 2(129 – 127)
= (–1) × 1.25 × 22
= –5.0
Converting to Floating Point
• E.g., Express 36.562510 as a 32-bit floating
point number (in hexadecimal)
• Step 1
– Express original value in binary
36.562510 =

100100.10012
• Step 2
– Normalize
100100.10012 =

1.0010010012 x 25
• Step 3
– Determine S, E, and M
+1.0010010012 x 25
n E = n + 127
S M
= 5 + 127
= 132
= 100001002

S = 0 (because the value is positive)


• Step 4
– Put S, E, and M together to form 32-bit binary
result
0 10000100 001001001000000000000002
S E M
• Step 5
– Express in hexadecimal

0 10000100 001001001000000000000002 =

0100 0010 0001 0010 0100 0000 0000 00002 =

4 2 1 2 4 0 0 016

Answer: 4212400016
Converting from Floating Point
• E.g., What decimal value is represented by the
following 32-bit floating point number?
C17B000016
• Step 1
– Express in binary and find S, E, and M
C17B000016 =

1 10000010 111101100000000000000002
S E M

1 = negative
0 = positive
• Step 2
– Find “real” exponent, n
– n = E – 127
= 100000102 – 127
= 130 – 127
=3
• Step 3
– Put S, M, and n together to form binary result
– (Don’t forget the implied “1.” on the left of the
mantissa.)
-1.11110112 x 2n =
-1.11110112 x 23 =

-1111.10112
• Step 4
– Express result in decimal
-1111.10112
-15 2-1 = 0.5
2-3 = 0.125
2-4 = 0.0625
0.6875

Answer: -15.6875
Denormal Numbers
• Exponent = 000...0  hidden bit is 0


x
( 
S
1) 
(0 Bias
Fraction
2
 Smaller than normal numbers

 allow for gradual underflow, with diminishing


precision
 Denormal with fraction = 000...0



x
( 
S
1) 
(0
0)2 

Bias
0
.
0
Two representations
of 0.0!
Infinities and NaNs
• Exponent = 111...1, Fraction = 000...0
– ±Infinity
– Can be used in subsequent calculations, avoiding
need for overflow check
• Exponent = 111...1, Fraction ≠ 000...0
– Not-a-Number (NaN)
– Indicates illegal or undefined result
• e.g., 0.0 / 0.0
– Can be used in subsequent calculations
Representation of Floating Point
Numbers
• IEEE 754 double precision
31 30 20 19 0

Sign Biased exponent Normalized Mantissa (implicit 53rd bit)

Exponent Mantissa Object Represented


0 0 0

(-1)s  F  2E-1023
0 non-zero denormalized
1-2046 anything FP number
2047 0 pm infinity
2047 non-zero NaN
Is FP addition associative?
• Associativity law for addition: a + (b + c) = (a + b) + c

• Let a = – 2.7 x 1023, b = 2.7 x 1023, and c = 1.0

• a + (b + c) = – 2.7 x 1023 + ( 2.7 x 1023 + 1.0 ) = – 2.7 x 1023 +


2.7 x 1023 = 0.0

• (a + b) + c = ( – 2.7 x 1023 + 2.7 x 1023 ) + 1.0 = 0.0 + 1.0 = 1.0

• Beware – Floating Point addition not associative!

• The result is approximate…


Floating point addition
S ta r t

1 . C o m p a r e th e e x p o n e n ts o f th e tw o n u m b e r s .
S h ift t h e s m a lle r n u m b e r t o th e r ig h t u n t il its
e x p o n e n t w o u ld m a t c h t h e la r g e r e x p o n e n t

2 . A d d t h e s ig n if ic a n d s

3 . N o r m a liz e t h e s u m , e ith e r s h iftin g r ig h t a n d


in c r e m e n tin g th e e x p o n e n t o r s h ift in g le ft
a n d d e c r e m e n t in g t h e e x p o n e n t

O v e r f lo w o r Ye s
u n d e r f lo w ?

No E x c e p tio n

4 . R o u n d t h e s ig n if ic a n d t o t h e a p p r o p r ia te
n u m b e r o f b its

N o
S t ill n o r m a liz e d ?

Ye s

D one
Floating-Point Addition
• Now consider a 4-digit binary example
– 1.0002 × 2–1 + –1.1102 × 2–2 (0.5 + –0.4375)
• 1. Align binary points
– Shift number with smaller exponent
– 1.0002 × 2–1 + –0.1112 × 2–1
• 2. Add significands
– 1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1
• 3. Normalize result & check for over/underflow
– 1.0002 × 2–4, with no over/underflow
• 4. Round and renormalize if necessary
– 1.0002 × 2–4 (no change) = 0.0625
FP Adder Hardware
• Much more complex than integer adder
• Doing it in one clock cycle would take too long
– Much longer than integer operations
• FP adder usually takes several cycles
– Can be pipelined
Floating Point Multiplication Algorithm
Floating-Point Multiplication
• Now consider a 4-digit binary example
– 1.0002 × 2–1 × –1.1102 × 2–2 (0.5 × –0.4375)
• 1. Add exponents
– Unbiased: –1 + –2 = –3
– Biased: (–1 + 127) + (–2 + 127) = –3 + 254 – 127 = –3 + 127
• 2. Multiply significands
– 1.0002 × 1.1102 = 1.1102  1.1102 × 2–3
• 3. Normalize result & check for over/underflow
– 1.1102 × 2–3 (no change) with no over/underflow
• 4. Round and renormalize if necessary
– 1.1102 × 2–3 (no change)
• 5. Determine sign: +ve × –ve  –ve
– –1.1102 × 2–3 = –0.21875

You might also like