0% found this document useful (0 votes)
25 views27 pages

Simulating Subtraction in Adders

The document discusses the design and functionality of a carry look-ahead adder, which reduces propagation delay in binary addition by using carry generate and carry propagate logic. It also explains the representation of unsigned and signed binary numbers, including sign-magnitude, 1's complement, and 2's complement forms. Additionally, it covers binary multiplication, floating point arithmetic, and methods for binary subtraction, including using 1's and 2's complement.

Uploaded by

nihalthakur1715
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views27 pages

Simulating Subtraction in Adders

The document discusses the design and functionality of a carry look-ahead adder, which reduces propagation delay in binary addition by using carry generate and carry propagate logic. It also explains the representation of unsigned and signed binary numbers, including sign-magnitude, 1's complement, and 2's complement forms. Additionally, it covers binary multiplication, floating point arithmetic, and methods for binary subtraction, including using 1's and 2's complement.

Uploaded by

nihalthakur1715
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Look-ahead Adder:

A carry look-ahead adder reduces the propagation delay by introducing more complex hardware.
In this design, the ripple carry design is suitably trans formed such that the carry logic over fixed
groups of bits of the adder is reduced to two-level logic. Let us discuss the design in detail

Consider the full adder circuit shown above with corresponding truth table. We define two
variables as ‘carry generate’ Gi and ‘carry propagate’ Pi then,

Pi= Ai XOR Bi

Gi = [Link]

The sum output and carry output can be expressed in terms of carry generate Gi and carry
propagate Pi as

Si = Pi XOR Ci

Ci+1 = Gi + [Link]
where Gi produces the carry when both Ai , Bi are 1 regardless of the input carry. Pi is
associated with the propagation of carry from Ci to Ci+1

The carry output Boolean function of each stage in a 4 stage carry look-ahead adder can be
expressed as

C1 = G0+P0Cin

C2= G1+P1C1=G1+P1G0+P1P0Cin
C3= G2+P2C2=G2+P2G1+P2P1G0+P2P1P0Cin

C4=G3+P3C3=G3+P3G2+P3P2G1+P3P2P1P0G0+P3P2P1P0Cin

From the above Boolean equations we can observe that C4 does not have to wait for C3
and C2 to propagate but actually C4 is propagated at the same time as C3 and C2. Since the
Boolean expression for each carry output is the sum of products so these can be implemented
with one level of AND gates followed by an OR gate

We can make the binary numbers into the following two groups − Unsigned
numbers and Signed numbers.

Unsigned Numbers
Unsigned numbers contain only magnitude of the number. They don’t have any sign. That means
all unsigned binary numbers are positive. As in decimal number system, the placing of positive
sign in front of the number is optional for representing positive numbers. Therefore, all positive
numbers including zero can be treated as unsigned numbers if positive sign is not assigned in
front of the number

Signed Numbers
Signed numbers contain both sign and magnitude of the number. Generally, the sign is placed in
front of number. So, we have to consider the positive sign for positive numbers and negative sign
for negative numbers. Therefore, all numbers can be treated as signed numbers if the
corresponding sign is assigned in front of the number.

If sign bit is zero, which indicates the binary number is positive. Similarly, if sign bit is one,
which indicates the binary number is negative
Representation of Un-Signed Binary Numbers
The bits present in the un-signed binary number holds the magnitude of a number. That means,
if the un-signed binary number contains ‘N’ bits, then all N bits represent the magnitude of the
number, since it doesn’t have any sign bit.

Example
Consider the decimal number 108. The binary equivalent of this number is 1101100. This is the
representation of unsigned binary number.
10810 =11011002
It is having 7 bits. These 7 bits represent the magnitude of the number 108

Representation of Signed Binary Numbers


The Most Significant Bit MSB of signed binary numbers is used to indicate the sign of the
numbers. Hence, it is also called as sign bit. The positive sign is represented by placing ‘0’ in the
sign bit. Similarly, the negative sign is represented by placing ‘1’ in the sign bit.
If the signed binary number contains ‘N’ bits, then N−1 bits only represent the magnitude of the
number since one bit MSB is reserved for representing sign of the number

There are three types of representations for signed binary numbers


• Sign-Magnitude form
• 1’s complement form
• 2’s complement form
Representation of a positive number in all these 3 forms is same. But, only the representation of
negative number will differ in each form.

Example
Consider the positive decimal number +108. The binary equivalent of magnitude of this
number is 1101100. These 7 bits represent the magnitude of the number 108. Since it is positive
number, consider the sign bit as zero, which is placed on left most side of magnitude.
+10810 = 011011002

Therefore, the signed binary representation of positive decimal number +108 is 𝟎𝟏𝟏𝟎𝟏𝟏𝟎𝟎.
So, the same representation is valid in sign-magnitude form, 1’s complement form and 2’s
complement form for positive decimal number +108.

Sign-Magnitude form
In sign-magnitude form, the MSB is used for representing sign of the number and the remaining
bits represent the magnitude of the number. So, just include sign bit at the left most side of
unsigned binary number. This representation is similar to the signed decimal numbers
representation.

Example
Consider the negative decimal number -108. The magnitude of this number is 108. We know
the unsigned binary representation of 108 is 1101100. It is having 7 bits. All these bits represent
the magnitude.
Since the given number is negative, consider the sign bit as one, which is placed on left most side
of magnitude.

−10810 = 111011002
Therefore, the sign-magnitude representation of -108 is 11101100

1’s complement form


The 1’s complement of a number is obtained by complementing all the bits of signed binary
number. So, 1’s complement of positive number gives a negative number. Similarly, 1’s
complement of negative number gives a positive number.
That means, if you perform two times 1’s complement of a binary number including sign bit,
then you will get the original signed binary number.

Example
Consider the negative decimal number -108. The magnitude of this number is 108. We know
the signed binary representation of 108 is 01101100.
It is having 8 bits. The MSB of this number is zero, which indicates positive number.
Complement of zero is one and vice-versa. So, replace zeros by ones and ones by zeros in order
to get the negative number.
−10810 = 100100112
Therefore, the 1’s complement of 10810 is100100112

2’s complement form


The 2’s complement of a binary number is obtained by adding one to the 1’s complement of
signed binary number. So, 2’s complement of positive number gives a negative number.
Similarly, 2’s complement of negative number gives a positive number.

That means, if you perform two times 2’s complement of a binary number including sign bit,
then you will get the original signed binary number.

Example
Consider the negative decimal number -108.
We know the 1’s complement of (108)10 is (10010011)2
2’s compliment of10810 = 1’s compliment of 10810 + 1.
= 10010011 + 1

= 10010100
Therefore, the 2’s complement of 10810 is 100101002.

BINARY MULTIPLICATION

Binary multiplication is one of the four binary arithmetic. The other three fundamental
operations are addition, subtraction and division. In the case of a binary operation, we deal with
only two digits, i.e. 0 and 1. The operation performed while finding the binary product is similar
to the conventional multiplication method. The four major steps in binary digit multiplication
are:

Note: The binary product of the two binary numbers 1 and 1 is equal to 1 only. And no additional
number is borrowed or carried forward in this operation.

Examples of Binary Multiplication

Some binary multiplication examples are given below for a better understanding of this concept.

 0×0=0
 0×1=0
 1×0=0
 1×1=1

Exaple1:

110

X100
——–
000
000x
110xx
———-
11000
———

Example 2:
1010 × 101 (10x5)

Solution:

1010 × 101

1010

x101

—————–

1010

0000x

——————

01010 ……. First Intermediate Sum

1010x

——————–

110010

Comparison with Decimal values:

10102 = 1010

10102 = 510

10 x 5 = 5010

(110010)2 = 5010

Example 3

1011.01 × 110.1

Solution:
The sign of the product is determined from the sign of the multiplicand and multiplier. If
they are alike, sign of the product is positive else negative.

Hardware Implementation:
Following components are required for the Hardware Implementation of multiplication
algorithm:

Registers:
Two Registers B and Q are used to store multiplicand and multiplier
[Link] A is used to store partial product during [Link]
Counter register (SC) is used to store number of bits in the multiplier.
Flip Flop:
To store sign bit of registers we require three flip flops (A sign, B sign and Q sign).Flip
flop E is used to store carry bit generated during partial product addition.

Complement and Parallel adder:


This hardware unit is used in calculating partial product i.e, perform addition required

1. Initially multiplicand is stored in B register and multiplier is stored in Q register.

2. Sign of registers B (Bs) and Q (Qs) are compared using XOR functionality (i.e., if both the
signs are alike, output of XOR operation is 0 unless 1) and output stored in As (sign of A
register).Note: Initially 0 is assigned to register A and E flip flop. Sequence counter is initialized
with value n, n is the number of bits in the Multiplier.

3. Now least significant bit of multiplier is checked. If it is 1 add the content of register A with
Multiplicand (register B) and result is assigned in A register with carry bit in flip flop E. Content
of E A Q is shifted to right by one position, i.e., content of E is shifted to most significant bit
(MSB) of A and least significant bit of A is shifted to most significant bit of Q.

4. If Qn = 0, only shift right operation on content of E A Q is performed in a similar fashion.

5. Content of Sequence counter is decremented by 1.


6. Check the content of Sequence counter (SC), if it is 0, end the process and the final product is
present in register A and Q, else repeat the process.

Multiplicand = 10111 Multiplier = 10011

BOOTH ALGORITHM

Booth algorithm gives a procedure for multiplying binary integers in signed 2’s complement
representation in efficient way, i.e., less number of additions/subtractions required. It operates
on the fact that strings of 0’s in the multiplier require no addition but just shifting and a string of
1’s in the multiplier from bit weight 2^k to weight 2^m can be treated as 2^(k+1 ) to 2^m.

As in all multiplication schemes, booth algorithm requires examination of the multiplier


bits and shifting of the partial product. Prior to the shifting, the multiplicand may be added to the
partial product, subtracted from the partial product, or left unchanged according to following
rules

[Link] multiplicand is subtracted from the partial product upon encountering the first least
significant 1 in a string of 1’s in the multiplier

2,The multiplicand is added to the partial product upon encountering the first 0 (provided that
there was a previous ‘1’) in a string of 0’s in the multiplier.
[Link] partial product d oes not change when the multiplier bit is identical to the previous
multiplier bit.
FLOATING POINT ARITHMATIC

ADDITION
To understand floating point addition, first we see addition of real numbers in decimal as same
logic is applied in both cases.
For example, we have to add 1.1 * 103 and 50.
We cannot add these numbers directly. First, we need to align the exponent and then, we can add
significand.
After aligning exponent, we get 50 = 0.05 * 103
Now adding significand, 0.05 + 1.1 = 1.15
So, finally we get (1.1 * 103 + 50) = 1.15 * 103
Here, notice that we shifted 50 and made it 0.05 to add these numbers.
We follow these steps to add two numbers

1. Align the significand


2. Add the significands
3. Normalize the result

SUBTRACTION
Binary subtraction is one of the four binary operations, where we perform the subtraction
method for two binary numbers (comprising of only two digits, 0 and 1). This operation is
similar to the basic arithmetic subtraction performed on decimal numbers in Maths. Hence, when
we subtract 1 from 0, we need to borrow 1 from the next higher order digit, to reduce the digit by
1 and the remainder left here is also 1

Binary Subtraction

 0–0=0
 1–0=1
 1–1=0
 0 – 1 = 1 (Borrow 1)

Note: For fractional binary numbers, the same rule applies for subtraction, and the decimal
should be appropriately placed.
In case of decimal subtraction, when 1 is subtracted from 0, then we borrow 1 from next
preceding number and make it 10, and after subtraction, it results in 9, i.e. 10 – 1 = 9. But for
binary subtraction, it results in 1 only.

Binary Subtraction Examples


Consider other examples of binary subtractions are as follows:
Example 1: 0011010 – 001100
Solution:
1 1 Borrow
0011010
(-) 0 0 1 1 0 0
——————
0001110
Decimal Equivalent:
0 0 1 1 0 1 0 = 26
0 0 1 1 0 0 = 12
Therefore, 26 – 12 = 14
The binary resultant 0 0 0 1 1 1 0 is equivalent to the 14
Example 2: 0100010 – 0001010
Solution:
1 1 Borrow
0 1 0 0 0 1 0 = 3410
(-) 0 0 0 1 0 1 0 = 1010
——————
0 0 1 1 0 0 0 = 2410

Floating point subtraction


By subtracting 1010101.10 from 1111011.11 we get;
1111011.11
– 1010101.10
——————–
100110.01

Binary Subtraction Using 1’s Complement

 The number 0 represents the positive sign


 The number 1 represents the negative sign

Procedures for Binary Subtraction by 1’s Complement

 Write the 1’s complement of the subtrahend


 Then add the 1’s complement subtrahend with the minuend
 If the result has a carryover, then add that carry over in the least significant bit
 If there is no carryover, then take the 1’s complement of the resultant, and it is negative.

Binary Subtraction Questions Using 1’s Complement


Question 1:
(110101)2 – (100101)2
Solution:
(1 1 0 1 0 1)2 = 5310
(1 0 0 1 0 1)2 = 3710 – subtrahend
Now take the 1’s complement of the subtrahend and add with minuend.
1 carry
110101
(+) 0 1 1 0 1 0
——————
001111
1 carry
——————
010000
Therefore, the solution is 010000
(010000)2 = 1610
Question 2:
(101011)2 – (111001)2
Solution:
Take 1’s complement of the subtrahend
111
101011
(+) 0 0 0 1 1 0 (1’s complement)
——————
110001
Now take the 1’s complement of the resultant since it does not carry 1
The resultant becomes 0 0 1 1 1 0
Now, add the negative sign to the resultant value
Therefore the solution is – (001110)2.

Subtraction by 2’s Complement


With the help of subtraction by 2’s complement method we can easily subtract two binary
numbers.

The operation is carried out by means of the following steps:

(i) At first, 2’s complement of the subtrahend is found.

(ii) Then it is added to the minuend.

(iii) If the final carry over of the sum is 1, it is dropped and the result is positive.

(iv) If there is no carry over, the two’s complement of the sum Evaluate:

(i) 110110 - 10110


Solution:

The numbers of bits in the subtrahend is 5 while that of minuend is 6. We make the number of
bits in the subtrahend equal to that of minuend by taking a `0’ in the sixth place of the
subtrahend.

Now, 2’s complement of 010110 is (101101 + 1) i.e.101010. Adding this with the minuend.

1 10110 Minuend

1 01010 2’s complement of subtrahend

Carry over 1 1 00000 Result of addition

After dropping the carry over we get the result of subtraction to be 100000.

(ii) 10110 – 11010

Solution:

2’s complement of 11010 is (00101 + 1) i.e. 00110. Hence

Minued - 10110

2’s complement of subtrahend - 00110

Result of addition - 11100

As there is no carry over, the result of subtraction is negative and is obtained by writing the 2’s
complement of 11100 i.e.(00011 + 1) or 00100.

Hence the difference is – 100.

(iii) 1010.11 – 1001.01

Solution:

2’s complement of 1001.01 is 0110.11. Hence

Minued - 1010.11

2’s complement of subtrahend - 0110.11


Carry over 1 0001.10

After dropping the carry over we get the result of subtraction as 1.10.

(iv) 10100.01 – 11011.10

Solution:

2’s complement of 11011.10 is 00100.10. Hence

Minued - 10100.01

2’s complement of subtrahend - 01100.10

Result of addition - 11000.11

As there is no carry over the result of subtraction is negative and is obtained by writing the 2’s
complement of 11000.11.

Hence the required result is – 00111.01.

Will be the result and it is negative.

ARRAY MULTIPLIER

An array multiplier is a digital combinational circuit used for multiplying two binary numbers
by employing an array of full adders and half adders. This array is used for the nearly
simultaneous addition of the various product terms involved. To form the various product terms,
an array of AND gates is used before the Adder array.
Checking the bits of the multiplier one at a time and forming partial products is a sequential
operation that requires a sequence of add and shift micro-operations. The multiplication of two
binary numbers can be done with one micro-operation by means of a combinational circuit that
forms the product bits all at once. This is a fast way of multiplying two numbers since all it takes
is the time for the signals to propagate through the gates that form the multiplication array.
However, an array multiplier requires a large number of gates, and for this reason it was not
economical until the development of integrated circuits.
For implementation of array multiplier with a combinational circuit, consider the multiplication
of two 2-bit numbers as shown in figure. The multiplicand bits are b1 and b0, the multiplier bits
are a1 and a0, and the product is
c3c2c1c0
Assuming A = a1a0 and B= b1b0, the various bits of the final product term P can be written as:-
1. P (0) = a0b0
2. P (1)= a1b0 + b1a0
3. P (2) = a1b1 + c1 where c1 is the carry generated during the addition for the P(1) term.
4. P (3) = c2 where c2 is the carry generated during the addition for the P(2) term.

For the above multiplication, an array of four AND gates is required to form the various product
terms like a0b0 etc. and then an adder array is required to calculate the sums involving the
various product terms and carry combinations mentioned in the above equations in order to get
the final Product bits.
1. The first partial product is formed by multiplying a0 by b1, b0. The multiplication of two bits
such as a0 and b0 produces a 1 if both bits are 1; otherwise, it produces 0. This is identical to an
AND operation and can be implemented with an AND gate.
2. The first partial product is formed by means of two AND gates.
3. The second partial product is formed by multiplying a1 by b1b0 and is shifted one position to the
left.
4. The above two partial products are added with two half-adder(HA) circuits. Usually there are
more bits in the partial products and it will be necessary to use full-adders to produce the sum.
5. Note that the least significant bit of the product does not have to go through an adder since it is
formed by the output of the first AND gate.
A combinational circuit binary multiplier with more bits can be constructed in similar fashion. A
bit of the multiplier is ANDed with each bit of the multiplicand in as many levels as there are
bits in the multiplier. The binary output in each level of AND gates is added in parallel with the
partial product of the previous level to form a new partial product. The last level produces the
product. For j multiplier bits and k multiplicand we need j*k AND gates and (j-1) k-bit adders to
produce a product of j+k bits.

Binary division
It is an important but often overlooked part of binary arithmetic.
Binary addition, binary subtraction, binary multiplication and binary division are the four types
of arithmetic operations that occur in the binary arithmetic.
Though binary division not too difficult, it can initially be a bit harder to understand than the
other binary operations as they shared similarities. For example, they all had four basic steps
which made all the processes quite easy to understand.

But the process of binary division does not have any specific rules to follow. Though this
process is quite similar to the decimal division.
Let us take A = 11010 and B = 101, where we want to divide A by B.

The structure of the operation of binary division is similar to that of decimal division, now we
will look into the operation step by step to make it understanding as much as possible.
In the first step the left most digits of dividend i.e. A are considered and depending upon the
value the divisor is multiplied with 1 and the result which is the result of multiplication of 101
and 1 are written. As we already know that 1 × 1 = 1, 1 × 0 = 0 and 1 × 1 = 1 that is exactly what
is written.

In this step 101 is subtracted from 110. This step is also very easy to understand as we already
know binary subtraction method. Now going into the next step.

As of the rules of division the next least significant bit comes down and we try to multiply 1 with
divider i.e. B but the result is bigger than the minuend so this step cannot be completed and we
have to go to the next step.

0 is inserted into the quotient and the least significant bit comes down now we can proceed to the
next step.

Now again the divisor is multiplied with 1 and the result is written, the result is similar to the
first one because all the numbers are same. Now we are going into the final step.
In the final step binary subtraction is done and we get the remainder and the operation of binary
division is completed and we get the following result.
Quotient = 101 and remainder = 1.

Division restoring algorithm


A division algorithm provides a quotient and a remainder when we divide two number. They are
generally of two type slow algorithm and fast algorithm. Slow division algorithm are restoring,
non-restoring, non-performing restoring, SRT algorithm and under fast comes Newton–Raphson
and Goldschmidt.
In this article, will be performing restoring algorithm for unsigned integer. Restoring term is due
to fact that value of register A is restored after each iteration.

Here, register Q contain quotient and register A contain remainder. Here, n-bit dividend is loaded
in Q and divisor is loaded in M. Value of Register is initially kept 0 and this is the register whose
value is restored during iteration due to which it is named Restoring.
Let’s pick the step involved:
 Step-1: First the registers are initialized with corresponding values (Q = Dividend, M =
Divisor, A = 0, n = number of bits in dividend)
 Step-2: Then the content of register A and Q is shifted left as if they are a single unit
 Step-3: Then content of register M is subtracted from A and result is stored in A
 Step-4: Then the most significant bit of the A is checked if it is 0 the least significant bit
of Q is set to 1 otherwise if it is 1 the least significant bit of Q is set to 0 and value of
register A is restored i.e the value of A before the subtraction with M
 Step-5: The value of counter n is decremented
 Step-6: If the value of n becomes zero we get of the loop otherwise we repeat from step 2
 Step-7: Finally, the register Q contain the quotient and A contain remainder
Examples:
Perform Division Restoring Algorithm
Dividend = 11
Divisor = 3

N M A Q OPERATION

4 00011 00000 1011 initialize

00011 00001 011_ shift left AQ

00011 11110 011_ A=A-M

00011 00001 0110 Q[0]=0 And restore A

3 00011 00010 110_ shift left AQ

00011 11111 110_ A=A-M

00011 00010 1100 Q[0]=0

2 00011 00101 100_ shift left AQ

00011 00010 100_ A=A-M

00011 00010 1001 Q[0]=1

1 00011 00101 001_ shift left AQ

00011 00010 001_ A=A-M

00011 00010 0011 Q[0]=1


Remember to restore the value of A most significant bit of A is 1. As that register Q contain the
quotient, i.e. 3 and register A contain remainder 2.

ARITHMATIC AND LOGIC UINT DESIGN


Inside a computer, there is an Arithmetic Logic Unit (ALU), which is capable of performing
logical operations (e.g. AND, OR, Ex-OR, Invert etc.) in addition to the arithmetic operations
(e.g. Addition, Subtraction etc.). The control unit supplies the data required by the ALU from
memory, or from input devices, and directs the ALU to perform a specific operation based on the
instruction fetched from the memory. ALU is the “calculator” portion of the computer.

An arithmetic logic unit (ALU) is a major component of the central processing unit of the a
computer system. It does all processes related to arithmetic and logic operations that need to be
done on instruction words. In some microprocessor architectures, the ALU is divided into the
arithmetic unit (AU) and the logic unit (LU).
An ALU can be designed by engineers to calculate many different operations. When the
operations become more and more complex, then the ALU will also become more and more
expensive and also takes up more space in the CPU and dissipates more heat. That is why
engineers make the ALU powerful enough to ensure that the CPU is also powerful and fast, but
not so complex as to become prohibitive in terms of cost and other disadvantages.
ALU is also known as an Integer Unit (IU). The arithmetic logic unit is that part of the CPU that
handles all the calculations the CPU may need. Most of these operations are logical in nature.
Depending on how the ALU is designed, it can make the CPU more powerful, but it also
consumes more energy and creates more heat. Therefore, there must be a balance between how
powerful and complex the ALU is and how expensive the whole unit becomes. This is why faster
CPUs are more expensive, consume more power and dissipate more heat.
Different operation as carried out by ALU can be categorized as follows –
 Logical operations − These include operations like AND, OR, NOT, XOR, NOR,
NAND, etc.
 Bit-Shifting Operations − This pertains to shifting the positions of the bits by a certain
number of places either towards the right or left, which is considered a multiplication or
division operations.
 Arithmetic operations − This refers to bit addition and subtraction. Although
multiplication and division are sometimes used, these operations are more expensive to
make. Multiplication and subtraction can also be done by repetitive additions and
subtractions respectively.

Normalization:
The Scientific Notation of any number has a single digit to the left of the decimal point.
A number in Scientific Notation with no leading 0s is called a Normalized Number like 1.0 × 10-
8

0.1 × 10-7 or 10.0 × 10-9 are not in Normalized form.


We can also represent binary numbers in scientific notation like this 1.0 × 2-3 . Computer
arithmetic that supports such numbers is called Floating Point
.
The form of this type number is [Link]… × 2yy… where .xxxxxx…. is called mantissa and
yyyyy…. Is called exponent.

Representation of Floating-Point numbers


-1S × M × 2E =1.Mx2E
For example, we represent 3.625 in 32 bit formats.
Changing 3 in binary = 11 Changing .625 in binary = .101
Writing in binary exponent form
3.625 = 11.101 X 20

On normalizing
11.101 X 20 = 1.1101 X 21
On biasing exponent E’=127 + 1 = 128
(128)10 = (10000000)2
So 8 bit exponent is (10000000)2
23 bit mantissa (significand) is= (1110100000000000000000)2

Sign bit 0
The number represented as
0 10000000 11101000000000000000000
Again we follow the same procedure upto normalization. After that, we add 1023 to bias the
exponent.

For example, we represent -3.625 in 64 bit format.


Changing 3 in binary = 11
Changing .625 in binary = .101
Writing in binary exponent form
3.625 = 11.101 X 20
On normalizing

11.101 X 20 = 1.1101 X 21
On biasing exponent 1023 + 1 = 1024
(1024)10 = (10000000000)2
So 11 bit exponent = 10000000000
52 bit significand = 110100000000 …………. making total 52 bits

Setting sign bit = 1 (number is negative)

So, final representation


1 10000000000 110100000000 …………. making total 52 bits by adding further 0’s

Converting floating point number into decimal


Let’s convert a FP number into decimal
1 01111100 11000000000000000000000
The decimal value of an IEEE number is given by the formula:

(1 -2s) * (1 + f) * 2( e – bias )
where

s, f and e fields are taken as decimal here.


(1 -2s) is 1 or -1, depending upon sign bit 0 and 1
add an implicit 1 to the significand (fraction field f), as in formula
Again, the bias is either 127 or 1023, for single or double precision respectively.

First convert each individual field to decimal.


The sign bit s is 1

The e field contains 01111100 = (124)10


The mantissa is 0.11000 … = (0.75)10
Putting these values in formula
(1 – 2) * (1 + 0.75) * 2124 – 127 = ( – 1.75 * 2-3 ) = – 0.21875

You might also like