Field Programmable Gate Array Prototyping of End-Around Carry Parallel Prefix Tree Architectures
Field Programmable Gate Array Prototyping of End-Around Carry Parallel Prefix Tree Architectures
org
Published in IET Computers & Digital Techniques Received on 27th March 2009 Revised on 27th September 2009 doi: 10.1049/iet-cdt.2009.0036
ISSN 1751-8601
Field programmable gate array prototyping of end-around carry parallel prex tree architectures
F. Liu1 Q. Tan1 G. Chen2 X. Song3 O. Ait Mohamed4 M. Gu5
National Lab of Parallel Distributed Processing, Hunan, China Lingcore Lab, Portland, OR, USA 3 ECE Department, Portland State University, Portland, OR, USA 4 ECE Department, Concordia University, Montreal, Quebec, Canada 5 School of Software, TsingHua University, Beijing, China E-mail: [email protected]
2 1
Abstract: As an important part of many processorss oating point unit, fused multiply-add unit performs a multiplication followed immediately by an addition. In IBM POWER6 microprocessors fused multiply-add unit, a fast 128-bit oating-point end-around-carry (EAC) adder is proposed. Very few algorithmic details exist in todays literature about this adder. In this study, a complete designed EAC adder that can work independently as a regular adder is proposed. Details about the proposed EAC adders arithmetic algorithms are described. In IBMs original EAC adder, the Kogge Stone tree has been chosen for its high performance on ASIC technology. In this study, the authors present a comparative study on different parallel prex trees which are used in the design of our new EAC adder targeting eld programmable gate array (FPGA) technology. Our study highlights the main performance differences among 14 different architecture congurations focusing on the area requirements and the critical path delay. The experimental results show that there is one architecture conguration with the lower area requirement and the higher performance.
Introduction
Fused multiply-add unit plays an important role in modern microprocessor. It performs oating-point multiplication followed immediately by an addition of the product with a third oating-point operand. In 2007, a seven-cycle fused multiply-add pipeline unit was proposed [1] as a part of the oating-point unit in IBMs POWER6 microprocessor. In this fused multiply-add dataow, the product should be aligned before it is added with the addend. Because the magnitude of the product is unknown in the early stages prior to the combination with the addend, it is difcult to determine a priori which operand is bigger [2]. Even if it was determined early that the product was bigger, there would be a problem on conditionally complementing two intermediate operands, the carry and sum outputs of the counter tree. Thus, an adder needs to be designed to always output a positive 306 & The Institution of Engineering and Technology 2010
magnitude result and preferably only needs to conditionally complement one operand [2]. Therefore a new 128-bit end-around carry (EAC) adder was designed and fabricated in IBMs fused multiply-add unit [3]. The intention is not to produce an adder with the best stand-alone performance but to provide the one with the best overall oating-point performance [3]. IBM implemented its EAC adder in a 65 nm SOI technology [4] and some sub-components are implemented using Kogge Stone tree [5]. In fact, the Ladner Fischer tree [6] was used in IBMs rst pass test chip. Compared to Ladner Fischer design, the Kogge Stone design is about 0.5 FO4 faster with only 6% area overhead and 5% power increase [3]. Therefore the Kogge Stone tree was chosen in the nal design. Besides Kogge Stone tree and Ladner Fischer tree, it is known that there are many other variations of parallel prex trees [7]. The motivation for our IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 doi: 10.1049/iet-cdt.2009.0036
www.ietdl.org
work was to nd the best EAC adder for use in a fused multiply-add unit. We also notice that eld programmable gate array (FPGA) technology has recently enjoyed a rapidly increasing popularity. With nanotechnology era, the logic density of FPGA has increased dramatically. Because the xed structure and large variety of resources of FPGA possess the potential to affect signicantly the implementation results. One interesting thing is to check whether the EAC adder can work well and to study the performance differences among different architecture congurations focusing on the area requirements and the critical path delay on FPGA technology. Since it would be difcult to evaluate the full oatingpoint performance, in this paper, we propose a complete designed EAC adder that can work independently without being a part of the fused multiply-add unit. Very few description on EAC adders formulations exist in todays literature, therefore details of the proposed EAC adders arithmetic algorithms are explained. Because the algorithms of our EAC adder mainly follows the IBM EAC adders arithmetic algorithms and can be read without the knowledge of the whole oating-point unit, it is our belief that our description would be helpful for people to get a better understanding about the nature of the EAC design. To make our EAC adder can work as a regular adder, we design some new logic units such as input logic unit, sign logic unit and so on. This design makes it easier to implement and test other design choices. On the other hand, the additional logic units do not affect the EAC adders key behaviours, evaluations of our EAC adders different designs has relevance to fused multiply-add unit design. We study the performance of EAC adder with different parallel prex trees on FPGA technology. The experimental results show that there is one architecture conguration with the lower area requirement and the higher performance. The paper is organised as follows. In Section 2, the related works are reviewed. In Section 3, some preliminaries and the algorithms of the 32-bit adder block are presented. Section 4 describes the architecture of our proposed 128-bit EAC adder and its arithmetic algorithms. Section 5 explains the implementation of different parallel prex trees in our EAC adder and reports the simulation results. Section 6 concludes this paper. EAC adder is used in recent processors. Although the EAC adder has become common hardware design practices, this technique has not been well documented. Shedletsky [11] analysed some behaviours of EAC adder using some real circuits examples. Yu et al. [3] proposed a fast 128-bit EAC adder which is fabricated as part of the IBM POWER6 microprocessor. They described the adders architecture and analysed its performance and power dissipation. Zhang et al. [12] presented a 108-bit EAC adder which is also used by a fused multiply-add unit. Structure-aware layout techniques were used to optimise their adders structure. All the works above focused on the EAC adders architecture design, while details of its arithmetic algorithms were not explained. Schwarz [2] discussed some aspects of the EAC adders algorithms, but some details were still not included. On the other hand, parallel prex tree is recently used as a subcomponent of the EAC adder. There are many classic parallel prex adders that have been proposed, including Sklansky [13], Kogge Stone and Brent Kung [14]. These prex networks achieve three extreme goals: minimal logic levels and wire tracks, minimal max-fanout and logic levels, and minimal wire tracks and max-fanout, respectively. In addition, Ladner Fischer, Han Carlson [15] and Knowles [7] implemented the trade-off between each pair of the extreme cases. Structure of the prex network determines the type of the prex adder. Ziegler et al. [16] considered sparsity, fanout and radix as three dimensions in the design space of regular parallel prex adders and presented a unied formalism to describe such structures. Liu et al. [17] studied how to nd optimal prex structures for specic applications and proposed an integer linear programming method to build minimal-power prex adders within a given timing and area constraints. In IBM POWER6s EAC adder, by chip test, it was found that Kogge Stone tree was a better choice than Ladner Fischer tree. The works discussed above are based on ASIC technology. Vitoroulis et al. [18] investigated the performance of parallel prex adders implemented with FPGA technology. It reported on the area requirements and critical path delay for a variety of classical parallel prex adder structures. However, parallel prex trees were implemented as a single adder, without being a part of bigger designs. In our work, we try to answer these questions: What are the arithmetic algorithms of EAC adder with parallel prex tree? If we use different parallel prex tree in the EAC adder on FPGA technology, which one is better? How parallel prex trees affect the other parts of EAC adder? As a part of EAC adder, should the implementation of parallel prex tree itself be changed?
Related work
In the past few years, several adders used in the fused multiply-add operation have been proposed [8 10]. These adder schemes are based on delay prole of the multiply compression tree. At a result, they are power efcient only when the nal addition is performed right after the compression tree and when the EAC computation is not needed [3]. For higher oating-point performance, the IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 doi: 10.1049/iet-cdt.2009.0036
www.ietdl.org
^ denotes the Boolean AND; _ (or + ) denotes the Boolean OR; denotes the Boolean Exclusive OR. A binary number of length n (n 1) is an ordered sequence of binary bits where each bit can assume one of the values 0 or 1. For traditional integer adder, we use y = (yn1 yn2 , . . . , y1 y0 ) to x = (xn1 xn2 , . . . , x1 x0 ), denote the two n-bit addends and s = (sn1 sn2 , . . . , s0 ) to denote the corresponding sum (n 1); xi , yi , si denote the binary bits of x, y, s at position i, where 0 i n 1. Let c = {cn , cn1 , . . . , c0 } be the corresponding set of carries where c0 is the initial incoming carry, ci denotes the carry form the bit position i 1 and cn is the outgoing carry. To explain the adders algorithm, some standard notions such as propagated carry, generated carry, group-propagated carry and group-generated carry should be introduced. These notions are related to parallel prex trees and their denitions can be found in Koren [19]. In this paper, we use Pi = xi yi , Gi = xi ^ yi (for simplicity, Gi = xi yi ) to denote the propagated carry and generated carry at bit position i, respectively. We use Pi:j , Gi:j to denote the group-propagated carry and group-generated carry for the bit positions i , i 1, . . . , j , respectively. The notation of carry select adder is also important. For the group that consists of k bit positions starting with bit position j and ending with bit position i, where i = j + k 1, the outputs of carry select adder are the sum bits si , si1 , . . . , sj and the outgoing carry ci+1 . These outputs can be selected by the incoming carry into this group cj as follows ci+1 = [ci0+1 ^ cj ] _ [ci1+1 ^ cj ]
0 1 ^ cj ] _ [sm ^ cj ] sm = [sm
Figure 1 Block diagram of the 128-bit binary adder [3] as well as 32-bit conditional sums. The last sub-component is a sum selection block [3]. From Fig. 1 we know that the 128-bit EAC adder is composed of four 32-bit adder blocks. Each 32-bit adder blocks architecture is shown in Fig. 2. Each 32-bit adder block is actually a carry select adder consisting of four 8-bit adder blocks. Each 8-bit adder block has the structure depicted in Fig. 3. In each 8-bit adder block, there are two 8-bit adders which are implemented using parallel prex tree. For IBMs design, it is implemented using 8-bit Kogge Stone tree. The real structure of each 8-bit adder block is a conditional sum adder. In fact, there are two levels parallel prex tree in the 32-bit adder block as Fig. 2 shows. The rst level is the 8-bit parallel prex tree with sparseness of 2 that generates 8-bit carry signals, propagate terms as well as conditional sums. The second level is the parallel prex tree with sparseness of 8 that generates 32-bit carry signals, propagate terms and
(m = j , j + 1, . . . , i )
(1)
0 cj is the Boolean complement code of cj ; sm is the sum bit at bit position m under the condition that the incoming carry is 1 1 , ci+1 are the 0 and ci0+1 is the corresponding outgoing carry; sm sum bit at position m and the outgoing carry under the condition that the incoming carry into the group is 1. Other useful notions and formulations about parallel prex trees and carry select adder can be found in Koren [19].
Since the carry signal is on the critical path, to obtain a high performance oating-point unit, a 128-bit EAC adder was designed in IBM POWER6 microprocessor. Fig. 1 shows its block diagram. This adder is divided into three sub-blocks: the 32-bit adder block, the EAC logic block and the nal sum selection block [3]. Each 32-bit adder block is also partitioned into three sub-components. The rst subcomponent is an 8-bit prex-2 Kogge stone tree with sparseness of 2 that generates 8-bit propagates as well as conditional sums that are needed later for sum selection. The second sub-component is a prex-2 Kogge stone tree with sparseness of 8 that generates 32-bit propagated terms 308 & The Institution of Engineering and Technology 2010
Figure 2 Block diagram of 32-bit adder block IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 doi: 10.1049/iet-cdt.2009.0036
www.ietdl.org
designed EAC adder and describe its architecture. The new architecture makes our EAC independent without being a part of the fused multiply-add unit. Our new design mainly follows the algorithms of the EAC adder which is implemented in IBM POWER6 microprocessor. The additional logic units of our EAC adder are useful to ensure the whole adder can work independently. They do not affect the key algorithms. Therefore we take our EAC design as the example to explain the EAC adders arithmetic algorithms which makes our descriptions more clearly and easy to read. People can understand them without the knowledge of other details about the IBM POWER6s oating-point unit. Another advantage is that our new design is easy to implement and test, which gives us the possibility to implement different architecture congurations and compare their properties such as performance. Fig. 4 shows the architecture of the proposed EAC adder. In this adder, the inputs are two 129-bit binary addends x = ( sx127 x126 , . . . , x0 ), y = ( sy127 y126 , . . . , y0 ) and the outputs is the sum s = ( ss127 s126 , . . . , s0 ). They are all in sign magnitude format. x.x, y.y, s.s are the magnitudes of x, y, s and x.s, y.s, s.s are the corresponding sign bits. The magnitudes of operands are used to produce the positive magnitude of the sum and the sign bits of operands are used to produce the sign of the sum. The adder in Fig. 4
Figure 3 Eight-bit adder block conditional sums. In Fig. 2, we just show one implementation of parallel prex tree for the second level.
Although the EAC adder has been implemented on several microprocessors, very few details on their formulations and arithmetic algorithms can be found in todays literature. Schwarz [2] given nice explanations about some aspects of the EAC adders algorithms, but some details were not included. In this section, we try to describe the details of EAC adders algorithms clearly. We propose a completely
Figure 4 Architecture of modied EAC adder IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 doi: 10.1049/iet-cdt.2009.0036 309
www.ietdl.org
can implement four operations: x.x + y.y, x.x y.y, (x.x) + y.y and ( x.x) + ( y.y). as the follows s.s = y.y x.x = ( x.x y.y) = ( x.x + y.y + 1) = ( x.x + y.y) 1 = ( x.x + y.y + 0) + 1 1 = ( x.x + y.y + 0) (2)
With the above equation we obtain the following property: when x.x , y.y, the output of the EAC adder is dened by the following equation s.s = x.x + y.y + cout (3)
3. Finally, the outgoing carry cout is used to select the correct s.s. When x.x y.y, the output of the EAC adder should be s.s = x.x + y.y + cout ; when x.x , y.y, the output of the EAC adder should be s.s = x.x + y.y + cout . After discussing how to implement the effective subtraction of operands x.x and y.y, we focus on the addition of them. Actually, it is easy to implement x.x + y.y. However, we must combine the addition with the subtraction in one single adder. Fig. 6 shows how to integrate them. In Fig. 6, the Add/sub-logic unit takes x.s, y.s as the inputs and os as the output. The output os is dened by os = x.s y.s (4)
The input logic unit takes os , y.y as the inputs and yt as the output. The output yt is dened by yt = y .y , y .y , os = 0 os = 1 (5)
The sign logic unit takes x.s, y.s, cout as the inputs and s.s as the output. The output s.s is calculated by s.s = ( x.s ^ cout ) _ ( y.s ^ cout ) (6)
Figure 5 Subtraction dataow of EAC adder 310 & The Institution of Engineering and Technology 2010
Figure 6 Integration of addition and subtraction IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 doi: 10.1049/iet-cdt.2009.0036
www.ietdl.org
In Fig. 6, when os = 0, we can use the EAC adder to do addition x + y; when os = 1, we use the EAC adder to perform the subtraction as shown in Fig. 5. The inputs of the EAC adder are yt , x.x, os ; the outputs are cout , s.s. When os = 0, because yt = y.y, actually, the inputs are x.x, y.y and the incoming carry 0; the outputs should be the sum s.s = x.x + y.y and the outgoing carry cout . When os = 1, the inputs are yt = y.y, x.x and the incoming carry 1; the outputs should be the correct result computed by the algorithm in Fig. 5. In this way, we perform both the addition and the subtraction using a single adder. We can use another logic unit named EAC logic unit to implement this method. c3 = G95:64 + P95:64 G63:32 + P95:64 P63:32 G31:0 + P95:64 P63:32 P31:0 cin c0 = G127:96 + P127:96 G95:64 + P127:96 P95:64 G63:32 + P127:96 P95:64 P63:32 G31:0 + P127:96 P95:64 P63:32 P31:0 cin (8)
Following the rst step of Fig. 5, we know that x.x + y.y + 1 should be done and the outgoing carry cout should be used to decide whether x.x is bigger than y.y or not. Thus, by the above equations, assuming cin = cin1 = 1, cout can be computed as cout = c0 = G127:96 + P127:96 G95:64 + P127:96 P95:64 G63:32 + P127:96 P95:64 P63:32 G31:0 + P127:96 P95:64 P63:32 P31:0 (9)
Then, for the second addition which means x.x + y.y + cout , we take cout = c0 as the incoming carry. Using the formulations of carry lookahead adder again, we can obtain group carry signals as
= G31:0 + P31:0 cout c1 = G31:0 + P31:0 {G127:96 + P127:96 G95:64
+ P127:96 P95:64 G63:32 + P127:96 P95:64 P63:32 G31:0 + P127:96 P95:64 P63:32 P31:0 } = G31:0 + P31:0 G127:96 + P127:96 P31:0 G95:64 + P127:96 P95:64 P31:0 G63:32
c2
+ P127:96 P95:64 P63:32 P31:0 = G63:32 + P63:32 G31:0 + P63:32 P31:0 cout = G63:32 + P63:32 G31:0 + P63:32 P31:0 G127:96 + P127:96 P63:32 P31:0 G95:64
c3
+ P127:96 P95:64 P63:32 P31:0 = G95:64 + P95:64 G63:32 + P95:64 P63:32 G31:0 + P95:64 P63:32 P31:0 cout = G95:64 + P95:64 G63:32 + P95:64 P63:32 G31:0 + P95:64 P63:32 P31:0 G127:96
P31:0 , 0,
os = 1 os = 0
(7)
+ P127:96 P95:64 P63:32 P31:0 c0 = G127:96 + P127:96 G95:64 + P127:96 P95:64 G63:32 + P127:96 P95:64 P63:32 G31:0 + P127:96 P95:64 P63:32 P31:0 cout = G127:96 + P127:96 G95:64 + P127:96 P95:64 G63:32 + P127:96 P95:64 P63:32 G31:0 + P127:96 P95:64 P63:32 P31:0 (10)
c0 , c1 , c2 , c3 can be used to select the correct sum x.x + y.y + cout = sum127:0 . In the following, we will show how the EAC logic unit completes the task mentioned above.
The EAC logic unit takes the signals G127:96 , P127:96 , . . . , G31:0 t together with P31:0 as the inputs to calculate the incoming carries into each group c0 , c1 , c2 , c3 . With the help of the above logic units, the algorithm of EAC adder is as follows: = P31:0 . 1. When x.s = y.s, we have os = 1, y = y.y, From the formulation of carry lookahead adder, we can obtain
t t P31:0
c1 = G31:0 + P31:0 cin c2 = G63:32 + P63:32 G31:0 + P63:32 P31:0 cin IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 doi: 10.1049/iet-cdt.2009.0036
311
www.ietdl.org
Denition 4.1 (EAC logic unit): The EAC logic unit t as the takes the signals G127:96 , P127:96 , . . . , G31:0 , P31:0 inputs and c0 , c1 , c2 , c3 as the outputs. The outputs are dened as follows
c1 = G31:0 + + t t + P31:0 P127:96 P95:64 G63:32 + P127:96 P95:64 P63:32 P31:0
t P31:0 G127:96 t P31:0 P127:96 G95:64
be calculated as follows
c1 = G31:0 + P31:0 cin = G31:0 c2 = G63:32 + P63:32 G31:0 + P63:32 P31:0 cin c3
= G63:32 + P63:32 G31:0 = G95:64 + P95:64 G63:32 + P95:64 P63:32 G31:0 + P95:64 P63:32 P31:0 cin
c2 = G63:32 + P63:32 G31:0 + t t + P31:0 P127:96 P63:32 G95:64 + P127:96 P95:64 P63:32 P31:0
t P31:0 P63:32 G127:96
= G95:64 + P95:64 G63:32 + P95:64 P63:32 G31:0 c0 = G127:96 + P127:96 G95:64 + P127:96 P95:64 G63:32 + P127:96 P95:64 P63:32 G31:0 + P127:96 P95:64 P63:32 P31:0 cin = G127:96 + P127:96 G95:64 + P127:96 P95:64 G63:32 + P127:96 P95:64 P63:32 G31:0 (13)
With cin = 0, c1 , c2 , c3 , we can select the correct sum is the x.x + y.y from the outputs s0127:0 and s1127:0 and c0 outgoing carry.
c3 = G95:64 + P95:64 G63:32 + P95:64 P63:32 G31:0 t t + P95:64 P63:32 P31:0 G127:96 + P127:96 P95:64 P63:32 P31:0 c0 = G127:96 + P127:96 G95:64 + P127:96 P95:64 G63:32 + P127:96 P95:64 P63:32 G31:0 +
t P127:96 P95:64 P63:32 P31:0
(11) As we know, when x.s = y.s, we have os = 1 and t = P31:0 . So, for EAC logic unit, the equations of P31:0 calculating c0 , c1 , c2 , c3 can be rewritten as c1 = G31:0 + P31:0 G127:96 + P31:0 P127:96 G95:64 + P31:0 P127:96 P95:64 G63:32 + P127:96 P95:64 P63:32 P31:0 c2 = G63:32 + P63:32 G31:0 + P31:0 P63:32 G127:96 + P31:0 P127:96 P63:32 G95:64 + P127:96 P95:64 P63:32 P31:0 c3 = G95:64 + P95:64 G63:32 + P95:64 P63:32 G31:0 + P95:64 P63:32 P31:0 G127:96 + P127:96 P95:64 P63:32 P31:0 c0 = G127:96 + P127:96 G95:64 + P127:96 P95:64 G63:32 + P127:96 P95:64 P63:32 G31:0 + P127:96 P95:64 P63:32 P31:0 (12) In this case, it is easy to nd that the equations of calculating c0 , c1 , c2 , c3 are equivalent to the formulations of computing c0 , c1 , c2 , c3 above. Therefore the end-around-logic unit can be used to implement the subtraction dataow shown in Fig. 5 by only one addition. Furthermore, os and c0 can be used to select the correct sum s.s from sum127:0 and sum127:0 according to the following rules: When x.x y.y, we have os = 1, c0 = 1, cout = c0 = 1. As a result, sum127:0 = x.x + y.y + 1, and the sum is selected as s.s = sum127:0 = x.x + y.y + 1. When x.x , y .y , os = 1, c0 = cout = 0. the we sum have is
On the other hand, because of x.s = y.s, we have os = 0, t = 0, the EAC logic units formulations can yt = y.y, P31:0 be rewritten as follows
t t G127:96 + P31:0 P127:96 G95:64 c1 = G31:0 + P31:0 t t + P31:0 P127:96 P95:64 G63:32 + P127:96 P95:64 P63:32 P31:0 = G31:0 + 0 G127:96
+ 0 P127:96 G95:64 + 0 P127:96 P95:64 G63:32 + P127:96 P95:64 P63:32 0 = G31:0 c2 = G63:32 + P63:32 G31:0 t t + P31:0 P63:32 G127:96 + P31:0 P127:96 P63:32 G95:64
t + P127:96 P95:64 P63:32 P31:0
c0 = G127:96 + P127:96 G95:64 + P127:96 P95:64 G63:32 + P127:96 P95:64 P63:32 G31:0
t + P127:96 P95:64 P63:32 P31:0
= G127:96 + P127:96 G95:64 + P127:96 P95:64 G63:32 + P127:96 P95:64 P63:32 G31:0 (14)
, c1 , c2 , c3 are We can see that the equations calculating c0 same to the equations calculating c0 , c1 , c2 , c3 . Therefore c1 , c2 , c3 can be used to select the correct sum. Here, the rst group is a special case. sum31:0 is not only controlled by c0 , but also controlled by os
Then,
s.s = sum127:0 = x.x + y.y + 0. 2. When x.s = y.s, we should do the addition x.x + y.y. Taking the formulations of carry lookahead adder and assuming the incoming carry cin = 0, the group carries can 312 & The Institution of Engineering and Technology 2010
sum31:0 =
s131:0 , s031:0 ,
c 0 ^ os = 1 others
(15)
IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 doi: 10.1049/iet-cdt.2009.0036
www.ietdl.org
By this way, when x.s = y.s, whatever the value of c0 is, we always have sum127:0 = x.x + y.y. The EAC logic unit can implement the simple addition x.x + y.y correctly. Furthermore, os and c0 can also be used to select the correct sum in subtraction dataow discussed above. In this way, the end around carry logic unit can combine the addition and subtraction correctly by doing only one addition operation. In paper [2], the formulation of the EAC adder is similar, but some details of the algorithms were not explained, and the EAC logic unit is introduced as a part of the fused multiply-add unit. This means it cannot do the addition independently. Our design given in Fig. 4 can perform the addition independently. So, it is easy to verify the correctness of the adders algorithms.
From the arithmetic algorithms discussed above, we know that for the 32-bit adder block in IBMs EAC adder design, the rst level and the second level parallel prex tree is a Kogge Stone tree. Comparing to Ladner Fischer tree, the Kogge Stone tree design is a better choice on ASIC technology. Here, we try to nd the best choice on FPGA technology. In this paper, our proposed EAC adder follows all the key algorithms of IBMs design, the additional logic units mainly are used to ensure that the EAC adder can work independently. Thus, it is not only useful to implement and test the EAC adder easily, but also useful as a reference to nd a better design for the EAC adder used in fused multiply-add unit. We will implement different parallel prex trees architecture congurations in our EAC adder and report the simulation results. Figure 7 Basic cells in parallel prex tree Knowles [7] has presented complete classes of regular fanout prex adders which are bounded at the extremes by the Kogge Stone tree and Ladner Fischer tree. For our study, using PFGA technology, we choose the regular parallel prex trees of Knowless adder family and other basic parallel prex trees to implement the rst level 8-bit parallel prex tree as depicted in Fig. 3. These chosen parallel prex trees are Kogge Stone; Ladner Fischer; Brent Kung; Han Carlson; Konwles [1, 1, 4]; Konwles [1, 2, 2]; Konwles [1, 1, 2]. Then, for the second level parallel prex tress in Fig. 2, we also choose Konwles [1, 1] and Konwles [1, 2] in Konwless adder family to implement them, respectively. These adders were selected because they span the design limits and intermediate cases in terms of area, depth of prex network, fan-out and interconnect count. The notions introduced in Section 3 are helpful to understand how these parallel prex trees work. However, we should change their regular implementation to ensure that they can work correctly in the EAC adder. IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 doi: 10.1049/iet-cdt.2009.0036
www.ietdl.org
we always assume that the incoming carry into this adder is c0 = 0. For two N-bit binary addends x = (xn1 xn2 , . . . , x0 ), y = (yn1 yn2 , . . . , y0 ), the formulations of computing carry and sum at bit position i in parallel prex tree are ci = Gi1:0 _ (Pi1:0 ^ c0 ), si = Pi ci , where 0 i n 1. Because c0 = 0, we have ci = Gi1:0 _ (Pi1:0 ^ c0 ) = Gi1:0 . That is why we can use two different basic cells in Fig. 7 to build the regular BrentKung tree in Fig. 8. The idea is that sometimes only the signal Gi1:0 is needed, therefore the triangular cell which is more simple can be used to reduce the complexity. Vitoroulis [18] compared the performance and area for regular parallel prex trees which are implemented on FPGA technology. But when the parallel prex trees are implemented as components of our EAC adder in Fig. 4, they cannot be designed in the regular way shown in Fig. 8. Both Gi:0 and Pi:0 should be kept as the outputs for reuse in the next stage. For example, if we want to use Brent Kung tree as the component in the EAC adder, which means the parallel prex tree in Fig. 3 is implemented using Brent Kung tree, we can only use the quadrate cell to calculate the signals in the intermediate stages. We must change the regular design of Brent Kung tree shown in Fig. 8. Fig. 9 shows the rough architecture of the modied BrentKung tree adopted. Therefore on FPGA technology, the properties of the different parallel prex trees such as area and performance will be different from the results listed in Vitorouliss report. As a result, if we implement different parallel prex trees in our EAC adder, we should rst change the implementation of the parallel prex tree itself; then, we also should take into account the relationship between the parallel prex trees and the other parts of the EAC adder. the various tree structures which are already discussed in this paper. They are rstly coded in VHDL in two different levels and then all 14 different architecture congurations are modeled in the Aldec Active HDL simulation environment. The adder functionality was successfully veried using 100 000 random test vectors. After functional verication, all the 14 adder architectures were implemented on a high performance Virtex II-PRO Xilinx FPGA (XC2VP100) chip in Xilinx ISE synthesiser environment. We measured the area of an implemented design in terms of the number of FPGA slices taken by the implemented design, and the speed performance in terms of the longest signal path or critical path delay of the design (ns). The area and speed results are compared in Figs. 1013. These results show that, we achieve minimum area when using the 32-bit Knowles [1, 1] tree and 8-bit Ladner Fischer tree conguration and the maximum area when using the 32-bit Knowles[1, 1] tree and 8-bit Knowles [1, 1, 2] tree conguration (Fig. 13), which is 18% larger. By comparing the critical delay results of various EAC adders in Fig. 10, we can nd that the 32-bit Knowles [1, 2] tree and 8-bit Han Carlson tree conguration has the lowest delay; the 32-bit Knowles [1, 1] tree and 8-bit
Figure 10 Critical path delay, logic delay and route delay (ns)
Figure 9 Eight-bit Brent Kung tree in EAC adder 314 & The Institution of Engineering and Technology 2010
Figure 11 Logic delay (ns) IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 doi: 10.1049/iet-cdt.2009.0036
www.ietdl.org
built-in carry logic is about 13.7 ns (Fig. 11), longer than that of parallel prex adders, which are about 10 ns. However, the routing delay of built-in carry logic is only 3.4 ns. In contrast, the routing delay of Brent Kung adder, which is almost the minimum among all parallel prex adders, is 11.8 ns. In summary, the total delay of built-in carry logic is 17.1 ns, less than that of Brent Kung adder, which is 21.9 ns. This result validates that built-in carry logic is a better choice in FPGA than parallel prex adder. However, in an EAC adder, we do not only use the sum signals from the adder, but also need the group propagated carries and group generated carries, which can only be obtained from parallel prex adders. That is to say, in order to port the EAC adder to FPGA, the use of parallel prex tree is still required. To achieve a better implementation, experiments over different parallel prex trees are helpful to nd the optimal solution. For the power consumption, the Xilink power estimation tool, XPOWER, gives very rough estimations. For all implementations of the EAC adders the power dissipation was estimated approximately 572 mW. We notice that Vitoroulis also did not list the power consumptions for regular parallel prex trees [18]. Therefore we will keep looking for better tools that can report precise power dissipation and consider the power consumption as a metric in future direction. But right now, based the simulation results we have, we may say Kogge Stone tree is not a better choice as in ASIC technology. Compared to other parallel prex trees, Kogge Stone implementation has longer delay, bigger area and similar power consumption.
Figure 13 Number of slices Ladner Fischer tree conguration has the maximum delay which is about 22.5% larger. As we know, the critical path delay has two main components, the logic delay and the routing delay. It can be seen in Fig. 10 that the routing delay for all adders is more than the logic delay, with very little variation. Here, the wiring (Routing) is automatically chosen by the synthesiser tools. Sometimes it can be optimised at the nal phase of any design manually or using other methods to decrease it. But sometimes it is very hard to do this optimisation. Although the logic delay is always related to the routing delay, in Figs. 11 and 12 we still compare them separately. The results show the 32-bit Knowles [1, 2] tree and 8-bit Han Carlson tree conguration also has the lowest logic delay, but not the lowest routing delay; the 32-bit Knowles [1, 1] tree and 8-bit Ladner Fischer tree conguration seems to have the maximum logic delay and the maximum routing delay. Finally the 32-bit Knowles [1, 2] and 8-bit Han Carlson conguration seems to be the best compromise between area and speed. Even though the occupied area is about 3% larger than the minimum, it is more than compensated by a signicant increase in terms of the speed. It is also known that FPGA have built-in carry logic based on fast-carry computations which outperforms parallel prex adders in both area and delay [18]. This is mainly because the built-in carry logic in FPGA can use a high speed bus to propagate the carry. In our experiments, the logic delay of IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 doi: 10.1049/iet-cdt.2009.0036
Conclusion
In this paper, we proposed a complete design of a binary oating-point EAC adder and explained the details of its arithmetic algorithms. Our EAC adders algorithms mainly follow a 128-bit binary oating-point adder which is implemented in the IBM POWER6 microprocessor. Compared to the IBMs design, our EAC adder can work independently, which makes it easy to implement and test. Because there are few details of the EAC adders arithmetic algorithms in todays literature, our paper can help designers to understand this arithmetic unit well. Then, we studied the performance of parallel prex trees implemented in our EAC adder with FPGA technology. After analysing the relationships between parallel prex trees and other parts of the EAC adder, we modied the implementation of regular parallel prex trees to ensure that they are able to be used within the EAC adder correctly. By comparing the areas and performances of 14 different parallel prex trees architecture congurations, we found that the 32-bit Knowles [1, 1] and 8-bit Ladner Fischer conguration has the minimum area while the 32-bit Knowles [1, 2] and 8-bit Han Carlson conguration has the minimum critical path delay. Although the occupied area is about 3% larger than the minimum, the 32-bit Knowles [1, 2] and 8-bit Han Carlson conguration 315
www.ietdl.org
seems to be the best compromise between area and speed for the FPGA implementation. Signal Process. (Special issue on VLSI Arithmetic), 1996, 14, (3) [10]
ZEYDEL B.R., OKLOBDZIJA V.G., MATHEW S., KRISHNAMURTHY R.K.,
References
[1] CURRAN B., MCCREDIE B., SIQAL L., ET AL .: 4GHz+ low-latency xed-point and binary oating-point execution units for the POWER 6 processor. Digest of 2006 IEEE Int. Solid-State Circuits Conf., 2006, pp. 1728 1734 [2] SCHWARZ E.M. : Binary oating-point unit design, in U.S.S. (ED.): High performance energy efcient microprocessor design (Springer, 2006), pp. 189 208 [3] YU X.Y., FLEISCHER B., CHAN Y.H., ET AL .: A 5 GHz+ 128-bit binary oating-point adder for the POWER 6 processor. Proc. Int. Conf. 32nd European Solid-State Circuits, 2006, pp. 166 169 [4] LEOBANDUNG D.M.E., NAYAKAMA H., ET AL .: High performance 65 nm SOI technology with dual stress liner and low capacitance sram cell. Digest of 2005 Symp. on VLSI Technology, 2005 [5] KOGGE P.M., STONE H.S. : A parallel algorithm for the efcient solution of a general class of recurrence equations, IEEE Trans. Comput., 1973, 22, (8), pp. 786 793 [6] LADNER R. , FISCHER M.: Parallel prex computation, J. ACM, 1980, 27, (4), pp. 831 838 [7] KNOWLES S.: A family of adders. Proc. 15th IEEE Symp. on Computer Arithmetic, 2001, pp. 277 281 [8] OKLOBDZIJA V.G., VILLEGER D.: Improving multiplier design by using improved column compression tree and optimized nal adder in CMOS technology, IEEE Trans. VLSI Syst., 1995, 3, (2) [9] STELLING P., OKLOBDZIJA V.G.: Design strategies for optimal hybrid nal adders in a parallel multiplier, J. VLSI
A 90 nm 1 GHz 22 mW 16 16-bit 2s complement multiplier for wireless baseband. Proc. 2003 Symp. on VLSI Circuits, 2003
BORKAR S. :
[11] SHEDLETSKY J.J.: Commenton on the sequential and indeterminate behavior of an end-around-carry adder, IEEE Trans. Comput., 1977, pp. 271 271 [12] ZHANG X.Y., CHAN Y.H., MONTOYE R., ET AL .: A 270 ps 20 mW 108-bit end-around carry adder for multiply-add fused oating point unit, J. Signal Process. Syst., 2009 [13] SKLANSKY J.: Conditional-sum addition logic, IRE Trans. Electronic Comput., 1960, EC-9, pp. 226 231 [14] BRENT R.P., KUNG H.T.: A regular layout for parallel adders, IEEE Trans. Comput., 1982, C, (31), pp. 260 264 [15] HAN T., CARLSON D.: Fast area-efcient VLSI adders. Proc. Eighth Symp. Comp, 1987, pp. 49 56 [16] ZIEGLER M.M. , STAN M.R.: A unied design space for regular parallel prex adders. Proc. Design, Automation and Test in Europe Conf. and Exhibition (DATE04), 2004, pp. 1386 1387 [17] LIU J.H., ZHU Y., ZHU H.K., ET AL .: Optimum prex adders in a comprehensive area, timing and power design space. Proc. 12th Conf. on Asia South Pacic Design Automation (ASP-DAC07), 2007, pp. 609 615 [18] VITOROULIS K., AI-KHALILI A.J. : Performance of parallel prex adders implemented with FPGA technology. IEEE Northeast Workshop on Circuits and Systems, 2007, pp. 498 501 [19] KOREN I.: Computer arithmetic algorithms (A.K. Peters, Natick, MA, 2002)
IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 4, pp. 306 316 doi: 10.1049/iet-cdt.2009.0036