Fast Architectures For FPGA-Based Implementation of RSA Encryption Algorithm
O m a r Nihouche', Mokhtar Nibouche2, Ahmed Bouridane3, and Ammar Belatreche' Faculty of Engineering, University of Ulster a t Magee, Northland Rd, Deny, BT48 7 J L , U K
[email protected], [email protected] Faculty of Computing, Engineering and Mathematical Sciences. University of the W e s t of England, Bristol BS16 I Q Y , UK [email protected] 1 School of computer Science, Queen's University of Belfast, 18 Malone Rd, Belfast B T 7 I N N , U K
[email protected]
Abstract
In this paper, new structures that implement RSA cryptographic algorithm are presented. These structures are built upon a modified Montgomery modular multiplier. where the operations of multiplication and modular reductions are carried out in parallel rather than interleaved as in the traditional Montgomery multiplier. The global broadcast of data lines is avoided by interleaving two or more encryptioddecryption operations onto the same structure, thus making the implementation systolic and scalable. The digit approach has been adopted in this paper. This methodology is based on varying the digit size and thc level of pipelining of the structures. This parameteriscd approach presents the designer with an efficient way of choosing the architecture that suits better hislher requirements in terms of speed and area usage. an issue of critical importance to the resources-limited FPGA chips. The results of implementation using FPGA have shown that the proposed RSA structures outperformed those structures built around the traditional Montgomery multiplier in terms of speed. thanks to avoiding global lines broadcast.
1. Introduction: In the recent past years, the use of software tools and hardware devices for security functions has increased dramatically 18, 10-12, 16. 18-22]. Security issucs play a crucial role in wide spreading the use of many computer and communication systems. such as the Internet. which more and more people are using to transmit sensitive information such a s credit card numbers. A central tool for achieving system security is the cryptographic algorithms. Privacy and fraud concems can be addressed through the use of various security primitives such as data encryption. which can be used with the appropriate protocols to construct secure and trusted networks [IO- 12. 16-22], Cryptographic algorithms require immense processing power that may cause a bottleneck in high-speed networks. In thc past. system developers have had to use software-based techniques in order to achieve the agility necessary to maintain compatibility in the presence of different algorithms and protocols [8, 16. 18-21]. However. software solutions are particularly inefficient for the computationally demanding asymmetric algorithms (also called public key cryptographic algorithms). To achieve hardware agility the use of reconiigurable logic is an appropriate solution especially that the recent development o f FPGA technology means that these devices are now large enough so that i t is possible to implement a reasonable application in a single FPGA device
[ I I . 16, 18-22]. m e unlimited re-configurability of an FPGA permits a continuous sequence of custom circuits to be employed, each optimized for the current task. This has led to a relatively new concept for computing where such a device can be used as a co-processor coupled with a host machine in order to speed up the computation of a specific task. Funhermore, in terms of security, another benefit of using dedicated hardware is that security attacks become more difficult. as the secrets can be contained within the coprocessor using nonvolatile memory, which is externally inaccessible [16, 171. In 1976, Diffie and Hellman introduced the idea of public key cryptography [6]. which is now widely used to provide confidentiality, authentication. data integrity and non-repudiation. Since then, numerous public-key cryptosystems have been proposed. All these systems base their security on some mathematical one-way functions. RSA [71 is the most widely used public-key cryptosystem. An RSA operation is a modular exponentiation operation. which requires repeated modular multiplications. For security reasons RSA operand sires need to be 1024-bits or greater [8. 10-12. 16. 18-22]. In 1985 Montgomery introduced a new method for modular multiplication [9]. The Montgomery multiplication algorithm i s the most efficient modular multiplication algorithm available. This method has proved to be very efficient and i s the basis of many implementations of modular multiplication. both in software and hadware [9]. It replaces trial division by the modulus with a series of additions and divisions by a power of 2, which is very easy to implement. For this reason. this algorithm has formed the basis of many of the RSA hardware architectures reported to date [3-5. 8. 10-12. 14, 16, 18-22]. This paper proposes new and efficient FPGA based hardware implementations of RSA algorithm based on a modified Montgomery's modular multiplication presented in [31. A systolic approach for the implementation strategy has been adopted in this paper in order to achieve a high clock frequency. Many systolic architectures of RSA algorithm have been proposed in the literature [ 18-22], In this paper. the focus is on avoiding global lines broadcast and using only nearest neighbor communication wherever it is possible. For FPGA implementations. serial and digit architectures that interleave more than one encryption operation and balance area-usage and speed requirements are presented. The paper is organized as follows: section 11 reviews Montgomery' modular multiplication algorithm and the author's modified version of the algorithm. This modified algorithms uses two conventional multipliers (with one multiplier being slightly different) to carry out the operation of modular multiplication. These two multipliers operate in parallel. A brief discussion about the bound of the result of the
0-7803-8652-3/04/$20.00 0 2004 I E E E
27 1
ICFPT 2004
algorithm is presented. As a matter of fact [19-221. the hound of the result of Montgomery algorithm is changed in such a way that, if multiple modular multiplication operations are to he carried out iteratively, with the result of one iteration used i n the next one. no subtraction operation would be required. Section 111 reviews Montgomery architectures and discusses the adopted implementation approach. To design fully systolic architectures, more than one instance is interleaved onto the same structure. Furthermore, the digit approach presents the designer with a trade off to find the best structure that matches the design needs in terms of area-usage, speed requirements, and the number of encryption operations interleaved onto the same structure [I. 3-51, The implementation results of the proposed architectures using FPGA are analyzed in Section VI. from which the conclusions are derived in section V .
M '= ( - M
r ) R
(2)
The computation of the Montgomery multiplication is carried out using Algorithm I s follows: Alnorirhm I T=AB P = (T + (TM') MI/R
, f P 2 M then P = P-M
11. Montgomery modular multiplication
RSA algorithm consists of a modular exponentiation operation, which is, on its turn, broken into two modular multiplication operations 17-8, 1O-l2, 18-22]. There are two classes of modular multiplication algorithms. A clear distinction between these two classes can he based upon the data format. The Most Significant Bit First (MSBF) class of f o r k 0 to n - l modular algorithms are either comparisonisubtraction based 4,= ( L + a , B ) > algorithms or Look Up Table ( L W based algorithms 12, f: = + a ,B i q,M 12 I 13, 151. On the other hand, Montgomery's algorithm [9] makes the Least Significant Bit First (LSBF) approach If 2 M rhen = Pa-,- M useful when performing numerous successive modular multiplication operations, such as modular exponentiation. Algorirhm 2 interleaves the multiplication steps with the This algorithm avoids long chains of carry propagation, and reduction steps. One bit of the partial result is used to select the therefore speeds up both the modular multiplication and reduction value. As shown in the algorithm. the LSB of the squaring operations required during the exponentiation pmial result of the previous iteration. P,.,. together with the bitprocess 191. Rather than using subtractionidivision product o;b,, directly selects the modular reduction value. which operations, this MSBF method computes the modular is either 0 or M. At each iteration, the LSB of P; equals 0. P, is product of two numbers multiplied by a scaling factor. then shifted one position to the right. After n iterations of the which is relatively prime to the modulus. This allows the algorithm, the scaling factor is equal to 2.'. Therefore. the final algorithm to perform divisions by a power of two, which is a result is AB2 as shown in equation ( I ) . The pmial results shift operation, thus making the design of modulo multipliers easier [9]. fall in the range [O,ZM[and as such an operation of Large sire operands are used in security applications [3-5.8. comparisodsubtraction is necessary at the end of the algorithm 10-12, 16. 18-22]. Therefore, when implementing [21. The critical delay of Algorithm 2 occurs during the Montgomery's algorithm, avoiding the global broadcast of calculation of the P values given by the three input addition, data is very important. Such a global broadcast can lower P, = y-, a,B q,M / 2 . The main contributing factor to this the clock frequency (3.51. As shown in 13-51. a way to avoid such global lines is to use a digit implementation and to delay is the carry propagation [IO-111 or in the case of serial interleave multiple instances onto the same S ~ N C ~ U I ~ . multipliers, the global broadcast line resulting from using very Furthermore. the digit approach can be the appropriate large operands [ 18-22]. solution to avoid the large hardware usage of the parallel based implementations and the slow speed that characterises As shown in Algorirl~m 2, the Montgomery's modular the serial architectures [ I , 3-51. multiplication algorithm is equivalent to two interleaved conventional multiplication operations. where the second one is a Let the modulus M be an integer within the range [2".', reduction operation which is required to ensure the partial results 2"land let R be 2". Montgomery's multiplication algorithm within the range [O, 2M[. requires R and M to be relatively prime. i.e., gcd(R,M) = The modified algorithm shown in [3] uses two conventional gcd(S",M) = I . which is satisfied if M is odd as it is required multipliers to build a Montgomery structure. In this way. one by the algorithm. By exploiting this property, Montgomery multiplier carries out the operation of multiplication and the other algorithm introduces an efficient multiplication scheme. multiplier carries out the modular reduction operation. These two which computes the modular product, P , of two given operations are carried out in parallel. Let T h e the product ofAxB, integers. A and B, as follows 191: i.e.: P = (A B R ), (1) T = AB = To +2TiR (3) where R-' is the inverse of R modulo M.In order to describe his algorithm, Montgomery introduced the quantity, M: which is the inverse of-M modulo R, i.e.:
The algorithm uses the multiplication modulo R and the division by R. which are faster and simpler than the computation of AB mod M which involves division by M. The algorithm is only efficient when multiple operations are carried out. such as in the modular exponentiation operation when it is broken into modular multiplication operations [121. For a hardware implementation. a systolic array can be derived from the bit-wise version of Montgomery multiplication as shown in Algorithm 2 19): Aleorirhm 2
e,
e.,
e,
-"L
+ +
272
Alporirhm 3:
I*>
bit-width operations is necessary. The bit-parallel approach presents a solution that overcomes the speed problems of many systems [I]. Unfortunately, this solution comes with its own I.0 ,=a drawback, as it requires an important area-usage. and pin-out, a,,+] = b,,,, = O s p.1=0 which may not be satisfied in the resource-limited FPGA chips [I]. For applications for which the bit-serial approach is slow and p; = o,c, = L y = o the bit-parallel one is faster than required and occupies a large area, a trade off must be found 13-51, The concept of digit-serial arithmetic has been adopted in this paper as a compromise Fori= 0 ro n + l between the bit serial and the bit parallel arithmetic. The systems (Stepl: e'=%,+Za,B based on this arithmetic process more than one bit in one clock Step 2: + c, = T,, + ZC,,, cycle. The number of bits processed in one clock cycle. which Step 3: 4, = @ varies from a single bit to the word-length, i s referred to as the digit size. Since the digit size is variable, the digit approach Step 4: P,+ q , ~ provides the designer with a flexible and efficient area-time Step 5 : P, = (p,'+ ~ , . ' ) / 2 1 ~*~ method to find the speed and the area that match the design needs through an appropriate choice of the digit size. Pipelining the As shown in Algorirhm 3, the modular multiplication is digit structures is ofcapital importance [ I , 3-51. Traditionally, the broken into two concurrent multiplication operations and digit arithmetic structures, such as MontgomeIy multipliers, cannot computes Montgomery's multiplication by using the two's be pipelined beyond the digit level due to the presence of feedback complement of Toto select the reduction value. The results loops. However, based on the use of the feedback loop pipelining from the two multipliers are then added and a division by 4 R technique [1-51, the aim of designing digit structures that are (or 2"") follows. The algorithm start by computing the pipelined at the bit level with the minimum number of operations interleaved onto the same structure can be achieved [3-51. This product AB as in (4). It then uses , the two's complement technique basically pipelines the feedback loops by adding Flip of To to compute the reduction value. Flops (FFs)/latches to feedback loops. Algorirhm 3 is implemented using two bit serial multipliers, a However, in the original algorithm (see Algorithm 2) after serial adder for use to perform the addition of step 5 , and a serial each multiplication a reduction operation is required (the last two's complement circuit that is used to two's complement the step in Algotirhm 2). The input has the restriction A, B < M product AB, as shown in Figure 1. The LSB cell of the second and the output P is bounded by P < 2M. As a consequence, serial multiplier is different from the remaining cells, which are i f P > M , M must be subtracted so that the output can be gated Full Adders [3]. During the n first cycles, the LSB cell used as input for the next multiplication. To avoid this generates the bit q; to select the reduction value. The clock path subtraction [9, 19-22]. a new bound for the inputs is A, B < of this bit serial modular multiplier is equivalent to that of a gated 2M. thus. the output is also bounded by P < 2M. In this way. FA (which a FA to which an AND gate is connected) and a latch. the result P is equal to: However, the data lines, such as the multiplier input a;. are P = ( T + ( T M ) , , M ~ / ~ RA =x B x 2 . " - ' < 2 M ( 5 ) broadcast to all the cells. Such a global distribution lowers the AI the i+Phiteration ofAlgorirhar3, the sum e! + ~ , ' t 2 ' * ~ i s clock frequency, especially for large operands. which are usually used in cryptography. equal to:
"-1
0 5 A = c u , 2 ' < 2M , O 5 B = x b j 2 ' < 2 M '
e;
e''=
c,,
P,.'+P,"=3M(I-1/2"') (6) Therefore, after the ( n + l j'" iteration of the algorithm: (7) P,,'+ P,y/ 2"" = 3 M ( I - 1 / 2"") At the 1,1+2j" (i.e., the last iteration), and since a,,-l = b,,, = 0 , die result p = (P":, + p , ; ' , ) / P is equal to: P = 2 M -3M /2"" < 2M (8)
Thus, the subtraction operation is avoided. In practice, the result of each modular multiplication operation can immediately be used as an input for the next operation. as required by R S 4 algorithm.
111. Implementation Approach: The serial approach processes the data serially where at ex'ery clock cycle a single data bit is fed to the RSA processor to be processed. In contrast, the parallel approach processes the data bits in a parallel fashion in just one clock cycle. Many RSA implementations are bit-serial based structures [ IO- I I . 16. 18-22]. This is essentially due to their design simplicity and low hardware area-usage requirements. However. when high sample rates are required. the family of bit-serial architectures leads to a slow system speed [ I ] . To avoid this problem, and thus, reach higher system speed, i t is clear that a move towards higher
Figure 1. A serial Montgomery Multiplier derived from Algorithm 3 [31 Another bit-serial Structure that implements Algorithrrz 3 is shown in Figure 2. The same basic cells used in the multiplier of Figure I are used to build this systolic multiplier SlNCtUTe. except that the overall number of latches has increased. It is worth noting that the multiplier uses nearest neighbour communications only and that there are two latches in the feedback loops. Thus this structure is fully systolic and the global distribution lines are avoided. The design of this architecture has been canied out by interleaving two modular multiplication operations onto the same SfNCtUreand
213
by pipelining the feedback loops, a technique that has been used in many works 13-51,In general, the number of operations interleaved onto the same StNCNre depends on the number of latches added to the feedback loops. The two latches in the loops of the serial multiplier and adder of Figure cope with feeding two different set of operands to the multiplier. The suggestion made by [22] is that the two interleaved modular multiplication operations are the two operations involved in the modular exponentiation. In this way, the serial multiplier can he used as a bit serial-parallel modular exponentiation structure.
(8 in Figure 5 ) multiplication operations while in Figure 6. 3J-I ( I 1 in Figure 6) operations are time multiplexed onto the digit
SfNCNIe 131.
bo
~
bp
4"
The number of operations interleaved onto the same structure depends on the number of latches added to the feedback loops. As shown in 131, this structure was derived from Algorithm 3 by using, in the general case, the scheduling vector [ J , 2JI on the multiplication Dependency Graph (DG) [31. Therefore, 25 latches are required in the cany feedback loop. 2J operations have to he interleaved. In Figure 2 the scheduling vector [ I , 21 was used. This structure was later unfolded to get digit modular multiplier structures pipelined at the hit level. The Montgomery's digit modular multiplier, as presented in [3]. is shown in Figure 3. It uses two digit multipliers, a digit adder, and a digit complement circuit.
r *
Digit multiplier 1
global line broadcast though the critical logic path is one FA
nI
result Digit multiplier 2
0
IV .Modular Exponentiation and RSA Cryptography
In 1977, Rivest, Shamir and Adleman [7] introduced the cryptographic algorithm that has shown to he the most attractive for implementation in hardware or software [8. I I, 16, 18-22]. It is based on the computation of modular exponentiation using prime numbers. As a matter of fact, the security of the algorithm is based on our inability to efficiently factorise large numbers [71. For the purpose of keeping the data as secured and secret as possible, current implementations of the RSA are suggested with keys of 1 Kb and up to 2 Kh word-length 18, 181.
As it was suggested by [71, the modulus M of the RSA algorithm is the product of two suitably generated secret prime numbers P and Q: M = P X Q . The public exponent (also known as the encryption key) E is randomly chosen such that it is prime to (P1)XQ-I).The secret exponent (also known as the decryption key) D is computed using the extended Euclidean algorithm such that: ( E X D ) , p - l ) ( Q - l l = (l),p.l,,Q.l, (9)
Figure 3. The Digit Montgomery Multiplier described in 131 As dcscribed in [3]. the starling point of the design is based on the &bit digit multiplier of Figure 4 (or an unfolding factor J of 4). The second step is to add 2 - 1 latches (or 7 latches) to the loops of the multiplier. I n the next step, the retiming of the structure is carried out using horizontal and then vertical cutest lines. which leads to the digit multiplier of Figure 5 131. The notation in Figure 5 : X) is the jrhhit of the ifh set of data x. However. the level of pipelining achieved is such that there are latches at every output of the cells. the connections from the last row of the digit multiplier of Figure 5 to the top row are not local connections. To circumvent this problem. latches were added to the feedback loops. For an unfolding factor of J . JI latches (3 latches in Figure 6) are added to each loop so that the upward lines are no longer broadcast to the top cells and as such. the resulting structure has its connections fully localised. The behaviour of the digit structure changes from a step to another. In Figure 4, the digit multiplier carries out a single operation. In Figure 5. the multiplier interleaves 25
P and Q must never he revealed. Once used. the two prime
numbers P and Q are no longer required and must be discarded. A message is divided into words of the same word-length as the
274
product M. If M is encoded on k bits, an n = bn bits message is divided into in blocks of k-bit integers each. A k-bit word A is encrypted into a word Enc(Aj defined by nc(A)=
<2"'
(AE)
.An encrypted word B is decrypted into
M
a word Dec(Bj defined by Dec(Bj=(BD)
P d r . 4 4 Rcgntcr
These two operations. the RSA encryption and decryption, are mutually inverse: Enc(Dec(A))= Dec(Enc(Ajj= A with 0 5 A < M . This indicates that the original data can be recovered through the RSA encryptldecrypt process. m e operation of modular exponentiations is carried out iteratively by repeating a modular multiplication and modular squaring operation, as described in algorithm 4 for the case of a LSBF method. The algorithm computes B, which is the remainder of the division ofAE by the modulus M. During every iteration. two modular multiplications are carried out, which can be done in parallel or pipelined using the same multiplier structure p . 8 . 221.
Algorithm 4
M
Figure 7. An RSA Structure that uses two Montgomen Multipliers
<2"*'>..
Pmdlel Rrgisfer
Serial Register
Pmallel Regirrer
Multiplier
Figure 8. An RSA Structure that uses one Montgomery Multiplier The Montgomery's multiplier architectures used with these two classes of RSA structures are: a conventional Montgomery multiplier based on Algoritltm I , the Montgomery multiplier of Figure I. a Montgomery multiplier derived from Figure I structure by applying the retiming technique as described in Figure 9. a Montgomery multiplier of Figure 2. a 2-hit digit Montgomery multiplier derived from Figure 1 structure via unfolding (see Figure IO). another 2-bit digit Montgomery multiplier derived from Figure I structure via unfolding but with its feedback loops pipelined so that 2 operations are interleaved into this structure, a 2-hit digit Montgomery multiplier that interleaves two operations and that has been retimed to avoid global line broadcast (see Figure 1I.b). and finally. a 4-bit digit Montgomery multiplier derived form Figure I structure via unfolding (with an unfolding factor of 4) and finally another 4-bit digit Montgomery multiplier in which two operations may he interleaved.
lowO.0
The modular multiplication and squaring operation are based on Montgomery algorithm; therefore, the scaling factor Y2is used to neutralize the 2-"2 factor introduced by this algorithm. The range of the result of each iteration is that specified an Algorithm 3. Thus no subtraction operation is required at the end of each iteration. At the end of the encryption operation. a multiplication by I is required to take away the effect of the 2."~' factor introduced by Algorirlitn 4 . Finally. for the range to fall in the range (0, M[. a comparisonl subtraction operation is required. Two classes of RSA structures that implement AlgorifAm 4 are presented. The first class requires two Montgomery multipliers: the first one is used for modular squaring and the second is used for modular multiplication. The general skeleton of this class of structure is depicted in Figure 7. Registers are used to store the result form the modular squarer in a parallel and a serial way. The parallel register associated with squarer is used to feed the parallel operand to the squarer while the serial register is used to feed the serial input to both the squarer and the multiplier. The parallel register associated with multiplier in Figure 7 is used to store the result from the multiplier which is used as the parallel input to this multiplier. On the other hand, the second class of RSA structures uses the general skeleton depicted in Figure 8. This class of structures requires only one Montgomery multiplier. However, multiplexers are required to select the right data to be fed to the multiplier, either for the modular multiplication operation or for the modular squaring operation [ 2 2 ] .
P,
Figure 9. A modified version of the Montgomery Multiplier of Figure I after using the retiming technique The rrtimed Montgomery multiplier of Figure 1 is depicted in Figure 9. In this SlNCtUre. global line broadcast are avoided by using id2 IatchesIFFs on the serial input line for a n-hit multiplier. The retiming cutest lines are vertical to the serial input of Figure I and as stated by the retiming algorithm 1231. for each FF added to the link of a processor A to a processor B. a FF is removed from the link of the processor B to the processor A. Therefore. the longest path on the multiplier is equivalent to 2 gFAs and a latch/FF. applying the unfolding technique with an unfolding factor of 2 on the structure of Figure I leads 10 the Montgomer; multiplier depicted in Figure IO. This multiplier is capable of
~~ ~ ~
275
processing 2 bits of the serial input per clock cycle. The clock path of this structure is that of two gFAs, a FF.
1M..O
b. The same Structure after retiming Among these Montgomery multipliers, those who interleave two modular multiplication operations are used with the RSA structure of Figure 8. The remaining multipliers, which carry out only one multiplication operation, are used with the RSA structure of Figure 7. As it will be explained further down in this section, all these architecture may be divided into two groups based on avoiding or not avoiding the global line broadcast in the Montgomery multiplier structures. For the purpose of comparison and analysis, the following proposed structures have been implemented in a Xilinx XC2V8000 device:
Struct I : RSA architecture of Figure 7 with a conventional Montgomery multiplier as described in Algorithm 2 [31. Srrucr 2: RSA structure of Figure 7 which uses the Montgomery multiplier depicted in Figure I . Stnrcr 3: RSA structure of Figure 8 which uses the Montgomery multiplier of Figure 2. Srrucr 4 : RSA architecture of Figure 7 with the multiplier depicted in Figure 9. Srrucr 5 : RSA architecture of Figure 7 which uses the 2-bit digit Montgomery multiplier of Figure IO. Strucr 6: RSA structure of Figure 8 with the multiplier of Figure 1l.b that interleaves 2 multiplication operation onto the same multiplier. Stnrcr 7 the RSA structure of Figure 7 which uses the 4-hit digit Montgomery multiplier obtained by unfolding the multiplier of Figure I (with an unfolding factor of 4). Stnrcr 8 the RSA structure of Figure 8 that uses a 4-bit digit Montgomery multiplier in which two multiplication operations are interleaved. The implementation results of these 8 structures were obtained using the Xilinx ISE 6.2 package. However. theses architectures were not verified with an actual implementation on a chip. The obtained results are compared against similar work in the literature [ I I . 201. I n [201. an algorithm combining the Montgomery's technique and the carry save representation of numbers was proposed. A highly modular hit-slice based architecture has been designed for executing the algorithm in FPGA. The serial RSA architecture in 1201 used carry save adden to avoid carry propagation during the addition stages of Montgomery algorithm. The architectures in [ I I ] are based on the use carry propagate adders. implemented using fast carry logic resources available in the FPGA chip, with the multiples of A and M ' are pre-computed and stored in a RAM. A radix-2
P,.,,
Figure 10. A 2-bit Digit Montgomery Multiplier that Implements Algorithm 3 lt is clear from ~i~~~~ 10 that the global inout line will have a major drawback on the clock frequency of this Structure. Although the logic path is reduced in the multiplier of Figure 10 to 1 gFA and 1 FF by interleaving two operands into this structure, the drawback of the global lines is not resolved. A solution is shown in Figure I I . By retiming the stmcture of Figure l1.a. the global lines are once again avoided but at the expense of a longer logic path that consists of 2 gFAs and a latch/FF as shown in Figure lI.b. By avoiding global line broadcast. the placement and routing tools used to implement the RSA algorithm in FPGA one dedicate more efforts into routing the remaining nets of the structures.
I
llO..O
(b) Figure I I a. A 2-nit Digit Montgomery Multiplier that Interleaves two operations
and aradix-16 architecture were proposed in [ I I]. As mentioned previously. the concept of digit-serial arithmetic was proposed as a compromise between the bit serial and the hit Parallel arithmetic. By a n appropriate selection of the digit size. it provides the designer with the hest area and speed that match the needs of the system. Therefore, the aim of the analysis presented in this section is to provide thedesigner with the necessary knowledge and information for finding the best compromise between the cost of the RSA architecture and its performance. The analysis is based on the effect of changing the digit size and the level of pipelining for a full 1024-bit modular exponentiation (i.e. both the key and the public exponent are 1024 bits long). The evaluation parameters are the frequency, the required time to carry out the modular exponentiation operation, and the area usage in terms of the number of FPGA slices. The proposed structures carry out a modular exponentiation operation in ( 2 , , + 4)k dl clock cycles. where 11. k , 1. and d are the number of bits in the modulus. the number of bits in the
276
exponent, the level of pipelining, and the digit size, respectively. Therefore, for a full n-bit modular exponentiation operation, ( z n + 4 ) n d [ clock cycles are required. The implementation results of the proposed architectures are shown in Table 1 (the implementation which does not include the pre-mapping and post-mapping operation that convert data to and from the residue system). A clear distinction can he made between two classes of structures: in one side Srrucr 3 , Srrucr 4, and Srruct 6 , and on the other side, the remaining structure. The distinction is made upon the working frequencies of the first class structures, which is much higher than the working frequencies of the architectures of the latter class. This underlines the benefits of using nearest neighbour communications only against the global lines distribution that characterises the architectures of the second class. The effect that pipelining has on the critical path of the different structures can clearly he seen if the frequencies of Srrucr 1, Srrucr 5, and Struct 7 are compared those of Srrucr 2, Stnrcr 6 , and Srrucr 8, respectively. The critical logic path of Srruct I is 2 FAs and operates at a frequency of 145MHz while the critical logic path of Srruct 2 is one FA, and thus operates at a higher frequency of 151 MH2. The same observation can he made for sfruct 5 and srrucr 7. Srrucr 5 whose critical logic path is 2 FAs and operates at a frequency I12 MHz and Srrucr 6 which exhibits a critical path of one FA and has a clock frequency of 255 MHr. Srrucr 7 that has a critical path of 4 FAs and operates at a frequency of 78 MHr and Struct 8 whose critical path consists of 2 FAs with an operating frequency of 84 MHz. In these three previous examples, the improvement in the clock frequency is very clear in the case of Srruct 5 and Sr,-ucr 6. This is due mainly to the fact that Srrucr 6 uses nearest neighbour communications only. The small improvement in the remaining two cases can he explained by the effect of inherent routing delay, which is more important than the delay of the logic path. Another point that is worth to he mentioned is the effect of using the architectures of Figure 7 and Figure 8. For example, Srrucr 3 operates at a frequency of 291 while Strucr 4 operates at a frequency of 278 MHL. However, Srrucr 4 carries out a 1024-bit modular exponentiation in 7.55 ms, thus faster than Srrucr 3, which requires 14.46 ms to perform the same operation. This is explained by the fact that Srrucr 3 interleaves two modular multiplication operations into the same Montgomery multiplier. The same can he said about Strucr 7 and Srrucr 8.
An important decision that can be made from reviewing Table 1 is the choice of the appropriate architecture for the application at hand. For instance, Srnrcr 4 requires 7.55 ms to carry out the full modular exponentiation operation while Srrucr 7 (which uses 4-hit digit Montgomery multipliers) carries out the same operation in 6.78 ms. Nevertheless, the a e a usage of Srnicr 7 is almost twice that of Srrucr 4. By changing the digit size and the level of pipelining, one may have a database of architectures from which he/she can choose the best architecture thdl matches the design frequency and area usage needs.
\Frequency
1 Period
ITime
Area
Table 1. Performances of the proposed architectures Table 2 shows a comparison between Strucr 4 which is a serial architecture against two serial architectures described in [ I I]and [20]. The table also compares Srrucr 7, which is a 4-hit digit architecture against a radix-16 architecture proposed in [ I I]. This comparison is canied out for illustration purposes only as the work in [I I ] was implemented in an XC4015OXV-8 FPGA while the work in [ZO] was implemented in a Virtex 2000E-8 chip. Probably, using a recent FPGA chip may lead to better performances. Clearly, the architectures presented in this paper are superiors to those in the literature in terms of the processing time. Struct 4 has an 82% and 73% improvement over the work in [ I l l and [20], respectively. StNct 7 has a 43% improvement over the radix-I6 architecture in [ I I].
Srrucr 4 Srmcr 7
11.95
27.88 Table 2. A comparison with similar work in the literature
V. Conclusion
The modular exponentiation operation is the core of the RSA algorithm that has become the most widely used public key computer security algorithm. New architectures for the implementation of Montgomery modular exponentiation for RSA have been proposed. The new architectures use a modified Montgomery algorithm in which the operations of modular multiplication and modular reduction are carried out separately hut in a parallel way. To investigate the hest area usage-time trade-off, the digit approach was adopted. The problem of having global lines in the architectures have been circumvented by interleaving more than one instance into the same digit multiplier. which was achieved by pipelining the feedback loops and retiming the whole structures. The implementation results have shown that by pipelining and increasing the digit size of the stmctures, the global line broadcast is avoided with more bits processed at each clock cycle. which in returns has led to better performances in terms of the processing time. However. this was achieved at the price of extra area usage. This provides designers with the best trade off between speed and area usage and throughput rate.
VI. Reference 111 . , 0.Nibouche. and M.Nihouche. "On Desienim Dipit Multipliers". Proceeding of the 9Ih Intemat;onai IEEE
211
RSA Cryptographic Processor Architectures, "37th Asilomar Conference on Electronics, Circuits, and Systems (ICECS), Conference on Signals, Sysrems. and Compurers, Nov 2003 Dubrovnik 2002. [I91 A. Mazzeo. L. Romano, G . P. Saggese, and N. Mazrocca. [2] 0. Nibouche. A. Bouridane and M. Nibouche, "New Iterative Algodtnts and Arclritectures o f Modular "FPGA Based lmplementation of a Serial RSA Processor". proceedings of the Design, Automation and Test in Europe Mulriplicarion for Ctyptography", Proceedings of the 8" Conference and Exhibition (DATE'03). March 03 - 07. 2003 , lntemational IEEE Conference on Electronics. Circuits, and Munich, Germany pp10582-10589. Systems, ICECS Malta 2001. [20] A. Cilardo, A. Mazzeo, L. Romano. G . P. Saggese, Cany[3] 0 . Nibouche, A. Bouridane. and M. Nibouche "Architectures for Montgomery's multiplication", IEE Save Montgomery Modular Exponentiation on Reconfigurable Hardware. Proceedings of the Design, Automation and Test in Proceedings. Computers and Digital Techniques, Vo1.150, November 2003. Issue 06, p. 361-368. Europe Conference and Exhibition Designers' Forum (DATE04). I41 W. L. Freking and K. K. Parhi. Performance-scalable array architectures for modular multiplication. In Procee[21] S . B. 0%L. Batina, B. Preneel, J. Vandewalle. "Hardware Implementation of a Montgomery Modular Multiplier i n a ings qf rhe IEEE lnreniational Conference on Application~. Systolic Array", Proceedings of 10th Reconfigurable Specific Sysrems, Archirecrures, and Processors, pages 149Architectures Workshop (RAW 2003), Nice, France, April 2003. 160. IEEE. 2000. [ 2 2 ] I. Sheu, M. Shieh. C. Wu and M. Sheu, "A Pipelined [SI W. L. Freking and K. K. Parhi, "Ring-Planarired architecture of fast modular multiplication for RSA Cylindrical Arrays with Application to Modular Multiplication" IEEE Workshop on Signal Processing cryptography". in Proc IEEE ISCAS'98, vo1.2 pp.117-180, 1998. Systems Design & Implementation (SIPS 2000). Lafayette, [231 C. E. Leiserson and J. B. Saxe, "Retiming ynchronous Louisiana, USA, pp 497-506. circuitry,", Algorirhmica, vol. 6, pp. 5-35, 1991. [61 W. Diffie and M. E. Hellman. New directions in cryptography. IEEE Trarrsncrions on Infomiation Theory. 22644654. 1976. [7] R. L. Rivest. A. Shamir. and L. Adleman. A method for obtaining digital signatures and public-key cryptosystems. Comniunicarionsof rhe ACM, 21(2):12&126. 1978. 181 M. Shand and 1. Vuillemin. 'Fast implementation of RSA cryptography'. Proceedings of the 1I th IEEE Symposium on Computer Arithmetic, 1993. [9] P. L. Montgomery, "Modular multiplication without trial division."Math. Comp., vol. 44, no. 170, pp. 519-521. 1985. [IO] T. Blum and C. Paar. Montgomery modular exponentiation on reconfigurable hardware. In Proceedings of 14th IEEE Synposium on Compurer Arithmetic, pages 7&77. Adelaide, Australia, April 14-16 1999. [ I I ] T. Blum and C. Paar. High-radix Montgomery modular exponentiation on reconfigurable hardware. IEEE Trarzsacrionson Computers. 50(7):759-764. July 2001. [I?] A. F. Tenca and C, . K. Koc,. A scalable architecture for Montgomery multiplication. In C, . K. Koc, and C. Paar. ed-itors. Proceedings of Cryprographic Hardware arid Embed-ded S?stenis (CHES 19991, number 1717 in Lecture Notes in Computer Science, pages 94-108. SpringerVerlag, 1999. [I31 H. ON^. Simplifying quotient determination in highradix modular multiplication. In Proceedings of the 12th S?inpo-siurn 011 Conipttrer Arirhnieric. pages 193-199. IEEE. 1995. 1141 S. E. Eldridge and C. D. Walter. Hardware implementation of Montgomery's modular multiplication algorithm. IEEE TmrLV"xr on Coriiputeis. 42593-699. 93. [IS] P. Kornerup. A systolic. linear-array multiplier for a class of right-shift algorithms. IEEE T m ~ t ~ n c t on i~,~~ Compurers, 43(8):892-898. August 1994. [I61 I. Goodman, and A. P. Chandrakaaan. "an EnergyEfficient Reconfigurable Public Key Cryptography Coprocessor". IEEE joumal of solid state circuits. vol. 36. pp. 1808-1320.N~.ll,November2001. [I71 Anderson. R.. and Kuhn. M., 'Tamper Resisatnce- a Cautionary Note". in The Second USENIX Workshop on Electronic Commerce Proceedings, Oakland, Califomia, NovemberIX-21, 1996,pp 1-11 [ 181 C. Mclvor. M. McLoone. J. McCanny. A. Daly and W. Mamane, "Fast Montgomery Modular Multiplication and
~~
278