Ahmad 2005
Ahmad 2005
www.elsevier.com/locate/compeleceng
Received 2 May 2005; received in revised form 1 June 2005; accepted 14 July 2005
Available online 18 October 2005
Abstract
Hash functions are common and important cryptographic primitives, which are very critical for data
integrity assurance and data origin authentication security services. Field programmable gate arrays
(FPGAs) being reconfigurable, flexible and physically secure are a natural choice for implementation of
hash functions in a broad range of applications with different area-performance requirements. In this paper,
we explore alternative architectures for the implementation of hash algorithms of the secure hash standards
SHA-256 and SHA-512 on FPGAs and study their area-performance trade-offs. As several 64-bit adders
are needed in SHA-512 hash value computation, new architectures proposed in this paper implement mod-
ulo-64 addition as modulo-32, modulo-16 and modulo-8 additions with a view to reduce the chip area.
Hash function SHA-512 is implemented in different FPGA families of ALTERA to compare their perfor-
mance metrics such as area, memory, latency, clocking frequency and throughput to guide a designer to
select the most suitable FPGA for an application. In addition, a common architecture is designed for imple-
menting SHA-256 and SHA-512 algorithms.
2005 Elsevier Ltd. All rights reserved.
q
This research is funded by Kuwait University Grant EO 04/03.
*
Corresponding author. Tel.: +965 4985849; fax: +965 4839461.
E-mail address: [email protected] (I. Ahmad).
0045-7906/$ - see front matter 2005 Elsevier Ltd. All rights reserved.
doi:10.1016/j.compeleceng.2005.07.001
346 I. Ahmad, A. Shoba Das / Computers and Electrical Engineering 31 (2005) 345–360
1. Introduction
Data integrity assurance and data origin authentication are essential security services in finan-
cial transactions, electronic commerce, electronic mail, software distribution, data storage and so
on. Cryptographic hash functions are utilized to achieve these security services. The purpose of a
hash function is to produce a ‘‘fingerprint’’ of a file, message, or other block of data. A hash value
h is generated by a function H of the form h = H(M), where M is a variable-length message and
H(M) is the fixed-length hash value. In a cryptographic hash function, a message of arbitrary
length padded and broken into blocks is input sequentially to a compression function which con-
verts a fixed-length input (current message block) to a fixed-length output (hash value). The hash
values of individual blocks are used iteratively by the compression function to find the final hash
value, referred to as message digest. A hash function provides a unique relationship between the
input message and the hash value and hence, represents a longer message in a concise way. There-
fore, computation of digital signature to a large document (message) can be replaced by applying
cryptographic processing to the documentÕs hash value which is much smaller than the document
[1]. Other popular applications of hash functions include digital signature schemes in public-key
cryptosystems recommended in REC 2437 [19], password storage and verification specified in
RFC 2289 [20], and pseudo-random number generation. Hash function is also the building block
of secret-key message authentication codes (MACs) [2] used in two popular security protocols,
namely, secure sockets layer (SSL) provided in RFC 2246 [21] and IPSecurity mentioned in
RFC 2404 [3].
A cryptographic hash function should be hard to invert, i.e., given a hash value h, it should be
computationally infeasible to find some input M such that H(M) = h and collision-free, i.e., find
two messages M1 and M2 such that H(M1) = H(M2). Most of the secure hash functions in use
today have an iterative structure. The motivation for the iterative structure stems from the fact
that the compression function which generates a hash value for a current message block using
hash value of the preceding block, actually combines two or more inputs to produce an output
where each output bit is a different complex non-linear function of all the input bits. This makes
the resultant hash function collision resistant. High performance cryptographic hardware systems
typically require an extra module for hash function calculation to reduce the workload of the
main microprocessor.
Hash functions which have a ‘‘dedicated’’ design are fast and have considerable advantage over
other algorithms which are based on block cipher. Dedicated hash functions suitable for both
software and hardware implementation have been proposed and now widely used in real world
applications. Some of the most widely used dedicated hash functions in real applications are mes-
sage-digest algorithm MD5 [4] and secure hash algorithm SHA-1 [5]. The complexity of the best
attack of SHA-1 is 280 and it does not any longer match the security guaranteed by the new secret
key encryption standard, AES (Advanced Encryption Standard), which uses one key for both
encryption and decryption with key sizes of 128, 192 and 256 bits [6]. Therefore, three new hash
functions (SHA-256, SHA-384 and SHA-512) referred to as SHA-2, with the security matching
the security of AES with complexity of the best attack as 2128, 2192 and 2256, respectively, have
been announced by the National Institute of Standards and Technology (NIST) [7]. More re-
cently, a new standard SHA-224 has been announced by NIST [8]. The functional characteristics
of SHA functions are different and are presented in Table 1.
I. Ahmad, A. Shoba Das / Computers and Electrical Engineering 31 (2005) 345–360 347
Table 1
Comparison of functional characteristics of hash functions
Hash functions SHA-1 SHA-224 SHA-256 SHA-384 SHA-512
Size of hash value 160 224 256 384 512
Complexity of the best attack 280 2112 2128 2192 2256
Message size <264 <264 <264 <2128 <2128
Message block size 512 512 512 1024 1024
Word size 32 32 32 64 64
Number of words 5 8 8 8 8
Number of digest rounds 80 64 64 80 80
Number of constants 4 64 64 80 80
Round-dependent operations ft None None None None
A hardware implementation of cryptographic hash function has more physical security by nat-
ure as they are physically separate from the main processor and has higher performance than soft-
ware implementation. Moreover, the reconfigurable hardware devices such as field programmable
gate arrays (FPGAs) are best suited for implementation of cryptographic hash functions as they
are flexible and easily upgradeable. In the implementation of hash function on FPGAs, area and
performance are two of the most important design criteria of concern. Hash functions have a
broad range of applications and hence, their area-performance requirements may be different
for different applications. In some applications such as smart cards, area is of concern, whereas
in storage area networks (SANs) and virtual private networks (VPNs) performance is critical.
Some other applications, such as digital video recorders, require an optimization of perfor-
mance/area ratio. Therefore, different architectures can be used for SHA function implementation
and it is necessary to evaluate alternative architectures on the basis of area-performance
characteristics.
In cryptographic hash functions a common sequence of operations is called a digest round and
the compression function produces a hash value by subjecting a block of message to several digest
rounds. The number of digest rounds differ among the SHA functions as shown in Table 1. In
many applications performance of these basic cryptographic primitives is often directly reflected
in an overall improvement of the system performance. Among the several operations of a digest
round in SHA-2 functions, addition of several operands is involved which occupy a major chunk
of the chip area when implemented in FPGAs. Multi-operand addition also dictates the critical
path delay in the computation of hash value [17]. Hence, we focused on design of adders while
proposing new architectures. This paper deals with three issues, namely, proposing different archi-
tectures for implementation of a hash function on FPGA, comparing the performance metrics of
different FPGAs that implement a SHA-2 function and single chip implementation of SHA-2 fam-
ily hash functions. As the performance metrics of FPGAs of different families even by the same
manufacturer are not identical, an evaluation of FPGAs on the basis of performance metrics helps
in selection of appropriate FPGA to suit an application. Moreover, since hash functions of SHA-2
family have identical operations in a digest round, we were motivated to design a common archi-
tecture for these functions.
The remaining paper is organized in the following manner. An overview of the previous work
is given in Section 2. The prelude in Section 3 discusses SHA-256 and SHA-512 algorithms. In
348 I. Ahmad, A. Shoba Das / Computers and Electrical Engineering 31 (2005) 345–360
Section 4, the philosophy behind the reduced word length implementation of SHA-512 function
and single chip implementation of SHA-256 and SHA-512 are explained and in Section 5 the imple-
mentation is detailed. Results are discussed in Section 6 and the paper is concluded in Section 7.
2. Previous work
Many studies had been done on implementation of cryptographic hash functions [9–17]. Boos-
elaers et al. [9,10] reported the software performance evaluations of Message-Digest algorithm
MD5 and algorithm of secure hash standards, SHA-1 hash functions on a Pentium processor.
Nakajima and Matsui [11] reported the software performance analysis of the new proposed hash
function SHA-512 on Pentium III processor. Hash function applications demand hardware imple-
mentations to meet the performance requirements for high-speed networks. Dominikus [12] has
reported an FPGA implementation of MD5 hash algorithm. McLoone and McCanny proposed
a single-chip FPGA solution for SHA-384 and SHA-512 [13]. Kang et al. [14] reported the imple-
mentation of MD5 and SHA-1 on Altera FPGA. Grembowski et al. [15] recently reported the
comparative analysis of the hardware implementation of SHA-1 and the new proposed hash func-
tion SHA-512 on Xilinix Virtex FPGA. A common architecture for implementation of SHA-2
family architecture is reported in [16]. An elegant application specific integrated circuit (ASIC)
implementation of SHA-512 by making use of delay balancing and pipelining is recently reported
by Dadda and Macchetti [17].
To the best of our knowledge a study on area-performance metrics in the FPGA implementa-
tion of SHA-2 functions to suit different applications has not been done so far. In a digest round
of SHA-512, several 64-bit operands are added and logic operations are performed on them. This
paper explores alternative architectures for SHA-512 implementation on a FPGA using 8, 16 and
32-bit adder/logic circuits and compares their area-performance trade-offs. Performance metrics
such as area, memory, latency, clocking frequency and throughput of FPGAs of different families
of ALTERA for implementation of SHA-512 are evaluated in this paper and finally, single chip
implementation of SHA-256 and SHA-512 with 32-bit adders also has been done.
3. Prelude
In this section, SHA-256 and SHA-512 algorithms are discussed in detail. When a message of
any length <264 bits (for SHA-256) or <2128 bits (for SHA-512) is input, the hash functions SHA-
256 and SHA-512 compute a condensed representation of message, referred to as message digest.
The message digest generated by SHA-256 and SHA-512 are 256 and 512 bits long, respectively.
The algorithm for generation of message digest is identical for SHA-256 and SHA-512 and only
the constants and functions used differ, and hence, in this section SHA-256 and SHA-512 are dis-
cussed simultaneously. The procedure consists of two stages, namely, preprocessing and hash
computation. In the preprocessing stage, the message is padded, parsed into m-bit blocks and ini-
tialization values to be used in the hash computation are set. A Message Scheduler (MS) divides
the m-bit block into 16 words and prepares a message schedule by passing one word at a time. A
series of hash values are generated iteratively from functions, constants, and word operations and
I. Ahmad, A. Shoba Das / Computers and Electrical Engineering 31 (2005) 345–360 349
Message
Padder
Wt
Message
Scheduler a in - hin
Kt
ROM
Iterative Processing Unit
H 0i- H 7i
Hash
constants a out - hout
ai - hi
Modulo Adder
H 0i+1 - H 7i+1
Message Digest
the final hash value is the message digest. The message digest generation technique is shown in
Fig. 1. The operations performed on the two stages are listed below:
Preprocessing:
Hash computation:
• Each message block Bi are processed in order. A word (32 bits or 64 bits wide) of a message
block Bi is referred to as Bit and in a block there are 16 such words.
• For each message block i in the range 1 to N, starting from message schedule Wt, following
steps (1– 4) are repeated to compute hash values H i0 to H i7 for the ith block.
Step 1: Wt is computed by identical procedure for SHA-256 and SHA-512, only the logic func-
tions r0 and r1 are different.
350 I. Ahmad, A. Shoba Das / Computers and Electrical Engineering 31 (2005) 345–360
SHA-256:
Message schedule W t ¼ Bit 0 6 t 6 15
¼ r256
1 ðW t2 Þ þ W t7 þ r256
0 ðW t15 Þ þ W t16 16 6 t 6 63
where
17 19 10
r256
1 ¼ ROTR ðxÞ ROTR ðxÞ SHR ðxÞ
7 18 3
r256
0 ¼ ROTR ðxÞ ROTR ðxÞ SHR ðxÞ
SHA-512:
Message schedule W t ¼ Bit 0 6 t 6 15
¼ r512
1 ðW t2 Þ þ W t7 þ r512
0 ðW t15 Þ þ W t16 16 6 t 6 80
where
19 61 6
r512
1 ¼ ROTR ðxÞ ROTR ðxÞ SHR ðxÞ
1 8 7
r512
0 ¼ ROTR ðxÞ ROTR ðxÞ SHR ðxÞ
ROTRn(x) is a circular rotation of a variable x by n positions to the right and SHRn(x) is shifting
of a variable x by n positions to the right.
The block diagram of SHA-256/SHA-512 algorithm is shown in Fig. 2.
Step 2: The hash values, H i1
0 to H i1
7 are assigned to variables a, b, c, d, e, f, g, h. The eight initial
hash values, which are 32 or 64 bits wide, are shown in Table 2.
Padded
Message scheduler Message
σ0 σ1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Wt
T1 + T 2
Kt
T1
0
Maj (a, b, c) 1
Ch (e, f, g)
a b c d e f g h
Table 2
Initial hash values of SHA-256 and SHA-512
SHA-256 SHA-512
H 00 ! a 6a09e667 6a09e667 f3bcc908
H 01 ! b bb67ae85 bb67ae85 84caa73b
H 02 ! c 3c6ef372 3c6ef372 fe94f82b
H 03 ! d a54ff53a a54ff53a 5f1d36f1
H 04 ! e 510e527f 510e527f ade682d1
H 05 ! f 9b05688c 9b05688c 2b3e6c1f
H 06 ! g 1f83d9ab 1f83d9ab fb41bd6b
H 07 ! h 5be0cd19 5be0cd19 137e2179
Chðx; y; zÞ ¼ ðx ^ yÞ ðpx ^ zÞ
Majðx; y; zÞ ¼ ðx ^ yÞ ðx ^ zÞ ðy ^ zÞ
SHA-256:
R0 ¼ ROTR2 ðxÞ ROTR13 ðxÞ ROTR22 ðxÞ
R1 ¼ ROTR6 ðxÞ ROTR11 ðxÞ ROTR25 ðxÞ
SHA-512:
R0 ¼ ROTR28 ðxÞ ROTR34 ðxÞ ROTR39 ðxÞ
R1 ¼ ROTR14 ðxÞ ROTR18 ðxÞ ROTR41 ðxÞ
Step 3: The processing unit performs this step, 64 or 80 times on a 512 or 1024 bit block.
T 1 ¼ h þ R1 ðeÞ þ Chðe; f ; gÞ þ K t þ W t
T 2 ¼ R0 ðaÞ þ Majða; b; cÞ
h¼g
g¼f
f ¼e
e ¼ d þ T1
d¼c
c¼b
b¼a
a ¼ T1 þ T2
Variables used in the above equations refer to respective values for SHA-256 and SHA-512.
352 I. Ahmad, A. Shoba Das / Computers and Electrical Engineering 31 (2005) 345–360
Step 4: The ith intermediate hash value H i0 to H i7 are computed by modulo-32 or modulo-64 bit
adders after the iterations.
H i0 ¼ a þ H i1
0 H i1 ¼ b þ H i1
1 H i2 ¼ c þ H i1
2 H i3 ¼ d þ H i1
3
H i4 ¼ e þ H i1
4 H i5 ¼ f þ H i1
5 H i6 ¼ g þ H i1
6 H i7 ¼ h þ H i1
7
FPGAs are best suited for implementation of cryptographic hash functions as they meet the
speed requirements and are reconfigurable. It is clear from Section 3, that message schedule
W t ¼ r256 256
1 ðW t2 Þ þ W t7 þ r0 ðW t15 Þ þ W t16 requires four operand addition and intermediate
value a = T1 + T2 where, T 1 ¼ h þ R1 ðeÞ þ Chðe; f ; gÞ þ K t þ W t and T 2 ¼ R0 ðaÞ þ Majða; b; cÞ,
requires six operand addition. The ith intermediate hash value H i0 to H i7 are computed by mod-
ulo-64 bit adders after the iterations and hence, an additional eight modulo-64 adders are required
to find the final hash value for the block. The multi-operand addition is the most problematic part
in the implementation of hash functions. Hence, reducing the size of these adders will reduce the
number of logic elements required in the FPGA, thereby reducing the overall area of the final
circuit.
The implementation of multi-operand 64-bit adders on FPGAs demands selection of proper
scheme for performing the addition as both the speed and area are of concern. It is a well-known
fact that carry look-ahead adders (CLAs) are faster than conventional carry propagate adders,
but carry save adders (CSAs), referred to as redundant adders are faster and has a smaller area
than a CLA. For an n-bit adder with each module handling m bits, the delay is proportional
to logmn for one level CLA whereas the redundant adders have a constant delay [18]. Assuming
a complexity of km for implementing 1 bit module, the area of a CLA has a complexity propor-
tional to kmn and a redundant adder has an area proportional to n. In a CSA, an array of full
adders (FAA) are used to perform addition of three binary vectors without propagating the car-
ries and two binary vectors, pseudo-sum and carry are generated. As the carry output of a ith full
adder has a weight i + 1, carry of bit 0, vc0 = 0. Hence, the carry-in (cin) can be included in the
place of vc0 as shown in Fig. 3.
Several schemes exist for the implementation of multi-operand addition. Using a network of
full-adders, p operands each n bits wide can be added using an array of [p:2] adders, referred
to as reduction by rows or using an array of (p:q] counters, referred to as reduction by columns.
The arrays can be linear or tree array and same number of adders are used by both the schemes.
Reduction by rows technique will involve an array of full adders and a CLA in the last stage to
add the final pseudo-sum and pseudo-carry. For large n, the number of groups in one level CLA
will be large, resulting in a slow operation. If multiple levels are used, the maximum number of
levels, L = logmn, the number of modules Nmax with maximum number of levels will be
(n 1)/(m 1) and the delay is proportional to 2 logmn [18]. It can be seen that the selection of
I. Ahmad, A. Shoba Das / Computers and Electrical Engineering 31 (2005) 345–360 353
X Y Z x0 y0 z0
n n n
c out
CSA c in FA
n n c in
vc vs vc 1 vs0 vc 0
group size m is an important factor which in turn affects the delay and the number of modules. An
optimum size for m will be 4 which gives rise to 21 modules for the 3 level implementation of a
64 bit adder.
Reduction of word length of operands input to the adders from 64-bit to smaller denomina-
tions, namely, 32, 16 and 8 will reduce the value of n which in turn will reduce L, Nmax and
the delay. The size of the adders and the overall size of the hash function circuit are thereby re-
duced. Instead of a 64-bit adder, if a 32-bit adder is used, the reduction in area of CSA will be
proportional to n = 32. The addition operation of three 8-bit operands, P, Q and R using a 4-
bit CSA is illustrated in Fig. 4. Initially the lower nibble of the operands is added and the same
adder is used to add the higher nibbles.
The carry bit vc4 = 1 generated while adding bits (3-0) of operands in a 4-bit CSA is stored in a
flip-flop and vc0 is assigned a value Ô0Õ. The pseudo-sum vs3-0 and the pseudo-carry vc3-0 are added
in a 4-bit CLA to get the final sum VS3-0 = 0101. The higher nibble of the operands is added by
the same CSA and the carry produced vc4 = 0 is ignored and vc0 is assigned the value Ô1Õ which is
the carry from lower nibble addition saved in the flip-flop. In the same 4-bit CLA, pseudo-sum vs3-
0 and the pseudo-carry vc3-0 are added to get the final sum VS74 = 0110. The same logic is used
for the carry generated from the lower nibble addition in CLA and the Modulo-8 sum is
01100101.
Moreover, the logic functions Ch(e, f, g) and Maj(a, b, c) are such that these can be performed
on reduced size operands. Design of SHA-512 with 64/32/16/8 bit adders and logic circuits will
Let P = 22 Q = 79 R = CA
Bits (3-0) Bits (7-4)
X3-0 = 0010 X3-0 = 0010
Y3-0 = 1001 Y3-0 = 0111
Z3-0 = 1010 Z3-0 = 1100
vs3-0 = 0001 vs3-0 = 1001
vc4-1 =1010 vc4-1 = 0 1 1 0
vc4 = 1 vc4 is ignored
5. Implementation of hash functions with scaled down adders and logic circuits
In the block diagram of SHA(64)-512 which is shown in Fig. 2 in Section 3, the message sched-
uler is implemented with sixteen, 64-bit registers and in the Iterative Processing Unit (IPU),
(a h) are 64-bit registers. The addition of operands in message scheduler and processing unit
are implemented using a network of 64-bit CSAs and the reduction is done by rows. The final
pseudo-sum and carry vectors are added using a 64-bit CLA.
Message scheduler (MS) and the iterative processing unit (IPU) implemented with 32-bit CSAs
are shown in Figs. 5 and 6, respectively. Sixteen registers in MS and the eight registers in IPU are
64-bit registers, but split and used as two 32-bit registers. The suffixes U and L refer, respectively,
to bit vectors (63-32) and (31-0). A selector unit SEL is used to select the U or L register and
DSEL performs the opposite function of SEL. The SEL unit consists of two tristate buffers which
in turn are driven by select lines S0 and S1. The select signal S1 is generated by inverting S0. A
positive edge triggered T-flip-flop generates S0. The data is always shifted from one half of a reg-
ister to the corresponding half of the next register in the path. The transfer of data between lower
halves of registers take place when S0 is asserted and upper halves when S1 = 1. The shift logic
functions r0 and r1 in MS needs 64-bit operands together, therefore, the two halves of the oper-
ands are latched into these blocks at negative edge of S1 and the shift logic is performed by com-
binational circuits. The same technique is used for R0 and R1 in IPU.
The logic functions Ch(e, f, g) and Maj(a, b, c) are such that these can be performed on two
halves of the operands independently, hence, only 32 bit circuits are used for these logic. This
in turn reduces the size of the overall circuit.
Padded
Message
CLA
FAA
FAA
32
32
Sel Sel 32 32
32 32 64
Sel
64
Sel
σ0 σ1
32
DSel
Sel
32
Wt
Wt
FAA
FAA
FAA
FAA
FAA
Kt
32 32
CLA Ch (e, f, g)
Sel Maj (a, b, c) FAA Sel
64 64
0
CLA 1
aU bU cU du eU fU gU hU
aL bL cL dL eL fL gL hL
The addition of three 64-bit operands using a 32-bit FAA is shown in Fig. 7. The SEL circuits
are included to demonstrate the selection of one half of the operands and the output is two 32 bit
vectors, namely, Sum and Carry. The vc32, which is shown as cout in Fig. 7 is stored in a D flip-flop
when S0 is asserted as this the carry generated from the addition of bits (31-0) of the operands. Bit
vc0 = 0 when S0 is asserted and vc0 = cout of the lower half addition when S1 is asserted. The carry
from higher half addition is ignored as hash functions require modulo-64 addition. The 64-bit
constant Kt is input 32-bit at a time from a ROM. In order to store the eighty constants, a
(160 * 32) ROM is used. Message digest computation is done by 32-bit CLAs, which add the hash
values of the preceding iteration with the contents of registers a to h, hence eight CLAs are used to
perform this addition. The carries from the lower order words are stored and used as cin while
higher order words are added as shown in Fig. 7 for FAAs.
The design methodology of SHA-512 function using 16 or 8 bits adder and logic circuits is
identical to 32-bit version. The 64-bit registers are split to suit the respective word lengths. The
X 31-0 32 32 vs31-0
Sum
32
32-bit Adder
S0 (FAA)
X 63-32 vc 31-1
32
32 Carry
c out vc 0
D FF
S1
Clk
S1
Y 31-0
Z 31-0 ‘0’
S0
S0 S0
Y 63-32 S0
Z 63-32
S1 S1
selection is done by a counter-decoder (2 · 4) circuit and a (3 · 8) decoder is used for the 8-bit
version. Accordingly, the size of SEL and DSEL circuits, FAAs, Ch, Maj functions and ROM
are also changed to operate on 16-bit or 8-bit operands.
Implementation of SHA-512 using 32 bit adders and logic circuits facilitates implementation of
SHA-256 on the same chip, as SHA-256 performs operations on 32 bits operands. The algorithm
is identical for SHA-512 and SHA-256 functions and the user can select the algorithm by asserting
an input line. One initial hash value and one Kt value are shown in Table 3. It is clear from the
table that even the initial hash values and constant Kt of SHA-256 are exactly half of those of
SHA-512. Therefore, the ROM is organized as two banks of 80 words, with each bank handling
one half of the constants. A combinational logic selects the appropriate banks depending on the
algorithm.
The ROM banks and their associated logic is shown in Fig. 8. An 80 · 32 ROM bank (KH32)
stores the Kt constants of SHA-256 which is also the higher words of SHA-512 and another ROM
bank (KL32) of same capacity stores the lower half of constants. The contents of KH32 are the Kt
constants of SHA-256 which have to be selected at every clock pulse, whereas, as the same con-
stants are the higher words of SHA-512, these have to be selected at alternate clock pulses when
passed on to a 32-bit adder. The associated logic shown in Fig. 8 selects the contents of KH32
either at every clock when it is computing SHA-256 or at alternate clocks for SHA-512.
The logic functions r0, r1, R0 , and R1 involve rotation and shifting and is different for both the
functions and separate logic circuits are designed for SHA-256 and SHA-512 to handle the respec-
tive functions.
Table 3
Relationship between constants of SHA-256 and SHA-512
SHA-256 SHA-512
H 1
0 ! a 6a09e667 6a09e667 f3bcc908
K0 428a2f98 428a2f98 d728ae22
Adr6-0 32
ROM
Adr7-1
80 X 32 Adr0
S512
6. Experimental results
The SHA-512 and SHA-256 algorithms were designed and tested using a comprehensive design
software, the Altera Quartus II, version 4.0. Altera is the programmable logic performance leader
across all platforms as reported in http://www.altera.com/products/devices/performance/per-in-
dex.html and provides a complete multi-platform design environment to suit specific design needs.
The designs were analyzed and synthesized using Verilog HDL and VHDL, placed and routed in
Altera devices of APEX II, Stratix, and Mercury family FPGAs. Five performance metrics such
as the area (a), memory (l), latency (k), clocking frequency (f) and throughput (d) were computed.
APEX II FPGAs have up to 67,200 logic elements (LEs) and 1.1 Mbits of embedded RAM and
these devices offer abundant logic resources and remarkable I/O performance. High speed com-
pute-intensive data path functions can be easily implemented with one or multiple APEX II de-
vices. Mercury family FPGAs typically have up to 14400 LEs with maximum RAM bits of
114,688. The FPGAs of Stratix family contain 10,570 to 79,040 LEs and up to 7,427,520 RAM
bits (928,440 bytes) without reducing logic resources. High-speed differential I/O support on up
to 116 channels with up to 80 channels optimized for 840 megabits per second (Mbps) is provided
by these FPGAs. The resources used in terms of number of logic elements for the implementation
of algorithm is referred to as the area. A memory segment consists of a bit-slice of a memory that
is implemented in a single embedded cell. Each embedded cell implements one output of the mem-
ory and multiple memory segments may be needed to create a single memory block. Latency is
defined as the number of rounds in a loop and the minimum operating clock as clock period.
The throughput (d) is computed as, d = message block size/(clock period * latency).
The designs were simulated for a block of 1024 bits padded message. SHA(64)-512 and SHA(32)-
512 were designed and placed on the FPGA, EP1S10F484C5 of Stratix family and their perfor-
mance metrics are presented in Table 4. The SHA(32)-512 design occupies 2800 logic elements
whereas SHA(64)-512 occupies 4229 logic elements. As SHA(32)-512 occupies only 26% of the chip
area to handle one block of 1024 bits of padded message, three blocks of message can be authen-
ticated at a time by the chip if the blocks are pipelined. This in turn will occupy only 78% of the
chip area and the rest can be used for implementing encryption logic. SHA(64)-512 can handle only
two blocks at a time and leaves only 20% of chip area for other purposes. Moreover, area used by
one block of SHA(32)-512 design is only 66% of that of SHA(64)-512 and throughput for maximum
number of blocks is almost 72%.
In Table 5, a similar SHA(64)-512 design implemented with a Xilinx Virtex-E XCV600E-8 [13] is
compared with our design implemented with the FPGA of Mercury family, EPM120F484C5. The
operating frequency reported in [13] was 38 MHz whereas, our design has an operating frequency
of 43.7 MHz. The throughput of our design is also more than that of [13].
Table 4
Synthesis results of SHA(64)-512 and SHA(32)-512 on Stratix
Design Area Percentage of chip Memory Percentage of memory Clock Throughput (Mbits/s)
(LEs) (a) area used (%) (bits) (l) on chip used (%) (MHz) (f) (Max block)
SHA(64)-512 4229 40 9216 <1 47.9 1226.2
SHA(32)-512 2800 26 8448 <1 46.5 892.8
358 I. Ahmad, A. Shoba Das / Computers and Electrical Engineering 31 (2005) 345–360
Table 5
Comparison of our SHA(64)-512 design with design of [13]
Design Clock (MHz) (f) Throughput (Mbits/s)
Design of [13] 38 479
SHA(64)-512 43.7 560
The lower word length versions, namely, SHA(64)-512, SHA(32)-512, SHA(16)-512, SHA(8)-512
were synthesized on a Mercury family FPGA, EPM120F484C5 and their areas were compared
on the basis of logic element count. The design was optimized for area and their comparison chart
is shown in Fig. 9. Choosing SHA(64) as the base, SHA(8) occupies 27.1% less area followed by
SHA(16) with 24.3% and SHA(32) with 16.5% than SHA(64). It is clear from Fig. 9, that applica-
tions where area is of concern, smaller word length implementations will be suitable.
In order to evaluate the devices belonging to different families of Altera, the throughput of
SHA(32)-512 design on devices belonging to three different families had been done and their per-
formance metrics area, memory, throughput, and operating frequency are listed in Table 6. It can
be seen that the hash algorithm synthesized on Stratix device occupies less area than the other two
FPGAs listed in Table 6. Moreover, since one block occupies only 26% of the chip area, three
blocks of 1024 bits of padded message can be handled by Stratix device whereas only two blocks
can be handled by Apex II and one block by Mercury device. The maximum possible throughput
is listed in the last column.
Finally, synthesis results of SHA(32)-512 and SHA-256 on a single chip are given in Table 7.
Both the algorithms use the same area and memory, but the throughput is different since the block
size and latency are 512 and 64, respectively, for SHA-256, whereas for SHA(32)-512, block size is
1024 and latency is 160.
4000 3711
3101
Logic element
2000
1000
0
SHA(8) SHA(16) SHA(32) SHA(64)
Table 6
Synthesis results of SHA(32)-512 on different FPGAs
FPGA Area Chip area Memory Memory Clock Throughput Throughput
(LEs) (a) used (%) (bits) (l) used (%) (MHz) (f) (Mbits/s) (Mbits/s)
(1 block) (Max block)
Stratix EP1S10F484C5 2794 26 8448 <1 45.8 292.8 878.4
Apex II EP20K200EFC484-1 2867 34 15,360 14 24.86 159.1 318.2
Mercury EPM120F484C5 3775 78 15,360 31 48.7 311.9 311.9
I. Ahmad, A. Shoba Das / Computers and Electrical Engineering 31 (2005) 345–360 359
Table 7
Single chip implementation of SHA(32)-512 and SHA-256
Design Area Memory Clock Throughput Throughput
used (%) used (%) (MHz) (f) (Mbits/s) (1 block) (Mbits/s) (Max block)
SHA-256 32 1 41.97 335.9 1007.7
SHA(32)-512 32 1 41.97 268.7 806.1
7. Conclusions
Secure hash algorithms SHA-256 and SHA-512 are versatile algorithms deployed in a broad
range of applications with different area-performance requirements. Several 64-bit adders are re-
quired to implement a SHA-512 hash function in FPGAs requiring bulk of the chip area. In this
paper, we explored alternative adder architectures for implementing SHA-512 in FPGA with re-
duced size operands and studied their area-performance trade-offs. Our results showed that the
chip area on FPGA decreased with reduction in operand size but the throughput suffered due
to increased latency. The architectures were synthesized in different FPGA families of ALTERA
and their performance metrics such as area, memory, latency, clocking frequency and throughput
were compared. Implementation of SHA-256 and SHA-512 was also done using a common archi-
tecture. The performance metrics shed light on the possibility of synthesizing multiple blocks on a
single chip, which in turn would increase the throughput. Stratix family of FPGAs offered the best
performance metrics. Future work will be directed towards error analysis and error detection pro-
cedures for the hardware implementation of hash functions.
References
[1] Kaufman C, Perlman R, Speciner M. Network security: private communication in a public world. 2nd
ed. Prentice-Hall; 2002.
[2] FIPS Publication 198. The Keyed-hash message authentication code (HMAC). US Doc/NIST, March 6, 2002.
[3] Madson C, Glenn R. The use of HMAC-SHA-1-96 within ESP and AH. RFC 2404, November 1998.
[4] Rivest RL. The MD5 message digest algorithm. RFC 1321, April 1992.
[5] FIPS Publication 180-1. Secure hash standard (SHS). US Doc/NIST, April 17, 1995.
[6] FIPS Publication 197. Advanced encryption standard (AES). US Doc/NIST, November 26, 2001.
[7] FIPS Publication 180-2. Secure hash standard (SHS). US Doc/NIST, May 30, 2001.
[8] FIPS Publication 180-2. Secure hash standard (SHS) change notice 1. US Doc/NIST, February 2004.
[9] Booselaers A, Govaerts R, Vandewalle J. Fast hashing on Pentium. Proceedings of CryptoÕ96, LNCS
1109. Springer-Verlag; 1996. p. 298–312.
[10] Booselaers A, Govaerts R, Vandewalle J. SHA: a design for parallel architectures? Proceedings of the
EUROCRYPTÕ97, LNCS 1233. Springer-Verlag; 1997. p. 348–62.
[11] Nakajima J, Matsui M. Performance analysis and parallel implementation of dedicated hash functions on Pentium
III. IEICE Transactions on Fundamentals 2003;E86-A(1):54–63.
[12] Dominikus S. A hardware implementation of MD5-family hash algorithm. In: Proceedings of the international
conference on electronics circuits and systems, Dubrovnik, Croatia, September 15–18, 2002. p. 1143–6.
[13] McLoone M, McCanny JV. Efficient single-chip implementation of SHA-384 and SHA-512. In: Proceedings of the
IEEE international conference on field-programmable technology (FPT), Hong Kong, July 2002. p. 311–4.
[14] Kang YK, Kim DW, Kwon TW, Choi JR. An efficient implementation of hash function processor for IPSEC. In:
Proceedings of the Asia–Pacific conference on ASICs, August 2002. p. 93–6.
360 I. Ahmad, A. Shoba Das / Computers and Electrical Engineering 31 (2005) 345–360
[15] Grembowski T, Lien R, Gaj K, Nguyen N, Bellows P, Flidr J, et al. Comparative analysis of the hardware
implementation of hash functions SHA-1 and SHA-512. Proceedings of the 5th international conference on
information security (ISCÕ2002), LNCS 2433. Springer-Verlag; 2002. p. 75–89.
[16] Sklavos N, Koufopavlou O. On the hardware implementation of the SHA-2 (256, 384, 512) hash functions. In:
Proceedings of the IEEE international symposium on circuits and systems, vol. 5, May 2003. p. 153–6.
[17] Dadda L, Macchetti M. The design of a high speed ASIC unit for the hash functions SHA-256 (384, 512). In:
Proceedings of the design, automation and test in Europe conference (DATEÕ04), February 16–20, 2004.
[18] Ercegovac MD, Lang T. Digital arithmetic. Morgan Kaufmann Publishers; 2004.
[19] Kaliski B, Staddon J. RSA cryptography specifications—Version 2.0. RFC 2437, October 1998.
[20] Haller N, Metz C, Nesser P, Straw M. A one-time password system. RFC 2289, February 1998.
[21] Dierks T, Allen C. The TLS protocol—Version 1.0. RFC 2246, January 1999.
Imtiaz Ahmad received his B.Sc. in Electrical Engineering from University of Engineering and
Technology, Lahore, Pakistan, an M.Sc. in Electrical Engineering from King Fahd University of
Petroleum and Minerals, Dhahran, Saudi Arabia, and a Ph.D. in Computer Engineering from
Syracuse University, Syracuse, New York, in 1984, 1988 and 1992, respectively. Since September
1992, he has been with the Department of Computer Engineering at Kuwait University, Kuwait,
where he is currently a professor. His research interests include design automation of digital
systems, high-level synthesis, and parallel and distributed computing.
A. Shoba Das received the B.E. degree from Guindy College of Engineering, Madras University,
India and the M.E. degree from PSG College of Technology, Madras University, India. She has
been in various teaching assignments in India from 1982 and presently working as scientific
assistant in Kuwait University. Her research interests include optimal design of sequential
machines and testing of communication systems.