0% found this document useful (0 votes)
10 views21 pages

INTEGRATION, The VLSI Journal: H.E. Michail, G.S. Athanasiou, G. Theodoridis, C.E. Goutis

Uploaded by

Surendra Guntur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views21 pages

INTEGRATION, The VLSI Journal: H.E. Michail, G.S. Athanasiou, G. Theodoridis, C.E. Goutis

Uploaded by

Surendra Guntur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

INTEGRATION, the VLSI journal


journal homepage: www.elsevier.com/locate/vlsi

On the development of high-throughput and area-efficient


multi-mode cryptographic hash designs in FPGAs
H.E. Michail a,n, G.S. Athanasiou b, G. Theodoridis c, C.E. Goutis c
a
Electrical Engineering, Computer Engineering and Informatics Department, Cyprus University of Technology, 3036 Lemesos, Cyprus
b
Antcor – Advanced Network Technologies S.A., Sorou Str. 12, 15125 Marousi, Greece
c
VLSI Design Laboratory, Electrical and Computer Engineering Department, University of Patras, 26500 Patras, Greece

art ic l e i nf o a b s t r a c t

Article history: In this paper, area-efficient and high-throughput multi-mode architectures for the SHA-1 and SHA-2
Received 23 December 2012 hash families are proposed and implemented in several FPGA technologies. Additionally a systematic
Received in revised form flow for designing multi-mode architectures (implementing more than one function) of these families is
4 February 2014
introduced. Compared to the corresponding architectures that are produced by a commercial synthesis
Accepted 7 February 2014
tool, the proposed ones are better in terms of both area (at least 40%) and throughput/area (from 32% up
to 175%). Finally, the proposed architectures outperform similar existing ones in terms of throughput and
Keywords: throughput/area, from 4.2  up to 279.4  and from 1.2  up to 5.5  , respectively.
Hash & 2014 Elsevier B.V. All rights reserved.
Authentication
Multi-mode
FPGA

1. Introduction through HMAC that is built on top of a hash function (e.g. MD-5 [8]
or SHA-1 [9]).
Due to the dramatic increase of electronic communications and However, security problems have been discovered in MD-5 and
transactions worldwide, security has become an indispensable SHA-1 functions. Specifically, the MD-5 class of hash functions has
feature of all systems and applications. A vital feature of the been totally broken [10]. On the other hand, concerning SHA-1,
security schemes that are used nowadays is authentication, which although its collision resistance has been reduced [11], the security
is achieved using cryptographic hash functions. Hash functions problems are non-critical. For that reasons, except the SHA-1, the
are used as single modules or they are included in hash-based SHA-2 hash function is expected to be adopted as a secure solution
authentication mechanisms such as the Hashed Message Authen- in security schemes (e.g. IPSec/IPv6) in coming years. Neverthe-
tication Code (HMAC) [1]. less, the US National Institute of Standards and Technology (NIST),
Furthermore, hash functions are used in the Public Key Infra- has established a competition for developing the new hash func-
structure (PKI) [2], Secure Electronic Transactions (SET) [3], tion standard (SHA-3), which was finalized in November 2012 [12].
and digital signature algorithms like DSA [4], which are used to As the transition to a new standard does not happen immediately,
provide authentication services in commercial applications such as SHA-1 and SHA-2 functions are expected to continue being used in
data interchange, electronic mail, and fund transfer. Additionally, near- and medium-future applications. In fact, NIST itself reports
hash functions are used in Web protocols such as the Secure that SHA-3 is not meant to replace SHA-1 and SHA-2, but to co-
Sockets Layer (SSL) and Transport Layer Security (TLS) [5]. exist with them, let alone that many administrators have not made
The importance of hash functions has been further increased in the jump from SHA-1 to SHA-2 yet [13,14].
recent years due to their inclusion in the Internet Protocol Security A crucial issue, arisen by the above transition from SHA-1 to
(IPSec) [6]. IPSec is a compulsory feature of the forthcoming SHA-2, is that different systems/applications have different needs
Internet Protocol version 6 (IPv6) [6] and includes encryption in terms of authentication's type. Hence, the systems have to be
and authentication schemes. Encryption is achieved through the flexible so as to be able to support more than one hash functions.
block cipher algorithm AES [7], whereas authentication is provided Although this requirement can be achieved by developing systems
where each hash function will be implemented by a separate
design, a more efficient solution in terms of area is the develop-
n
Corresponding author. ment of a multi-mode architecture that will be able to support
E-mail addresses: [email protected] (H.E. Michail),
[email protected] (G.S. Athanasiou),
with one design more than one hash functions. Such architectures
[email protected] (G. Theodoridis), [email protected] (C.E. Goutis). increase significantly the flexibility of the whole security module
URL: http://www.antcor.com (G.S. Athanasiou). allowing its use in a wide range of applications spreading from

http://dx.doi.org/10.1016/j.vlsi.2014.02.004
0167-9260 & 2014 Elsevier B.V. All rights reserved.

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
2 H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

servers (which have to support a set of different hash-type timing constraints of modern applications. In the literature there
interactions), to end-users that communicate through many com- are numerous hardware implementations for SHA-1 and SHA-2
munication channels that, each of them, may employ different families, such as those presented in [15,16–18,31]. In these works,
hash functions for authentication purposes. novel architectures have been introduced to design solely each
In this paper, two multi-mode architectures, namely SHA-256/ hash function. However, there are significantly fewer works deal-
512 and SHA-1/256/512, are introduced. They achieve high ing with the development of multi-mode architectures for the
throughput rates, outperforming all the existing similar ones in above hash families.
terms of throughput/area cost factor. At the same time, they are Specifically, in [19–23,29,30,32] multi-mode architectures have
area-efficient. Specifically, they occupy less area compared to the been proposed for implementing in FPGA and ASIC technology the
corresponding architectures that are derived by simply designing SHA-1 and/or SHA-2 family. Besides the above works that focus on
the sole hash cores together (two/three separate cores designed as the SHA-1/SHA-2 hash family, multi-mode architectures that
one module having the same inputs and a multiplexer for selecting target on different hash functions have also been proposed. In
the preferred output and feeding them to a commercial FPGA [24] a multi-mode architecture for implementing the MD-5, SHA-
synthesis/P&R/mapping tool). The introduced designs are able to 1, and RIPEMD160 hash functions was proposed, while in [25] a
perform hashing on one message using the SHA-512 algorithm or unified reconfigurable HMAC unit that supports the MD-4, MD-5,
on two different messages when the SHA-256 or SHA-1 hash SHA-1, and RIPEMD160 hash functions has been introduced. In
function is executed. Several older (Xilinx Virtex, Virtex-II, Virtex-II [26] an HMAC unit that integrates the MD-5 and SHA-1 is
Pro, and Virtex-4) and modern (Xilinx Virtex-5, Virtex-6, and presented and implemented in FPGA and ASIC technologies.
Virtex-7) FPGA families are used for implementing the above Finally, in [27,28], ASIC implementations of multi-mode architec-
architectures. tures that include SHA-1 and SHA-2 hash functions are reported.
Moreover, a systematic design flow for producing multi-mode Although, novel multi-mode architectures have been introduced
architectures that implement more than one hash functions is in the above works, they suffer by the following problems. First, they
proposed. The inputs of the flow are the separate designs and the include algorithms such as MD-4 or MD-5 that they have been totally
algorithmic descriptions of the targeted hash functions and following broken [10,11] or they include the RIPEMD160 hash algorithm whose
a set of well-defined steps the individual designs are properly commercial use is very limited while it is not proposed by NIST.
merged to produce the final multi-mode architecture. To keep area Additionally, to the best of authors' knowledge, there are no FPGA
overhead low extensive recourse sharing is performed in the applied implementations of multi-mode architectures that support both the
steps. Also, exploiting special features of the hash functions, novel SHA-1 and SHA-2 hash families that are widely used by the current
techniques are introduced to keep the delay increase low. and it is also expected to be used by future applications. Such designs
Due to the fact that any architecture for the above hash families is are presented only in two works [27,28], which perform only ASIC
composed by similar functional blocks (adders, non-linear functions, implementations. Moreover, no systematic approach has been pro-
rotations, and logic modules), the proposed flow can be applied to posed for producing a multi-mode architecture, efficient in terms of
any RTL architecture of the SHA-1 and SHA-2 families. Thus, we apply area or delay. Instead, all the introduced architectures are produced
the proposed flow in both base and optimized hash functions' by an ad-hoc manner, making it difficult for another designer, who
designs, collect the corresponding results and also perform the has already developed sole hash cores, to follow a similar procedure
related comparisons. However, it must be stressed that the goal is and create multi-mode designs having the sole architectures as a
not to introduce general techniques for developing optimized designs starting point.
in terms of area, delay, or throughput. It is assumed that the
individual designs of each hash function have already been optimized
and then they are properly merged by the proposed flow. Further- 3. SHA-1/SHA-256 hash function background
more, the flow exploits specific features appeared in SHA-1 and
SHA-2 families and for that reason it is tailored to produce optimized A hash function, H(M), operates on an arbitrary-length message,
multi-mode architectures for them. It is not a general design flow M, and returns a fixed-length output, h, which is called hash value
that could be applied in general-purpose designs. or message digest of M. The aim of hash function is to provide a
To the best of our knowledge, it is a first time that (a) an FPGA “signature” of M that is unique. Given M, it is easy to compute h if H
of a multi-mode design was implemented that includes the SHA-1 (M) is known. However, given h, it is hard to compute M such that H
and SHA-2 hash families, and (b) a systematic design flow for (M)¼h, even when H(M) is known. Their nature is iterative and they
producing multi-mode architectures for implementing a set of hash include two stages: the pre-processing and the computation ones [9].
functions is presented. The main characteristics of the targeted hash families, namely
The paper is organized as follows. In Section 2 the related work is the SHA-1 and SHA-2 (SHA-224/256/384/512), are shown in
briefly stated. The basic background concerning the SHA-1 and Table 1. They include simple processing elements, such as arith-
SHA-2 hash families is presented in Section 3, along with a short metic computations (e.g. additions and non-linear functions) and
description of their base and optimized designs. In Section 4 the bit-level shift/rotations [9].
proposed design flow along with the production of above-mentioned As reported in the standard [9], the SHA-224 and SHA-384 hash
base multi-mode architectures is presented in detail. Continuously, in functions are exactly the same with the SHA-256 and SHA-512,
Section 5, the corresponding optimized multi-mode architectures respectively, having their outputs truncated. Hence, they are omitted
(constructed from the introduced flow) are presented. FPGA imple- from the paper's analysis. However, they are considered both in the
mentation results, along with the corresponding comparisons with design flow and the final designs, where a simple truncation of the
existing implementations, are provided and discussed in Section 6. output is accomplished in case of these two functions. Additionally, it
Finally, Section 7 concludes the paper. should be noticed that throughout the paper the terms hash function
and the hash algorithm will be used interchangeable.

2. Related work 3.1. Arithmetic and logical functions

Currently, there are many hardware implementations of hash A number of non-linear functions are applied on the w-bit
functions targeting high throughput rates, so as to meet the strict words that are represented as x, y, and z in the following. The

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 3

Table 1 multiple of 512 for SHA-1 and SHA-256 or 1024 for the SHA-512
SHA-1 and SHA-2 characteristics. algorithm. Since padding is a simple procedure, it is usually
implemented in software without affecting the security level of
Function j Massage Word Hash Iterations
block length value ðt maxj Þ the implementation. For more information about padding the
(k bits) (w bits) (n bits) reader is referred to the standard [9].
During parsing, the padded massage in separated in N k-bit
SHA-1 1 512 32 160 80 blocks denoted as M1, M2,…, MN with k taking according to Table 1.
SHA-256 2 512 32 256 64
SHA-224 – 512 32 224 64
Concerning SHA-1 and SHA-256 algorithms, since the 512-bit
SHA-512 3 1024 64 512 80 block can be expressed as 16 32-bit words, the first 32-bit of the
SHA-384 – 1024 64 384 80 message block i is denoted as M ðiÞ ðiÞ
0 , the next 32 bits denoted as M 1 ,
ðiÞ
and so on up to M 15 . Similar procedure is followed for SHA-512
with the exception that each M(i) is a 64-bit word.

result of the logical computations is also a w-bit word. Speci-


fically, the following non-linear functions are included in SHA-1,
SHA-256, and SHA-512 algorithms: 3.4. Computation stage

Chðx; y; zÞ ¼ xy  xz ð1Þ The computation stage includes the computations of the


message schedule and those of transformation round. These are
Parityðx; y; zÞ ¼ x  y  z ð2Þ
accomplished by the following procedure:
Majðx; y; zÞ ¼ xy  xz  yz ð3Þ For i¼1 to N do

∑j0¼ 2;3 ðxÞ ¼ ROTRα1;j ðxÞ  ROTRα2;j ðxÞ  ROTRα3;j ðxÞ Step 1. Message schedule preparation.
  8
ð4Þ > ðiÞ
ðα1;j ; α2;j ; α3;j Þ ¼ ð2; 13; 22Þj ¼ 2 ; ð28; 34; 39Þj ¼ 3 > Mt ;
>
<
0 r t r15 ðj ¼ 1; 2; 3Þ
1
W t;j ¼ ROTL ðW t  3  W t  8  W t  14  W t  16 Þ; 16 r t r t maxj  1 ðj ¼ 1Þ
>
>
> sj ðW
: t  2 Þ  s0 ðW t  15 Þ  W t  7  W t  16 ;
j
1 16 r t r t maxj  1 ðj ¼ 2; 3Þ
∑j1¼ 2;3 ðxÞ ¼ ROTRα4;j ðxÞ  ROTRα5;j ðxÞ  ROTRα6;j ðxÞ
  ð5Þ ð8Þ
ðα4;j ; α5;j ; α6;j Þ ¼ ð6; 11; 25Þj ¼ 2 ; ð14; 18; 41Þj ¼ 3

Step 2. Initialization of the working variables.


sj0¼ 2;3 ðxÞ ¼ ROTRβ1;j ðxÞ  ROTRβ2;j ðxÞ  SHRβ3;j ðxÞ SHA-1 uses five working variables, a, b, c, d, and e, which
  ð6Þ for the first iteration (t¼ 0) of the transformation round equals
ðβ1;j ; β2;j ; β3;j Þ ¼ ð7; 18; 3Þj ¼ 2 ; ð1; 8; 7Þj ¼ 3
to ða; b; …; eÞ ¼ ðH ð0Þ;1
0 ; H ð0Þ;1
1 ; …; H ð0Þ;1
4 Þ. Similarly, SHA-256 and
SHA-512 algorithms use eight working variables a, b, c, d, e,
sj1¼ 2;3 ðxÞ ¼ ROTRβ4;j ðxÞ  ROTRβ5;j ðxÞ  SHRβ6;j ðxÞ f, g, and h, which are initialized as follows ða; b; …; hÞ ¼
  ð7Þ
ðβ4;j ; β5;j ; β6;j Þ ¼ ð17; 19; 10Þj ¼ 2 ; ð19; 61; 6Þj ¼ 3 ðH ð0Þ;j ð0Þ;j ð0Þ;j
0 ; H 1 ; …; H 7 Þ with j ¼2, 3.
Step 3. Application of transformation round.
where  corresponds to the logical operation XOR, whereas The computations of the SHA-1 transformation round are as
ROTRl(x) and SHRl(x) denote the l times right circular rotation follows:
and the l times right shifting on word x, respectively. Concerning For t¼ 0–79{
the arithmetic operations, modulo 2w addition between two w-bit
words is used in all the above algorithms. T 1 ¼ ROTL5 ðaÞ þf t ðb; c; dÞ þ e þ W t þ K t ð9Þ

3.2. Constants K jt and initial values H(0),j ½ajjbjjcjjdjje ¼ ½T 1 jjajjROTL30 ðbÞjjcjjd ð10Þ
where || denotes the concatenation operation. According to
Constant values, K jt (0 r t r t max j 1), are used in each trans-
iteration number t, function ft(b,c,d) equals to: ft(b, c, d) ¼
formation round. Thus, the SHA-1 algorithm uses 80 32-bit
Ch(b, c, d) when 0 r t r 19, whereas ft(b, c, d)¼Parity(b, c, d)
constants, K 10 , K 11 ,…, K 179 , which are the first 32 bits of the when 20 r t r 39 and 60 r t r79, and ft(b, c, d)¼Maj(b, c, d)
fractional parts of the cube roots of the first 80 primes. Similarly, when 40 rt r 59.
SHA-256 function uses 64 32-bit constants, K 20 , K 21 ,…, K 263 , whereas The computations of the SHA-256 and SHA-512 (given below)
SHA-512 functions employs 80 64-bit constants K 30 , K 31 ,…, K 379 . are the same for both algorithms with the exception that
Concerning the initial values, the SHA-1 algorithm employs five SHA-256 performs 32 bit data processing whereas SHA-512
32-bit initial values, H 0ð0Þ; 1 , H 1ð0Þ; 1 ,…,H 4ð0Þ; 1 , which are used in the operates on 64-bit words.
first iteration (t¼ 0) of the transformation round. On the other For t¼ 0 to t maxj (j ¼2, 3)

hand, SHA-256 and SHA-512 use eight, H 1ð0Þ; j , H 2ð0Þ; j ,…,H 7ð0Þ; j (j¼ 2 T 1 ¼ hþ ∑j1 ðeÞ þ Chðe; f ; gÞ þ K jt þ W t;j ð11Þ
for SHA-256 and j¼ 3 for SHA-512), w-bit initial values (w¼32 for
SHA-256 and w¼ 64 for SHA-512). For all algorithms, the constant T 2 ¼ ∑j0 ðaÞ þ Majða; b; cÞ ð12Þ
and initial values are provided by the standard [9].
½ajjbjjcjjdjjejjf jjgjjh ¼ ½T 1 þT 2 jjajjbjjcjjdjjejjf jjg ð13Þ
3.3. Pre-processing stage

The pre-processing stage includes the padding and parsing of Step 4. Computation of the i-th intermediate hash value H(i).
the initial message M. Regarding padding, it is a procedure that
additional bits are added to M so that its size in bits to be a H ðiÞ;j ¼ a þ H 0ði  1Þ;j jjb þ H 1ði  1Þ;j jj…jje þH ði4  1Þ;j ðj ¼ 1Þ ð14Þ

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
4 H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

H ðiÞ;j ¼ a þ H 0ði  1Þ;j jjbþ H ði1  1Þ;j jj…jjh þ H 7ði  1Þ;j ðj ¼ 2; 3Þ ð15Þ

}.

After repeating these steps N times (i.e. after processing M(N)


message block) the computed hash value H(N) is the message
digest h of message M.

3.5. Base and optimized designs

Regarding the above hash functions, simple hardware designs


can be easily developed. In those architectures, named base
architectures, the transformation round is implemented exactly
as described by the standard. Examples of such a transformation
round are the SHA-1 and SHA-256 ones, which are shown in Fig. 1,
where the term ft(x,y,z) denotes the non-linear functions (Ch(x,y,z), Fig. 2. General structure of a 4-stage pipelined, base architecture.
Maj(x,y,z), Parity(x,y,z)) that are used by the SHA-1 algorithm,
whereas the shadowed boxes correspond to the circuits modules
of the critical paths. transformation rounds assigned to this stage, a Wi unit, which is
The most widely adopted pipeline choice of hash function responsible to compute the requited Wt values, and local registers
designers is a 4-staged pipelined architecture. Such a general for storing the constant values Kt.
architecture for the base designs is shown in Fig. 2. The inputs to Since four pipeline stages are used, each of them executes 20,
the architecture are the initial values H and the Input Message 16, and 20 iterations of the transformation rounds of SHA-1,
Block, as described in the previous sub-section. Each pipeline SHA-256, and SHA-512 algorithms, respectively. As each round
stage contains a round unit to execute the iterations of the unit executes a number of iterations of the transformation rounds,
multiplexers exist in front of them to pass the results of an
iteration executed by the current round unit or to pass the result
of the previous pipeline stage.
et-1 dt-1 ct-1 bt-1 at-1 The control logic consists of four counters, count_i (i¼ 1, 2, 3, 4),
that are used for activating the next pipeline stage and addressing
the local memories. Specifically, the output of each counter is used
Wt-1
for addressing the local memory of each stage, whereas the next
+ + ft (x, y, z)
pipeline stage is activated via the signals tround_i and tcwicnt_i,
Kt-1
which are generated when the previous pipeline stage finishes its
computations.
+ ROTL5
Beyond the base architectures, as reported in Section 2, in the
literature there are numerous throughput-optimized hardware
ROTL30 implementations for SHA-1 and SHA-2 families. Among the best
+ published designs in terms of throughput and throughput/area
factors are those reported in [17,18]. There, the application of
advanced optimization methodologies, which employs optimiza-
tion techniques such as loop un-rolling, spatial pre-computation
and resource reordering, retiming, temporal pre-computation, and
et dt ct bt at circuit-level optimization, led to optimized SHA-1 and SHA-256
designs. The optimized transformation rounds proposed in those
works are shown in Fig. 3. Due to the above techniques, they
include two stages of computation (pre- and post-) separated by a
ht-1 gt-1 ft-1 et-1 dt-1 ct-1 bt-1 at-1 pipeline register. Additionally, due to loop-unrolling by 2, they
iterate half of the times compared to those of Fig. 1.
Wt-1 We chose to use the above optimized designs as inputs to the
introduced design flow and produce our optimized multi-mode
+ (256)
Ch (e, f, g) 1 (256) architectures.
Kt-1 Maj (a, b, c) 0

+ 4. Proposed flow and base multi-mode designs


+ +
In this section the proposed design flow along with base
multimode architectures is presented in details. We start with a
+ discussion on the features and properties of the hash functions
+ + based on which the flow is established. Then, the design flow and
the resulted architectures are presented.
In the following sub-sections, the above are presented along
with a running example using the SHA-1 and SHA-256/512
ht gt ft et dt ct bt at base designs. This is accomplished so as for the flow to be easily
understandable. The similarities between SHA-256 and SHA-512
Fig. 1. Base transformation rounds: SHA-1 (a) and SHA-256 (b). functions are many. So, the creation of the multi-mode design is

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 5

Fig. 3. Optimized transformation rounds: SHA-1 (a) and SHA-256 (b).

not that complex. Hence, we choose to demonstrate the intro- results in no complex designs that reduces the overall complexity
duced flow from the point that the base SHA-256/512 design of the flow, as it will be explained in the following.
is merged with the base SHA-1 design. The construction of the Regarding the transformation rounds, it is easily derived that
corresponding optimized designs is accomplished similarly, fol- only two working variables are modified during each iteration,
lowing the introduced flow. The resulted optimized architectures whereas the remaining ones stay intact. Particularly, in the SHA-1
are presented in Section 5. algorithm, only the variables a and c are modified, and in SHA-256
and SHA-512 algorithms only the variables a and e are computed.
This feature is important because it reduces considerably the
4.1. Special features of SHA-1, SHA-256, and SHA-512 complexity of the flow and the area of the final multi-mode
hash algorithms architectures. Hence, the effort is mainly focused on merging a
few numbers of sub-circuits and particular those that correspond
Taking into account Eqs. (8)–(13), which describe the transfor- to the modified variables.
mation rounds and message scheduling procedures, it is easily Additionally, the operations of the transformation rounds
derived that the major features of the targeted hash functions are include simple modulo 2w additions, simple bitwise logical com-
the similarity and the relative simplicity of the performed compu- putations in the non-linear functions, and rotation/shift operations
tations. The similarity of the computations is a general feature that with fixed amount of rotating/shifting bits. Hence, exploiting the
holds for both families, while the simplicity of the computations associativity and commutativity laws of addition and the low delay

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
6 H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

of the non-linear functions and shift/rotations, the merging of each hash function has already been developed and we use it as
more than one function becomes more flexible. In other words, the input to the flow.
above features allow the reuse of addition and/or non-linear The flow consists of four main stages. In the first stage, an
functions to compute the modified variables of the other hash initialization takes place. Specifically, the acknowledgment of the
functions. In that way, the delay overhead on the multi-mode input designs' pipeline stages is performed, so as to repeat
design is reduced and resource sharing is easily applied. The above the next stages the same number of times. Additionally, the input
facts are also valid in the case of message schedule procedures of designs are compared and the bigger graph in terms of the
the targeted algorithms, because they have similar features as employed I/Os and computational resources is selected to be used
those of the transformation rounds. as the base design.
In the second stage the development of the multi-mode trans-
formation round of each pipeline stage takes place. To achieve this,
4.2. Overall structure of the systematic design flow an iterative procedure is followed considering two functions
each time. In particular, we use the design of the most complex
In Fig. 4 the overall structure of the proposed flow is illustrated. transformation round as the base design and we try to merge the
The inputs are the targeted hash algorithms and their individual design of the second function with it. After the development of
designs in the form of dependency graphs (block diagrams) where the multi-mode transformation round for the first two algorithms,
each node denotes a hardware resource, whereas an edge corre- the latter is used as base, and the round of a third function is
sponds to an interconnection among nodes or primary I/Os and chosen for merging. This procedure is repeated for all remaining
nodes. The output is a multi-mode architecture that implements a functions.
set of hash functions. Next, the development of the multi-mode message schedule
It must be mentioned that the goal of the introduced flow is not unit takes place, where a similar approach to that applied for
to propose techniques for developing an optimized design for each producing the multi-mode transformation rounds is followed.
one of the targeted hash functions. We assume that the design of Finally, at the fourth stage, the development of the control unit
is performed. Based on the developed multi-mode round and
message schedule units, the required control signals are specified.
Thus, the FSMs of each design are properly modified to develop
the FSM of the multi-mode architecture.
In the following sections, the above stages are presented in
details by a running example using the SHA-1 and SHA-256/512
algorithms. The similarities between SHA-256 and SHA-512 func-
Stage 1: tions are many. So, the creation of the multimode design is not
Initialization that complex. Hence, we choose to demonstrate the introduced
flow from the point that the SHA-256/512 is merged with the
SHA-1 design.

4.3. Stage 1: initialization

The inputs of the flow are the SHA-1 and SHA-256/512 designs
with 4 stages of pipeline. As it was mentioned above, in order to
Stage 2: clearly describe the flow, the individual designs (base designs)
Multi-mode derived straightforwardly based on the description of the corre-
transf. round sponding algorithms. Based on Eqs. (9)–(14) and the Sub-section 4.1,
Construction of the Multi-mode Paths the designs of the transformation rounds can be easily derived. The
corresponding graph for SHA-1 is shown in Fig. 1(a). The round's
graph of SHA-256/512 is almost the same with the one shown in
Fig. 1(b). As mentioned above, SHA-256 and SHA-512 are extremely
similar. Thus, the multi-mode transformation round of SHA-256/512
Stage 3: is almost identical with the one of SHA-256, having however double-
Wt
Multi-mode sized data-path (64-bits). Additionally, the Σ0 and Σ1 boxes include
Wt unit the non-linear functions of both the functions. More details will be
Construction of the Multi-mode Paths available in the following sub-sections.
As shown in Fig. 4, the initialization stage includes three steps.
Firstly, the input graphs are compared and the bigger one in terms
of the employed I/Os and computational resources is selected to be
used as base. It is obvious that the more complex is the SHA256/
Stage 4:
512 one. Hence it is considered as the base design. Then the
Multi-mode
control unit designs' number of pipeline stages is acknowledged. Hence, the
whole procedure of the design flow will be repeated 4 times,
equally to the number of the pipeline stages. However, it must be
stressed that the introduced flow can be applied to any architec-
ture independently of the number of the pipeline stages.

4.4. Stage 2: multi-mode transformation round

After the initialization, the next stage of the flow concerns


Fig. 4. Overall structure of the systematic design flow. the development of the multi-mode transformation round of each

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 7

pipeline stage. This procedure is divided in three steps. First, the ignoring the assignment of primary I/Os. Then, the assignment of
development of the sub-circuits (circuit paths) required to realize the primary I/Os is performed, and finally the possible different
the computations of the transformation rounds takes place, word lengths of the involved algorithms are handled (Fig. 5).

4.4.1. Circuit paths development


We define as a circuit path or path a walk through circuit
List the paths in descending order in terms of complexity modules that starts from one primary input and terminates in one
primary output. If the circuit path that starts from one input uses
Pick a path from the top of the smaller round (critical path) exactly the same modules with the path that starts from another
input, then these paths are treated as one path.

Identical paths in y
the bigger round? 4.4.1.1. Multi-mode paths development. Initially, the circuit paths of
the smaller graph (i.e. SHA-1) are ranked in terms of their contribution
n on the delay and occupied area. Then, the first path of the list is
selected and we try to merge it with the resources existed in the base
Create new n Similar paths design, which in our case is that of SHA-256/512. Afterwards, the
path in the bigger procedure is repeated choosing the next path of the list until all paths
round?
of the small graph to be included in the base design.
y The merging of the circuit paths is performed by examining if
Create candidate multi-path(s) and check for there is a same or similar path in the base graph and adding the
resource sharing with the bigger round required steering logic (e.g. multiplexers). If there is no similar
path then a new circuit path is inserted in the base design with
the additional required circuitry. By similarity between two circuit
Establish resource sharing (steering logic) with the
paths it is meant that there is one to one correspondence
regarding the circuit modules and interconnections of the paths,
in other words the graphs of the two paths are isomorphic. To
Are there paths
y avoid overlong presentation, only the development of the first
in the list ? multi-mode transformation round is used as a running example.
Studying the two input transformation rounds, it is easily
n derived that the first selected path of SHA-1 is that shown in
Circuit paths development
Fig. 6(a), whereas investigating the graph of SHA-256/512 (base
graph) two identical paths exist as it is shown in Fig. 8(b) and (c).
For every candidate multi-path identify the I/O positions on the initial paths The split signal is omitted for clarity reasons. Thus, there are two
candidate multi-paths for merging the first selected path of SHA-1
with the current base design. Because the assignment of the inputs
Apply in the y
candidate
Same I/Os on the and outputs of the candidate paths is ignored during the first sub-
initial paths?
multi-paths I/Os assignment stage of the flow, the two candidate multi-paths are shown in
Fig. 7 using general names for the I/Os. It should be mentioned
n
that in Fig. 6(a) the Ch(d, c, b) function is used as a non-linear
function because this is executed in the first 20 iterations which
intact I/O positions during y Apply in the are assigned to the first pipeline stage.
candidate
?
multi-paths
Then, the second path (Fig. 8(a)) of SHA-1 is selected from the
list. Studying, the graph of SHA-256/512 we also found an identical
path (Fig. 8(b)) and by combining it with the currently selected
n
path of SHA-1 a third candidate multi-path is produced, as shown
Steering logic: intact I/Os in Fig. 8(c). Although this procedure is repeated for all paths of
SHA-1 graph, we will use the above paths to present the next steps
Adopt the most efficient multi-path of each group of the flow. The next step is the study of the generated multi-paths
with the paths existing in the base design to perform resource
Word-length handling sharing and keep the delay penalty low.

Different Width?
n
4.4.1.2. Resource sharing with the base design. The two selected
y paths of SHA-1 graph (Figs. 6(a) and 8(a)) are used together for
the computation of the aSHA-1 variable, as it is shown in Fig. 1.
In addition, they are used for the production of the three candidate
multi-paths shown in Figs. 7 and 8(c). Specifically, the first path of
Possible multiple SHA-1 graph (Fig. 6(a)) is included in the first two multi-paths
n
instances of the smaller (Fig. 7), whereas the second path of SHA-1 graph (Fig. 8(a)) is
hash graph
included in the third candidate multi-path (Fig. 8(c)).
Therefore, the computation of the aSHA-1 variable can be
y
implemented by combining either the first and third multi-paths
Enable feature Add appropriate resources or the second and third multi-paths. To keep the delay and area
overheads low, the pair of multi-paths, which will be selected, has
to be properly merged with the resources existing in the base
Fig. 5. Stage 2: multimode transformation round. design (SHA-256/512 transformation round).

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
8 H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

SHA256/512 Paths - Identical


SHA-1 Path
gt-1 ft-1 et-1 dt-1 gt-1 ft-1 et-1
et-1 dt-1 ct-1 bt-1

+ Ch (d, c, b) Ch(e,f,g) Ch(e,f,g)


X2 X2

X1
+ Y2
Y1 + + T2
Y2

+ + +
+ +
T1

at-1
et at

Fig. 6. The first selected path of SHA-1 graph (a), identical paths on SHA256/512 graph (b) and (c)

or with input I 21 . Moreover, the adder encircled in Fig. 7 has to be


Multi-Path 1 Multi-Path 2 bypassed when the SHA-1 function is executed. This is easily
2
accomplished by using a multiplexer as shown in Fig. 9, which
I21 I31 I41 I2 I32 I 42 illustrates the modification of the candidate multi-paths after the
resource sharing with the base design.
Although, it seems that the above procedure is complex in the
general case, it does not hold for the case of hash functions. As
mentioned in Sub-section 4.1, a basic feature of these functions is
Ch(x,y,z) Ch(x,y,z)
I5 1 I 52 that only a small number of working variables (e.g.: two variables
for SHA-1, SHA-256, and SHA-512) are computed in the trans-
formation rounds by simple computational structures. Thus, the
I12 graphs of the transformation rounds are not complex and the
I11 I6 1 I62 paths that have to be examined are not many. Both these proper-
ties reduce the complexity and allow the application of the above
procedure.

4.4.2. I/Os assignment


The goal of this stage is to assign (place) the variables of
O11 O the second algorithm in the positions of the above input register
trying to avoid the use of extra steering logic and the correspond-
Fig. 7. The first (a) and the second (b) candidate multi-paths. ing area and delay overheads.
Since each multi-path is used to execute a number of transfor-
Based on Eq. (9), the computation of aSHA-1 variable is per- mation rounds of the targeted hash functions, the output values
formed by adding five terms, namely the ROTL5(a), Ch(b, c, d), e, W, produced after the execution of one round are feedback to the
and K. Studying the graph of SHA-256/512 transformation round, it inputs of multi-mode round unit through registers (Fig. 2). As
can be easily derived that it contains two circuit paths consisting the base design is that of SHA-256/512 function and because it
of four serially interconnected adders, which particularly consti- uses eight 64-bit variables, an 8  64-bit register is used for this
tute its two equivalent critical paths. However, a circuit structure purpose. Taking into consideration the design of SHA-256/512
consisting of four serially interconnected adders has five inputs and transformation round, (see Fig. 1(b)), the variable a is placed in the
one output. Thus, any of the above circuit structures of SHA256/512 most right position of the input register, variable b is placed in the
transformation round can be used to implement the computations next most right and so on.
of the aSHA-1 variable. Initially, the produced multi-paths (Fig. 9) are examined to find
What actually happens is that we exploit the associativity and if there are inputs or outputs that could be placed in the same
commutativity properties of the addition and we change the order of positions of the input register with that of the base design. If this
the performed additions and the variables used in each addition happens then these I/Os are placed in the corresponding positions.
without violating the computations of variable aSHA-1. In that way, the Studying the first two candidate multi-paths (see Fig. 9) it is easily
resources existing in SHA-256/512 graph are reused keeping the area derived that the output O21 corresponds to the computed variables
overhead low. This is one the feature of the hash function mentioned aSHA-1 and aSHA-256/512; thus these variables are stored in the same
in Sub-section 4.1 which is exploited in our design flow. position of the register and particularly in the right most one.
In order to avoid an increase of the critical path of the base Then, it is examined if there are variables that can be placed
design, the circuit paths of the four interconnected adders of SHA- in other positions of the register and remain intact through the
256/512 round should remain intact. To achieve this, the pairs of execution of the targeted algorithms. In other words, as one hash
the multi-paths discussed above have to be properly combined. algorithm is executed by the multi-mode round unit each time, we
This is accomplished by connecting the output O31 with the input I 11 try to perform variable renaming by placing its variables in proper

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 9

Fig. 8. The second selected path of SHA-1 graph (a), identical path on SHA256/512 graph (b), the third candidate multi-path (c).

variable bSHA-1. Thus, a de-multiplexer is demanded to overcome


Candidate Multi-Path 1 Candidate Multi-Path 2 this conflict and store the variable aSHA-1 in the proper register
position as shown in Fig. 10. However, the de-multiplexer lies in
I 21 I3 1 I41 I2 2 I32 I42 the critical path which increases the delay of the produced design
and in addition increases the total area. Thus, between the two
multi-paths the second one is selected and used in the final
design.
The above procedure is repeated for the third multi-path
Ch(x,y,z) Ch(x,y,z) (Fig. 8(c)) and merging the I/Os assignment is performed. Combin-
I 51 I52
ing the produced path with that of Fig. 10(b) the implementation
of the first two selected paths of SHA-1 is shown in Fig. 11.

+ + 4.4.3. Word-length handling


Finally, the last step of the development of the multi-mode
I12= O13 transformation round is the handling of the different word lengths
1 3 MUX
I1 = O 1 I6 2 used by each algorithm. Specifically, the SHA-256/512 function uses
MUX
I61 64-bit words, whereas the SHA-1 uses 32-bit words. To handle this
reusing the base design (SHA-256/512), a ratio r is specified, which
+ is produced by dividing the different word lengths, and r instances
+ + of resources of the algorithm with the smaller word length are
+ employed.
O 12 Specifically, in our case the above ratio equals to 2 and there-
O1 1 fore the Ch(x, y, z) node of Fig. 13 includes two 32-bit units that
performs the computations of the Ch(x, y, z) non-linear functions
Fig. 9. The first (a) and the second (b) candidate multi-paths after performing on two 32-bit data or one computation on 64-bit data. This
resource sharing with the base design.
happens because the corresponding computations are bit-wised.
Concerning, the addition node, it also contains two 32-bit modulo
adders where the carry out signal of the first adder is connected to
positions of the input register. Considering the first candidate
the carry in signal of the second through a simple logic (AND gate)
multi-path (Fig. 9(a)), we can assign the variables g, f, and e to the
and using a control signal. In this, two 32-bit additions required
inputsI 12 , I 13 and I 14 when the SHA-512 algorithm is executed and
by SHA-1 and SHA-256 or one 64-bit addition required by the
place these variables in the corresponding register positions. We
SHA-512 algorithm can be executed. For the above selection, the
also assign in I 12 , I 13 , and I 14 inputs the variables dSHA-1, cSHA-1, and
split signal is responsible. In that way the produced design is
bSHA-1 when the SHA-1 is executed. Thus, we use the same
able to perform a SHA-512 hashing on one message or to perform
registers for storing the variables gSHA-256/512, fSHA-256/512, and
SHA-1 hashing on two different messages.
eSHA-256/512 or the variables dSHA-1, cSHA-1, and bSHA-1 with no
conflict. Based on the above, the new multi-paths with I/Os are
shown in Fig. 10. 4.4.4. Produced multi-mode transformation rounds
Adopting the above assignment of the I/Os, a de-multiplexer is Up to now the merging of two paths of SHA-1 transformation
needed for the O11 output of the first multi-path. This happens round with the design of SHA256/512 transformation has been
because it produces the variable eSHA-256/512 or the variable aSHA-1 described. The same procedure is repeated for the remaining paths
according to the executed algorithm. However, the position that is of the SHA-1 transformation round. After the completion of the
used for storing eSHA-256/512 variable is also used for storing the merging of all paths, the multi-mode transformation round used in

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
10 H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 10. First (a) and second (b) multi-paths after of I/Os assignment.

et-1 dt-1 ct-1 bt-1 SHA-1


SHA-512 /
ht-1 gt-1 ft-1 et-1 SHA-256

Wt-1
+
Kt-1
Ch(x,y,z)
In5

+
+

MUX In6

+
+

SHA-512 /
SHA-256
at

SHA-1 at

Fig. 11. Merged SHA-1/256/512 multi-paths with resource sharing and I/Os
assignment. Fig. 12. Multi-mode transformation round of the first pipeline stage.

the first pipeline stage of the architecture of Fig. 2 is produced and It should be mentioned that the multi-mode transformation
it is depicted in Fig. 12. round of the second and fourth pipeline stages are identical. This
Repeating the above procedure the multi-mode transformation happens, because in these pipeline stages the assigned transfor-
rounds for the second, third, and fourth pipeline stages are produced mation rounds of SHA-1 function are identical. Additionally, due to
and illustrated in Fig. 13. the fact that the Parity(x,y,z) non-linear function is used in these

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 11

To overcome this, a 6to3 multiplexer is added before the existing


Maj of SHA256/512 so as to feed it with the appropriate values,
depending on the function (cycle in Fig. 13(b)).

4.5. Stage 3: multi-mode message scheduling units

After the production of the multi-mode transformation round


unit of each pipeline stage, the development of the multi-mode
Wt units takes place. Specifically, for each pipeline stage of the
architecture of Fig. 2(a) Wt unit exists to feed the required Wt
values to the multi-mode transformation round. Again, a proce-
dure similar to the above is followed.
The corresponding units of sole SHA-1, SHA-256, and SHA-512
designs are derived easily. Each of them consists of a shift register,
a 2to1 multiplexer and a logic unit that computes the Wt values
according to Eq. (8) (circle in Fig. 14). Concerning the operation of
SHA-512 Wt unit, when a 1024-bit message block is inserted, the
first 16 Wt values are produced instantly by performing a simple
split in the Block Split Unit (see Fig. 2) and fed into the shift register
of the first Wt unit. During the first 16 iterations of the SHA-512
transformation round, the W16–W31 values are computed (one per
clock cycle) and stored in the shift register via the serial input.
Since, each pipeline stage executes 20 iterations of the transfor-
mation round, the values W16–W19 are consumed in the next
4 iterations, whereas during these iterations the values W32–W35
are computed and inserted to the shift register. Thus, when the
first pipeline stage finishes its computations, the shift register
contains the values W20–W35 that are transferred (through a
parallel load) to the shift register of the second Wt unit. The same
procedure takes place at the second and third pipeline stages.
Concerning the control load/shift signal this is the tcwicnt_i
produced by the counter unit (see Fig. 2).
The development of the multi-mode scheduling unit for SHA-
256/512 architecture is not complex. The computations of the Wt
values (Wnext logic) are slightly different (regarding the non-linear
functions), whereas the structure of the shift register and the
multiplexer exists in both designs differing on the word length.
Thus, the shift register and the multiplexer can be reused in the
multi-mode design, whereas the Wnext logic module in the multi-
mode implementation includes the separate circuits of the two
Wnext logic modules. The same holds regarding the resulted unit of
the SHA-256/512 unit compared to the corresponding one of the
SHA-1 architecture. Hence, considering the fact that the SHA-256/
512 multi-mode Wt unit is the more complex, it is used as the base
(initial) design in the flow and we try to merge it with the design
of the SHA-1. The resulted Wt unit is shown in Fig. 14.
As shown in this figure, to handle the different word lengths of
SHA-1, SHA-256 and SHA-512 algorithms, a similar approach to
that followed in the development of the multi-mode transforma-
tion round is followed. Specifically, the units of the function with
the smallest word length are duplicated r times (r ¼2) in the final
design. To support the multi-mode operation, steering logic is
also included in the final design. The input is the 1024 bits and is
treated as one 1024-bit message block when the SHA-512 algo-
rithm is executed or as two 512-bit message blocks b1 and b2
when the SHA-1 or SHA-256 function is executed. Thus, two
message blocks can be hashed with SHA-1 or SHA-256 functions
simultaneously.
Fig. 13. Multi-mode transformation rounds: (a) second and fourth pipeline stages, The meaningful responsibility of the split signal is fully shown
and (b) third pipeline stage.
in Fig. 14. In case that the SHA-256 is selected, it splits in the half
the 64-bit adders are so as to perform 2 32-bit additions. This is
rounds, in the corresponding multi-mode one, a Par module is accomplished by handling the 32nd carry-out bit exploiting an
included and its outputs are chosen only when SHA-1 function is AND gate. The same idea is followed before regarding the addition
selected (cycles in Fig. 13(a)). Finally, in third multi-mode round, in stages of the transformation rounds.
both SHA-1 and SHA-256/512 designs, the Maj(x,y,z) non-linear To perform two hash function executions in two different
function is used. However, its inputs are different in each design. message blocks at the same time, the two 512-bit blocks are not

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
12 H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 14. Multi-mode Wt computation unit of the SHA-1/256/512 architecture.

inserted sequentially (i.e. firstly the message block of the first The control logic for the final multi-mode hash design is
message and then the message block of the second one). This way constructed by considering the control unit with the greatest
the concurrent hash computation would be impossible. Instead, an number of states and modifying it so as to be able to control
interleaving of the 16 message dependent W values of each block all the incorporated hash function algorithms. Beyond that, the
is performed. Specifically, in each of the 16 64-bit positions of the resulted FSM, along with the 4 counters, are designed to perform
shift register, two 32-bit W values are stored, namely the 32-bit W On-the-Fly functionality. Specifically, each one of the 4 pipeline
value of the first message block in the 32 most significant bits of stages (including its transformation round, its W computation unit
the position and the corresponding 32-bit W value of the second and the corresponding counter module) can realize different hash
message block in the remaining 32 bits of the position (less computations and perform them concurrently. For example, the
significant bits). Hence, at each clock cycle, two W values are fed first stage can perform a SHA-512 computation while, at the same
in the transformation round, one from the first message block and time, the second one can perform 2 concurrent SHA-1 or SHA-256
one from the second. Consequently, as it can be observed in Fig. 13, ones. This way, the pipeline structure, together with the multi-
the calculation of the rest W values by the Wnext Logic block is mode principle, is fully exploited. This is achieved by determining
performed pairs (one for the first hash computation and one specific flag bits that indicate the current, previous and forth-
for the second one), in order to be stored in the same manner as coming hash functions. By exploiting this information in the
described above. resulted FSM, through flag-bits, issues of stalling the hash compu-
tation of a stage in order for the next one to finalize its computa-
tion, when the second one needs to iterate more times, are
4.6. Stage 4: multi-mode control unit overcome.

As reported above, the controller of the initial hash designs is


simple Finite State Machines (one FSM/design), implemented by 5. Proposed optimized multi-mode designs
counters. Specifically, the control logic consists of four counters,
count_i (i ¼1, 2, 3, 4), one for each pipeline stage. The output of Following the above design flow, and having as inputs the
each counter is used for addressing the local memory of each optimized designs of the hash functions, we proceeded in the
stage, whereas the next pipeline stage is activated via the signals construction of the optimized multi-mode architectures. We con-
tround_i and tcwicnt_i, which are generated when the previous structed two optimized multi-mode architectures, namely the
pipeline stage finishes its computations. SHA-256/512 and the SHA-1/256/512 ones. However, to avoid an

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 13

over-length text, in this section we present only the, more the multi-mode rounds includes a Pre- and Post-Computation
complex, SHA-1/256/512 architecture. stage, separated by a register. It must be stressed that in all the
above figures, the split signal (responsible for choosing between
5.1. Multi-mode transformation rounds 2  32-bit or 64-bit operations) is omitted in the detailed pre-
sentation of the rounds for clarity reasons. However, it exists and
As the design flow imposes, firstly, we constructed the multi- its responsibility is the same as described in Section 3.
mode transformation rounds. The input designs were 4-staged A key point of the resulted optimized multi-mode transforma-
pipelined. Hence, the flow was repeated 4 times, one for each tion rounds is the splitting of the Carry-Save Adders (CSAs) of
pipeline stage. The resulted multi-mode rounds are depicted in the second and fourth rounds (Fig. 17). This happens during the
Figs. 15–17. Similarly to sole optimized architectures, each one of “resource-sharing” sub-stage of the flow and it is accomplished by

Fig. 15. First multi-mode round of optimized SHA-1/256/512 architecture.

Fig. 16. First multi-mode round of optimized SHA-1/256/512 architecture.

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
14 H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 17. Second and fourth multi-mode round of optimized SHA-1/256/512 architecture.

splitting the carry-save tree from the final addition of the CSA, in control unit with the greatest number of states is considered and is
order for the first to be used as Par non-linear function when SHA- modified so as to be able to control all the incorporated hash
1 function is selected. This choice slightly affects the overall function algorithms.
round's delay. However increases in the area gains due to the fact
that (a) extra logic for the non-linear function is not added and
(b) the certain path's output had to assigned to different output 6. Experimental results
register (depending on the selected function) during the flow's
“I/O assignment”, leading to more steering logic which increases The introduced SHA-256/512 and SHA-1/256/512 multi-
not only the area but the delay as well. Due the same reasons, in mode architectures were captured in VHDL and implemented in
the third multi-mode round (Fig. 16), an extra Maj box is included. a wide range of Xilinx FPGAs. Specifically, six Xilinx families are
considered, three older ones: Virtex (xcv1000-6FG680), Virtex-II
5.2. Multi-mode message scheduling units (xc2v6000-6FF1517), and Virtex-4 (xc4vlx100-12FF1148), and
three modern ones: Virtex-5 (xc5vlx155t-3FF1136), Virtex-6
Similarly to the procedure followed in the construction of (xc6vlx240t-3FF784), and Virtex-7 (xc7v855t-3FFG1157). It should
multi-mode base designs, after the multi-mode transformation be mentioned that the Virtex and Virtex-II technologies were
rounds, the multi-mode Wt units take place. Due to the parametric selected only for comparison with existing similar works.
loop unrolling exploited in the initial optimized designs, two W The XST synthesis tool of Xilinx ISE Design Suite (version 13.1)
values are simultaneously fed per clock cycle in a transformation was used for mapping the designs to the FPGA devices. The
round. Thus in the initial, optimized, Wt units two W values are functionality of the implementations was initially verified via
produced at the same time. Therefore, the corresponding Wnext Post-Place-and-Route simulations using the MentorGraphics's
logic is doubled. ModelSim simulator. A large set of test vectors, apart from those
The above fact does not affect the application of the introduced provided by the standard, were used. Also, downloading to
design flow. Following the stages as reported in Section 4, we development boards, additional functional and timing verification
constructed a Wt unit able to support the process of either of two were performed.
independent 512-bit message blocks (when SHA-1 or SHA-256 The quality of the introduced architectures was measured in
is selected) or one 1024-bit message block (when SHA-512 is terms of frequency, throughput, area, and throughput/area cost
selected). The unit (Fig. 18), in any case, produces two 64-bit Wt factor. Regarding throughput of each hash function, it is calculated as
values per clock cycle. If the SHA-512 function is selected, then this
#bits  f
quantity represents one 64-bit Wt value. If the SHA-1 or the SHA- Throughput ¼ ð16Þ
#cycles
256 function is selected the above quantity includes two 32-bit Wt
values, each one correspond to two independent 512-bit message where # is the number of bits of incoming message (512 bits for SHA-
blocks. 1, SHA-256 and 1024 for SHA-512), f is the operating frequency, and
#cycles is the clock cycles spending in each case (20/10 for SHA-1,
5.3. Multi-mode control units SHA-512 and 16/8 for SHA-256, base/optimized architectures).
As mentioned in Section 4, there is a word-length mismatch
Finally, the construction of the control logic for the final multi- between SHA-1, SHA-256, and SHA-512 functions. Specifically,
mode hash design takes place. Similarly to the base designs, the the SHA-512 function operates on 64-bit data, while the SHA-1

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 15

Fig. 18. Multi-mode Wt computation unit of the SHA-1/256/512 optimized architecture

Table 2
Frequency of the base and optimized architectures.

Tech. SHA-1 SHA-256 SHA-512 Proposed SHA-256/512 Proposed SHA-1/256/512

Base architectures – frequency (MHz)


Virtex-4 120.7 108.7 93.2 90.1 77.9
Virtex-5 154.6 138.7 118.9 115.8 98.3
Virtex-6 161.6 143,6 123 118.7 99.8
Virtex-7 194 169 144.9 139.3 115.6

Optimized architectures – frequency (MHz)


Virtex-4 157.2 130.1 119.6 103.1 90
Virtex-5 207.1 169.1 151.9 132.2 116.3
Virtex-6 217.4 172 165.3 141.6 124.7
Virtex-7 259.3 204 189.7 159 139.6

and SHA-256 ones operate on 32 bits. The above is handled, in by ISE design suite is presented, and finally comparisons with
favor to the proposed multi-mode architectures by merging two existing similar works are offered.
SHA-1 (or SHA-256 depending on the initial selection) operations
in the data buses of SHA-512. This is achievable due to the fact that 6.1. Evaluation of the proposed architectures
the majority of the included operations of hash functions (Section
4.1) are bit-wise. For the cases that it does not hold, appropriate Table 2 presents the frequency of the individual and multi-
modifications are made (see Sub-sections 4.4 and 4.5). As a result, mode designs in representative Xilinx technologies. The second,
two independent message blocks are able to be processed, when third, and fourth columns correspond to the separate designs of
either SHA-1 or SHA-256 function is selected. Hence, the through- SHA-1, SHA-256, and SHA-512 functions, respectively, whereas the
put value of SHA-1 and SHA-256 functions, regarding the pro- last two columns correspond to the multi-mode SHA-256/512 and
posed multi-mode architectures, is calculated by doubling the SHA-1/256/512 architectures. As it was expected the following can
value that is calculated by Eq. (16). be put forth:
Below, the experimental results are presented in details. First,
we present and discuss the results for the proposed architectures  The frequency of the multi-mode architectures is lower than
in terms of throughput and area, then a comparison between the the frequency of the separate designs due to the additional
designs produced by the proposed flow and the designs produced logic that is inserted in the multi-mode architectures.

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
16 H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 19. Area comparisons: (a) base designs and (b) optimized designs.

 The optimized multi-mode designs outperform the corresponding Figs. 20 and 21 illustrate this comparison for the proposed
base ones. This means that although the separate optimized SHA-256/512 and SHA-1/256/512 multi-mode architectures.
designs, which are used as input in the proposed flow, are more Concerning the base SHA-256/512 multi-mode design (Fig. 20(a))
complex than the base ones, the flow works efficiently without in Virtex-6 technology, the throughput of the SHA-256 function
inserting large logic that could make the optimized multi-mode (7.6 Gbps) is improved by 65.3% over the individual implementation
designs worst compared to the base ones. of this hash function (4.6 Gbps). This happens because in the multi-
 The frequency is improved substantially when modern tech- mode architecture two SHA-256 message blocks are processed
nologies are used. concurrently. Regarding the throughput of SHA-512 function in the
base SHA-256/512 multi-mode design, it equals to 6.1 Gbps, while
However, to evaluate the quality of the produced multi-mode the SHA-512 one achieves 6.3 Gbps, which corresponds to a 3.5%
architectures the area and throughput factors must be studied. reduction. However, as it was mentioned above, the area of the
Fig. 19 depicts the area of the base and optimized designs. base multi-mode SHA-256/512 design is improved by 34.9%. Con-
For each FPGA family, the first three bars corresponds to the area sequently, the proposed multi-mode SHA-256/512 architecture
of the individual (single-mode) designs, the fourth and fifth bars improves the area factor by 34.9%, while the throughput of
correspond to the sum of the area of the considered sole designs, SHA-256 function is improved by 65.3% and the throughput of the
whereas the last two bars correspond to the area of the proposed SHA-512 function is slightly reduced by 3.5%.
SHA-256/512 and SHA-1/256/512 multi-mode architectures. For the case of the optimized SHA-256/512 multi-mode design
Studying the Virtex-6 implementations (Fig. 19), it is derived in Virtex-6 technology, the area is improved by 36.9%, as men-
that the area of the base SHA-1/256/512 architecture (3787 slices) tioned above. Also, the throughput of the SHA-256 function is
is reduced by 28.6% compared to the total area of the three improved by 64.7%, whereas the throughput of the SHA-512
separate designs (5302 slices). Concerning the optimized designs, function is reduced by 14.3%. The larger throughput reduction of
the area of the SHA-1/256/512 multi-mode architecture is SHA-512 compared to the throughput reduction of the base
4129 slices, whereas the total area of the separate designs is designs happens because the optimized designs are more complex
6177 slices, which corresponds to 33.2% area reduction. Also, the than the base ones. Thus, more extra logics are required to develop
area of the base SHA-256/512 multi-mode architecture (3179 the multi-mode architecture, which results in more frequency
slices) is reduced by 34.9% compared to the total area of the SHA- degradation. Mentioning that the SHA-512 design was used as the
256 and SHA-512 designs (4290 slices), while in the optimized initial design by the proposed flow. Studying the implementation
SHA-256/512 design the area reduction is 36.9%. Similar conclu- results of SHA-256/512 multi-mode architecture in the other FPGA
sions are derived regarding the implementations in the other families, similar outcomes are also derived.
families. Concluding, the major outcome is that in most of the cases the
Thus, the first outcome is that a remarkable resource sharing is proposed flow and multi-mode implementations achieve an area
achieved and the area of the produced multi-mode architectures is reduction of more than 30% compared to the total area of the
reduced by 30% about compared to the total area of the individual individual design. Also, the throughput of the SHA-1 function is
designs. improved up to 29% (base SHA-1/256/512 in Virtex-4), where for
However, as it is shown in Table 2 the frequency of the multi- the SHA-256 function the throughput improvement is up to 67%
mode architectures is lower than the frequency of the individual (base SHA256/512 in Virtex-5) and up to 45% (optimized SHA-1/
designs, thus a comparison in terms of throughput is required. 256/512 in Virtex-6). Finally, the throughput of the SHA-512

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 17

Fig. 20. Multi-mode SHA-256/512 architecture – throughput comparisons: (a) base designs, and (b) optimized designs.

Fig. 21. Multi-mode SHA-1/256/512 architecture – throughput comparisons: (a) base designs, and (b) optimized designs.

function is reduced from 3% (base SHA-256/512 in Virtex-4) up to 6.2. Comparison with commercial synthesis tool (Xilinx ISE)
27%(optimized SHA-1/256/512 in Virtex-7). Though, there is a
throughput reduction for the SHA-512 function, the final SHA-1/ A major question that arises regarding the efficiency of the
256/512 is capable to support both the three hash functions with proposed flow and the produced multi-mode designs is whether it
one low-area design. is possible to produce better multi-mode designs if the individual

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
18 H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

designs are fed to a commercial synthesis tool allowing it to Table 6


perform resource sharing and delay optimization to develop a Proposed and ISE optimized SHA1/256/512 designs.
multi-mode architecture.
Techn. Metric Proposed SHA-1/256/512 ISE SHA-1/256/512
For that reason, a top-level design that includes the hardware
modules of the SHA-1, SHA-256, and SHA-512 functions as separate 1 256 512 1 256 512
modules are developed in VHDL and the proper output is selected via
a multiplexer. This top-level design was fed to the ISE synthesis tool Virtex 4 F (MHz) 90 117.9
T (Gbps) 9.2 11.5 9.2 6. 7.6 12.0
setting the area and delay optimization efforts to high and normal, Virtex 5 F (MHz) 116.3 119.8
respectively. Tables 3–6 present the corresponding results and com- T (Gbps) 11.9 14.9 11.9 6.1 7.7 12.3
parisons in terms of frequency and throughput for the base and Virtex 6 F (MHz) 124.7 128.1
optimized multi-mode designs. Also, in these tables the frequency and T (Gbps) 12.8 15.9 12.9 6.69 8.2 13.1
Virtex 7 F (MHz) 139.6 147.6
throughput values of the proposed architectures are provided.
T (Gbps) 14.3 17.9 14.3 7.6 9.5 15.1
As it is shown, the frequency values of the designs produced by
ISE are higher than those of the proposed architectures. This results
in better throughput values for the SHA-512 functions both in SHA-
256/512 and SHA-1/256/512 multi-mode architectures. However, the comparisons between the proposed multi-mode implementations
proposed architectures outperform the corresponding ones produced and those that are produced by ISE tool are depicted. As it is illustrated,
by ISE for the SHA-1 and SHA-256 functions. This happens because the area of the proposed implementations is lower than those that are
the proposed designs are capable to process two input messages produced by the ISE tool. Specifically, in Virtex-6 technology, the
concurrently, while the ISE's architecture operates on one message. introduced base SHA-1/256/512 module requires 3787 slices, whereas
To have a more fair comparison, the occupied area and the the corresponding multi-mode module derived by ISE requires 5576
throughput/area factors must be studied. In Fig. 22, the area slices achieving a 47.2% area reduction. For the corresponding opti-
mized designs in the same technology, the area reduction that is
achieved by the proposed architectures is 45%. On average, compared
Table 3 to the base SHA-256/512 and SHA-1/256/512 implementations, which
Proposed and ISE base SHA256/512 designs. are produced by the ISE tool, the area of the corresponding proposed
Techn. Metric Proposed SHA-256/ 512 ISE SHA-256/512
architectures is lower by 40% and 46%, respectively. For the optimized
SHA-256/512 and SHA-1/256/512 architectures the area savings are
256 512 256 512 40% and 47% on average, respectively.
Thus, the outcome is that the designs of the ISE tool outperform
Virtex 4 F (MHz) 90.1 92.8
the proposed ones in terms of frequency but they demand larger
T (Gbps) 5.8 4.6 3.0 4.7
Virtex 5 F (MHz) 115.8 118.1 area. Therefore, to get a more in-depth and fair comparison, the
T (Gbps) 7.4 5.9 3.8 6.1 throughput/area factor must be studied.
Virtex 6 F (MHz) 118.7 122.2 In Figs. 23 and 24, the throughput/area values of the proposed
T (Gbps) 7.6 6.1 3.9 6.3 multi-mode designs and those that are produced by the ISE tool are
Virtex 7 F (MHz) 139.3 144
T (Gbps) 8.9 7.2 4.6 7.4
depicted. In these figures, the throughput/area values are presented
per FPGA family, including each one of the incorporated function in
the multi-mode architecture, in groups-of-2 (ISE – proposed).
Table 4
Concerning the base SHA-1/256/512 multi-mode architectures
Proposed and ISE base SHA1/256/512 designs. (Fig. 23(b)), the throughput/area value of the SHA-1 function
proposed (1.35 Mbps/slice) is 141.5% higher than the throughput/
Techn. Metric Proposed SHA-1/256/512 ISE SHA-1/256/512 area value of the ISE design (0.56 Mbps/slice). The same improve-
ment is also achieved for the SHA-256 function in this case. Also,
1 256 512 1 256 512
the throughput/area value of the SHA-512 function in the pro-
Virtex 4 F (MHz) 77.9 92.4 posed architecture equals to 1.35 Mbps/slice, while the corre-
T (Gbps) 4.0 5.0 4.0 2.4 2.9 4.7 sponding value for the ISE design is 1.12 Mbps/slice achieving
Virtex 5 F (MHz) 98.3 117.6
a 20.7% improvement. On average for all the considered FPGA
T (Gbps) 5.0 6.3 5.1 3.0 3.8 6.0
Virtex 6 F (MHz) 99.8 121.7
technologies, the throughput/area value of the SHA-1 and
T (Gbps) 5.1 6.4 5.2 3.1 3.9 6.2 SHA-256 functions is improved by 140%, while for the SHA-512
Virtex 7 F (MHz) 115.6 143.6 function the improvement is 20%. Similar results are also held for
T (Gbps) 5.9 7.4 6.1 3.7 4.6 7.3 the case of the base SHA-256/512 multi-mode architecture, where
the improvements of the throughput/area factor for the SHA-256
and SHA-512 functions are 175% and 37% on average, respectively.
Table 5 Regarding the fact the SHA-1 and SHA-256 functions exhibit
Proposed and ISE optimized SHA256/512 designs. the same throughput/area values in the proposed architectures, it
Techn. Metric Proposed SHA-256/ 512 ISE SHA-256/512
is explained by the fact that the operating frequency equals to the
frequency of the whole multi-mode architecture for both func-
256 512 256 512 tions. Thus, according to Eq. (16), the throughput is the same.
Moreover, the area equals to the area of the whole multi-mode
Virtex 4 F (MHz) 103.1 118.7
architecture for both functions.
T (Gbps) 13.2 10.5 7.6 12.1
Virtex 5 F (MHz) 132.2 120.4
T (Gbps) 17.0 13.6 7.7 12.3 6.3. Comparison with existing multi-mode architectures
Virtex 6 F (MHz) 141.6 129.3
T(Gbps) 18.2 15.5 8.3 13.2 In this sub-section the proposed multi-mode designs are com-
Virtex 7 F (MHz) 159 148.5
T (Gbps) 20.1 16.3 9.5 15.2
pared with similar existing ones. To the best of our knowledge,
there are not too many similar FPGA designs. These designs

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 19

Fig. 22. Area comparisons: (a) base SHA-256/512, (b) base SHA-1/256/512, (c) optimized SHA/256/512, and (d) optimized SHA-1/256/512.

Fig. 23. Throughput/area comparisons – base designs: (a) SHA-256/512 and (b) SHA-1/256/512.

concern SHA-1/256, SHA-1/256/512, SHA-256/512, and SHA-384/ Indicatively, for SHA-256/512, regarding throughput, the improve-
512 multi-mode architectures implemented in FPGA technology. ments are from 4.2  (SHA-512 – Virtex-II) to 23.8  (SHA-256 –
However, these deigns belong to the category of the optimized Virtex-II), while concerning throughput/area, the improvements are
designs since techniques such as pipeline and loop unrolling have from 1.2  (SHA-512 – Virtex-II) to 5.5  (SHA-256 – Virtex).
been applied to improve performance. Thus, the proposed opti- The reason why the introduced architectures achieve signifi-
mized multi-mode architectures are used for comparisons. cantly better results in terms of throughput compared to the
Tables 7–10 present the comparisons in terms of frequency, others is twofold. Firstly, the proposed multi-mode architectures
area, throughput, and throughput/area metrics. As it can be seen, have their transformation round unrolled-by-2 and, in parallel,
the proposed architecture outperforms all the existing ones in they are 4-stage pipelined. Hence, the denominator of Eq. (16) is
both throughput and throughput/area metrics. divided by 4, compared to the initial number of the function's

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
20 H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 24. Throughput/area comparisons-optimized designs: (a) SHA-256/512 and (b) SHA-1/256/512.

Table 7 Table 10
Comparison of proposed optimized SHA-1/256 architecture to similar existing ones. Comparison of proposed optimized SHA-384/512 architecture to similar existing
ones.
Techn. Ref. Freq. (MHz) Area (slices) Throughput (Gbps)
Techn. Ref. Freq. (MHz) Area (slices) Throughput (Gbps)
1 256
Virtex-4 [32] 182.3 1731 2.3
Virtex-5 [29] 227 371 1.4 1.8 Prop. 119.6 7224 12.2
Prop. 159.4 2104 8.2 10.2

Virtex-6 [29] 266 369 1.7 2.1


[30] 258 148 0.064 0.060
Prop. 164.7 2087 8.4 10.5

SHA-256 SHA-512
1.20
Table 8
0.99
Comparison of proposed optimized SHA-1/256/512 architecture to similar existing 1.00
ones.
0.80
0.80 0.75
Mbps/Slice

Techn Ref. Freq. (MHz) Area (slices) Throughput (Gbps) 0.67 0.67
0.60
0.60
1 256 512
0.40 0.34 0.34
Virtex-6 [30] 271 251 0.067 0.064 0.046
Prop. 124.7 4129 12.8 15.9 12.9 0.22 0.20
0.20 0.14
0.11

0.00
[20] [23] Proposed [19] [21] Proposed
Table 9
Virtex Virtex-II
Comparison of proposed optimized SHA-256/512 architecture to similar existing
ones. Fig. 25. Throughput/area comparisons between proposed optimized SHA-256/512
architecture and similar existing ones.
Techn. Ref. Freq. (MHz) Area (slices) Throughput (Mbps)

256 512
iterations. Secondly, regarding SHA-256, as mentioned before, two
Virtex [20] 50 2951 400 320 independent input blocks are able to be consequently processed,
[23] 53 2530 848
leading to doubling the throughput that is calculated by Eq. (16).
Prop. 40.3 6911 5158 4127
Finally, in Fig. 25, the throughput/area comparisons between
Virtex-II [19] 74 2384 291 467
the proposed optimized SHA-256/512 architecture and the similar
[21] 81 1938 1296
Prop. 54.1 6968 6925 5540
existing ones are provided, illustrating the fact that the proposed
designs outperform against the competition.

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i
H.E. Michail et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 21

7. Conclusions [26] M.-Y. Wang, C.-P. Su, C.-T. Huang, C.-W. Wu, An HMAC processor with
integrated SHA-1 and MD5 algorithms, in: Proceedings of Asia and South
Pacific Design Automation Conference, ASP-DAC 04, 2004, pp. 456–458.
In this paper, area-efficient and high-throughput multi-mode [27] R. Ramanarayanan, S. Mathew, F. Sheikh, S. Srinivasan, A. Agarwal, S. Hsu,
architectures regarding the SHA-1 and SHA-2 families were H. Kaul, H. Anders, V.M.M. Erraguntla, R. Krishanurthy, 18 Gbps, 50 mW
proposed and implemented in several FPGA technologies. These reconfigurable multi-mode SHA Hashing accelerator in 45 nm CMOS, in:
Proceedings of the ESSCIRC, 2010, pp. 210–213.
architectures are able to realize more than one function, while [28] J. Docherty, A. Koelmans, A flexible hardware implementation of SHA-1 and
their frequency and throughput degradation (caused by merging SHA-2 hash functions, in: Proceedings of 2011 IEEE International Symposium
of separate designs) are kept significantly low. Compared to the on Circuits and Systems, 2011, pp. 1932–1935.
[29] Helion Tech., Fast hash core family for Xilinx FPGA. Data sheet available from:
corresponding architectures that were produced by a commercial 〈http://www.heliontech.com/hash.htm〉 (accessed December 2013).
synthesis tool (Xilinx ISE), the proposed ones are significantly [30] Helion Tech., Tiny hash core family for Xilinx FPGA. Data sheet available from:
more area-efficient and at the same time significantly better in 〈http://www.heliontech.com/hash.htm〉 (accessed December 2013).
[31] A.T. Hoang, K. Yamazaki, S. Oyanagi, Three-stage pipeline implementation for
terms of throughput/area. Additionally a systematic design flow
SHA2 using data forwarding, in: Proceedings of 2008 International Conference on
for producing multi-mode architectures of the above two families Field Programmable Logic and Applications, FPL 2008, 8–10 September 2008,
is introduced. Finally, the proposed multimode architectures out- pp. 29, 34.
perform the previously proposed ones significantly, in terms of [32] A.T. Hoang, K. Yamazaki, S. Oyanagi, Pipelining a multi-mode SHA-384/512
core with high area performance rate, IEICE Trans. Inf. Syst. E92.D (10) (2009)
throughput and throughput/area. 2034–2042.

References

Harris E. Michail received his Dipl. Eng. and Ph.D. from


[1] NIST: FIPS 198, The Keyed–Hash message authentication code (HMAC) Federal
the Department of Electrical & Computer Engineering,
Information Processing Standard, NIST Publication, US Dept. of Commerce, 2002.
University of Patras, Greece, in 2009. From 2009 to 2011,
[2] SP 800-32, Introduction to public key technology and the federal PKI
he was an Adjunct Assistant Professor in the Computer
infrastructure, NIST, US Dept of Commerce, 2001.
Engineering and Informatics Department, University
[3] Larry Loeb secure electronic transactions: introduction and technical refer-
of Patras. In 2011, he joined the Electrical Engineering,
ence, Artech House Publishers, 1998.
Computer Engineering, and Informatics Department,
[4] NIST: FIPS 186-3, The digital signature standard (DSD) federal information
Cyprus University of Technology. He has authored and
processing standard, NIST Publication, US Dept. of Commerce, 2009.
co-authored more than 60 papers in international journals
[5] Stephen Thomas, SSL & TLS Essentials: Securing the Web, John Wiley and Sons
and conferences and he has more than 170 cross-
Publications, New York, USA, 2000.
references in his work. His main research interests include
[6] P. Loshin, IPv6: Theory, Protocol and Practice, Elsevier, San Francisco, USA,
Cryptography, Computer Security, and Embedded Systems.
2004.
[7] NIST: FIPS 197, Advanced encryption standard (AES), NIST Publication, US
Dept. of Commerce, 2001.
[8] R. Rivest, RFC1321: The MD5 message digest algorithm, Publications of MIT
Laboratory for Computer Science and RSA Data Security Inc. Available at: George S. Athanasiou received his 5-year Dipl. Eng. on
〈http://www.faqs.org/rfcs/rfc1321.html〉 (accessed December 2012). Electronic and Computer Engineering, from the Technical
[9] NIST: FIPS 180-3, Secure Hash Standard, (SHS), NIST Publication, US Dept. of University of Crete, Greece, on 2008, achieving a gradua-
Commerce, 2008. tion degree of Excellent (GPA 8.51/10) with distinction. In
[10] H. Dobbertin, The status of MD-5 after a recent attack, RSA Labs' CryptoBytes, May 2013 he received his Ph.D. from the Electrical and
1996. Computer Engineering Department, University of Patras,
[11] X. Wang, Y.L. Yin, H. Yu, Finding collisions in the full SHA-1, Springer Lect. Greece. From then on he was a Post-Doctoral researcher
Notes Comput. Sci. 3621 (2005) 17–36. in the same department until August 2013. In September
[12] NIST, Cryptographic hash algorithm competition – SHA-3, NIST. 〈http://csrc. 2013 he joined Antcor – Advanced Network Technologies
nist.gov/groups/ST/hash/sha-3/index.html〉, 2012 (accessed December 2012). S.A., Athens, Greece. Until today he has more than 25
[13] B. Preneel, Cryptographic hash functions and the SHA-3 competition, talk in publications in international journals and conferences.
Asiacrypt 2010. Available at 〈https://www.cosic.esat.kuleuven.be/publications/ His research interests include Cryptography, VLSI Design,
talk-198.pdf〉 (accessed May 2012). and Design for Testability.
[14] M. Ermer, Doubts over necessity of SHA-3 cryptography standard. 〈http://
www.h-online.com/security/news/item/Doubts-over-necessity-of-SHA-3-cryp
tography-standard-1498071.html〉 (accessed December 2012).
[15] R. Chaves, G.K. Kuzmanov, L. Souza, S. Vassiliadis, Cost-efficient SHA hardware
accelerators,, IEEE Trans. Very Large Scale Integr. Syst. 16 (8) (2008) 999–1008. George Theodoridis received his Dipl. Eng. in Electrical
[16] M. Macchetti, L. Dadda, Quasi-pipelined hash circuits, in: Proceedings of 17th Engineering and Ph.D. in Electrical and Computer
Symposium on Computer Arithmetic, ARITH-17, 2005, pp. 222–229. Engineering from the University of Patras, Greece, in
[17] H.E. Michail, A.P. Kakarountas, A.S. Milidonis, C.E. Goutis, A top-down design 1994 and 2001, respectively. In 2001 he was (co-
methodology for implementing ultra high-speed hashing cores, IEEE Trans. founder) with ALMA Technologies S.A., Athens, Greece.
Dependable Secure Comput. 6 (4) (2009) 255–268. During June 2003 to July 2009 he was a Lecturer in the
[18] H.E. Michail, G.S. Athanasiou, V. Kelefouras, G. Theodoridis, C.E. Goutis, On the Department of Physics of Aristotle University of Thes-
exploitation of as high-throughput SHA-256 FPGA design for HMAC, ACM saloniki, Greece. In July 2009 he joined as an assistant
Trans. Reconfigurable Technol. Syst. 5 (1) (2012) 2:1–2:28. professor the Department of Electrical and Computer
[19] N. Sklavos, O. Koufopavlou, Implementation of the SHA-2 hash family standard Engineering of Patras University. His research interests
using FPGAs, J. Supercomput. 31 (2005) 227–248. include power estimation and low-power VLSI design
[20] R. Glabb, L. Imbert, G. Jullien, A. Tisserand, N.V. Charvillon, Multi-mode and embedded systems design.
operator for SHA-2 hash functions, J. Syst. Archit. 53 (2–3) (2007) 127–138.
[21] M. Zeghid, B. Bouallegue, A. Baganne, M. Machhout, R. Tourki, A reconfigurable
implementation of the new secure hash algorithm, in: Proceedings of the
Second International Conference on Availability, Reliability and Security (ARES
'07), IEEE Computer Society, Washington, DC, USA, 2007, pp. 281–285. Costas E. Goutis received his B.Sc. in Physics from
[22] S. Wanhong, G. Hongspeng, H. Huilei, Zibin Dai, Design and optimized imple- the University of Athens, Greece, in 1966. In 1974 he
mentation of the SHA-2(256, 384, 512) hash algorithms, in: Proceedings of the received his M.Sc. in Digital Telecommunications from
7th International Conference on ASIC, 2007, ASICON '07, 2007, pp. 858–861. the Heriot-Watt University and in 1978 he received his
[23] M. Zeghid, B. Bouallegue, A. Baganne, M. Machhout, R. Tourki, Architectural Ph.D. in Digital Image Processing from the Southampton
design features of a programmable high throughput reconfigurable SHA-2 University, UK. Since 1985 he has been associate, full and
processor, J. Inf. Assur. Secur. 2 (2008) 147–158. Emeritus professor in the ECE Department of University
[24] T.S. Ganesh, M.T. Frederick, T.S.B. Sudarshan, A.K. Somani, Hashchip: a shared- of Patras, Greece. His research interests include VLSI
resource multi-hash function processor architecture on FPGA, Integr., VLSI J. Design, High Level Synthesis and Low Power design.
40 (2007) 11–19. He has published more than 300 articles (104 in journals
[25] E. Khan, M.W. El-Kharashi, F. Gebali, M. Abd-El-Barr, Design and performance and 194 in conferences) and holds 2 best paper awards.
analysis of a unified reconfigurable HMAC-hash unit, IEEE Trans. Circuits Syst. – I
54 (12) (2007) 2683–2695.

Please cite this article as: H.E. Michail, et al., On the development of high-throughput and area-efficient multi-mode cryptographic hash
designs in FPGAs, INTEGRATION, the VLSI journal (2014), http://dx.doi.org/10.1016/j.vlsi.2014.02.004i

You might also like