On the development of high-throughput and area-efcient
multi-mode cryptographic hash designs in FPGAs
H.E. Michail
a,n
, G.S. Athanasiou
b
, G. Theodoridis
c
, C.E. Goutis
c
a
Electrical Engineering, Computer Engineering and Informatics Department, Cyprus University of Technology, 3036 Lemesos, Cyprus
b
Antcor Advanced Network Technologies S.A., Sorou Str. 12, 15125 Marousi, Greece
c
VLSI Design Laboratory, Electrical and Computer Engineering Department, University of Patras, 26500 Patras, Greece
a r t i c l e i n f o
Article history:
Received 23 December 2012
Received in revised form
4 February 2014
Accepted 7 February 2014
Available online 2 March 2014
Keywords:
Hash
Authentication
Multi-mode
FPGA
a b s t r a c t
In this paper, area-efcient and high-throughput multi-mode architectures for the SHA-1 and SHA-2
hash families are proposed and implemented in several FPGA technologies. Additionally a systematic
ow for designing multi-mode architectures (implementing more than one function) of these families is
introduced. Compared to the corresponding architectures that are produced by a commercial synthesis
tool, the proposed ones are better in terms of both area (at least 40%) and throughput/area (from 32% up
to 175%). Finally, the proposed architectures outperform similar existing ones in terms of throughput and
throughput/area, from 4.2 up to 279.4 and from 1.2 up to 5.5, respectively.
& 2014 Elsevier B.V. All rights reserved.
1. Introduction
Due to the dramatic increase of electronic communications and
transactions worldwide, security has become an indispensable
feature of all systems and applications. A vital feature of the
security schemes that are used nowadays is authentication, which
is achieved using cryptographic hash functions. Hash functions
are used as single modules or they are included in hash-based
authentication mechanisms such as the Hashed Message Authen-
tication Code (HMAC) [1].
Furthermore, hash functions are used in the Public Key Infra-
structure (PKI) [2], Secure Electronic Transactions (SET) [3],
and digital signature algorithms like DSA [4], which are used to
provide authentication services in commercial applications such as
data interchange, electronic mail, and fund transfer. Additionally,
hash functions are used in Web protocols such as the Secure
Sockets Layer (SSL) and Transport Layer Security (TLS) [5].
The importance of hash functions has been further increased in
recent years due to their inclusion in the Internet Protocol Security
(IPSec) [6]. IPSec is a compulsory feature of the forthcoming
Internet Protocol version 6 (IPv6) [6] and includes encryption
and authentication schemes. Encryption is achieved through the
block cipher algorithm AES [7], whereas authentication is provided
through HMAC that is built on top of a hash function (e.g. MD-5 [8]
or SHA-1 [9]).
However, security problems have been discovered in MD-5 and
SHA-1 functions. Specically, the MD-5 class of hash functions has
been totally broken [10]. On the other hand, concerning SHA-1,
although its collision resistance has been reduced [11], the security
problems are non-critical. For that reasons, except the SHA-1, the
SHA-2 hash function is expected to be adopted as a secure solution
in security schemes (e.g. IPSec/IPv6) in coming years. Neverthe-
less, the US National Institute of Standards and Technology (NIST),
has established a competition for developing the new hash func-
tion standard (SHA-3), which was nalized in November 2012 [12].
As the transition to a new standard does not happen immediately,
SHA-1 and SHA-2 functions are expected to continue being used in
near- and medium-future applications. In fact, NIST itself reports
that SHA-3 is not meant to replace SHA-1 and SHA-2, but to co-
exist with them, let alone that many administrators have not made
the jump from SHA-1 to SHA-2 yet [13,14].
A crucial issue, arisen by the above transition from SHA-1 to
SHA-2, is that different systems/applications have different needs
in terms of authentication's type. Hence, the systems have to be
exible so as to be able to support more than one hash functions.
Although this requirement can be achieved by developing systems
where each hash function will be implemented by a separate
design, a more efcient solution in terms of area is the develop-
ment of a multi-mode architecture that will be able to support
with one design more than one hash functions. Such architectures
increase signicantly the exibility of the whole security module
allowing its use in a wide range of applications spreading from
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/vlsi
INTEGRATION, the VLSI journal
http://dx.doi.org/10.1016/j.vlsi.2014.02.004
0167-9260 & 2014 Elsevier B.V. All rights reserved.
n
Corresponding author.
E-mail addresses:
[email protected] (H.E. Michail),
[email protected] (G.S. Athanasiou),
[email protected] (G. Theodoridis),
[email protected] (C.E. Goutis).
URL: http://www.antcor.com (G.S. Athanasiou).
INTEGRATION, the VLSI journal 47 (2014) 387407
servers (which have to support a set of different hash-type
interactions), to end-users that communicate through many com-
munication channels that, each of them, may employ different
hash functions for authentication purposes.
In this paper, two multi-mode architectures, namely SHA-256/
512 and SHA-1/256/512, are introduced. They achieve high
throughput rates, outperforming all the existing similar ones in
terms of throughput/area cost factor. At the same time, they are
area-efcient. Specically, they occupy less area compared to the
corresponding architectures that are derived by simply designing
the sole hash cores together (two/three separate cores designed as
one module having the same inputs and a multiplexer for selecting
the preferred output and feeding them to a commercial FPGA
synthesis/P&R/mapping tool). The introduced designs are able to
perform hashing on one message using the SHA-512 algorithm or
on two different messages when the SHA-256 or SHA-1 hash
function is executed. Several older (Xilinx Virtex, Virtex-II, Virtex-II
Pro, and Virtex-4) and modern (Xilinx Virtex-5, Virtex-6, and
Virtex-7) FPGA families are used for implementing the above
architectures.
Moreover, a systematic design ow for producing multi-mode
architectures that implement more than one hash functions is
proposed. The inputs of the ow are the separate designs and the
algorithmic descriptions of the targeted hash functions and following
a set of well-dened steps the individual designs are properly
merged to produce the nal multi-mode architecture. To keep area
overhead low extensive recourse sharing is performed in the applied
steps. Also, exploiting special features of the hash functions, novel
techniques are introduced to keep the delay increase low.
Due to the fact that any architecture for the above hash families is
composed by similar functional blocks (adders, non-linear functions,
rotations, and logic modules), the proposed ow can be applied to
any RTL architecture of the SHA-1 and SHA-2 families. Thus, we apply
the proposed ow in both base and optimized hash functions'
designs, collect the corresponding results and also perform the
related comparisons. However, it must be stressed that the goal is
not to introduce general techniques for developing optimized designs
in terms of area, delay, or throughput. It is assumed that the
individual designs of each hash function have already been optimized
and then they are properly merged by the proposed ow. Further-
more, the ow exploits specic features appeared in SHA-1 and
SHA-2 families and for that reason it is tailored to produce optimized
multi-mode architectures for them. It is not a general design ow
that could be applied in general-purpose designs.
To the best of our knowledge, it is a rst time that (a) an FPGA
of a multi-mode design was implemented that includes the SHA-1
and SHA-2 hash families, and (b) a systematic design ow for
producing multi-mode architectures for implementing a set of hash
functions is presented.
The paper is organized as follows. In Section 2 the related work is
briey stated. The basic background concerning the SHA-1 and
SHA-2 hash families is presented in Section 3, along with a short
description of their base and optimized designs. In Section 4 the
proposed design owalong with the production of above-mentioned
base multi-mode architectures is presented in detail. Continuously, in
Section 5, the corresponding optimized multi-mode architectures
(constructed from the introduced ow) are presented. FPGA imple-
mentation results, along with the corresponding comparisons with
existing implementations, are provided and discussed in Section 6.
Finally, Section 7 concludes the paper.
2. Related work
Currently, there are many hardware implementations of hash
functions targeting high throughput rates, so as to meet the strict
timing constraints of modern applications. In the literature there
are numerous hardware implementations for SHA-1 and SHA-2
families, such as those presented in [15,1618,31]. In these works,
novel architectures have been introduced to design solely each
hash function. However, there are signicantly fewer works deal-
ing with the development of multi-mode architectures for the
above hash families.
Specically, in [1923,29,30,32] multi-mode architectures have
been proposed for implementing in FPGA and ASIC technology the
SHA-1 and/or SHA-2 family. Besides the above works that focus on
the SHA-1/SHA-2 hash family, multi-mode architectures that
target on different hash functions have also been proposed. In
[24] a multi-mode architecture for implementing the MD-5, SHA-
1, and RIPEMD160 hash functions was proposed, while in [25] a
unied recongurable HMAC unit that supports the MD-4, MD-5,
SHA-1, and RIPEMD160 hash functions has been introduced. In
[26] an HMAC unit that integrates the MD-5 and SHA-1 is
presented and implemented in FPGA and ASIC technologies.
Finally, in [27,28], ASIC implementations of multi-mode architec-
tures that include SHA-1 and SHA-2 hash functions are reported.
Although, novel multi-mode architectures have been introduced
in the above works, they suffer by the following problems. First, they
include algorithms such as MD-4 or MD-5 that they have been totally
broken [10,11] or they include the RIPEMD160 hash algorithmwhose
commercial use is very limited while it is not proposed by NIST.
Additionally, to the best of authors' knowledge, there are no FPGA
implementations of multi-mode architectures that support both the
SHA-1 and SHA-2 hash families that are widely used by the current
and it is also expected to be used by future applications. Such designs
are presented only in two works [27,28], which perform only ASIC
implementations. Moreover, no systematic approach has been pro-
posed for producing a multi-mode architecture, efcient in terms of
area or delay. Instead, all the introduced architectures are produced
by an ad-hoc manner, making it difcult for another designer, who
has already developed sole hash cores, to follow a similar procedure
and create multi-mode designs having the sole architectures as a
starting point.
3. SHA-1/SHA-256 hash function background
A hash function, H(M), operates on an arbitrary-length message,
M, and returns a xed-length output, h, which is called hash value
or message digest of M. The aim of hash function is to provide a
signature of M that is unique. Given M, it is easy to compute h if H
(M) is known. However, given h, it is hard to compute M such that H
(M)h, even when H(M) is known. Their nature is iterative and they
include two stages: the pre-processing and the computation ones [9].
The main characteristics of the targeted hash families, namely
the SHA-1 and SHA-2 (SHA-224/256/384/512), are shown in
Table 1. They include simple processing elements, such as arith-
metic computations (e.g. additions and non-linear functions) and
bit-level shift/rotations [9].
As reported in the standard [9], the SHA-224 and SHA-384 hash
functions are exactly the same with the SHA-256 and SHA-512,
respectively, having their outputs truncated. Hence, they are omitted
from the paper's analysis. However, they are considered both in the
design ow and the nal designs, where a simple truncation of the
output is accomplished in case of these two functions. Additionally, it
should be noticed that throughout the paper the terms hash function
and the hash algorithm will be used interchangeable.
3.1. Arithmetic and logical functions
A number of non-linear functions are applied on the w-bit
words that are represented as x, y, and z in the following. The
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 388
result of the logical computations is also a w-bit word. Speci-
cally, the following non-linear functions are included in SHA-1,
SHA-256, and SHA-512 algorithms:
Chx; y; z xy xz 1
Parityx; y; z x y z 2
Majx; y; z xy xz yz 3
j 2;3
0
x ROTR
1;j
x ROTR
2;j
x ROTR
3;j
x
1;j
;
2;j
;
3;j
2; 13; 22
j 2
; 28; 34; 39
j 3
4
j 2;3
1
x ROTR
4;j
x ROTR
5;j
x ROTR
6;j
x
4;j
;
5;j
;
6;j
6; 11; 25
j 2
; 14; 18; 41
j 3
5
s
j 2;3
0
x ROTR
1;j
x ROTR
2;j
x SHR
3;j
x
1;j
;
2;j
;
3;j
7; 18; 3
j 2
; 1; 8; 7
j 3
6
s
j 2;3
1
x ROTR
4;j
x ROTR
5;j
x SHR
6;j
x
4;j
;
5;j
;
6;j
17; 19; 10
j 2
; 19; 61; 6
j 3
7
where corresponds to the logical operation XOR, whereas
ROTR
l
(x) and SHR
l
(x) denote the l times right circular rotation
and the l times right shifting on word x, respectively. Concerning
the arithmetic operations, modulo 2
w
addition between two w-bit
words is used in all the above algorithms.
3.2. Constants K
j
t
and initial values H
(0),j
Constant values, K
j
t
(0rt rt
max j
1), are used in each trans-
formation round. Thus, the SHA-1 algorithm uses 80 32-bit
constants, K
1
0
, K
1
1
,, K
1
79
, which are the rst 32 bits of the
fractional parts of the cube roots of the rst 80 primes. Similarly,
SHA-256 function uses 64 32-bit constants, K
2
0
, K
2
1
,, K
2
63
, whereas
SHA-512 functions employs 80 64-bit constants K
3
0
, K
3
1
,, K
3
79
.
Concerning the initial values, the SHA-1 algorithm employs ve
32-bit initial values, H
0; 1
0
, H
0; 1
1
,,H
0; 1
4
, which are used in the
rst iteration (t0) of the transformation round. On the other
hand, SHA-256 and SHA-512 use eight, H
0; j
1
, H
0; j
2
,,H
0; j
7
(j 2
for SHA-256 and j 3 for SHA-512), w-bit initial values (w32 for
SHA-256 and w64 for SHA-512). For all algorithms, the constant
and initial values are provided by the standard [9].
3.3. Pre-processing stage
The pre-processing stage includes the padding and parsing of
the initial message M. Regarding padding, it is a procedure that
additional bits are added to M so that its size in bits to be a
multiple of 512 for SHA-1 and SHA-256 or 1024 for the SHA-512
algorithm. Since padding is a simple procedure, it is usually
implemented in software without affecting the security level of
the implementation. For more information about padding the
reader is referred to the standard [9].
During parsing, the padded massage in separated in N k-bit
blocks denoted as M
1
, M
2
,, M
N
with k taking according to Table 1.
Concerning SHA-1 and SHA-256 algorithms, since the 512-bit
block can be expressed as 16 32-bit words, the rst 32-bit of the
message block i is denoted as M
i
0
, the next 32 bits denoted as M
i
1
,
and so on up to M
i
15
. Similar procedure is followed for SHA-512
with the exception that each M
(i)
is a 64-bit word.
3.4. Computation stage
The computation stage includes the computations of the
message schedule and those of transformation round. These are
accomplished by the following procedure:
For i 1 to N do
Step 1. Message schedule preparation.
W
t;j
M
i
t
; 0rt r15 j 1; 2; 3
ROTL
1
W
t 3
W
t 8
W
t 14
W
t 16
; 16rt rtmaxj
1 j 1
s
j
1
W
t 2
s
j
0
W
t 15
W
t 7
W
t 16
; 16rt rtmaxj
1 j 2; 3
8
>
>
>
<
>
>
>
:
8
Step 2. Initialization of the working variables.
SHA-1 uses ve working variables, a, b, c, d, and e, which
for the rst iteration (t0) of the transformation round equals
to a; b; ; e H
0;1
0
; H
0;1
1
; ; H
0;1
4
. Similarly, SHA-256 and
SHA-512 algorithms use eight working variables a, b, c, d, e,
f, g, and h, which are initialized as follows a; b; ; h
H
0;j
0
; H
0;j
1
; ; H
0;j
7
with j 2, 3.
Step 3. Application of transformation round.
The computations of the SHA-1 transformation round are as
follows:
For t079{
T
1
ROTL
5
a f
t
b; c; deW
t
K
t
9
ajjbjjcjjdjje T
1
jjajjROTL
30
bjjcjjd 10
where || denotes the concatenation operation. According to
iteration number t, function f
t
(b,c,d) equals to: f
t
(b, c, d)
Ch(b, c, d) when 0rt r19, whereas f
t
(b, c, d)Parity(b, c, d)
when 20rt r39 and 60rt r79, and f
t
(b, c, d)Maj(b, c, d)
when 40rt r59.
The computations of the SHA-256 and SHA-512 (given below)
are the same for both algorithms with the exception that
SHA-256 performs 32 bit data processing whereas SHA-512
operates on 64-bit words.
For t0 to t
max
j
(j 2, 3)
T
1
h
j
1
eChe; f ; g K
j
t
W
t;j
11
T
2
j
0
aMaja; b; c 12
ajjbjjcjjdjjejjf jjgjjh T
1
T
2
jjajjbjjcjjdjjejjf jjg 13
Step 4. Computation of the i-th intermediate hash value H
(i)
.
H
i;j
aH
i 1;j
0
jjbH
i 1;j
1
jjjjeH
i 1;j
4
j 1 14
Table 1
SHA-1 and SHA-2 characteristics.
Function j Massage
block
(k bits)
Word
length
(w bits)
Hash
value
(n bits)
Iterations
tmaxj
SHA-1 1 512 32 160 80
SHA-256 2 512 32 256 64
SHA-224 512 32 224 64
SHA-512 3 1024 64 512 80
SHA-384 1024 64 384 80
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 389
H
i;j
aH
i 1;j
0
jjbH
i 1;j
1
jjjjhH
i 1;j
7
j 2; 3 15
}.
After repeating these steps N times (i.e. after processing M
(N)
message block) the computed hash value H
(N)
is the message
digest h of message M.
3.5. Base and optimized designs
Regarding the above hash functions, simple hardware designs
can be easily developed. In those architectures, named base
architectures, the transformation round is implemented exactly
as described by the standard. Examples of such a transformation
round are the SHA-1 and SHA-256 ones, which are shown in Fig. 1,
where the term f
t
(x,y,z) denotes the non-linear functions (Ch(x,y,z),
Maj(x,y,z), Parity(x,y,z)) that are used by the SHA-1 algorithm,
whereas the shadowed boxes correspond to the circuits modules
of the critical paths.
The most widely adopted pipeline choice of hash function
designers is a 4-staged pipelined architecture. Such a general
architecture for the base designs is shown in Fig. 2. The inputs to
the architecture are the initial values H and the Input Message
Block, as described in the previous sub-section. Each pipeline
stage contains a round unit to execute the iterations of the
transformation rounds assigned to this stage, a W
i
unit, which is
responsible to compute the requited W
t
values, and local registers
for storing the constant values K
t
.
Since four pipeline stages are used, each of them executes 20,
16, and 20 iterations of the transformation rounds of SHA-1,
SHA-256, and SHA-512 algorithms, respectively. As each round
unit executes a number of iterations of the transformation rounds,
multiplexers exist in front of them to pass the results of an
iteration executed by the current round unit or to pass the result
of the previous pipeline stage.
The control logic consists of four counters, count_i (i 1, 2, 3, 4),
that are used for activating the next pipeline stage and addressing
the local memories. Specically, the output of each counter is used
for addressing the local memory of each stage, whereas the next
pipeline stage is activated via the signals tround_i and tcwicnt_i,
which are generated when the previous pipeline stage nishes its
computations.
Beyond the base architectures, as reported in Section 2, in the
literature there are numerous throughput-optimized hardware
implementations for SHA-1 and SHA-2 families. Among the best
published designs in terms of throughput and throughput/area
factors are those reported in [17,18]. There, the application of
advanced optimization methodologies, which employs optimiza-
tion techniques such as loop un-rolling, spatial pre-computation
and resource reordering, retiming, temporal pre-computation, and
circuit-level optimization, led to optimized SHA-1 and SHA-256
designs. The optimized transformation rounds proposed in those
works are shown in Fig. 3. Due to the above techniques, they
include two stages of computation (pre- and post-) separated by a
pipeline register. Additionally, due to loop-unrolling by 2, they
iterate half of the times compared to those of Fig. 1.
We chose to use the above optimized designs as inputs to the
introduced design ow and produce our optimized multi-mode
architectures.
4. Proposed ow and base multi-mode designs
In this section the proposed design ow along with base
multimode architectures is presented in details. We start with a
discussion on the features and properties of the hash functions
based on which the ow is established. Then, the design ow and
the resulted architectures are presented.
In the following sub-sections, the above are presented along
with a running example using the SHA-1 and SHA-256/512
base designs. This is accomplished so as for the ow to be easily
understandable. The similarities between SHA-256 and SHA-512
functions are many. So, the creation of the multi-mode design is
f
t
(x, y, z) +
W
t-1
K
t-1
a
t-1
b
t-1
c
t-1
d
t-1
e
t-1
+
+
+
ROTL
5
ROTL
30
a
t
b
t
c
t
d
t
e
t
a
t-1 b
t-1
c
t-1
d
t-1
e
t-1 f
t-1
h
t-1
g
t-1
a
t b
t
c
t
d
t
e
t f
t
h
t
g
t
W
t-1
K
t-1
+
+
1
(256)
Maj (a, b, c)
Ch (e, f, g)
+
0
(256)
+
+
+ +
Fig. 1. Base transformation rounds: SHA-1 (a) and SHA-256 (b).
Fig. 2. General structure of a 4-stage pipelined, base architecture.
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 390
not that complex. Hence, we choose to demonstrate the intro-
duced ow from the point that the base SHA-256/512 design
is merged with the base SHA-1 design. The construction of the
corresponding optimized designs is accomplished similarly, fol-
lowing the introduced ow. The resulted optimized architectures
are presented in Section 5.
4.1. Special features of SHA-1, SHA-256, and SHA-512
hash algorithms
Taking into account Eqs. (8)(13), which describe the transfor-
mation rounds and message scheduling procedures, it is easily
derived that the major features of the targeted hash functions are
the similarity and the relative simplicity of the performed compu-
tations. The similarity of the computations is a general feature that
holds for both families, while the simplicity of the computations
results in no complex designs that reduces the overall complexity
of the ow, as it will be explained in the following.
Regarding the transformation rounds, it is easily derived that
only two working variables are modied during each iteration,
whereas the remaining ones stay intact. Particularly, in the SHA-1
algorithm, only the variables a and c are modied, and in SHA-256
and SHA-512 algorithms only the variables a and e are computed.
This feature is important because it reduces considerably the
complexity of the ow and the area of the nal multi-mode
architectures. Hence, the effort is mainly focused on merging a
few numbers of sub-circuits and particular those that correspond
to the modied variables.
Additionally, the operations of the transformation rounds
include simple modulo 2
w
additions, simple bitwise logical com-
putations in the non-linear functions, and rotation/shift operations
with xed amount of rotating/shifting bits. Hence, exploiting the
associativity and commutativity laws of addition and the low delay
Fig. 3. Optimized transformation rounds: SHA-1 (a) and SHA-256 (b).
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 391
of the non-linear functions and shift/rotations, the merging of
more than one function becomes more exible. In other words, the
above features allow the reuse of addition and/or non-linear
functions to compute the modied variables of the other hash
functions. In that way, the delay overhead on the multi-mode
design is reduced and resource sharing is easily applied. The above
facts are also valid in the case of message schedule procedures of
the targeted algorithms, because they have similar features as
those of the transformation rounds.
4.2. Overall structure of the systematic design ow
In Fig. 4 the overall structure of the proposed ow is illustrated.
The inputs are the targeted hash algorithms and their individual
designs in the form of dependency graphs (block diagrams) where
each node denotes a hardware resource, whereas an edge corre-
sponds to an interconnection among nodes or primary I/Os and
nodes. The output is a multi-mode architecture that implements a
set of hash functions.
It must be mentioned that the goal of the introduced ow is not
to propose techniques for developing an optimized design for each
one of the targeted hash functions. We assume that the design of
each hash function has already been developed and we use it as
input to the ow.
The ow consists of four main stages. In the rst stage, an
initialization takes place. Specically, the acknowledgment of the
input designs' pipeline stages is performed, so as to repeat
the next stages the same number of times. Additionally, the input
designs are compared and the bigger graph in terms of the
employed I/Os and computational resources is selected to be used
as the base design.
In the second stage the development of the multi-mode trans-
formation round of each pipeline stage takes place. To achieve this,
an iterative procedure is followed considering two functions
each time. In particular, we use the design of the most complex
transformation round as the base design and we try to merge the
design of the second function with it. After the development of
the multi-mode transformation round for the rst two algorithms,
the latter is used as base, and the round of a third function is
chosen for merging. This procedure is repeated for all remaining
functions.
Next, the development of the multi-mode message schedule
unit takes place, where a similar approach to that applied for
producing the multi-mode transformation rounds is followed.
Finally, at the fourth stage, the development of the control unit
is performed. Based on the developed multi-mode round and
message schedule units, the required control signals are specied.
Thus, the FSMs of each design are properly modied to develop
the FSM of the multi-mode architecture.
In the following sections, the above stages are presented in
details by a running example using the SHA-1 and SHA-256/512
algorithms. The similarities between SHA-256 and SHA-512 func-
tions are many. So, the creation of the multimode design is not
that complex. Hence, we choose to demonstrate the introduced
ow from the point that the SHA-256/512 is merged with the
SHA-1 design.
4.3. Stage 1: initialization
The inputs of the ow are the SHA-1 and SHA-256/512 designs
with 4 stages of pipeline. As it was mentioned above, in order to
clearly describe the ow, the individual designs (base designs)
derived straightforwardly based on the description of the corre-
sponding algorithms. Based on Eqs. (9)(14) and the Sub-section 4.1,
the designs of the transformation rounds can be easily derived. The
corresponding graph for SHA-1 is shown in Fig. 1(a). The round's
graph of SHA-256/512 is almost the same with the one shown in
Fig. 1(b). As mentioned above, SHA-256 and SHA-512 are extremely
similar. Thus, the multi-mode transformation round of SHA-256/512
is almost identical with the one of SHA-256, having however double-
sized data-path (64-bits). Additionally, the
0
and
1
boxes include
the non-linear functions of both the functions. More details will be
available in the following sub-sections.
As shown in Fig. 4, the initialization stage includes three steps.
Firstly, the input graphs are compared and the bigger one in terms
of the employed I/Os and computational resources is selected to be
used as base. It is obvious that the more complex is the SHA256/
512 one. Hence it is considered as the base design. Then the
designs' number of pipeline stages is acknowledged. Hence, the
whole procedure of the design ow will be repeated 4 times,
equally to the number of the pipeline stages. However, it must be
stressed that the introduced ow can be applied to any architec-
ture independently of the number of the pipeline stages.
4.4. Stage 2: multi-mode transformation round
After the initialization, the next stage of the ow concerns
the development of the multi-mode transformation round of each
Construction of the Multi-mode Paths
Stage 1:
Initialization
Stage 2:
Multi-mode
transf. round
Construction of the Multi-mode Paths
W
t
Stage 3:
Multi-mode
W
t
unit
Stage 4:
Multi-mode
control unit
Fig. 4. Overall structure of the systematic design ow.
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 392
pipeline stage. This procedure is divided in three steps. First, the
development of the sub-circuits (circuit paths) required to realize
the computations of the transformation rounds takes place,
ignoring the assignment of primary I/Os. Then, the assignment of
the primary I/Os is performed, and nally the possible different
word lengths of the involved algorithms are handled (Fig. 5).
4.4.1. Circuit paths development
We dene as a circuit path or path a walk through circuit
modules that starts from one primary input and terminates in one
primary output. If the circuit path that starts from one input uses
exactly the same modules with the path that starts from another
input, then these paths are treated as one path.
4.4.1.1. Multi-mode paths development. Initially, the circuit paths of
the smaller graph (i.e. SHA-1) are ranked in terms of their contribution
on the delay and occupied area. Then, the rst path of the list is
selected and we try to merge it with the resources existed in the base
design, which in our case is that of SHA-256/512. Afterwards, the
procedure is repeated choosing the next path of the list until all paths
of the small graph to be included in the base design.
The merging of the circuit paths is performed by examining if
there is a same or similar path in the base graph and adding the
required steering logic (e.g. multiplexers). If there is no similar
path then a new circuit path is inserted in the base design with
the additional required circuitry. By similarity between two circuit
paths it is meant that there is one to one correspondence
regarding the circuit modules and interconnections of the paths,
in other words the graphs of the two paths are isomorphic. To
avoid overlong presentation, only the development of the rst
multi-mode transformation round is used as a running example.
Studying the two input transformation rounds, it is easily
derived that the rst selected path of SHA-1 is that shown in
Fig. 6(a), whereas investigating the graph of SHA-256/512 (base
graph) two identical paths exist as it is shown in Fig. 8(b) and (c).
The split signal is omitted for clarity reasons. Thus, there are two
candidate multi-paths for merging the rst selected path of SHA-1
with the current base design. Because the assignment of the inputs
and outputs of the candidate paths is ignored during the rst sub-
stage of the ow, the two candidate multi-paths are shown in
Fig. 7 using general names for the I/Os. It should be mentioned
that in Fig. 6(a) the Ch(d, c, b) function is used as a non-linear
function because this is executed in the rst 20 iterations which
are assigned to the rst pipeline stage.
Then, the second path (Fig. 8(a)) of SHA-1 is selected from the
list. Studying, the graph of SHA-256/512 we also found an identical
path (Fig. 8(b)) and by combining it with the currently selected
path of SHA-1 a third candidate multi-path is produced, as shown
in Fig. 8(c). Although this procedure is repeated for all paths of
SHA-1 graph, we will use the above paths to present the next steps
of the ow. The next step is the study of the generated multi-paths
with the paths existing in the base design to perform resource
sharing and keep the delay penalty low.
4.4.1.2. Resource sharing with the base design. The two selected
paths of SHA-1 graph (Figs. 6(a) and 8(a)) are used together for
the computation of the a
SHA-1
variable, as it is shown in Fig. 1.
In addition, they are used for the production of the three candidate
multi-paths shown in Figs. 7 and 8(c). Specically, the rst path of
SHA-1 graph (Fig. 6(a)) is included in the rst two multi-paths
(Fig. 7), whereas the second path of SHA-1 graph (Fig. 8(a)) is
included in the third candidate multi-path (Fig. 8(c)).
Therefore, the computation of the a
SHA-1
variable can be
implemented by combining either the rst and third multi-paths
or the second and third multi-paths. To keep the delay and area
overheads low, the pair of multi-paths, which will be selected, has
to be properly merged with the resources existing in the base
design (SHA-256/512 transformation round).
List the paths in descending order in terms of complexity
Enable feature Add appropriate resources
Different Width?
y
n
Possible multiple
instances of the smaller
hash graph
n
y
Pick a path from the top of the smaller round (critical path)
Identical paths in
the bigger round?
y
n
Similar paths
in the bigger
round?
y
n
Create new
path
Create candidate multi-path(s) and check for
resource sharing with the bigger round
Are there paths
in the list ?
Establish resource sharing (steering logic) with the
y
n
Circuit paths development
Word-length handling
For every candidate multi-path identify the I/O positions on the initial paths
Same I/Os on the
initial paths?
y
n
Apply in the
candidate
multi-paths
intact I/O positions during
?
y
Apply in the
candidate
multi-paths
Steering logic: intact I/Os
n
Adopt the most efficient multi-path of each group
I/Os assignment
Fig. 5. Stage 2: multimode transformation round.
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 393
Based on Eq. (9), the computation of a
SHA-1
variable is per-
formed by adding ve terms, namely the ROTL
5
(a), Ch(b, c, d), e, W,
and K. Studying the graph of SHA-256/512 transformation round, it
can be easily derived that it contains two circuit paths consisting
of four serially interconnected adders, which particularly consti-
tute its two equivalent critical paths. However, a circuit structure
consisting of four serially interconnected adders has ve inputs and
one output. Thus, any of the above circuit structures of SHA256/512
transformation round can be used to implement the computations
of the a
SHA-1
variable.
What actually happens is that we exploit the associativity and
commutativity properties of the addition and we change the order of
the performed additions and the variables used in each addition
without violating the computations of variable a
SHA-1
. In that way, the
resources existing in SHA-256/512 graph are reused keeping the area
overhead low. This is one the feature of the hash function mentioned
in Sub-section 4.1 which is exploited in our design ow.
In order to avoid an increase of the critical path of the base
design, the circuit paths of the four interconnected adders of SHA-
256/512 round should remain intact. To achieve this, the pairs of
the multi-paths discussed above have to be properly combined.
This is accomplished by connecting the output O
3
1
with the input I
1
1
or with input I
2
1
. Moreover, the adder encircled in Fig. 7 has to be
bypassed when the SHA-1 function is executed. This is easily
accomplished by using a multiplexer as shown in Fig. 9, which
illustrates the modication of the candidate multi-paths after the
resource sharing with the base design.
Although, it seems that the above procedure is complex in the
general case, it does not hold for the case of hash functions. As
mentioned in Sub-section 4.1, a basic feature of these functions is
that only a small number of working variables (e.g.: two variables
for SHA-1, SHA-256, and SHA-512) are computed in the trans-
formation rounds by simple computational structures. Thus, the
graphs of the transformation rounds are not complex and the
paths that have to be examined are not many. Both these proper-
ties reduce the complexity and allow the application of the above
procedure.
4.4.2. I/Os assignment
The goal of this stage is to assign (place) the variables of
the second algorithm in the positions of the above input register
trying to avoid the use of extra steering logic and the correspond-
ing area and delay overheads.
Since each multi-path is used to execute a number of transfor-
mation rounds of the targeted hash functions, the output values
produced after the execution of one round are feedback to the
inputs of multi-mode round unit through registers (Fig. 2). As
the base design is that of SHA-256/512 function and because it
uses eight 64-bit variables, an 864-bit register is used for this
purpose. Taking into consideration the design of SHA-256/512
transformation round, (see Fig. 1(b)), the variable a is placed in the
most right position of the input register, variable b is placed in the
next most right and so on.
Initially, the produced multi-paths (Fig. 9) are examined to nd
if there are inputs or outputs that could be placed in the same
positions of the input register with that of the base design. If this
happens then these I/Os are placed in the corresponding positions.
Studying the rst two candidate multi-paths (see Fig. 9) it is easily
derived that the output O
2
1
corresponds to the computed variables
a
SHA-1
and a
SHA-256/512
; thus these variables are stored in the same
position of the register and particularly in the right most one.
Then, it is examined if there are variables that can be placed
in other positions of the register and remain intact through the
execution of the targeted algorithms. In other words, as one hash
algorithm is executed by the multi-mode round unit each time, we
try to perform variable renaming by placing its variables in proper
b
t-1
c
t-1
d
t-1
+
+
+
SHA-1 Path
Ch (d, c, b)
e
t-1
X1
Y1
a
t-1
SHA256/512 Paths - Identical
e
t-1 f
t-1
g
t-1
+
Ch(e,f,g)
+
+
e
t
d
t-1
X2
Y2
e
t-1 f
t-1
g
t-1
+
Ch(e,f,g)
T1
+
T2
+
a
t
X2
Y2
Fig. 6. The rst selected path of SHA-1 graph (a), identical paths on SHA256/512 graph (b) and (c)
Ch(x,y,z)
I
1
1
O
1
1
Multi-Path 1
I
2
1
I
3
1
I
4
1
I
5
1
I
6
1
O
Ch(x,y,z)
Multi-Path 2
I
1
2
I
6
2
I
5
2
I
4
2
I
3
2
I
2
2
Fig. 7. The rst (a) and the second (b) candidate multi-paths.
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 394
positions of the input register. Considering the rst candidate
multi-path (Fig. 9(a)), we can assign the variables g, f, and e to the
inputsI
1
2
, I
1
3
and I
1
4
when the SHA-512 algorithm is executed and
place these variables in the corresponding register positions. We
also assign in I
1
2
, I
1
3
, and I
1
4
inputs the variables d
SHA-1
, c
SHA-1
, and
b
SHA-1
when the SHA-1 is executed. Thus, we use the same
registers for storing the variables g
SHA-256/512
, f
SHA-256/512
, and
e
SHA-256/512
or the variables d
SHA-1
, c
SHA-1
, and b
SHA-1
with no
conict. Based on the above, the new multi-paths with I/Os are
shown in Fig. 10.
Adopting the above assignment of the I/Os, a de-multiplexer is
needed for the O
1
1
output of the rst multi-path. This happens
because it produces the variable e
SHA-256/512
or the variable a
SHA-1
according to the executed algorithm. However, the position that is
used for storing e
SHA-256/512
variable is also used for storing the
variable b
SHA-1
. Thus, a de-multiplexer is demanded to overcome
this conict and store the variable a
SHA-1
in the proper register
position as shown in Fig. 10. However, the de-multiplexer lies in
the critical path which increases the delay of the produced design
and in addition increases the total area. Thus, between the two
multi-paths the second one is selected and used in the nal
design.
The above procedure is repeated for the third multi-path
(Fig. 8(c)) and merging the I/Os assignment is performed. Combin-
ing the produced path with that of Fig. 10(b) the implementation
of the rst two selected paths of SHA-1 is shown in Fig. 11.
4.4.3. Word-length handling
Finally, the last step of the development of the multi-mode
transformation round is the handling of the different word lengths
used by each algorithm. Specically, the SHA-256/512 function uses
64-bit words, whereas the SHA-1 uses 32-bit words. To handle this
reusing the base design (SHA-256/512), a ratio r is specied, which
is produced by dividing the different word lengths, and r instances
of resources of the algorithm with the smaller word length are
employed.
Specically, in our case the above ratio equals to 2 and there-
fore the Ch(x, y, z) node of Fig. 13 includes two 32-bit units that
performs the computations of the Ch(x, y, z) non-linear functions
on two 32-bit data or one computation on 64-bit data. This
happens because the corresponding computations are bit-wised.
Concerning, the addition node, it also contains two 32-bit modulo
adders where the carry out signal of the rst adder is connected to
the carry in signal of the second through a simple logic (AND gate)
and using a control signal. In this, two 32-bit additions required
by SHA-1 and SHA-256 or one 64-bit addition required by the
SHA-512 algorithm can be executed. For the above selection, the
split signal is responsible. In that way the produced design is
able to perform a SHA-512 hashing on one message or to perform
SHA-1 hashing on two different messages.
4.4.4. Produced multi-mode transformation rounds
Up to now the merging of two paths of SHA-1 transformation
round with the design of SHA256/512 transformation has been
described. The same procedure is repeated for the remaining paths
of the SHA-1 transformation round. After the completion of the
merging of all paths, the multi-mode transformation round used in
Fig. 8. The second selected path of SHA-1 graph (a), identical path on SHA256/512 graph (b), the third candidate multi-path (c).
I
1
2
= O
1
3
+
Ch(x,y,z)
+
+
Candidate Multi-Path 1
MUX
I
2
1
I
3
1
I
4
1
I
5
1
I
6
1
O
1
1
I
1
1
= O
1
3
+
Ch(x,y,z)
+
+
Candidate Multi-Path 2
MUX
I
6
2
I
5
2
I
4
2
I
3
2
I
2
2
O
1
2
Fig. 9. The rst (a) and the second (b) candidate multi-paths after performing
resource sharing with the base design.
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 395
the rst pipeline stage of the architecture of Fig. 2 is produced and
it is depicted in Fig. 12.
Repeating the above procedure the multi-mode transformation
rounds for the second, third, and fourth pipeline stages are produced
and illustrated in Fig. 13.
It should be mentioned that the multi-mode transformation
round of the second and fourth pipeline stages are identical. This
happens, because in these pipeline stages the assigned transfor-
mation rounds of SHA-1 function are identical. Additionally, due to
the fact that the Parity(x,y,z) non-linear function is used in these
Fig. 10. First (a) and second (b) multi-paths after of I/Os assignment.
+
Ch(x,y,z)
+
+
In5
In6 MUX
e
t-1 f
t-1
g
t-1
b
t-1 c
t-1
d
t-1
+
W
t-1
K
t-1
+
h
t-1
e
t-1
a
t
a
t
SHA-1
SHA-512 /
SHA-256
SHA-1
SHA-512 /
SHA-256
Fig. 11. Merged SHA-1/256/512 multi-paths with resource sharing and I/Os
assignment.
Fig. 12. Multi-mode transformation round of the rst pipeline stage.
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 396
rounds, in the corresponding multi-mode one, a Par module is
included and its outputs are chosen only when SHA-1 function is
selected (cycles in Fig. 13(a)). Finally, in third multi-mode round, in
both SHA-1 and SHA-256/512 designs, the Maj(x,y,z) non-linear
function is used. However, its inputs are different in each design.
To overcome this, a 6to3 multiplexer is added before the existing
Maj of SHA256/512 so as to feed it with the appropriate values,
depending on the function (cycle in Fig. 13(b)).
4.5. Stage 3: multi-mode message scheduling units
After the production of the multi-mode transformation round
unit of each pipeline stage, the development of the multi-mode
W
t
units takes place. Specically, for each pipeline stage of the
architecture of Fig. 2(a) W
t
unit exists to feed the required W
t
values to the multi-mode transformation round. Again, a proce-
dure similar to the above is followed.
The corresponding units of sole SHA-1, SHA-256, and SHA-512
designs are derived easily. Each of them consists of a shift register,
a 2to1 multiplexer and a logic unit that computes the W
t
values
according to Eq. (8) (circle in Fig. 14). Concerning the operation of
SHA-512 W
t
unit, when a 1024-bit message block is inserted, the
rst 16 W
t
values are produced instantly by performing a simple
split in the Block Split Unit (see Fig. 2) and fed into the shift register
of the rst W
t
unit. During the rst 16 iterations of the SHA-512
transformation round, the W
16
W
31
values are computed (one per
clock cycle) and stored in the shift register via the serial input.
Since, each pipeline stage executes 20 iterations of the transfor-
mation round, the values W
16
W
19
are consumed in the next
4 iterations, whereas during these iterations the values W
32
W
35
are computed and inserted to the shift register. Thus, when the
rst pipeline stage nishes its computations, the shift register
contains the values W
20
W
35
that are transferred (through a
parallel load) to the shift register of the second W
t
unit. The same
procedure takes place at the second and third pipeline stages.
Concerning the control load/shift signal this is the tcwicnt_i
produced by the counter unit (see Fig. 2).
The development of the multi-mode scheduling unit for SHA-
256/512 architecture is not complex. The computations of the W
t
values (W
next
logic) are slightly different (regarding the non-linear
functions), whereas the structure of the shift register and the
multiplexer exists in both designs differing on the word length.
Thus, the shift register and the multiplexer can be reused in the
multi-mode design, whereas the W
next
logic module in the multi-
mode implementation includes the separate circuits of the two
W
next
logic modules. The same holds regarding the resulted unit of
the SHA-256/512 unit compared to the corresponding one of the
SHA-1 architecture. Hence, considering the fact that the SHA-256/
512 multi-mode W
t
unit is the more complex, it is used as the base
(initial) design in the ow and we try to merge it with the design
of the SHA-1. The resulted W
t
unit is shown in Fig. 14.
As shown in this gure, to handle the different word lengths of
SHA-1, SHA-256 and SHA-512 algorithms, a similar approach to
that followed in the development of the multi-mode transforma-
tion round is followed. Specically, the units of the function with
the smallest word length are duplicated r times (r2) in the nal
design. To support the multi-mode operation, steering logic is
also included in the nal design. The input is the 1024 bits and is
treated as one 1024-bit message block when the SHA-512 algo-
rithm is executed or as two 512-bit message blocks b
1
and b
2
when the SHA-1 or SHA-256 function is executed. Thus, two
message blocks can be hashed with SHA-1 or SHA-256 functions
simultaneously.
The meaningful responsibility of the split signal is fully shown
in Fig. 14. In case that the SHA-256 is selected, it splits in the half
the 64-bit adders are so as to perform 2 32-bit additions. This is
accomplished by handling the 32nd carry-out bit exploiting an
AND gate. The same idea is followed before regarding the addition
stages of the transformation rounds.
To perform two hash function executions in two different
message blocks at the same time, the two 512-bit blocks are not
Fig. 13. Multi-mode transformation rounds: (a) second and fourth pipeline stages,
and (b) third pipeline stage.
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 397
inserted sequentially (i.e. rstly the message block of the rst
message and then the message block of the second one). This way
the concurrent hash computation would be impossible. Instead, an
interleaving of the 16 message dependent W values of each block
is performed. Specically, in each of the 16 64-bit positions of the
shift register, two 32-bit W values are stored, namely the 32-bit W
value of the rst message block in the 32 most signicant bits of
the position and the corresponding 32-bit W value of the second
message block in the remaining 32 bits of the position (less
signicant bits). Hence, at each clock cycle, two W values are fed
in the transformation round, one from the rst message block and
one from the second. Consequently, as it can be observed in Fig. 13,
the calculation of the rest W values by the W
next
Logic block is
performed pairs (one for the rst hash computation and one
for the second one), in order to be stored in the same manner as
described above.
4.6. Stage 4: multi-mode control unit
As reported above, the controller of the initial hash designs is
simple Finite State Machines (one FSM/design), implemented by
counters. Specically, the control logic consists of four counters,
count_i (i 1, 2, 3, 4), one for each pipeline stage. The output of
each counter is used for addressing the local memory of each
stage, whereas the next pipeline stage is activated via the signals
tround_i and tcwicnt_i, which are generated when the previous
pipeline stage nishes its computations.
The control logic for the nal multi-mode hash design is
constructed by considering the control unit with the greatest
number of states and modifying it so as to be able to control
all the incorporated hash function algorithms. Beyond that, the
resulted FSM, along with the 4 counters, are designed to perform
On-the-Fly functionality. Specically, each one of the 4 pipeline
stages (including its transformation round, its W computation unit
and the corresponding counter module) can realize different hash
computations and perform them concurrently. For example, the
rst stage can perform a SHA-512 computation while, at the same
time, the second one can perform 2 concurrent SHA-1 or SHA-256
ones. This way, the pipeline structure, together with the multi-
mode principle, is fully exploited. This is achieved by determining
specic ag bits that indicate the current, previous and forth-
coming hash functions. By exploiting this information in the
resulted FSM, through ag-bits, issues of stalling the hash compu-
tation of a stage in order for the next one to nalize its computa-
tion, when the second one needs to iterate more times, are
overcome.
5. Proposed optimized multi-mode designs
Following the above design ow, and having as inputs the
optimized designs of the hash functions, we proceeded in the
construction of the optimized multi-mode architectures. We con-
structed two optimized multi-mode architectures, namely the
SHA-256/512 and the SHA-1/256/512 ones. However, to avoid an
Fig. 14. Multi-mode W
t
computation unit of the SHA-1/256/512 architecture.
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 398
over-length text, in this section we present only the, more
complex, SHA-1/256/512 architecture.
5.1. Multi-mode transformation rounds
As the design ow imposes, rstly, we constructed the multi-
mode transformation rounds. The input designs were 4-staged
pipelined. Hence, the ow was repeated 4 times, one for each
pipeline stage. The resulted multi-mode rounds are depicted in
Figs. 1517. Similarly to sole optimized architectures, each one of
the multi-mode rounds includes a Pre- and Post-Computation
stage, separated by a register. It must be stressed that in all the
above gures, the split signal (responsible for choosing between
232-bit or 64-bit operations) is omitted in the detailed pre-
sentation of the rounds for clarity reasons. However, it exists and
its responsibility is the same as described in Section 3.
A key point of the resulted optimized multi-mode transforma-
tion rounds is the splitting of the Carry-Save Adders (CSAs) of
the second and fourth rounds (Fig. 17). This happens during the
resource-sharing sub-stage of the ow and it is accomplished by
Fig. 15. First multi-mode round of optimized SHA-1/256/512 architecture.
Fig. 16. First multi-mode round of optimized SHA-1/256/512 architecture.
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 399
splitting the carry-save tree from the nal addition of the CSA, in
order for the rst to be used as Par non-linear function when SHA-
1 function is selected. This choice slightly affects the overall
round's delay. However increases in the area gains due to the fact
that (a) extra logic for the non-linear function is not added and
(b) the certain path's output had to assigned to different output
register (depending on the selected function) during the ow's
I/O assignment, leading to more steering logic which increases
not only the area but the delay as well. Due the same reasons, in
the third multi-mode round (Fig. 16), an extra Maj box is included.
5.2. Multi-mode message scheduling units
Similarly to the procedure followed in the construction of
multi-mode base designs, after the multi-mode transformation
rounds, the multi-mode W
t
units take place. Due to the parametric
loop unrolling exploited in the initial optimized designs, two W
values are simultaneously fed per clock cycle in a transformation
round. Thus in the initial, optimized, W
t
units two W values are
produced at the same time. Therefore, the corresponding W
next
logic is doubled.
The above fact does not affect the application of the introduced
design ow. Following the stages as reported in Section 4, we
constructed a W
t
unit able to support the process of either of two
independent 512-bit message blocks (when SHA-1 or SHA-256
is selected) or one 1024-bit message block (when SHA-512 is
selected). The unit (Fig. 18), in any case, produces two 64-bit W
t
values per clock cycle. If the SHA-512 function is selected, then this
quantity represents one 64-bit W
t
value. If the SHA-1 or the SHA-
256 function is selected the above quantity includes two 32-bit W
t
values, each one correspond to two independent 512-bit message
blocks.
5.3. Multi-mode control units
Finally, the construction of the control logic for the nal multi-
mode hash design takes place. Similarly to the base designs, the
control unit with the greatest number of states is considered and is
modied so as to be able to control all the incorporated hash
function algorithms.
6. Experimental results
The introduced SHA-256/512 and SHA-1/256/512 multi-
mode architectures were captured in VHDL and implemented in
a wide range of Xilinx FPGAs. Specically, six Xilinx families are
considered, three older ones: Virtex (xcv1000-6FG680), Virtex-II
(xc2v6000-6FF1517), and Virtex-4 (xc4vlx100-12FF1148), and
three modern ones: Virtex-5 (xc5vlx155t-3FF1136), Virtex-6
(xc6vlx240t-3FF784), and Virtex-7 (xc7v855t-3FFG1157). It should
be mentioned that the Virtex and Virtex-II technologies were
selected only for comparison with existing similar works.
The XST synthesis tool of Xilinx ISE Design Suite (version 13.1)
was used for mapping the designs to the FPGA devices. The
functionality of the implementations was initially veried via
Post-Place-and-Route simulations using the MentorGraphics's
ModelSim simulator. A large set of test vectors, apart from those
provided by the standard, were used. Also, downloading to
development boards, additional functional and timing verication
were performed.
The quality of the introduced architectures was measured in
terms of frequency, throughput, area, and throughput/area cost
factor. Regarding throughput of each hash function, it is calculated as
Throughput
#bits f
#cycles
16
where # is the number of bits of incoming message (512 bits for SHA-
1, SHA-256 and 1024 for SHA-512), f is the operating frequency, and
#cycles is the clock cycles spending in each case (20/10 for SHA-1,
SHA-512 and 16/8 for SHA-256, base/optimized architectures).
As mentioned in Section 4, there is a word-length mismatch
between SHA-1, SHA-256, and SHA-512 functions. Specically,
the SHA-512 function operates on 64-bit data, while the SHA-1
Fig. 17. Second and fourth multi-mode round of optimized SHA-1/256/512 architecture.
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 400
and SHA-256 ones operate on 32 bits. The above is handled, in
favor to the proposed multi-mode architectures by merging two
SHA-1 (or SHA-256 depending on the initial selection) operations
in the data buses of SHA-512. This is achievable due to the fact that
the majority of the included operations of hash functions (Section
4.1) are bit-wise. For the cases that it does not hold, appropriate
modications are made (see Sub-sections 4.4 and 4.5). As a result,
two independent message blocks are able to be processed, when
either SHA-1 or SHA-256 function is selected. Hence, the through-
put value of SHA-1 and SHA-256 functions, regarding the pro-
posed multi-mode architectures, is calculated by doubling the
value that is calculated by Eq. (16).
Below, the experimental results are presented in details. First,
we present and discuss the results for the proposed architectures
in terms of throughput and area, then a comparison between the
designs produced by the proposed ow and the designs produced
by ISE design suite is presented, and nally comparisons with
existing similar works are offered.
6.1. Evaluation of the proposed architectures
Table 2 presents the frequency of the individual and multi-
mode designs in representative Xilinx technologies. The second,
third, and fourth columns correspond to the separate designs of
SHA-1, SHA-256, and SHA-512 functions, respectively, whereas the
last two columns correspond to the multi-mode SHA-256/512 and
SHA-1/256/512 architectures. As it was expected the following can
be put forth:
The frequency of the multi-mode architectures is lower than
the frequency of the separate designs due to the additional
logic that is inserted in the multi-mode architectures.
Fig. 18. Multi-mode W
t
computation unit of the SHA-1/256/512 optimized architecture
Table 2
Frequency of the base and optimized architectures.
Tech. SHA-1 SHA-256 SHA-512 Proposed SHA-256/512 Proposed SHA-1/256/512
Base architectures frequency (MHz)
Virtex-4 120.7 108.7 93.2 90.1 77.9
Virtex-5 154.6 138.7 118.9 115.8 98.3
Virtex-6 161.6 143,6 123 118.7 99.8
Virtex-7 194 169 144.9 139.3 115.6
Optimized architectures frequency (MHz)
Virtex-4 157.2 130.1 119.6 103.1 90
Virtex-5 207.1 169.1 151.9 132.2 116.3
Virtex-6 217.4 172 165.3 141.6 124.7
Virtex-7 259.3 204 189.7 159 139.6
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 401
The optimized multi-mode designs outperformthe corresponding
base ones. This means that although the separate optimized
designs, which are used as input in the proposed ow, are more
complex than the base ones, the ow works efciently without
inserting large logic that could make the optimized multi-mode
designs worst compared to the base ones.
The frequency is improved substantially when modern tech-
nologies are used.
However, to evaluate the quality of the produced multi-mode
architectures the area and throughput factors must be studied.
Fig. 19 depicts the area of the base and optimized designs.
For each FPGA family, the rst three bars corresponds to the area
of the individual (single-mode) designs, the fourth and fth bars
correspond to the sum of the area of the considered sole designs,
whereas the last two bars correspond to the area of the proposed
SHA-256/512 and SHA-1/256/512 multi-mode architectures.
Studying the Virtex-6 implementations (Fig. 19), it is derived
that the area of the base SHA-1/256/512 architecture (3787 slices)
is reduced by 28.6% compared to the total area of the three
separate designs (5302 slices). Concerning the optimized designs,
the area of the SHA-1/256/512 multi-mode architecture is
4129 slices, whereas the total area of the separate designs is
6177 slices, which corresponds to 33.2% area reduction. Also, the
area of the base SHA-256/512 multi-mode architecture (3179
slices) is reduced by 34.9% compared to the total area of the SHA-
256 and SHA-512 designs (4290 slices), while in the optimized
SHA-256/512 design the area reduction is 36.9%. Similar conclu-
sions are derived regarding the implementations in the other
families.
Thus, the rst outcome is that a remarkable resource sharing is
achieved and the area of the produced multi-mode architectures is
reduced by 30% about compared to the total area of the individual
designs.
However, as it is shown in Table 2 the frequency of the multi-
mode architectures is lower than the frequency of the individual
designs, thus a comparison in terms of throughput is required.
Figs. 20 and 21 illustrate this comparison for the proposed
SHA-256/512 and SHA-1/256/512 multi-mode architectures.
Concerning the base SHA-256/512 multi-mode design (Fig. 20(a))
in Virtex-6 technology, the throughput of the SHA-256 function
(7.6 Gbps) is improved by 65.3% over the individual implementation
of this hash function (4.6 Gbps). This happens because in the multi-
mode architecture two SHA-256 message blocks are processed
concurrently. Regarding the throughput of SHA-512 function in the
base SHA-256/512 multi-mode design, it equals to 6.1 Gbps, while
the SHA-512 one achieves 6.3 Gbps, which corresponds to a 3.5%
reduction. However, as it was mentioned above, the area of the
base multi-mode SHA-256/512 design is improved by 34.9%. Con-
sequently, the proposed multi-mode SHA-256/512 architecture
improves the area factor by 34.9%, while the throughput of
SHA-256 function is improved by 65.3% and the throughput of the
SHA-512 function is slightly reduced by 3.5%.
For the case of the optimized SHA-256/512 multi-mode design
in Virtex-6 technology, the area is improved by 36.9%, as men-
tioned above. Also, the throughput of the SHA-256 function is
improved by 64.7%, whereas the throughput of the SHA-512
function is reduced by 14.3%. The larger throughput reduction of
SHA-512 compared to the throughput reduction of the base
designs happens because the optimized designs are more complex
than the base ones. Thus, more extra logics are required to develop
the multi-mode architecture, which results in more frequency
degradation. Mentioning that the SHA-512 design was used as the
initial design by the proposed ow. Studying the implementation
results of SHA-256/512 multi-mode architecture in the other FPGA
families, similar outcomes are also derived.
Concluding, the major outcome is that in most of the cases the
proposed ow and multi-mode implementations achieve an area
reduction of more than 30% compared to the total area of the
individual design. Also, the throughput of the SHA-1 function is
improved up to 29% (base SHA-1/256/512 in Virtex-4), where for
the SHA-256 function the throughput improvement is up to 67%
(base SHA256/512 in Virtex-5) and up to 45% (optimized SHA-1/
256/512 in Virtex-6). Finally, the throughput of the SHA-512
Fig. 19. Area comparisons: (a) base designs and (b) optimized designs.
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 402
function is reduced from 3% (base SHA-256/512 in Virtex-4) up to
27%(optimized SHA-1/256/512 in Virtex-7). Though, there is a
throughput reduction for the SHA-512 function, the nal SHA-1/
256/512 is capable to support both the three hash functions with
one low-area design.
6.2. Comparison with commercial synthesis tool (Xilinx ISE)
A major question that arises regarding the efciency of the
proposed ow and the produced multi-mode designs is whether it
is possible to produce better multi-mode designs if the individual
Fig. 20. Multi-mode SHA-256/512 architecture throughput comparisons: (a) base designs, and (b) optimized designs.
Fig. 21. Multi-mode SHA-1/256/512 architecture throughput comparisons: (a) base designs, and (b) optimized designs.
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 403
designs are fed to a commercial synthesis tool allowing it to
perform resource sharing and delay optimization to develop a
multi-mode architecture.
For that reason, a top-level design that includes the hardware
modules of the SHA-1, SHA-256, and SHA-512 functions as separate
modules are developed in VHDL and the proper output is selected via
a multiplexer. This top-level design was fed to the ISE synthesis tool
setting the area and delay optimization efforts to high and normal,
respectively. Tables 36 present the corresponding results and com-
parisons in terms of frequency and throughput for the base and
optimized multi-mode designs. Also, in these tables the frequency and
throughput values of the proposed architectures are provided.
As it is shown, the frequency values of the designs produced by
ISE are higher than those of the proposed architectures. This results
in better throughput values for the SHA-512 functions both in SHA-
256/512 and SHA-1/256/512 multi-mode architectures. However, the
proposed architectures outperform the corresponding ones produced
by ISE for the SHA-1 and SHA-256 functions. This happens because
the proposed designs are capable to process two input messages
concurrently, while the ISE's architecture operates on one message.
To have a more fair comparison, the occupied area and the
throughput/area factors must be studied. In Fig. 22, the area
comparisons between the proposed multi-mode implementations
and those that are produced by ISE tool are depicted. As it is illustrated,
the area of the proposed implementations is lower than those that are
produced by the ISE tool. Specically, in Virtex-6 technology, the
introduced base SHA-1/256/512 module requires 3787 slices, whereas
the corresponding multi-mode module derived by ISE requires 5576
slices achieving a 47.2% area reduction. For the corresponding opti-
mized designs in the same technology, the area reduction that is
achieved by the proposed architectures is 45%. On average, compared
to the base SHA-256/512 and SHA-1/256/512 implementations, which
are produced by the ISE tool, the area of the corresponding proposed
architectures is lower by 40% and 46%, respectively. For the optimized
SHA-256/512 and SHA-1/256/512 architectures the area savings are
40% and 47% on average, respectively.
Thus, the outcome is that the designs of the ISE tool outperform
the proposed ones in terms of frequency but they demand larger
area. Therefore, to get a more in-depth and fair comparison, the
throughput/area factor must be studied.
In Figs. 23 and 24, the throughput/area values of the proposed
multi-mode designs and those that are produced by the ISE tool are
depicted. In these gures, the throughput/area values are presented
per FPGA family, including each one of the incorporated function in
the multi-mode architecture, in groups-of-2 (ISE proposed).
Concerning the base SHA-1/256/512 multi-mode architectures
(Fig. 23(b)), the throughput/area value of the SHA-1 function
proposed (1.35 Mbps/slice) is 141.5% higher than the throughput/
area value of the ISE design (0.56 Mbps/slice). The same improve-
ment is also achieved for the SHA-256 function in this case. Also,
the throughput/area value of the SHA-512 function in the pro-
posed architecture equals to 1.35 Mbps/slice, while the corre-
sponding value for the ISE design is 1.12 Mbps/slice achieving
a 20.7% improvement. On average for all the considered FPGA
technologies, the throughput/area value of the SHA-1 and
SHA-256 functions is improved by 140%, while for the SHA-512
function the improvement is 20%. Similar results are also held for
the case of the base SHA-256/512 multi-mode architecture, where
the improvements of the throughput/area factor for the SHA-256
and SHA-512 functions are 175% and 37% on average, respectively.
Regarding the fact the SHA-1 and SHA-256 functions exhibit
the same throughput/area values in the proposed architectures, it
is explained by the fact that the operating frequency equals to the
frequency of the whole multi-mode architecture for both func-
tions. Thus, according to Eq. (16), the throughput is the same.
Moreover, the area equals to the area of the whole multi-mode
architecture for both functions.
6.3. Comparison with existing multi-mode architectures
In this sub-section the proposed multi-mode designs are com-
pared with similar existing ones. To the best of our knowledge,
there are not too many similar FPGA designs. These designs
Table 3
Proposed and ISE base SHA256/512 designs.
Techn. Metric Proposed SHA-256/ 512 ISE SHA-256/512
256 512 256 512
Virtex 4 F (MHz) 90.1 92.8
T (Gbps) 5.8 4.6 3.0 4.7
Virtex 5 F (MHz) 115.8 118.1
T (Gbps) 7.4 5.9 3.8 6.1
Virtex 6 F (MHz) 118.7 122.2
T (Gbps) 7.6 6.1 3.9 6.3
Virtex 7 F (MHz) 139.3 144
T (Gbps) 8.9 7.2 4.6 7.4
Table 4
Proposed and ISE base SHA1/256/512 designs.
Techn. Metric Proposed SHA-1/256/512 ISE SHA-1/256/512
1 256 512 1 256 512
Virtex 4 F (MHz) 77.9 92.4
T (Gbps) 4.0 5.0 4.0 2.4 2.9 4.7
Virtex 5 F (MHz) 98.3 117.6
T (Gbps) 5.0 6.3 5.1 3.0 3.8 6.0
Virtex 6 F (MHz) 99.8 121.7
T (Gbps) 5.1 6.4 5.2 3.1 3.9 6.2
Virtex 7 F (MHz) 115.6 143.6
T (Gbps) 5.9 7.4 6.1 3.7 4.6 7.3
Table 5
Proposed and ISE optimized SHA256/512 designs.
Techn. Metric Proposed SHA-256/ 512 ISE SHA-256/512
256 512 256 512
Virtex 4 F (MHz) 103.1 118.7
T (Gbps) 13.2 10.5 7.6 12.1
Virtex 5 F (MHz) 132.2 120.4
T (Gbps) 17.0 13.6 7.7 12.3
Virtex 6 F (MHz) 141.6 129.3
T(Gbps) 18.2 15.5 8.3 13.2
Virtex 7 F (MHz) 159 148.5
T (Gbps) 20.1 16.3 9.5 15.2
Table 6
Proposed and ISE optimized SHA1/256/512 designs.
Techn. Metric Proposed SHA-1/256/512 ISE SHA-1/256/512
1 256 512 1 256 512
Virtex 4 F (MHz) 90 117.9
T (Gbps) 9.2 11.5 9.2 6. 7.6 12.0
Virtex 5 F (MHz) 116.3 119.8
T (Gbps) 11.9 14.9 11.9 6.1 7.7 12.3
Virtex 6 F (MHz) 124.7 128.1
T (Gbps) 12.8 15.9 12.9 6.69 8.2 13.1
Virtex 7 F (MHz) 139.6 147.6
T (Gbps) 14.3 17.9 14.3 7.6 9.5 15.1
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 404
concern SHA-1/256, SHA-1/256/512, SHA-256/512, and SHA-384/
512 multi-mode architectures implemented in FPGA technology.
However, these deigns belong to the category of the optimized
designs since techniques such as pipeline and loop unrolling have
been applied to improve performance. Thus, the proposed opti-
mized multi-mode architectures are used for comparisons.
Tables 710 present the comparisons in terms of frequency,
area, throughput, and throughput/area metrics. As it can be seen,
the proposed architecture outperforms all the existing ones in
both throughput and throughput/area metrics.
Indicatively, for SHA-256/512, regarding throughput, the improve-
ments are from 4.2 (SHA-512 Virtex-II) to 23.8 (SHA-256
Virtex-II), while concerning throughput/area, the improvements are
from 1.2 (SHA-512 Virtex-II) to 5.5 (SHA-256 Virtex).
The reason why the introduced architectures achieve signi-
cantly better results in terms of throughput compared to the
others is twofold. Firstly, the proposed multi-mode architectures
have their transformation round unrolled-by-2 and, in parallel,
they are 4-stage pipelined. Hence, the denominator of Eq. (16) is
divided by 4, compared to the initial number of the function's
Fig. 22. Area comparisons: (a) base SHA-256/512, (b) base SHA-1/256/512, (c) optimized SHA/256/512, and (d) optimized SHA-1/256/512.
Fig. 23. Throughput/area comparisons base designs: (a) SHA-256/512 and (b) SHA-1/256/512.
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 405
iterations. Secondly, regarding SHA-256, as mentioned before, two
independent input blocks are able to be consequently processed,
leading to doubling the throughput that is calculated by Eq. (16).
Finally, in Fig. 25, the throughput/area comparisons between
the proposed optimized SHA-256/512 architecture and the similar
existing ones are provided, illustrating the fact that the proposed
designs outperform against the competition.
Fig. 24. Throughput/area comparisons-optimized designs: (a) SHA-256/512 and (b) SHA-1/256/512.
Table 7
Comparison of proposed optimized SHA-1/256 architecture to similar existing ones.
Techn. Ref. Freq. (MHz) Area (slices) Throughput (Gbps)
1 256
Virtex-5 [29] 227 371 1.4 1.8
Prop. 159.4 2104 8.2 10.2
Virtex-6 [29] 266 369 1.7 2.1
[30] 258 148 0.064 0.060
Prop. 164.7 2087 8.4 10.5
Table 8
Comparison of proposed optimized SHA-1/256/512 architecture to similar existing
ones.
Techn Ref. Freq. (MHz) Area (slices) Throughput (Gbps)
1 256 512
Virtex-6 [30] 271 251 0.067 0.064 0.046
Prop. 124.7 4129 12.8 15.9 12.9
Table 9
Comparison of proposed optimized SHA-256/512 architecture to similar existing
ones.
Techn. Ref. Freq. (MHz) Area (slices) Throughput (Mbps)
256 512
Virtex [20] 50 2951 400 320
[23] 53 2530 848
Prop. 40.3 6911 5158 4127
Virtex-II [19] 74 2384 291 467
[21] 81 1938 1296
Prop. 54.1 6968 6925 5540
Table 10
Comparison of proposed optimized SHA-384/512 architecture to similar existing
ones.
Techn. Ref. Freq. (MHz) Area (slices) Throughput (Gbps)
Virtex-4 [32] 182.3 1731 2.3
Prop. 119.6 7224 12.2
0.14
0.34
0.75
0.22
0.67
0.99
0.11
0.34
0.60
0.20
0.67
0.80
0.00
0.20
0.40
0.60
0.80
1.00
1.20
[20] [23] Proposed [19] [21] Proposed
Virtex-II Virtex
M
b
p
s
/
S
l
i
c
e
SHA-256 SHA-512
Fig. 25. Throughput/area comparisons between proposed optimized SHA-256/512
architecture and similar existing ones.
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 406
7. Conclusions
In this paper, area-efcient and high-throughput multi-mode
architectures regarding the SHA-1 and SHA-2 families were
proposed and implemented in several FPGA technologies. These
architectures are able to realize more than one function, while
their frequency and throughput degradation (caused by merging
of separate designs) are kept signicantly low. Compared to the
corresponding architectures that were produced by a commercial
synthesis tool (Xilinx ISE), the proposed ones are signicantly
more area-efcient and at the same time signicantly better in
terms of throughput/area. Additionally a systematic design ow
for producing multi-mode architectures of the above two families
is introduced. Finally, the proposed multimode architectures out-
perform the previously proposed ones signicantly, in terms of
throughput and throughput/area.
References
[1] NIST: FIPS 198, The KeyedHash message authentication code (HMAC) Federal
Information Processing Standard, NIST Publication, US Dept. of Commerce, 2002.
[2] SP 800-32, Introduction to public key technology and the federal PKI
infrastructure, NIST, US Dept of Commerce, 2001.
[3] Larry Loeb secure electronic transactions: introduction and technical refer-
ence, Artech House Publishers, 1998.
[4] NIST: FIPS 186-3, The digital signature standard (DSD) federal information
processing standard, NIST Publication, US Dept. of Commerce, 2009.
[5] Stephen Thomas, SSL & TLS Essentials: Securing the Web, John Wiley and Sons
Publications, New York, USA, 2000.
[6] P. Loshin, IPv6: Theory, Protocol and Practice, Elsevier, San Francisco, USA,
2004.
[7] NIST: FIPS 197, Advanced encryption standard (AES), NIST Publication, US
Dept. of Commerce, 2001.
[8] R. Rivest, RFC1321: The MD5 message digest algorithm, Publications of MIT
Laboratory for Computer Science and RSA Data Security Inc. Available at:
http://www.faqs.org/rfcs/rfc1321.html (accessed December 2012).
[9] NIST: FIPS 180-3, Secure Hash Standard, (SHS), NIST Publication, US Dept. of
Commerce, 2008.
[10] H. Dobbertin, The status of MD-5 after a recent attack, RSA Labs' CryptoBytes,
1996.
[11] X. Wang, Y.L. Yin, H. Yu, Finding collisions in the full SHA-1, Springer Lect.
Notes Comput. Sci. 3621 (2005) 1736.
[12] NIST, Cryptographic hash algorithm competition SHA-3, NIST. http://csrc.
nist.gov/groups/ST/hash/sha-3/index.html, 2012 (accessed December 2012).
[13] B. Preneel, Cryptographic hash functions and the SHA-3 competition, talk in
Asiacrypt 2010. Available at https://www.cosic.esat.kuleuven.be/publications/
talk-198.pdf (accessed May 2012).
[14] M. Ermer, Doubts over necessity of SHA-3 cryptography standard. http://
www.h-online.com/security/news/item/Doubts-over-necessity-of-SHA-3-cryp
tography-standard-1498071.html (accessed December 2012).
[15] R. Chaves, G.K. Kuzmanov, L. Souza, S. Vassiliadis, Cost-efcient SHA hardware
accelerators,, IEEE Trans. Very Large Scale Integr. Syst. 16 (8) (2008) 9991008.
[16] M. Macchetti, L. Dadda, Quasi-pipelined hash circuits, in: Proceedings of 17th
Symposium on Computer Arithmetic, ARITH-17, 2005, pp. 222229.
[17] H.E. Michail, A.P. Kakarountas, A.S. Milidonis, C.E. Goutis, A top-down design
methodology for implementing ultra high-speed hashing cores, IEEE Trans.
Dependable Secure Comput. 6 (4) (2009) 255268.
[18] H.E. Michail, G.S. Athanasiou, V. Kelefouras, G. Theodoridis, C.E. Goutis, On the
exploitation of as high-throughput SHA-256 FPGA design for HMAC, ACM
Trans. Recongurable Technol. Syst. 5 (1) (2012) 2:12:28.
[19] N. Sklavos, O. Koufopavlou, Implementation of the SHA-2 hash family standard
using FPGAs, J. Supercomput. 31 (2005) 227248.
[20] R. Glabb, L. Imbert, G. Jullien, A. Tisserand, N.V. Charvillon, Multi-mode
operator for SHA-2 hash functions, J. Syst. Archit. 53 (23) (2007) 127138.
[21] M. Zeghid, B. Bouallegue, A. Baganne, M. Machhout, R. Tourki, A recongurable
implementation of the new secure hash algorithm, in: Proceedings of the
Second International Conference on Availability, Reliability and Security (ARES
'07), IEEE Computer Society, Washington, DC, USA, 2007, pp. 281285.
[22] S. Wanhong, G. Hongspeng, H. Huilei, Zibin Dai, Design and optimized imple-
mentation of the SHA-2(256, 384, 512) hash algorithms, in: Proceedings of the
7th International Conference on ASIC, 2007, ASICON '07, 2007, pp. 858861.
[23] M. Zeghid, B. Bouallegue, A. Baganne, M. Machhout, R. Tourki, Architectural
design features of a programmable high throughput recongurable SHA-2
processor, J. Inf. Assur. Secur. 2 (2008) 147158.
[24] T.S. Ganesh, M.T. Frederick, T.S.B. Sudarshan, A.K. Somani, Hashchip: a shared-
resource multi-hash function processor architecture on FPGA, Integr., VLSI J.
40 (2007) 1119.
[25] E. Khan, M.W. El-Kharashi, F. Gebali, M. Abd-El-Barr, Design and performance
analysis of a unied recongurable HMAC-hash unit, IEEE Trans. Circuits Syst. I
54 (12) (2007) 26832695.
[26] M.-Y. Wang, C.-P. Su, C.-T. Huang, C.-W. Wu, An HMAC processor with
integrated SHA-1 and MD5 algorithms, in: Proceedings of Asia and South
Pacic Design Automation Conference, ASP-DAC 04, 2004, pp. 456458.
[27] R. Ramanarayanan, S. Mathew, F. Sheikh, S. Srinivasan, A. Agarwal, S. Hsu,
H. Kaul, H. Anders, V.M.M. Erraguntla, R. Krishanurthy, 18 Gbps, 50 mW
recongurable multi-mode SHA Hashing accelerator in 45 nm CMOS, in:
Proceedings of the ESSCIRC, 2010, pp. 210213.
[28] J. Docherty, A. Koelmans, A exible hardware implementation of SHA-1 and
SHA-2 hash functions, in: Proceedings of 2011 IEEE International Symposium
on Circuits and Systems, 2011, pp. 19321935.
[29] Helion Tech., Fast hash core family for Xilinx FPGA. Data sheet available from:
http://www.heliontech.com/hash.htm (accessed December 2013).
[30] Helion Tech., Tiny hash core family for Xilinx FPGA. Data sheet available from:
http://www.heliontech.com/hash.htm (accessed December 2013).
[31] A.T. Hoang, K. Yamazaki, S. Oyanagi, Three-stage pipeline implementation for
SHA2 using data forwarding, in: Proceedings of 2008 International Conference on
Field Programmable Logic and Applications, FPL 2008, 810 September 2008,
pp. 29, 34.
[32] A.T. Hoang, K. Yamazaki, S. Oyanagi, Pipelining a multi-mode SHA-384/512
core with high area performance rate, IEICE Trans. Inf. Syst. E92.D (10) (2009)
20342042.
Harris E. Michail received his Dipl. Eng. and Ph.D. from
the Department of Electrical & Computer Engineering,
University of Patras, Greece, in 2009. From 2009 to 2011,
he was an Adjunct Assistant Professor in the Computer
Engineering and Informatics Department, University
of Patras. In 2011, he joined the Electrical Engineering,
Computer Engineering, and Informatics Department,
Cyprus University of Technology. He has authored and
co-authored more than 60 papers in international journals
and conferences and he has more than 170 cross-
references in his work. His main research interests include
Cryptography, Computer Security, and Embedded Systems.
George S. Athanasiou received his 5-year Dipl. Eng. on
Electronic and Computer Engineering, from the Technical
University of Crete, Greece, on 2008, achieving a gradua-
tion degree of Excellent (GPA 8.51/10) with distinction. In
May 2013 he received his Ph.D. from the Electrical and
Computer Engineering Department, University of Patras,
Greece. From then on he was a Post-Doctoral researcher
in the same department until August 2013. In September
2013 he joined Antcor Advanced Network Technologies
S.A., Athens, Greece. Until today he has more than 25
publications in international journals and conferences.
His research interests include Cryptography, VLSI Design,
and Design for Testability.
George Theodoridis received his Dipl. Eng. in Electrical
Engineering and Ph.D. in Electrical and Computer
Engineering from the University of Patras, Greece, in
1994 and 2001, respectively. In 2001 he was (co-
founder) with ALMA Technologies S.A., Athens, Greece.
During June 2003 to July 2009 he was a Lecturer in the
Department of Physics of Aristotle University of Thes-
saloniki, Greece. In July 2009 he joined as an assistant
professor the Department of Electrical and Computer
Engineering of Patras University. His research interests
include power estimation and low-power VLSI design
and embedded systems design.
Costas E. Goutis received his B.Sc. in Physics from
the University of Athens, Greece, in 1966. In 1974 he
received his M.Sc. in Digital Telecommunications from
the Heriot-Watt University and in 1978 he received his
Ph.D. in Digital Image Processing from the Southampton
University, UK. Since 1985 he has been associate, full and
Emeritus professor in the ECE Department of University
of Patras, Greece. His research interests include VLSI
Design, High Level Synthesis and Low Power design.
He has published more than 300 articles (104 in journals
and 194 in conferences) and holds 2 best paper awards.
H.E. Michail et al. / INTEGRATION, the VLSI journal 47 (2014) 387407 407