0% found this document useful (0 votes)
5 views11 pages

A High-Performance Multimem SHA-256 Accelerator For Society 5.0

This document presents a novel multimem SHA-256 accelerator designed to enhance performance at the system on chip (SoC) level, addressing the significant data transfer time between external memory and the accelerator. The proposed architecture incorporates techniques such as pipelined arithmetic logic units, multimem processing elements, and shift buffers to improve processing rates and hardware efficiency. Experimental results demonstrate that the accelerator achieves a maximum processing rate of 284 Gbps using advanced FPGA technology, outperforming existing solutions in both speed and efficiency.

Uploaded by

SACHIN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views11 pages

A High-Performance Multimem SHA-256 Accelerator For Society 5.0

This document presents a novel multimem SHA-256 accelerator designed to enhance performance at the system on chip (SoC) level, addressing the significant data transfer time between external memory and the accelerator. The proposed architecture incorporates techniques such as pipelined arithmetic logic units, multimem processing elements, and shift buffers to improve processing rates and hardware efficiency. Experimental results demonstrate that the accelerator achieves a maximum processing rate of 284 Gbps using advanced FPGA technology, outperforming existing solutions in both speed and efficiency.

Uploaded by

SACHIN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Received February 16, 2021, accepted February 28, 2021, date of publication March 2, 2021, date of current version

March 16, 2021.


Digital Object Identifier 10.1109/ACCESS.2021.3063485

A High-Performance Multimem SHA-256


Accelerator for Society 5.0
THI HONG TRAN , (Member, IEEE), HOAI LUAN PHAM ,
AND YASUHIKO NAKASHIMA, (Senior Member, IEEE)
Graduation School of Information Science, Nara Institute of Science and Technology (NAIST), Nara 630-0192, Japan
Corresponding author: Thi Hong Tran ([email protected])
This work was supported by the Japan Science and Technology Agency (JST) under the Strategic Basic Research Program named
Precursory Research for Embryonic Science and Technology (PRESTO) Grant number 2020A031.

ABSTRACT The development of a low-cost high-performance secure hash algorithm (SHA)-256 accelerator
has recently received extensive interest because SHA-256 is important in widespread applications, such as
cryptocurrencies, data security, data integrity, and digital signatures. Unfortunately, most current researches
have focused on the performance of the SHA-256 accelerator but not on a system level, in which the
data transfer between the external memory and accelerator occupies a large time fraction. In this paper,
we solve the state-of-art problem with a novel SHA-256 architecture named the multimem SHA-256
accelerator that achieves high performance at the system on chip (SoC) level. Notably, our accelerator
employs three novel techniques, the pipelined arithmetic logic unit (ALU), multimem processing element
(PE), and shift buffer in shift buffer out (SBi-SBo), to reduce the critical path delay and significantly increase
the processing rate. Experiments on a field-programmable gate array (FPGA) and an application-specific
integrated circuit (ASIC) show that the proposed accelerator achieves significantly better processing rate and
hardware efficiency than previous works. The accelerator accuracy is verified on a real hardware platform
(FPGA ZCU102). The accelerator is synthesized and laid out with 180 nm complementary metal oxide
semiconductor (CMOS) technology with a chip sized 8.5 mm × 8.5 mm, consumes 1.86 W, and provides a
maximum processing rate of 40.96 Gbps at 80 MHz and 1.8 V. With FPGA Xilinx 16 nm FinFET technology,
the accelerator processing rate is as high as 284 Gbps.

INDEX TERMS SHA-256, blockchain, society 5.0, cryptography hash function, SHA-256 accelerator,
FPGA, ASIC.

I. INTRODUCTION disadvantage of the Bitcoin blockchain network of extremely


Cryptography secure hash algorithm (SHA)-2 plays an high energy consumption. It was reported that the power
important role in developing super smart society 5.0 because consumption of the Bitcoin network reached 64 TWh in 2020,
it is required for data security and data integrity purposes which is even higher than the total energy consumption of
in many applications, such as the digital signature algo- several nations such as Switzerland and the Czech Republic.
rithm (DSA) [1], hash-based message authentication code In the field of the Internet of Things (IoT), SHA-256 should
(HMAC) [2], pseudorandom number generation (PRNG) [3], be implemented in power-constrained sensors or edge nodes
and message authentication in internet protocol security to secure data and/or check the integrity of data. For these
(IPSec) [4]. Recently, the representative SHA-2 hash fam- reasons, developing a high-performance SHA-256 accelera-
ily named SHA-256 has become the key component for tor has become a research trend in recent years.
securing decentralized blockchain networks such as Bit- Several techniques have been widely used in published
coin and Ethereum [5]. To secure the network, the double works. For instance, unrolled architectures [6], [7] were
SHA-256 must be repeatedly computed a large number of implemented to improve the processing rate in trade-offs with
times until a valid hash value that is smaller than a pre- large areas and high power consumption. Pipelined architec-
defined target is found [5]. This results in the well-known tures have been developed in many works, such as [8], [9],
and [10], to increase the processing rate of the SHA-256 core.
The associate editor coordinating the review of this manuscript and Furthermore, [8] employed the parallel counter technique to
approving it for publication was Wei Huang . reduce the circuit area and critical path delay. Recently, [11]
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
39182 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 9, 2021
T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

introduced the rescheduling of the SHA-256 round compu-


tation method to reduce the critical path delay and improve
the processing rate of the SHA-256 accelerator. Generally,
these techniques merely rearranged the operators required
for SHA-256 round computation such as adders, logic gates,
and rotations, in an appropriate and efficient way to achieve
high performance in terms of the processing rate, circuit area,
or critical path delay. In another approach, the work in [5]
introduced the idea of a compact message expander (CME) FIGURE 1. The generation of a SHA-256 hash value for a long message.

in which the input data characteristic of double SHA-256 was


utilized to significantly optimize the hardware cost and power
consumption of the circuit for the case of Bitcoin mining. many applications such as decentralized blockchain, DSA,
However, the proposed work aimed to support the double and HMAC. SHA-256 calculates a 256-bit hash value for an
SHA-256 only in Bitcoin mining. input message of 512 bits. The real application may need to
The reality is that an SHA-256 circuit is just a component calculate the hash value for a very long message. In such
of an embedded system under the control of a microprocessor. cases, the message is divided into many 512-bit data blocks.
The processing time of the SHA-256 circuit includes not If the last block is smaller than 512 bits, padding is added. The
only the time for calculating the hash value but also the hash calculation for a long message is shown in Fig. 1. The
time for transferring data between the external memory and SHA-256 algorithm computes intermediate hash values for
SHA-256 circuit. Unfortunately, the aforementioned conven- data blocks one by one, in which the hash result of the current
tional works focused merely on the calculation time of the block becomes the input initial hash for hash computing of
SHA-256 circuit. The data transfer time was ignored. An effi- the next data block. The result of the final data block is
cient SHA-256 accelerator must provide a high computation considered to be the hash value of the entire message.
rate and be very compatible with the embedded systems so
that the data transfer can be performed efficiently and fast.
In this study, we propose a high-performance hardware
architecture for SHA-256 by considering both the hash calcu-
lation time and data transfer time. We named the architecture
the multimem SHA-256 accelerator because multiple local
memories are implemented in each processing element (PE)
to temporarily store the input and output data of SHA-256.
These memories work appropriately and efficiently to reduce
the data transfer time between the accelerator and external
memory. Several ideas of interest such as pipelined arithmetic
logic units (ALUs), multimem PEs, and shift buffer in shift
buffer out (SBi-SBo), are introduced in this paper.
The remainder of this paper is organized as follows.
Section II provides the background of the research. Section III
describes our proposed multimem SHA-256 architec-
ture. Section IV reports our evaluation in terms of
application-specific integrated circuit (ASIC) and field-
programmable gate array (FPGA) experiments. Finally,
section V concludes the paper.

FIGURE 2. The overview operation of the SHA-256 algorithm.


II. BACKGROUND
A. SHA-256 ALGORITHM Fig. 2 illustrates the overview operation of the SHA-256
The SHA-2 family is a set of cryptographic hash func- algorithm. It includes two processes named the message
tions designed and published by the US National Insti- expander (ME) and message compressor (MC). The ME
tute of Standards and Technology (NIST) in 2002 [12]. process expands the 512-bit input message into 64 chunks
The SHA-2 family includes six hash functions named of 32-bit data Wj (0 ≤ j ≤ 63). In the first 16 rounds,
SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, the ME parses the 512-bit message into 16 32-bit data chunks
and SHA-512/256. They are actually the same algorithm (denoted as Wj , j = 0 to 15, where j is the round index). In the
but with different word lengths, constant parameters, and final 48 rounds, the ME calculates 48 chunks of 32-bit data
initialization values. SHA-256 is known as representative of Wj (16 ≤ j ≤ 63) following eq. (1). Three 32-bit adders and
the SHA-2 family and is currently applied to secure data in two logical functions σ0 (x) and σ1 (x) are needed to compute

VOLUME 9, 2021 39183


T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

Wj (16 ≤ j ≤ 63). σ0 (x) and σ1 (x) are computed as eq. (2) in a low processing rate. One of the main reasons is that
and eq. (3). Note that S n (x) and Rn (x) denote the right rotation memory blocks such as double data random access mem-
and the right shift of data x by n bits, respectively. ory (DDRAM) and cache L1 and L2 of these platforms are
placed far from the calculation unit, and the data transfer time
Wj = σ1 (Wj−2 ) + Wj−7 + σ0 (Wj−15 ) + Wj−16 (1)
between the memories and calculation units thus constitutes a
σ0 (x) = S (x) ⊕ S (x) ⊕ R (x)
7 18 3
(2) large amount of the total processing time. Furthermore, most
σ1 (x) = S 17 (x) ⊕ S 19 (x) ⊕ R10 (x) (3) of the existing hardware resources of CPUs and GPUs, such
as multipliers and exponentials, are useless for SHA-256 but
The MC process computes the 256-bit hash value from
still consume energy. Despite these disadvantages, multicore
the outputs of the ME process (64 chunks of Wj (0 ≤ j ≤
CPUs and GPUs are currently considered to be the most
63)). The process involves two main steps: loops and hash
applicable hardware platforms for calculating SHA-256 in
updates. In the loop step, eight loop hash values (denoted
Bitcoin mining and other blockchain networks.
a, b, c, d, e, f , g, h) are initialized by the initial hash values
In another approach, the systolic array-based accelerator
H0 , H1 ,. . . ,H7 . The loop hash values a, b, c, d, e, f , g, h are
named EMAXVR in [13] and its improved version in [14]
then computed and updated through 64 loops. In each loop
were applied to reduce the data access time by implement-
(loop j, 0 ≤ j ≤ 63), eqs. (4) to (8) are applied.
ing local memory near the ALU. Although this platform
T1 = h + 61 (e) + Ch(e, f , g) + Kj + Wj (4) achieves high performance on image processing and AI learn-
T2 = 60 (a) + Maj(a, b, c) (5) ing [13], [14], its performance for computing SHA-256 is
very poor [15]. The key reason for the low processing rate is
a = T1 + T2 (6) the data-dependent characteristic of the SHA-256 algorithm,
e = d + T1 (7) as mentioned above. The processing element (PE) architec-
b = a; c = b; d = c; f = e; g = f ; h = g (8) ture of EMAXVR is not yet suitable for SHA-256 computa-
tion; therefore, a large number of PEs are required to compute
where logical functions such as 60 (x), 61 (x), Ch(x, y, z), and a single hash loop. Furthermore, the data dependence among
Maj(x, y, z) are computed using the following equations. the loops requires 1) multiple copies of data to be stored in
60 (x) = S 2 (x) ⊕ S 13 (x) ⊕ S 22 (x) (9) different PEs, which results in low hardware efficiency, and
61 (x) = S 6 (x) ⊕ S 11 (x) ⊕ S 25 (x) (10) 2) external memory data transfer after the completion of a
loop, which results in a low processing rate.
Ch(x, y, z) = (x ∧ y) ⊕ (¬x ∧ z) (11) Studying the weaknesses of the existing hardware plat-
Maj(x, y, z) = (x ∧ y) ⊕ (x ∧ z) ⊕ (y ∧ z) (12) forms for SHA-256 and deeply understanding the char-
In the hash update step, the final 256-bit hash value, which acteristics of SHA-256, we propose a novel multimem
is divided into 8 chunks of 32-bit data HO0 , HO1 ,. . . ,HO7 , SHA-256 accelerator that achieves high performance in both
is computed by adding the initial hashes H0 , H1 ,. . . ,H7 to the hardware efficiency and processing rate. We propose multiple
loop hashes a, b, c, d, e, f , g, h, as illustrated in eq. (13). local memory structures near the ALU for reducing data
access time and improving hardware efficiency. Furthermore,
HO0 = H0 + a; . . . ; HO7 = H7 + h (13) we design a processing element (PE) architecture that is able
to compute all loops of the hash function without requiring
B. SHA-256 CHARACTERISTICS AND PRELIMINARY IDEA data from the nearby PEs. Thus, the PEs can work inde-
FOR THE HIGH PERFORMANCE ACCELERATOR pendently and parallel, and no wire connections are required
There are three characteristic points of SHA-256 that should among the PEs. As a result, the accelerator does not need to
be noted. First, the SHA-256 algorithm needs only low-cost frequently request data from external memories, and the data
arithmetic logic operators such as adders, shifts, rotations, access time is significantly shortened. Furthermore, the hard-
and XORs. No complex operators such as multipliers and ware efficiency and processing rate of the SHA-256 acceler-
dividers are required. Second, the number of operators (nots, ator are expected to be remarkably enhanced.
XORs, adders, shifts, rotations, etc.) per loop calculation is
significantly large (approximately 50 operators per loop).
Third, data dependence among the loops is present. This III. THE PROPOSED MULTIMEM SHA-256 ACCELERATOR
means that the current loop needs the result of the previ- A. OVERVIEW OF THE ARCHITECTURE
ous loops for calculation. In addition, some common data Fig. 3 shows the overview architecture of our proposed
are required in multiple loops. Because of these character- accelerator. It can be employed in any embedded system
istics, the current high-performance CPU (Intel multicores) via an advanced extensible interface (AXI) bus. The CPU is
and GPU (GPU tesla V100, GTX 1080, etc.) platforms do the main microprocessor controlling operation of the entire
not perform well in calculating the SHA-256 algorithm. embedded system. Basically, the CPU is busy with many
Although these general purpose platforms are rich in hard- tasks, including the control of the hash calculation inside the
ware resources, they still need a large number of clock SHA-256 accelerator. In the task of controlling the accelera-
cycles to compute a single loop of SHA-256, which results tor, the CPU must send commands to transfer data from DDR

39184 VOLUME 9, 2021


T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

out (SBo1, SBo2). The pipelined ALU includes the arith-


metic logic operators required for computing all loops of
the hash function. The WRAM and HRAM blocks include
local memories storing the input and output data of the PE
so that the PE can compute hash values of a large num-
ber of SHA-256 functions rapidly without requesting data
from external DDR memory. The SBi1, SBi2, SBo1, and
SBo2 temporarily store and rearrange the data flow between
the pipelined ALU and the WRAM/HRAM blocks. All these
blocks cooperate together to achieve the following two things:
1) a PE can compute all the loops, and 2) the hardware
efficiency of the ALU is 100%.
In general, a PE of our accelerator computes all 64 loops
FIGURE 3. Overview of the architecture of the proposed multimem
SHA-256 accelerator. of SHA-256 within 64 clock cycles on average at a latency of
64 × 4 = 256 clock cycles. In other words, the accelerator
including 64 PEs can compute a hash value (64 loops) per
memory to the SHA-256 accelerator via data buses (such clock cycle. Furthermore, the memory resources of WRAM
as advanced microcontroller bus architecture (AMBA) buses and HRAM are efficiently allocated so that the 64-PE accel-
and AXI buses), reconfigure the accelerator, and send com- erator can uninterruptedly compute an amount of 64 × 4 × L
mands to read results from the accelerator to DDR memory. functions without further requesting data from external mem-
The data transfer between the accelerator and DDR memory ory, in which L is a pre-defined positive integer parameter.
is basically much slower than the data transfer inside the The AXI bus and PEs are designed to work in pipe-lined
accelerator. In addition, the data buses are not dedicated to and parallel as shown in fig. 5. When the AXI bus is read-
only the operation of the SHA-256 accelerator. They may not ing/writing data to/from a PE, the other PEs of accelerator is
immediately respond to the request from the accelerator and executing the hash calculation. As a result, the data transfer
may make the accelerator wait for data. Therefore, we imple- time between the accelerator and external DDR memory
ment multiple memory blocks inside the accelerator to reduce does not affected to the entire system processing rate (if the
the number of external DDR memory requests and to increase embedded system is equipped a fast enough AXI bus).
the accelerator processing rate.
The main component of the accelerator is an array
of PEs responsible for computing the SHA-256 hash
value. In addition, it has an AXI bus interface (AXI IF),
a global memory (GRAM) for storing global parameters
such as the accelerator configuration parameters, and a con-
troller (CTRL) providing control signals for the PEs, AXI IF,
and GRAM.
FIGURE 5. Timing chart of the entire multimem SHA-256 accelerator.

B. PIPELINED ALU ARCHITECTURE


As shown in Fig. 4, the pipelined ALU includes two blocks
named ME and MC that compute the ME and MC processes
of the SHA-256 function, respectively. Fig. 6 shows the cir-
cuit inside the pipelined ALU. To achieve high maximum
frequency and high processing throughput, the ME block is
designed to compute the ME process in 2 pipelined clock
cycles named ME-1 and ME-2, while the MC block computes
the MC process in 4 pipelined clock cycles named MC-1,
MC-2, MC-3, and MC-4. Two stages of the ME process in
parallel with the first two stages of the MC (named MC-1 and
MC-2). The last two stages of the MC receive the output data
FIGURE 4. Block diagram of a processing element (PE). from the ME block as well as its previous stage (MC-1 and
MC-2) outputs to complete the loop calculation.
Fig. 4 shows the internal architecture of a PE. It includes a With this architecture, the ALU can compute a hash func-
pipelined ALU, multiple memory structures named WRAM tion loop in 4 pipelined clock cycles. It then feeds back the
and HRAM, and shift buffer in (SBi1, SBi2) shift buffer result of the current loop to recursively compute the next loop.

VOLUME 9, 2021 39185


T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

FIGURE 6. The pipelined arithmetic logic unit (ALU) architecture.


FIGURE 7. The timing chart of the ME and MC blocks inside the pipelined
ALU when 4 data flows are processed.

Therefore, all 64 loops can be completed in a single ALU.


To efficiently use the hardware resources of the ALU (100% flows to efficiently use the hardware resources, we implement
hardware efficiency), four data flows from the local memories 4 identical RAM blocks in both WRAM and HRAM, named
WRAM and HRAM should be input to the 4 pipelined ALUs. WM1, WM2, WM3, WM4, and HM1, HM2, HM3, HM4 (see
The operation of the ME block is as follows. In the first fig. 4). Two-port RAMs are used because the RAMs may need
16 loops (loop j, 0 ≤ j ≤ 15), the ME block merely passes to simultaneously read and write data.
the input data Wj to its output port. In the last 48 loops (loop
j, 16 ≤ j ≤ 63), the ME block computes Wj from the four
input data Wj−16 , Wj−15 , Wj−7 , Wj−2 following eq. (1) to
eq. (3). The result Wj is then 1) passed to the MC-3 stage
of the MC block for further computing and 2) written into
WRAM for future use. The operation of the MC block is as
follows. In the first loop (loop j = 0), the loop hash values
a, b, . . . , and h are assigned to the initial hash values H0 , H1 ,
. . . , and H7 read from the HRAM. The values of a, b, . . . ,
and h are then computed and updated in four pipelined stages
MC-1, MC-2, MC-3, and MC-4 following eq. (4) to eq. (12).
Because four data flows share the same hardware resources of
the ALU, all the ALU stages are always busy. This means that
we achieve 100% hardware efficiency for the ALU resources.
By the time MC-4 stage is computing for data flow 1,
FIGURE 8. The memory map of a RAM block inside a) the WRAM and b)
MC-3 computes for data flow 2, MC-2 for data flow 3, and the HRAM.
MC-1 for data flow 4.
Fig. 7 illustrates the timing chart of the ME and MC blocks Fig. 8 a and b shows the size of each RAM block on
when loop 0 and loop 1 are computed for the four hash the WRAM and HRAM sides, respectively. Each block of
functions F1, F2, F3, and F4. The updated values of a, b, . . . , WRAM, such as WM1 and WM2, is sized L × 64 × 32 bits
and h after the MC-4 stage are then fed back to become the and can simultaneously store 64 words Wj (0 ≤ j ≤ 63) of L
inputs of the MC-1 stage. The calculation for the next loop hash functions, where L ×16×32 bits equivalent to 16 words
(loop j = 1) is then started. This operation is repeated until Wj (0 ≤ j ≤ 15) of L hash functions are written into each
all 64 loops of the hash function are completed. The values of RAM before starting the hash calculation. The remaining
a, b, . . . , and h after the completion of 64 loops are transferred L × 48 × 32 bits, equivalent to 48 words Wj (16 ≤ j ≤ 63)
to SBo2 to be written into the HRAM. of L hash functions, are written during the hash calculation
by the ALU. Note that L is a parameter that can be changed
C. MULTIPLE LOCAL MEMORY STRUCTURES WRAM AND before fabricating the chip.
HRAM Each block of HRAM, such as HM1 and HM2, is sized
The local memories WRAM and HRAM store the input and L × 8 × 32 bits and can store 8 chunks of 32-bit hash values
output data of the ALU. While WRAM stores the message H0 , . . . , and H7 of L hash functions. Our accelerator can
words Wj (0 ≤ j ≤ 63), HRAM stores the hash values H0 , work in two different modes. The first mode computes the
. . . , and H7 of the message. Because the ALU requires 4 data hash values of independent hash functions. In this mode,

39186 VOLUME 9, 2021


T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

the initial hash value of L hash functions is written into all


blocks of the HRAM before starting the hash calculation. The
content of the HRAM blocks is then be overwritten by the
final hash values calculated by the ALU. These values are
read out of the accelerator after the hash calculation of all
L hash functions is completed. The second mode computes
the hash values of a long message divided into L 512-bit data
blocks, as illustrated in fig. 1. In this mode, only the initial
hash value of the first data block is written into 4 RAM blocks
of HRAM before the accelerator is activated. During the hash
calculation, the output hash value of a data block is written
into the HRAM to be used as the initial hash value of the next
adjacent data block. The output hash value of the final data
block is the hash value of the entire message, which is then
read out of the accelerator. FIGURE 10. Timing chart of SBi1.

D. SHIFT BUFFER IN SHIFT BUFFER OUT (SBi-SBo)


The ALU requires multiple input data such as Wj−16 , Wj−15 , once per clock cycle. The multiplexer MUX-2 then selects the
Wj−7 , Wj−2 , and a, . . . , h at the same time. However, each appropriate dataset from one of four data flows to give to the
RAM block is able to read out only one data point per clock ALU.
cycle. Similarly, the ALU simultaneously outputs many data SBo1 delivers the result Wj of the ME block to the corre-
values, such as the updated versions of a, . . . , and h, but only sponding RAMs of the WRAM. Because only one value Wj is
one data point can be written into a block of HRAM per clock generated per clock cycle, shift buffers are not required, and
cycle. We thus cannot directly connect the data flows between only a 1-to-4 multiplexer is needed inside SBo1.
WRAM/HRAM and ALU. The SBi and SBo are required to
collect and schedule the timing of data flows.

FIGURE 9. SBi1 and SBi2 architectures. FIGURE 11. SBo2 architecture.

Fig. 9 a and b shows the SBi1 and SBi2 architectures that SBo2 has two main functions. First, it collects and sched-
connect the outputs of WRAM and HRAM to the ALU inputs, ules the loop hash words a, . . . , h output from the ALU
respectively. SBi1 includes 4 shift buffers named SBi1-1, after 64 loops are complete. Second, it calculates the final
SBi1-2, SBi1-3, and SBi1-4 receiving data flows from WM1, hash values HO1 , . . . , HO7 by adding the loop hash words
WM2, WM3, and WM4 of WRAM, respectively. Each shift a, . . . , h and the initial hash values H0 , . . . , H7 following
buffer has 4 32-bit shift registers that receive data from the eq. (13). Fig. 11 shows the inside of SBo2. It includes four
WRAM and then shift the data to the right once per clock shift buffers on the left named SBo2-l1, SBo2-l2, SBo2-l3,
cycle. The timings of 4 data flows from 4 RAM blocks and SBo2-l4 and four shift buffers on the right named
WM1, . . . , WM4 are interleaved, as shown in fig. 10. The SBo2-r1, SBo2-r2, SBo2-r3, and SBo2-r4. Each shift buffer
multiplexer MUX-1 then selects the appropriate dataset from has 8 32-bit shift registers. The buffers on the left (SBo2-l1,
4 data flows one per clock cycle to input to the ALU. SBo2-l2, SBo2-l3, SBo2-l4) store the ALU outputs a, . . . , h,
Similarly, SBi2 shown in Fig. 9b has 4 shift buffers named while the buffers on the right (SBo2-r1, SBo2-r2, SBo2-r3,
SBi2-1, SBi2-2, SBi2-3, and SBi2-4 receiving data flows SBo2-r4) receive the initial hash values H0 , . . . , H7 from
from HM1, HM2, HM3, and HM4 of HRAM, respectively. SBi2. Each left-right set of buffers aims to process one data
Each shift buffer has 8 32-bit shift registers that receive data flow. The content of the left-right set of buffers is shifted to
a, . . . , h from the HRAM and then shift the data to the right the left (direction of the most significant register) once per

VOLUME 9, 2021 39187


T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

TABLE 1. Comparison based on ASIC results.

clock cycle. The values from the most significant registers are IV. EVALUATION
read out and added. The addition results become the final hash In this section, we evaluate the performance of our proposed
words HO1 , . . . , HO7 that are written into the corresponding multimem SHA-256 accelerator from both ASIC and FPGA
RAM blocks (HM1, . . . , HM4) of the HRAM. experimental results.

A. ASIC EXPERIMENT
The proposed multimem SHA-256 circuits for cases of 1-PE
and 64-PEs were coded in Verilog and synthesized in an ASIC
using a Synopsys Design Compiler with the Rohm 0.18µm
CMOS standard cell library [16]. Table 1 shows the synthe-
sized area of the SHA-256 circuit in comparison with those
of other works. To ensure a fair comparison, we provide the
processing throughput in terms of both the number of clocks
per hash function and Mbps. We also normalize the process-
ing throughput Mbps and hardware efficiency of other works
to the same 180 nm technology. The proposed architecture in
cases of both 1 PE and 64 PEs exhibits significantly better
processing throughput and hardware efficiency. Compared
with the works in [17] and [18], which reported only the ALU
circuit results, our ALU generates a processing rate as high
as 2,930 Mbps, which is 1.7 times and 1.9 times faster than
the works in [17] and [18], respectively. Our ALU hardware
efficiency is 123 Kbps/gate, which is 1.1 times and 1.4 times
better than those of [17] and [18], respectively.
The proposed architecture with only 1 PE (No RAM PE)
provides a processing rate as high as 2,431 Mbps, which is
12.3 times faster than that of the work in [22]. Its hardware
FIGURE 12. Timing chart of SBo2.
efficiency is 43 Kbps/gate, which is almost 2 times better than
that of [22]. Notably, our architecture with 64 PEs achieves
Fig. 12 illustrates the timing chart of the SBo2. First, the best processing rate. It provides up to 139 Gbps at a
the ALU output data, such as a, . . . , and h, of the 4 data flows frequency of 272 MHz, which is 70 times faster than [22].
are stored in the 8 registers of the shift buffers SBo2-l1 to Note that the term ‘‘No RAM’’ refers to the synthesis result
SBo-l4 in 4 interleave clock cycles. The values of the most except the local memory resources inside the HRAM and
significant register (for example SBo2-l1[7]), which are a, b, WRAM.
c, d, e, f , g, and h at clock cycles #1, #2, #3, #4, #5, #6, #7, Furthermore, we successfully layout the proposed
respectively, are read out to add to the initial hash values H0 , SHA-256 circuit in the case of 64 PEs in ASIC technology
. . . , H7 before writing into the HRAM (for example WM1). with a Rohm 180 nm CMOS standard cell library. The size of
It needs 8 clock cycles to complete the writing of 8 32-bit the layout chip is 8.5 mm × 8.5 mm and chip processing rate
hash words for one data flow and 8 + 3 = 11 clock cycles to is 40.96 Gbps when a global operating voltage of 1.8 V and
complete the writing for 4 data flows in pipelined mode. operating frequency of 80 MHz are applied.

39188 VOLUME 9, 2021


T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

TABLE 2. A comparison of FPGA synthesis results.

B. FPGA EXPERIMENT stores the input/output data. To ensure a fair comparison


1) FPGA SYNTHESIS RESULTS with other works, we synthesized two versions (1-PE and
To ensure a fair comparison with other existing SHA-256 64-PEs) of our accelerator with many options such as ‘‘only
architectures such as [11], [19], and [20], we synthesized the ALU’’’, ‘‘no RAM’’, and ‘‘full’’. In general, both versions
proposed SHA-256 circuits on several Xilinx FPGA boards, 1-PE and 64-PEs of our proposed SHA-256 accelerator out-
including Virtex 2, E, 4, 5, 6, 7, and Kintex UltraScale. The perform the existing SHA-256 architectures in terms of pro-
comparison factors include the circuit area, processing rate, cessing rate and hardware efficiency. Some typical numerical
and hardware efficiency. results are as follows.
The results are shown in Table 2. Note that among the On the Virtex 2 device, the ALU circuit of our accelerator
previous works, some reported only the results of the ALU version 1-PE generates 1 hash value every 64 clock cycles,
part, some reported the results of full accelerators includ- which is equivalent to a processing rate of 1,989 Mbps at a
ing ALU, controller, memories, etc., and some reported maximum frequency of 249 MHz. The processing rate of our
the full accelerator except the memory that temporarily accelerator is 30.6 times, 2.4 times, 2 times, and 3.7 times

VOLUME 9, 2021 39189


T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

FIGURE 13. SoC-based multimem SHA-256 diagram.

better than the architectures in [6], [19], [20], and [21], Overall, our proposed SHA-256 accelerator obtains signif-
respectively. The hardware efficiency of our accelerator ver- icantly better processing rate and hardware efficiency and has
sion 1-PE is 1,864 Kbps/LUT, which is 12.4 times, 2.5 times, the ability to scale the trade-off between the processing rate
2.5 times, and 1.76 times higher than the architectures in [6], and hardware cost. This means that our accelerator processing
[19], [20], and [21], respectively. rate can be changed according to the requirements of the real
On the Virtex E device, the processing rate and hard- system.
ware efficiency of the full circuit of our accelerator version
1-PE are 7.6 times (662 vs. 87 Mbps) and 2.7 times (183 vs.
69 Kbps/LUT) higher than those of the architecture in [22],
respectively.
On the Virtex 4 device, compared to that of the architec-
tures in [10], [11], [19], and [21], the performance of our
accelerator version 1-PE is 34 times, 1.6 times, 2.3 times, and
3.5 times better in terms of processing rate and 13.5 times,
1.4 times, 1.3 times, and 1.6 times better in terms of hardware
efficiency, respectively.
On the Virtex 5 device, the processing rate of our archi-
tecture version 1-PE is 28.5 times and 3.1 times better than
those of the architectures in [19] and [21], respectively.
Our circuit hardware efficiency is improved by 5.1 times
and 1.1 times compared to that of the architectures in [19]
and [21], respectively.
On the Virtex 6 device, the processing rate and hardware FIGURE 14. SoC-based multimem SHA-256 experiment with real devices.

efficiency of the ALU part of our accelerator version 1-PE


are 2 times and 2.5 times higher than those of the architecture
in [25], respectively. 2) FUNCTIONAL VERIFICATION ON A REAL HARDWARE
On the Virtex 7 device, the processing rate and hardware PLATFORM
efficiency of our accelerator version 1-PE are 1.9 times To prove that the circuit operates correctly not only in
and 5 times higher than those of the architecture in [26], the software simulation tool but also on real hardware,
respectively. we built a system on chip (SoC) platform to execute the
On the Kintex UltraScale+ device, our accelerator version proposed SHA-256 circuit including 64 PEs. The SoC plat-
1-PE achieves almost the same performance as that of the form overview is shown in Fig. 13, and the real image of the
work in [27]. However, our accelerator version 64-PE obtains platform is captured in Fig. 14.
a significantly better processing rate, 49 times that of the The platform consists of two main devices: a host PC
aforementioned study (284,448 vs. 5,829 Mbps). and a Zynq UltraScale+ ZCU102 evaluation board. The two

39190 VOLUME 9, 2021


T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

devices connect and exchange data via JTAG and UART achieves significantly better processing rate and hardware
cables. The ZCU102 board comprises an ARMv8 Cortex- efficiency than previous works. The functional behavior of
A53 microprocessor, a programmable logic (PL), and a clock our accelerator has been proven to run correctly on a real SoC
generator. Our multimem SHA-256 is developed in the PL FPGA platform Zynq UltraScale+ ZCU102. The maximum
part and controlled by the ARMv8 microprocessor. The clock processing rate of our accelerator in theory is 284 Gbps.
generator generates a clock speed of 100 MHz to operate Version 64-PE of our accelerator has been successfully laid
the ARMv8 microprocessor and our accelerator embedded in out with 180 nm CMOS technology with a chip size of
the PL. 8.5 mm × 8.5 mm, consumes a power of 1.86 W, and provides
The host PC runs two Xilinx tools: a Vivado and a maximum processing rate of 40.96 Gbps at 80 MHz fre-
a Software Development Kit (SDK). We use Vivado to quency and 1.8 V voltage.
design and load the SHA-256 circuit onto the Zynq Obviously, the processing rate of an entire SoC embed-
UltraScale+ ZCU102 board. To control the operation of the ding SHA-256 accelerator is much more important than
SHA-256 accelerator, we use the SDK to embed C code that of the SHA-256 accelerator alone. This research has
onto the ARMv8 microprocessor. The C code runs tasks such proposed an SoC-compatible SHA-256 accelerator and sig-
as transferring data between the external memory and the nificantly improved the accelerator processing rate. How-
SHA-256 accelerator, activating the operation of the accel- ever, the potential of the accelerator may not be sufficiently
erator, and verifying the accelerator results. utilized due to the bottleneck data transfer rate provided
In our experiment, we develop two testing cases named by the SoC platform itself. Therefore, we believe that
individual PE and parallel 64-PE, as shown in Fig. 13. developing a new architecture and technique to overcome
In the individual PE testing case, all 64 PEs of the acceler- the data transfer rate bottleneck of SoC platforms will
ator are programmed to perform the same task. They are all be an important research trend of the field in the near
programmed to compute the hash values for 100,000 indepen- future.
dent data blocks. The results from each PE are then compared
against the correct 100,000 hashing values computed by the REFERENCES
Intel Xeon Gold 6144 CPU. Our implementation results have [1] U.S. Department of Commerce. (Jul. 2013). Digital Signature
shown that all 64 PEs work correctly to generate hash values Standard (DSS). [Online]. Available: https://nvlpubs.nist.gov/nistpubs/
matched with those computed by the Xeon CPU. In other FIPS/NIST.FIPS.186-4.pdf
[2] National Institute of Standards and Technologies, U.S. Department
words, the correction ratio of our accelerator is 100%. of Commerce. (Jun. 2015). The Keyed-Hash Message Authentication
In the parallel 64-PE testing case, the 64 PEs of the Code (HMAC). [Online]. Available: https://nvlpubs.nist.gov/nistpubs/
accelerator are programmed to perform different tasks in FIPS/NIST.FIPS.198-1.pdf
[3] Recommendation for Random Number Generation Using Deterministic
parallel. The input data of different data blocks are loaded Random Bit Generators, 800-90a Revision 1, Nat. Inst. Standards Tech-
into different PEs. The results are then compared against the nol., U.S. Dept. Commerce, Washington, DC, USA, Jun. 2015, doi:
corresponding hashing outputs computed by the Intel Xeon 10.6028/NIST.SP.800-90Ar1.
Gold 6144 CPU. Again, the implementation results show that [4] R. S. Kent, IP Authentication Header, document RFC 2404, 1998.
[Online]. Available: https://www.ietf.org/rfc/rfc2402.txt
all 64 PEs are calculated correctly with a correction ratio [5] H. L. Pham, T. H. Tran, T. D. Phan, V. T. D. Le, D. K. Lam, and
of 100%. Y. Nakashima, ‘‘Double SHA-256 hardware architecture with com-
pact message expander for bitcoin mining,’’ IEEE Access, vol. 8,
pp. 139634–139646, 2020, doi: 10.1109/ACCESS.2020.3012581.
V. CONCLUSION [6] R. P. McEvoy, F. M. Crowe, C. C. Murphy, and W. P. Marnane, ‘‘Opti-
The cryptography hash function SHA-256 is an important misation of the SHA-2 family of hash functions on FPGAs,’’ in Proc.
process in keeping the decentralized blockchain network IEEE Comput. Soc. Annu. Symp. Emerg. VLSI Technol. Archit. (ISVLSI),
Mar. 2006, pp. 317–322.
secure and preserving data security and integrity in the [7] H. E. Michail, G. S. Athanasiou, V. Kelefouras, G. Theodoridis, and
smart systems of society 5.0. Developing a high-performance C. E. Goutis, ‘‘On the exploitation of a high-throughput SHA-256 FPGA
SHA-256 circuit compatible with SoC is thus a require- design for HMAC,’’ ACM Trans. Reconfigurable Technol. Syst., vol. 5,
no. 1, pp. 1–28, Mar. 2012.
ment. Unfortunately, state-of-art SHA-256 architectures
[8] L. Dadda, M. Macchetti, and J. Owen, ‘‘The design of a high speed ASIC
have focused only on improving the performance of the unit for the hash function SHA-256 (384, 512),’’ in Proc. Design, Automat.
SHA-256 core as a standing-alone circuit. The compatibility Test Eur. Conf. Exhib., 2004, pp. 70–75.
of the SHA-256 circuit with the SoC as well as the data [9] R. Chaves, G. Kuzmanov, L. Sousa, and S. Vassiliadis, ‘‘Cost-efficient
SHA hardware accelerators,’’ IEEE Trans. Very Large Scale Integr. (VLSI)
transfer between the SHA-256 core and other parts of the SoC Syst., vol. 16, no. 8, pp. 999–1008, Aug. 2008.
have not yet been studied. In this paper, we have solved the [10] M. Padhi and R. Chaudhari, ‘‘An optimized pipelined architecture of
problem by developing an SoC-compatible SHA-256 accel- SHA-256 hash function,’’ in Proc. 7th Int. Symp. Embedded Comput. Syst.
Design (ISED), Dec. 2017, pp. 1–4.
erator. Several new techniques such as multiple local memory
[11] Y. Chen and S. Li, ‘‘A high-throughput hardware implementation of
structures (multimem), pipelined ALU, and SBi-SBo have SHA-256 algorithm,’’ in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS),
been introduced to achieve this purpose. Our experimental Oct. 2020, pp. 1–4.
results on both FPGA and ASIC have shown that the proposed [12] National Technical Information Service, U.S. Department of
Commerce/NIST, Springfield, VA, USA. (2002). FIPS 180-2—Secure
accelerator is not only compatible with the SoC via an AXI Hash Standard. [Online]. Available: http://csrc.nist.gov/publications/
bus and provides an efficient data transfer mechanism but also fips/fips180-2/fips180-2.pdf

VOLUME 9, 2021 39191


T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

[13] T. Ichikura, R. Yamano, Y. Kikutani, R. Zhang, and Y. Nakashima, THI HONG TRAN (Member, IEEE) received
‘‘EMAXVR: A programmable accelerator employing near ALU utilization the bachelor’s degree in physics and the mas-
to DSA,’’ in Proc. IEEE Symp. Low-Power High-Speed Chips (COOL ter’s degree in microelectronics from the Vietnam
CHIPS), Apr. 2018, pp. 1–3. National University Ho Chi Minh City University
[14] J. Iwamoto, Y. Kikutani, R. Zhang, and Y. Nakashima, ‘‘Daisy-chained of Science (VNU-HCMUS), Vietnam, in 2008 and
systolic array and reconfigurable memory space for narrow memory 2012, respectively, and the Ph.D. degree in infor-
bandwidth,’’ IEICE Trans. Inf. Syst., vol. E103.D, no. 3, pp. 578–589, mation science from the Kyushu Institute of Tech-
Mar. 2020, doi: 10.1587/transinf.2019EDP7144.
nology, Japan, in 2014. Since 2015, she has been
[15] D. Phan, T. H. Tran, and Y. Nakashima, ‘‘SHA-256 implementation on
working with the Nara Institute of Science and
coarse-grained reconfigurable architecture,’’ in Proc. IEEE Symp. Low-
Power High-Speed Chips, Japan, Apr. 2020. Technology (NAIST), Japan, as a full-time Assis-
[16] T. H. Tran, S. Kanagawa, D. P. Nguyen, and Y. Nakashima, ‘‘ASIC tant Professor. Her research interests include digital hardware circuit design
design of MUL-RED radix-2 pipeline FFT circuit for 802.11ah system,’’ and algorithm related to wireless communications, communication security,
in Proc. IEEE Symp. Low-Power High-Speed Chips (COOL CHIPS XIX), blockchain technologies, SHA-2, SHA-3, and cryptography. She is currently
Yokohama, Japan, Apr. 2016, pp. 1–3. a Regular Member of IEICE and REV-JEC.
[17] Helion Technology. Helion IP Core Products. Accessed: Jan. 10, 2021.
[Online]. Available: http://heliontech.com/core.htm
[18] A. Satoh and T. Inoue, ‘‘ASIC hardware focused comparison for hash
functions MD5, RIPEMD-160, and SHS,’’ in Proc. Int. Conf. Inf. Technol.,
Coding Comput. (ITCC), Las Vegas, NV, USA, vol. 1, 2005, pp. 532–537,
doi: 10.1109/ITCC.2005.92.
[19] R. García, I. Algredo-Badillo, M. Morales-Sandoval, C. Feregrino-Uribe, HOAI LUAN PHAM received the bachelor’s
and R. Cumplido, ‘‘A compact FPGA-based processor for the secure hash degree in computer engineering from the Univer-
algorithm SHA-256,’’ Comput. Electr. Eng., vol. 40, no. 1, pp. 194–202, sity of Information Technology Vietnam National
Jan. 2014. University-Ho Chi Minh (VNU-HCM), Vietnam,
[20] I. Algredo-Badillo, C. Feregrino-Uribe, R. Cumplido, and
in 2018, and the master’s degree in informa-
M. Morales-Sandoval, ‘‘Novel hardware architecture for implementing
tion science from the Nara Institute of Science
the inner loop of the SHA-2 algorithms,’’ in Proc. 14th Euromicro
Conf. Digit. Syst. Design, Oulu, Finland, Aug. 2011, pp. 543–549, doi: and Technology (NAIST), Japan, in 2020, where
10.1109/DSD.2011.75. he is currently pursuing the Ph.D. degree. His
[21] M. M. Wong, V. Pudi, and A. Chattopadhyay, ‘‘Lightweight and high per- research interests include blockchain technology
formance SHA-256 using architectural folding and 4-2 adder compressor,’’ and cryptography.
in Proc. IFIP/IEEE Int. Conf. Very Large Scale Integr. (VLSI-SoC), Verona,
Italy, Oct. 2018, pp. 95–100, doi: 10.1109/VLSI-SoC.2018.8644825.
[22] M. Kim, J. Ryou, and S. Jun, ‘‘Efficient hardware architecture
of SHA-256 algorithm for trusted mobile computing,’’ in Proc.
4th Int. Conf. Inf. Secur. Cryptol., China, Jun. 2009, pp. 240–252,
doi: 10.1007/978-3-642-01440-6_19.
[23] K. K. Ting, S. C. L. Yuen, K. H. Lee, and P. H. W. Leong, ‘‘An FPGA based
SHA-256 processor,’’ in Proc. Int. Conf. Field Program. Log. Appl., 2002, YASUHIKO NAKASHIMA (Senior Member,
pp. 577–585. IEEE) received the B.E., M.E., and Ph.D. degrees
[24] C. Jeong and Y. Kim, ‘‘Implementation of efficient SHA-256 hash algo- in computer engineering from Kyoto University,
rithm for secure vehicle communication using FPGA,’’ in Proc. Int. SoC in 1986, 1988, and 1998, respectively. From
Design Conf. (ISOCC), Jeju, South Korea, Nov. 2014, pp. 224–225, doi: 1988 to 1999, he was a Computer Architect with
10.1109/ISOCC.2014.7087617. the Computer and System Architecture Depart-
[25] M. D. Rote, N. Vijendran, and D. Selvakumar, ‘‘High performance SHA-2 ment, FUJITSU Ltd. From 1999 to 2005, he was an
core using the round pipelined technique,’’ in Proc. IEEE Int. Conf.
Associate Professor with the Graduate School of
Electron., Comput. Commun. Technol. (CONECCT), Bangalore, India,
Jul. 2015, pp. 1–6, doi: 10.1109/CONECCT.2015.7383912.
Economics, Kyoto University. Since 2006, he has
[26] M. Kammoun, M. Elleuchi, M. Abid, and M. S. BenSaleh, ‘‘FPGA-based been a Professor with the Graduate School of
implementation of the SHA-256 hash algorithm,’’ in Proc. IEEE Int. Conf. Information Science, Nara Institute of Science and Technology. His research
Design Test Integr. Micro Nano-Syst. (DTS), Jun. 2020, pp. 1–6. interests include computer architecture, emulation, circuit design, and accel-
[27] R. Martino and A. Cilardo, ‘‘A flexible framework for exploring, erators. He is currently a fellow of IEICE, a Senior Member of IPSJ, and a
evaluating, and comparing SHA-2 designs,’’ IEEE Access, vol. 7, member of IEEE CS and ACM.
pp. 72443–72456, 2019.

39192 VOLUME 9, 2021

You might also like