0% found this document useful (0 votes)

5 views11 pages

A High-Performance Multimem SHA-256 Accelerator For Society 5.0

This document presents a novel multimem SHA-256 accelerator designed to enhance performance at the system on chip (SoC) level, addressing the significant data transfer time between external memory and the accelerator. The proposed architecture incorporates techniques such as pipelined arithmetic logic units, multimem processing elements, and shift buffers to improve processing rates and hardware efficiency. Experimental results demonstrate that the accelerator achieves a maximum processing rate of 284 Gbps using advanced FPGA technology, outperforming existing solutions in both speed and efficiency.

Uploaded by

SACHIN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views11 pages

A High-Performance Multimem SHA-256 Accelerator For Society 5.0

Uploaded by

SACHIN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Received February 16, 2021, accepted February 28, 2021, date of publication March 2, 2021, date of current version

March 16, 2021.

Digital Object Identifier 10.1109/ACCESS.2021.3063485

A High-Performance Multimem SHA-256

Accelerator for Society 5.0
THI HONG TRAN , (Member, IEEE), HOAI LUAN PHAM ,
AND YASUHIKO NAKASHIMA, (Senior Member, IEEE)
Graduation School of Information Science, Nara Institute of Science and Technology (NAIST), Nara 630-0192, Japan
Corresponding author: Thi Hong Tran ([email protected])
This work was supported by the Japan Science and Technology Agency (JST) under the Strategic Basic Research Program named
Precursory Research for Embryonic Science and Technology (PRESTO) Grant number 2020A031.

ABSTRACT The development of a low-cost high-performance secure hash algorithm (SHA)-256 accelerator
has recently received extensive interest because SHA-256 is important in widespread applications, such as
cryptocurrencies, data security, data integrity, and digital signatures. Unfortunately, most current researches
have focused on the performance of the SHA-256 accelerator but not on a system level, in which the
data transfer between the external memory and accelerator occupies a large time fraction. In this paper,
we solve the state-of-art problem with a novel SHA-256 architecture named the multimem SHA-256
accelerator that achieves high performance at the system on chip (SoC) level. Notably, our accelerator
employs three novel techniques, the pipelined arithmetic logic unit (ALU), multimem processing element
(PE), and shift buffer in shift buffer out (SBi-SBo), to reduce the critical path delay and significantly increase
the processing rate. Experiments on a field-programmable gate array (FPGA) and an application-specific
integrated circuit (ASIC) show that the proposed accelerator achieves significantly better processing rate and
hardware efficiency than previous works. The accelerator accuracy is verified on a real hardware platform
(FPGA ZCU102). The accelerator is synthesized and laid out with 180 nm complementary metal oxide
semiconductor (CMOS) technology with a chip sized 8.5 mm × 8.5 mm, consumes 1.86 W, and provides a
maximum processing rate of 40.96 Gbps at 80 MHz and 1.8 V. With FPGA Xilinx 16 nm FinFET technology,
the accelerator processing rate is as high as 284 Gbps.

INDEX TERMS SHA-256, blockchain, society 5.0, cryptography hash function, SHA-256 accelerator,
FPGA, ASIC.

I. INTRODUCTION disadvantage of the Bitcoin blockchain network of extremely

Cryptography secure hash algorithm (SHA)-2 plays an high energy consumption. It was reported that the power
important role in developing super smart society 5.0 because consumption of the Bitcoin network reached 64 TWh in 2020,
it is required for data security and data integrity purposes which is even higher than the total energy consumption of
in many applications, such as the digital signature algo- several nations such as Switzerland and the Czech Republic.
rithm (DSA) [1], hash-based message authentication code In the field of the Internet of Things (IoT), SHA-256 should
(HMAC) [2], pseudorandom number generation (PRNG) [3], be implemented in power-constrained sensors or edge nodes
and message authentication in internet protocol security to secure data and/or check the integrity of data. For these
(IPSec) [4]. Recently, the representative SHA-2 hash fam- reasons, developing a high-performance SHA-256 accelera-
ily named SHA-256 has become the key component for tor has become a research trend in recent years.
securing decentralized blockchain networks such as Bit- Several techniques have been widely used in published
coin and Ethereum [5]. To secure the network, the double works. For instance, unrolled architectures [6], [7] were
SHA-256 must be repeatedly computed a large number of implemented to improve the processing rate in trade-offs with
times until a valid hash value that is smaller than a pre- large areas and high power consumption. Pipelined architec-
defined target is found [5]. This results in the well-known tures have been developed in many works, such as [8], [9],
and [10], to increase the processing rate of the SHA-256 core.
The associate editor coordinating the review of this manuscript and Furthermore, [8] employed the parallel counter technique to
approving it for publication was Wei Huang . reduce the circuit area and critical path delay. Recently, [11]
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
39182 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 9, 2021
T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

introduced the rescheduling of the SHA-256 round compu-

tation method to reduce the critical path delay and improve
the processing rate of the SHA-256 accelerator. Generally,
these techniques merely rearranged the operators required
for SHA-256 round computation such as adders, logic gates,
and rotations, in an appropriate and efficient way to achieve
high performance in terms of the processing rate, circuit area,
or critical path delay. In another approach, the work in [5]
introduced the idea of a compact message expander (CME) FIGURE 1. The generation of a SHA-256 hash value for a long message.

in which the input data characteristic of double SHA-256 was

utilized to significantly optimize the hardware cost and power
consumption of the circuit for the case of Bitcoin mining. many applications such as decentralized blockchain, DSA,
However, the proposed work aimed to support the double and HMAC. SHA-256 calculates a 256-bit hash value for an
SHA-256 only in Bitcoin mining. input message of 512 bits. The real application may need to
The reality is that an SHA-256 circuit is just a component calculate the hash value for a very long message. In such
of an embedded system under the control of a microprocessor. cases, the message is divided into many 512-bit data blocks.
The processing time of the SHA-256 circuit includes not If the last block is smaller than 512 bits, padding is added. The
only the time for calculating the hash value but also the hash calculation for a long message is shown in Fig. 1. The
time for transferring data between the external memory and SHA-256 algorithm computes intermediate hash values for
SHA-256 circuit. Unfortunately, the aforementioned conven- data blocks one by one, in which the hash result of the current
tional works focused merely on the calculation time of the block becomes the input initial hash for hash computing of
SHA-256 circuit. The data transfer time was ignored. An effi- the next data block. The result of the final data block is
cient SHA-256 accelerator must provide a high computation considered to be the hash value of the entire message.
rate and be very compatible with the embedded systems so
that the data transfer can be performed efficiently and fast.
In this study, we propose a high-performance hardware
architecture for SHA-256 by considering both the hash calcu-
lation time and data transfer time. We named the architecture
the multimem SHA-256 accelerator because multiple local
memories are implemented in each processing element (PE)
to temporarily store the input and output data of SHA-256.
These memories work appropriately and efficiently to reduce
the data transfer time between the accelerator and external
memory. Several ideas of interest such as pipelined arithmetic
logic units (ALUs), multimem PEs, and shift buffer in shift
buffer out (SBi-SBo), are introduced in this paper.
The remainder of this paper is organized as follows.
Section II provides the background of the research. Section III
describes our proposed multimem SHA-256 architec-
ture. Section IV reports our evaluation in terms of
application-specific integrated circuit (ASIC) and field-
programmable gate array (FPGA) experiments. Finally,
section V concludes the paper.

FIGURE 2. The overview operation of the SHA-256 algorithm.

II. BACKGROUND
A. SHA-256 ALGORITHM Fig. 2 illustrates the overview operation of the SHA-256
The SHA-2 family is a set of cryptographic hash func- algorithm. It includes two processes named the message
tions designed and published by the US National Insti- expander (ME) and message compressor (MC). The ME
tute of Standards and Technology (NIST) in 2002 [12]. process expands the 512-bit input message into 64 chunks
The SHA-2 family includes six hash functions named of 32-bit data Wj (0 ≤ j ≤ 63). In the first 16 rounds,
SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, the ME parses the 512-bit message into 16 32-bit data chunks
and SHA-512/256. They are actually the same algorithm (denoted as Wj , j = 0 to 15, where j is the round index). In the
but with different word lengths, constant parameters, and final 48 rounds, the ME calculates 48 chunks of 32-bit data
initialization values. SHA-256 is known as representative of Wj (16 ≤ j ≤ 63) following eq. (1). Three 32-bit adders and
the SHA-2 family and is currently applied to secure data in two logical functions σ0 (x) and σ1 (x) are needed to compute

VOLUME 9, 2021 39183

T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

Wj (16 ≤ j ≤ 63). σ0 (x) and σ1 (x) are computed as eq. (2) in a low processing rate. One of the main reasons is that
and eq. (3). Note that S n (x) and Rn (x) denote the right rotation memory blocks such as double data random access mem-
and the right shift of data x by n bits, respectively. ory (DDRAM) and cache L1 and L2 of these platforms are
placed far from the calculation unit, and the data transfer time
Wj = σ1 (Wj−2 ) + Wj−7 + σ0 (Wj−15 ) + Wj−16 (1)
between the memories and calculation units thus constitutes a
σ0 (x) = S (x) ⊕ S (x) ⊕ R (x)
7 18 3
(2) large amount of the total processing time. Furthermore, most
σ1 (x) = S 17 (x) ⊕ S 19 (x) ⊕ R10 (x) (3) of the existing hardware resources of CPUs and GPUs, such
as multipliers and exponentials, are useless for SHA-256 but
The MC process computes the 256-bit hash value from
still consume energy. Despite these disadvantages, multicore
the outputs of the ME process (64 chunks of Wj (0 ≤ j ≤
CPUs and GPUs are currently considered to be the most
63)). The process involves two main steps: loops and hash
applicable hardware platforms for calculating SHA-256 in
updates. In the loop step, eight loop hash values (denoted
Bitcoin mining and other blockchain networks.
a, b, c, d, e, f , g, h) are initialized by the initial hash values
In another approach, the systolic array-based accelerator
H0 , H1 ,. . . ,H7 . The loop hash values a, b, c, d, e, f , g, h are
named EMAXVR in [13] and its improved version in [14]
then computed and updated through 64 loops. In each loop
were applied to reduce the data access time by implement-
(loop j, 0 ≤ j ≤ 63), eqs. (4) to (8) are applied.
ing local memory near the ALU. Although this platform
T1 = h + 61 (e) + Ch(e, f , g) + Kj + Wj (4) achieves high performance on image processing and AI learn-
T2 = 60 (a) + Maj(a, b, c) (5) ing [13], [14], its performance for computing SHA-256 is
very poor [15]. The key reason for the low processing rate is
a = T1 + T2 (6) the data-dependent characteristic of the SHA-256 algorithm,
e = d + T1 (7) as mentioned above. The processing element (PE) architec-
b = a; c = b; d = c; f = e; g = f ; h = g (8) ture of EMAXVR is not yet suitable for SHA-256 computa-
tion; therefore, a large number of PEs are required to compute
where logical functions such as 60 (x), 61 (x), Ch(x, y, z), and a single hash loop. Furthermore, the data dependence among
Maj(x, y, z) are computed using the following equations. the loops requires 1) multiple copies of data to be stored in
60 (x) = S 2 (x) ⊕ S 13 (x) ⊕ S 22 (x) (9) different PEs, which results in low hardware efficiency, and
61 (x) = S 6 (x) ⊕ S 11 (x) ⊕ S 25 (x) (10) 2) external memory data transfer after the completion of a
loop, which results in a low processing rate.
Ch(x, y, z) = (x ∧ y) ⊕ (¬x ∧ z) (11) Studying the weaknesses of the existing hardware plat-
Maj(x, y, z) = (x ∧ y) ⊕ (x ∧ z) ⊕ (y ∧ z) (12) forms for SHA-256 and deeply understanding the char-
In the hash update step, the final 256-bit hash value, which acteristics of SHA-256, we propose a novel multimem
is divided into 8 chunks of 32-bit data HO0 , HO1 ,. . . ,HO7 , SHA-256 accelerator that achieves high performance in both
is computed by adding the initial hashes H0 , H1 ,. . . ,H7 to the hardware efficiency and processing rate. We propose multiple
loop hashes a, b, c, d, e, f , g, h, as illustrated in eq. (13). local memory structures near the ALU for reducing data
access time and improving hardware efficiency. Furthermore,
HO0 = H0 + a; . . . ; HO7 = H7 + h (13) we design a processing element (PE) architecture that is able
to compute all loops of the hash function without requiring
B. SHA-256 CHARACTERISTICS AND PRELIMINARY IDEA data from the nearby PEs. Thus, the PEs can work inde-
FOR THE HIGH PERFORMANCE ACCELERATOR pendently and parallel, and no wire connections are required
There are three characteristic points of SHA-256 that should among the PEs. As a result, the accelerator does not need to
be noted. First, the SHA-256 algorithm needs only low-cost frequently request data from external memories, and the data
arithmetic logic operators such as adders, shifts, rotations, access time is significantly shortened. Furthermore, the hard-
and XORs. No complex operators such as multipliers and ware efficiency and processing rate of the SHA-256 acceler-
dividers are required. Second, the number of operators (nots, ator are expected to be remarkably enhanced.
XORs, adders, shifts, rotations, etc.) per loop calculation is
significantly large (approximately 50 operators per loop).
Third, data dependence among the loops is present. This III. THE PROPOSED MULTIMEM SHA-256 ACCELERATOR
means that the current loop needs the result of the previ- A. OVERVIEW OF THE ARCHITECTURE
ous loops for calculation. In addition, some common data Fig. 3 shows the overview architecture of our proposed
are required in multiple loops. Because of these character- accelerator. It can be employed in any embedded system
istics, the current high-performance CPU (Intel multicores) via an advanced extensible interface (AXI) bus. The CPU is
and GPU (GPU tesla V100, GTX 1080, etc.) platforms do the main microprocessor controlling operation of the entire
not perform well in calculating the SHA-256 algorithm. embedded system. Basically, the CPU is busy with many
Although these general purpose platforms are rich in hard- tasks, including the control of the hash calculation inside the
ware resources, they still need a large number of clock SHA-256 accelerator. In the task of controlling the accelera-
cycles to compute a single loop of SHA-256, which results tor, the CPU must send commands to transfer data from DDR

39184 VOLUME 9, 2021

T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

out (SBo1, SBo2). The pipelined ALU includes the arith-

metic logic operators required for computing all loops of
the hash function. The WRAM and HRAM blocks include
local memories storing the input and output data of the PE
so that the PE can compute hash values of a large num-
ber of SHA-256 functions rapidly without requesting data
from external DDR memory. The SBi1, SBi2, SBo1, and
SBo2 temporarily store and rearrange the data flow between
the pipelined ALU and the WRAM/HRAM blocks. All these
blocks cooperate together to achieve the following two things:
1) a PE can compute all the loops, and 2) the hardware
efficiency of the ALU is 100%.
In general, a PE of our accelerator computes all 64 loops
FIGURE 3. Overview of the architecture of the proposed multimem
SHA-256 accelerator. of SHA-256 within 64 clock cycles on average at a latency of
64 × 4 = 256 clock cycles. In other words, the accelerator
including 64 PEs can compute a hash value (64 loops) per
memory to the SHA-256 accelerator via data buses (such clock cycle. Furthermore, the memory resources of WRAM
as advanced microcontroller bus architecture (AMBA) buses and HRAM are efficiently allocated so that the 64-PE accel-
and AXI buses), reconfigure the accelerator, and send com- erator can uninterruptedly compute an amount of 64 × 4 × L
mands to read results from the accelerator to DDR memory. functions without further requesting data from external mem-
The data transfer between the accelerator and DDR memory ory, in which L is a pre-defined positive integer parameter.
is basically much slower than the data transfer inside the The AXI bus and PEs are designed to work in pipe-lined
accelerator. In addition, the data buses are not dedicated to and parallel as shown in fig. 5. When the AXI bus is read-
only the operation of the SHA-256 accelerator. They may not ing/writing data to/from a PE, the other PEs of accelerator is
immediately respond to the request from the accelerator and executing the hash calculation. As a result, the data transfer
may make the accelerator wait for data. Therefore, we imple- time between the accelerator and external DDR memory
ment multiple memory blocks inside the accelerator to reduce does not affected to the entire system processing rate (if the
the number of external DDR memory requests and to increase embedded system is equipped a fast enough AXI bus).
the accelerator processing rate.
The main component of the accelerator is an array
of PEs responsible for computing the SHA-256 hash
value. In addition, it has an AXI bus interface (AXI IF),
a global memory (GRAM) for storing global parameters
such as the accelerator configuration parameters, and a con-
troller (CTRL) providing control signals for the PEs, AXI IF,
and GRAM.
FIGURE 5. Timing chart of the entire multimem SHA-256 accelerator.

B. PIPELINED ALU ARCHITECTURE

As shown in Fig. 4, the pipelined ALU includes two blocks
named ME and MC that compute the ME and MC processes
of the SHA-256 function, respectively. Fig. 6 shows the cir-
cuit inside the pipelined ALU. To achieve high maximum
frequency and high processing throughput, the ME block is
designed to compute the ME process in 2 pipelined clock
cycles named ME-1 and ME-2, while the MC block computes
the MC process in 4 pipelined clock cycles named MC-1,
MC-2, MC-3, and MC-4. Two stages of the ME process in
parallel with the first two stages of the MC (named MC-1 and
MC-2). The last two stages of the MC receive the output data
FIGURE 4. Block diagram of a processing element (PE). from the ME block as well as its previous stage (MC-1 and
MC-2) outputs to complete the loop calculation.
Fig. 4 shows the internal architecture of a PE. It includes a With this architecture, the ALU can compute a hash func-
pipelined ALU, multiple memory structures named WRAM tion loop in 4 pipelined clock cycles. It then feeds back the
and HRAM, and shift buffer in (SBi1, SBi2) shift buffer result of the current loop to recursively compute the next loop.

VOLUME 9, 2021 39185

T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

FIGURE 6. The pipelined arithmetic logic unit (ALU) architecture.

FIGURE 7. The timing chart of the ME and MC blocks inside the pipelined
ALU when 4 data flows are processed.

Therefore, all 64 loops can be completed in a single ALU.

To efficiently use the hardware resources of the ALU (100% flows to efficiently use the hardware resources, we implement
hardware efficiency), four data flows from the local memories 4 identical RAM blocks in both WRAM and HRAM, named
WRAM and HRAM should be input to the 4 pipelined ALUs. WM1, WM2, WM3, WM4, and HM1, HM2, HM3, HM4 (see
The operation of the ME block is as follows. In the first fig. 4). Two-port RAMs are used because the RAMs may need
16 loops (loop j, 0 ≤ j ≤ 15), the ME block merely passes to simultaneously read and write data.
the input data Wj to its output port. In the last 48 loops (loop
j, 16 ≤ j ≤ 63), the ME block computes Wj from the four
input data Wj−16 , Wj−15 , Wj−7 , Wj−2 following eq. (1) to
eq. (3). The result Wj is then 1) passed to the MC-3 stage
of the MC block for further computing and 2) written into
WRAM for future use. The operation of the MC block is as
follows. In the first loop (loop j = 0), the loop hash values
a, b, . . . , and h are assigned to the initial hash values H0 , H1 ,
. . . , and H7 read from the HRAM. The values of a, b, . . . ,
and h are then computed and updated in four pipelined stages
MC-1, MC-2, MC-3, and MC-4 following eq. (4) to eq. (12).
Because four data flows share the same hardware resources of
the ALU, all the ALU stages are always busy. This means that
we achieve 100% hardware efficiency for the ALU resources.
By the time MC-4 stage is computing for data flow 1,
FIGURE 8. The memory map of a RAM block inside a) the WRAM and b)
MC-3 computes for data flow 2, MC-2 for data flow 3, and the HRAM.
MC-1 for data flow 4.
Fig. 7 illustrates the timing chart of the ME and MC blocks Fig. 8 a and b shows the size of each RAM block on
when loop 0 and loop 1 are computed for the four hash the WRAM and HRAM sides, respectively. Each block of
functions F1, F2, F3, and F4. The updated values of a, b, . . . , WRAM, such as WM1 and WM2, is sized L × 64 × 32 bits
and h after the MC-4 stage are then fed back to become the and can simultaneously store 64 words Wj (0 ≤ j ≤ 63) of L
inputs of the MC-1 stage. The calculation for the next loop hash functions, where L ×16×32 bits equivalent to 16 words
(loop j = 1) is then started. This operation is repeated until Wj (0 ≤ j ≤ 15) of L hash functions are written into each
all 64 loops of the hash function are completed. The values of RAM before starting the hash calculation. The remaining
a, b, . . . , and h after the completion of 64 loops are transferred L × 48 × 32 bits, equivalent to 48 words Wj (16 ≤ j ≤ 63)
to SBo2 to be written into the HRAM. of L hash functions, are written during the hash calculation
by the ALU. Note that L is a parameter that can be changed
C. MULTIPLE LOCAL MEMORY STRUCTURES WRAM AND before fabricating the chip.
HRAM Each block of HRAM, such as HM1 and HM2, is sized
The local memories WRAM and HRAM store the input and L × 8 × 32 bits and can store 8 chunks of 32-bit hash values
output data of the ALU. While WRAM stores the message H0 , . . . , and H7 of L hash functions. Our accelerator can
words Wj (0 ≤ j ≤ 63), HRAM stores the hash values H0 , work in two different modes. The first mode computes the
. . . , and H7 of the message. Because the ALU requires 4 data hash values of independent hash functions. In this mode,

39186 VOLUME 9, 2021

T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

the initial hash value of L hash functions is written into all

blocks of the HRAM before starting the hash calculation. The
content of the HRAM blocks is then be overwritten by the
final hash values calculated by the ALU. These values are
read out of the accelerator after the hash calculation of all
L hash functions is completed. The second mode computes
the hash values of a long message divided into L 512-bit data
blocks, as illustrated in fig. 1. In this mode, only the initial
hash value of the first data block is written into 4 RAM blocks
of HRAM before the accelerator is activated. During the hash
calculation, the output hash value of a data block is written
into the HRAM to be used as the initial hash value of the next
adjacent data block. The output hash value of the final data
block is the hash value of the entire message, which is then
read out of the accelerator. FIGURE 10. Timing chart of SBi1.

D. SHIFT BUFFER IN SHIFT BUFFER OUT (SBi-SBo)

The ALU requires multiple input data such as Wj−16 , Wj−15 , once per clock cycle. The multiplexer MUX-2 then selects the
Wj−7 , Wj−2 , and a, . . . , h at the same time. However, each appropriate dataset from one of four data flows to give to the
RAM block is able to read out only one data point per clock ALU.
cycle. Similarly, the ALU simultaneously outputs many data SBo1 delivers the result Wj of the ME block to the corre-
values, such as the updated versions of a, . . . , and h, but only sponding RAMs of the WRAM. Because only one value Wj is
one data point can be written into a block of HRAM per clock generated per clock cycle, shift buffers are not required, and
cycle. We thus cannot directly connect the data flows between only a 1-to-4 multiplexer is needed inside SBo1.
WRAM/HRAM and ALU. The SBi and SBo are required to
collect and schedule the timing of data flows.

FIGURE 9. SBi1 and SBi2 architectures. FIGURE 11. SBo2 architecture.

Fig. 9 a and b shows the SBi1 and SBi2 architectures that SBo2 has two main functions. First, it collects and sched-
connect the outputs of WRAM and HRAM to the ALU inputs, ules the loop hash words a, . . . , h output from the ALU
respectively. SBi1 includes 4 shift buffers named SBi1-1, after 64 loops are complete. Second, it calculates the final
SBi1-2, SBi1-3, and SBi1-4 receiving data flows from WM1, hash values HO1 , . . . , HO7 by adding the loop hash words
WM2, WM3, and WM4 of WRAM, respectively. Each shift a, . . . , h and the initial hash values H0 , . . . , H7 following
buffer has 4 32-bit shift registers that receive data from the eq. (13). Fig. 11 shows the inside of SBo2. It includes four
WRAM and then shift the data to the right once per clock shift buffers on the left named SBo2-l1, SBo2-l2, SBo2-l3,
cycle. The timings of 4 data flows from 4 RAM blocks and SBo2-l4 and four shift buffers on the right named
WM1, . . . , WM4 are interleaved, as shown in fig. 10. The SBo2-r1, SBo2-r2, SBo2-r3, and SBo2-r4. Each shift buffer
multiplexer MUX-1 then selects the appropriate dataset from has 8 32-bit shift registers. The buffers on the left (SBo2-l1,
4 data flows one per clock cycle to input to the ALU. SBo2-l2, SBo2-l3, SBo2-l4) store the ALU outputs a, . . . , h,
Similarly, SBi2 shown in Fig. 9b has 4 shift buffers named while the buffers on the right (SBo2-r1, SBo2-r2, SBo2-r3,
SBi2-1, SBi2-2, SBi2-3, and SBi2-4 receiving data flows SBo2-r4) receive the initial hash values H0 , . . . , H7 from
from HM1, HM2, HM3, and HM4 of HRAM, respectively. SBi2. Each left-right set of buffers aims to process one data
Each shift buffer has 8 32-bit shift registers that receive data flow. The content of the left-right set of buffers is shifted to
a, . . . , h from the HRAM and then shift the data to the right the left (direction of the most significant register) once per

VOLUME 9, 2021 39187

T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

TABLE 1. Comparison based on ASIC results.

clock cycle. The values from the most significant registers are IV. EVALUATION
read out and added. The addition results become the final hash In this section, we evaluate the performance of our proposed
words HO1 , . . . , HO7 that are written into the corresponding multimem SHA-256 accelerator from both ASIC and FPGA
RAM blocks (HM1, . . . , HM4) of the HRAM. experimental results.

A. ASIC EXPERIMENT
The proposed multimem SHA-256 circuits for cases of 1-PE
and 64-PEs were coded in Verilog and synthesized in an ASIC
using a Synopsys Design Compiler with the Rohm 0.18µm
CMOS standard cell library [16]. Table 1 shows the synthe-
sized area of the SHA-256 circuit in comparison with those
of other works. To ensure a fair comparison, we provide the
processing throughput in terms of both the number of clocks
per hash function and Mbps. We also normalize the process-
ing throughput Mbps and hardware efficiency of other works
to the same 180 nm technology. The proposed architecture in
cases of both 1 PE and 64 PEs exhibits significantly better
processing throughput and hardware efficiency. Compared
with the works in [17] and [18], which reported only the ALU
circuit results, our ALU generates a processing rate as high
as 2,930 Mbps, which is 1.7 times and 1.9 times faster than
the works in [17] and [18], respectively. Our ALU hardware
efficiency is 123 Kbps/gate, which is 1.1 times and 1.4 times
better than those of [17] and [18], respectively.
The proposed architecture with only 1 PE (No RAM PE)
provides a processing rate as high as 2,431 Mbps, which is
12.3 times faster than that of the work in [22]. Its hardware
FIGURE 12. Timing chart of SBo2.
efficiency is 43 Kbps/gate, which is almost 2 times better than
that of [22]. Notably, our architecture with 64 PEs achieves
Fig. 12 illustrates the timing chart of the SBo2. First, the best processing rate. It provides up to 139 Gbps at a
the ALU output data, such as a, . . . , and h, of the 4 data flows frequency of 272 MHz, which is 70 times faster than [22].
are stored in the 8 registers of the shift buffers SBo2-l1 to Note that the term ‘‘No RAM’’ refers to the synthesis result
SBo-l4 in 4 interleave clock cycles. The values of the most except the local memory resources inside the HRAM and
significant register (for example SBo2-l1[7]), which are a, b, WRAM.
c, d, e, f , g, and h at clock cycles #1, #2, #3, #4, #5, #6, #7, Furthermore, we successfully layout the proposed
respectively, are read out to add to the initial hash values H0 , SHA-256 circuit in the case of 64 PEs in ASIC technology
. . . , H7 before writing into the HRAM (for example WM1). with a Rohm 180 nm CMOS standard cell library. The size of
It needs 8 clock cycles to complete the writing of 8 32-bit the layout chip is 8.5 mm × 8.5 mm and chip processing rate
hash words for one data flow and 8 + 3 = 11 clock cycles to is 40.96 Gbps when a global operating voltage of 1.8 V and
complete the writing for 4 data flows in pipelined mode. operating frequency of 80 MHz are applied.

39188 VOLUME 9, 2021

T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

TABLE 2. A comparison of FPGA synthesis results.

B. FPGA EXPERIMENT stores the input/output data. To ensure a fair comparison

1) FPGA SYNTHESIS RESULTS with other works, we synthesized two versions (1-PE and
To ensure a fair comparison with other existing SHA-256 64-PEs) of our accelerator with many options such as ‘‘only
architectures such as [11], [19], and [20], we synthesized the ALU’’’, ‘‘no RAM’’, and ‘‘full’’. In general, both versions
proposed SHA-256 circuits on several Xilinx FPGA boards, 1-PE and 64-PEs of our proposed SHA-256 accelerator out-
including Virtex 2, E, 4, 5, 6, 7, and Kintex UltraScale. The perform the existing SHA-256 architectures in terms of pro-
comparison factors include the circuit area, processing rate, cessing rate and hardware efficiency. Some typical numerical
and hardware efficiency. results are as follows.
The results are shown in Table 2. Note that among the On the Virtex 2 device, the ALU circuit of our accelerator
previous works, some reported only the results of the ALU version 1-PE generates 1 hash value every 64 clock cycles,
part, some reported the results of full accelerators includ- which is equivalent to a processing rate of 1,989 Mbps at a
ing ALU, controller, memories, etc., and some reported maximum frequency of 249 MHz. The processing rate of our
the full accelerator except the memory that temporarily accelerator is 30.6 times, 2.4 times, 2 times, and 3.7 times

VOLUME 9, 2021 39189

T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

FIGURE 13. SoC-based multimem SHA-256 diagram.

better than the architectures in [6], [19], [20], and [21], Overall, our proposed SHA-256 accelerator obtains signif-
respectively. The hardware efficiency of our accelerator ver- icantly better processing rate and hardware efficiency and has
sion 1-PE is 1,864 Kbps/LUT, which is 12.4 times, 2.5 times, the ability to scale the trade-off between the processing rate
2.5 times, and 1.76 times higher than the architectures in [6], and hardware cost. This means that our accelerator processing
[19], [20], and [21], respectively. rate can be changed according to the requirements of the real
On the Virtex E device, the processing rate and hard- system.
ware efficiency of the full circuit of our accelerator version
1-PE are 7.6 times (662 vs. 87 Mbps) and 2.7 times (183 vs.
69 Kbps/LUT) higher than those of the architecture in [22],
respectively.
On the Virtex 4 device, compared to that of the architec-
tures in [10], [11], [19], and [21], the performance of our
accelerator version 1-PE is 34 times, 1.6 times, 2.3 times, and
3.5 times better in terms of processing rate and 13.5 times,
1.4 times, 1.3 times, and 1.6 times better in terms of hardware
efficiency, respectively.
On the Virtex 5 device, the processing rate of our archi-
tecture version 1-PE is 28.5 times and 3.1 times better than
those of the architectures in [19] and [21], respectively.
Our circuit hardware efficiency is improved by 5.1 times
and 1.1 times compared to that of the architectures in [19]
and [21], respectively.
On the Virtex 6 device, the processing rate and hardware FIGURE 14. SoC-based multimem SHA-256 experiment with real devices.

efficiency of the ALU part of our accelerator version 1-PE

are 2 times and 2.5 times higher than those of the architecture
in [25], respectively. 2) FUNCTIONAL VERIFICATION ON A REAL HARDWARE
On the Virtex 7 device, the processing rate and hardware PLATFORM
efficiency of our accelerator version 1-PE are 1.9 times To prove that the circuit operates correctly not only in
and 5 times higher than those of the architecture in [26], the software simulation tool but also on real hardware,
respectively. we built a system on chip (SoC) platform to execute the
On the Kintex UltraScale+ device, our accelerator version proposed SHA-256 circuit including 64 PEs. The SoC plat-
1-PE achieves almost the same performance as that of the form overview is shown in Fig. 13, and the real image of the
work in [27]. However, our accelerator version 64-PE obtains platform is captured in Fig. 14.
a significantly better processing rate, 49 times that of the The platform consists of two main devices: a host PC
aforementioned study (284,448 vs. 5,829 Mbps). and a Zynq UltraScale+ ZCU102 evaluation board. The two

39190 VOLUME 9, 2021

T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

devices connect and exchange data via JTAG and UART achieves significantly better processing rate and hardware
cables. The ZCU102 board comprises an ARMv8 Cortex- efficiency than previous works. The functional behavior of
A53 microprocessor, a programmable logic (PL), and a clock our accelerator has been proven to run correctly on a real SoC
generator. Our multimem SHA-256 is developed in the PL FPGA platform Zynq UltraScale+ ZCU102. The maximum
part and controlled by the ARMv8 microprocessor. The clock processing rate of our accelerator in theory is 284 Gbps.
generator generates a clock speed of 100 MHz to operate Version 64-PE of our accelerator has been successfully laid
the ARMv8 microprocessor and our accelerator embedded in out with 180 nm CMOS technology with a chip size of
the PL. 8.5 mm × 8.5 mm, consumes a power of 1.86 W, and provides
The host PC runs two Xilinx tools: a Vivado and a maximum processing rate of 40.96 Gbps at 80 MHz fre-
a Software Development Kit (SDK). We use Vivado to quency and 1.8 V voltage.
design and load the SHA-256 circuit onto the Zynq Obviously, the processing rate of an entire SoC embed-
UltraScale+ ZCU102 board. To control the operation of the ding SHA-256 accelerator is much more important than
SHA-256 accelerator, we use the SDK to embed C code that of the SHA-256 accelerator alone. This research has
onto the ARMv8 microprocessor. The C code runs tasks such proposed an SoC-compatible SHA-256 accelerator and sig-
as transferring data between the external memory and the nificantly improved the accelerator processing rate. How-
SHA-256 accelerator, activating the operation of the accel- ever, the potential of the accelerator may not be sufficiently
erator, and verifying the accelerator results. utilized due to the bottleneck data transfer rate provided
In our experiment, we develop two testing cases named by the SoC platform itself. Therefore, we believe that
individual PE and parallel 64-PE, as shown in Fig. 13. developing a new architecture and technique to overcome
In the individual PE testing case, all 64 PEs of the acceler- the data transfer rate bottleneck of SoC platforms will
ator are programmed to perform the same task. They are all be an important research trend of the field in the near
programmed to compute the hash values for 100,000 indepen- future.
dent data blocks. The results from each PE are then compared
against the correct 100,000 hashing values computed by the REFERENCES
Intel Xeon Gold 6144 CPU. Our implementation results have [1] U.S. Department of Commerce. (Jul. 2013). Digital Signature
shown that all 64 PEs work correctly to generate hash values Standard (DSS). [Online]. Available: https://nvlpubs.nist.gov/nistpubs/
matched with those computed by the Xeon CPU. In other FIPS/NIST.FIPS.186-4.pdf
[2] National Institute of Standards and Technologies, U.S. Department
words, the correction ratio of our accelerator is 100%. of Commerce. (Jun. 2015). The Keyed-Hash Message Authentication
In the parallel 64-PE testing case, the 64 PEs of the Code (HMAC). [Online]. Available: https://nvlpubs.nist.gov/nistpubs/
accelerator are programmed to perform different tasks in FIPS/NIST.FIPS.198-1.pdf
[3] Recommendation for Random Number Generation Using Deterministic
parallel. The input data of different data blocks are loaded Random Bit Generators, 800-90a Revision 1, Nat. Inst. Standards Tech-
into different PEs. The results are then compared against the nol., U.S. Dept. Commerce, Washington, DC, USA, Jun. 2015, doi:
corresponding hashing outputs computed by the Intel Xeon 10.6028/NIST.SP.800-90Ar1.
Gold 6144 CPU. Again, the implementation results show that [4] R. S. Kent, IP Authentication Header, document RFC 2404, 1998.
[Online]. Available: https://www.ietf.org/rfc/rfc2402.txt
all 64 PEs are calculated correctly with a correction ratio [5] H. L. Pham, T. H. Tran, T. D. Phan, V. T. D. Le, D. K. Lam, and
of 100%. Y. Nakashima, ‘‘Double SHA-256 hardware architecture with com-
pact message expander for bitcoin mining,’’ IEEE Access, vol. 8,
pp. 139634–139646, 2020, doi: 10.1109/ACCESS.2020.3012581.
V. CONCLUSION [6] R. P. McEvoy, F. M. Crowe, C. C. Murphy, and W. P. Marnane, ‘‘Opti-
The cryptography hash function SHA-256 is an important misation of the SHA-2 family of hash functions on FPGAs,’’ in Proc.
process in keeping the decentralized blockchain network IEEE Comput. Soc. Annu. Symp. Emerg. VLSI Technol. Archit. (ISVLSI),
Mar. 2006, pp. 317–322.
secure and preserving data security and integrity in the [7] H. E. Michail, G. S. Athanasiou, V. Kelefouras, G. Theodoridis, and
smart systems of society 5.0. Developing a high-performance C. E. Goutis, ‘‘On the exploitation of a high-throughput SHA-256 FPGA
SHA-256 circuit compatible with SoC is thus a require- design for HMAC,’’ ACM Trans. Reconfigurable Technol. Syst., vol. 5,
no. 1, pp. 1–28, Mar. 2012.
ment. Unfortunately, state-of-art SHA-256 architectures
[8] L. Dadda, M. Macchetti, and J. Owen, ‘‘The design of a high speed ASIC
have focused only on improving the performance of the unit for the hash function SHA-256 (384, 512),’’ in Proc. Design, Automat.
SHA-256 core as a standing-alone circuit. The compatibility Test Eur. Conf. Exhib., 2004, pp. 70–75.
of the SHA-256 circuit with the SoC as well as the data [9] R. Chaves, G. Kuzmanov, L. Sousa, and S. Vassiliadis, ‘‘Cost-efficient
SHA hardware accelerators,’’ IEEE Trans. Very Large Scale Integr. (VLSI)
transfer between the SHA-256 core and other parts of the SoC Syst., vol. 16, no. 8, pp. 999–1008, Aug. 2008.
have not yet been studied. In this paper, we have solved the [10] M. Padhi and R. Chaudhari, ‘‘An optimized pipelined architecture of
problem by developing an SoC-compatible SHA-256 accel- SHA-256 hash function,’’ in Proc. 7th Int. Symp. Embedded Comput. Syst.
Design (ISED), Dec. 2017, pp. 1–4.
erator. Several new techniques such as multiple local memory
[11] Y. Chen and S. Li, ‘‘A high-throughput hardware implementation of
structures (multimem), pipelined ALU, and SBi-SBo have SHA-256 algorithm,’’ in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS),
been introduced to achieve this purpose. Our experimental Oct. 2020, pp. 1–4.
results on both FPGA and ASIC have shown that the proposed [12] National Technical Information Service, U.S. Department of
Commerce/NIST, Springfield, VA, USA. (2002). FIPS 180-2—Secure
accelerator is not only compatible with the SoC via an AXI Hash Standard. [Online]. Available: http://csrc.nist.gov/publications/
bus and provides an efficient data transfer mechanism but also fips/fips180-2/fips180-2.pdf

VOLUME 9, 2021 39191

T. H. Tran et al.: High-Performance Multimem SHA-256 Accelerator for Society 5.0

[13] T. Ichikura, R. Yamano, Y. Kikutani, R. Zhang, and Y. Nakashima, THI HONG TRAN (Member, IEEE) received
‘‘EMAXVR: A programmable accelerator employing near ALU utilization the bachelor’s degree in physics and the mas-
to DSA,’’ in Proc. IEEE Symp. Low-Power High-Speed Chips (COOL ter’s degree in microelectronics from the Vietnam
CHIPS), Apr. 2018, pp. 1–3. National University Ho Chi Minh City University
[14] J. Iwamoto, Y. Kikutani, R. Zhang, and Y. Nakashima, ‘‘Daisy-chained of Science (VNU-HCMUS), Vietnam, in 2008 and
systolic array and reconfigurable memory space for narrow memory 2012, respectively, and the Ph.D. degree in infor-
bandwidth,’’ IEICE Trans. Inf. Syst., vol. E103.D, no. 3, pp. 578–589, mation science from the Kyushu Institute of Tech-
Mar. 2020, doi: 10.1587/transinf.2019EDP7144.
nology, Japan, in 2014. Since 2015, she has been
[15] D. Phan, T. H. Tran, and Y. Nakashima, ‘‘SHA-256 implementation on
working with the Nara Institute of Science and
coarse-grained reconfigurable architecture,’’ in Proc. IEEE Symp. Low-
Power High-Speed Chips, Japan, Apr. 2020. Technology (NAIST), Japan, as a full-time Assis-
[16] T. H. Tran, S. Kanagawa, D. P. Nguyen, and Y. Nakashima, ‘‘ASIC tant Professor. Her research interests include digital hardware circuit design
design of MUL-RED radix-2 pipeline FFT circuit for 802.11ah system,’’ and algorithm related to wireless communications, communication security,
in Proc. IEEE Symp. Low-Power High-Speed Chips (COOL CHIPS XIX), blockchain technologies, SHA-2, SHA-3, and cryptography. She is currently
Yokohama, Japan, Apr. 2016, pp. 1–3. a Regular Member of IEICE and REV-JEC.
[17] Helion Technology. Helion IP Core Products. Accessed: Jan. 10, 2021.
[Online]. Available: http://heliontech.com/core.htm
[18] A. Satoh and T. Inoue, ‘‘ASIC hardware focused comparison for hash
functions MD5, RIPEMD-160, and SHS,’’ in Proc. Int. Conf. Inf. Technol.,
Coding Comput. (ITCC), Las Vegas, NV, USA, vol. 1, 2005, pp. 532–537,
doi: 10.1109/ITCC.2005.92.
[19] R. García, I. Algredo-Badillo, M. Morales-Sandoval, C. Feregrino-Uribe, HOAI LUAN PHAM received the bachelor’s
and R. Cumplido, ‘‘A compact FPGA-based processor for the secure hash degree in computer engineering from the Univer-
algorithm SHA-256,’’ Comput. Electr. Eng., vol. 40, no. 1, pp. 194–202, sity of Information Technology Vietnam National
Jan. 2014. University-Ho Chi Minh (VNU-HCM), Vietnam,
[20] I. Algredo-Badillo, C. Feregrino-Uribe, R. Cumplido, and
in 2018, and the master’s degree in informa-
M. Morales-Sandoval, ‘‘Novel hardware architecture for implementing
tion science from the Nara Institute of Science
the inner loop of the SHA-2 algorithms,’’ in Proc. 14th Euromicro
Conf. Digit. Syst. Design, Oulu, Finland, Aug. 2011, pp. 543–549, doi: and Technology (NAIST), Japan, in 2020, where
10.1109/DSD.2011.75. he is currently pursuing the Ph.D. degree. His
[21] M. M. Wong, V. Pudi, and A. Chattopadhyay, ‘‘Lightweight and high per- research interests include blockchain technology
formance SHA-256 using architectural folding and 4-2 adder compressor,’’ and cryptography.
in Proc. IFIP/IEEE Int. Conf. Very Large Scale Integr. (VLSI-SoC), Verona,
Italy, Oct. 2018, pp. 95–100, doi: 10.1109/VLSI-SoC.2018.8644825.
[22] M. Kim, J. Ryou, and S. Jun, ‘‘Efficient hardware architecture
of SHA-256 algorithm for trusted mobile computing,’’ in Proc.
4th Int. Conf. Inf. Secur. Cryptol., China, Jun. 2009, pp. 240–252,
doi: 10.1007/978-3-642-01440-6_19.
[23] K. K. Ting, S. C. L. Yuen, K. H. Lee, and P. H. W. Leong, ‘‘An FPGA based
SHA-256 processor,’’ in Proc. Int. Conf. Field Program. Log. Appl., 2002, YASUHIKO NAKASHIMA (Senior Member,
pp. 577–585. IEEE) received the B.E., M.E., and Ph.D. degrees
[24] C. Jeong and Y. Kim, ‘‘Implementation of efficient SHA-256 hash algo- in computer engineering from Kyoto University,
rithm for secure vehicle communication using FPGA,’’ in Proc. Int. SoC in 1986, 1988, and 1998, respectively. From
Design Conf. (ISOCC), Jeju, South Korea, Nov. 2014, pp. 224–225, doi: 1988 to 1999, he was a Computer Architect with
10.1109/ISOCC.2014.7087617. the Computer and System Architecture Depart-
[25] M. D. Rote, N. Vijendran, and D. Selvakumar, ‘‘High performance SHA-2 ment, FUJITSU Ltd. From 1999 to 2005, he was an
core using the round pipelined technique,’’ in Proc. IEEE Int. Conf.
Associate Professor with the Graduate School of
Electron., Comput. Commun. Technol. (CONECCT), Bangalore, India,
Jul. 2015, pp. 1–6, doi: 10.1109/CONECCT.2015.7383912.
Economics, Kyoto University. Since 2006, he has
[26] M. Kammoun, M. Elleuchi, M. Abid, and M. S. BenSaleh, ‘‘FPGA-based been a Professor with the Graduate School of
implementation of the SHA-256 hash algorithm,’’ in Proc. IEEE Int. Conf. Information Science, Nara Institute of Science and Technology. His research
Design Test Integr. Micro Nano-Syst. (DTS), Jun. 2020, pp. 1–6. interests include computer architecture, emulation, circuit design, and accel-
[27] R. Martino and A. Cilardo, ‘‘A flexible framework for exploring, erators. He is currently a fellow of IEICE, a Senior Member of IPSJ, and a
evaluating, and comparing SHA-2 designs,’’ IEEE Access, vol. 7, member of IEEE CS and ACM.
pp. 72443–72456, 2019.

39192 VOLUME 9, 2021

Low Power and Area SHA-256 Hardware Accelerator On Virtex-7 FPGA
No ratings yet
Low Power and Area SHA-256 Hardware Accelerator On Virtex-7 FPGA
5 pages
FPGA-Based SHA-3 Accelerator
No ratings yet
FPGA-Based SHA-3 Accelerator
11 pages
Cost-Efficient SHA Hardware Accelerators
No ratings yet
Cost-Efficient SHA Hardware Accelerators
10 pages
Computers 13 00009 v2
No ratings yet
Computers 13 00009 v2
16 pages
Ahmad 2005
No ratings yet
Ahmad 2005
16 pages
Kim 2009
No ratings yet
Kim 2009
13 pages
32JST 2895 20211
No ratings yet
32JST 2895 20211
25 pages
Suhali&Watanabe
No ratings yet
Suhali&Watanabe
6 pages
Optimizing SHA-2 FPGA Implementations
No ratings yet
Optimizing SHA-2 FPGA Implementations
13 pages
An Embedded Memory-Centric Reconfigurable Hardware Accelerator For Security Applications
No ratings yet
An Embedded Memory-Centric Reconfigurable Hardware Accelerator For Security Applications
7 pages
Zeghid
No ratings yet
Zeghid
12 pages
Litehash
No ratings yet
Litehash
17 pages
Lecture 7 Hash Functions 1
No ratings yet
Lecture 7 Hash Functions 1
40 pages
An Efficient Implementation of SHA
No ratings yet
An Efficient Implementation of SHA
4 pages
Chen 2020
No ratings yet
Chen 2020
4 pages
Capstone PPT 1 2023
No ratings yet
Capstone PPT 1 2023
19 pages
CARRV2021 Paper 87 Adams
No ratings yet
CARRV2021 Paper 87 Adams
8 pages
Efficient and High-Speed CGRA Accelerator For Cryptographic Applications
No ratings yet
Efficient and High-Speed CGRA Accelerator For Cryptographic Applications
7 pages
Sensors 22 05028 v2
No ratings yet
Sensors 22 05028 v2
23 pages
Tcsi 2021 3054758-2
No ratings yet
Tcsi 2021 3054758-2
14 pages
Hardware Acceleration Design of The SHA-3 For High Throughput and Low Area On FPGA
No ratings yet
Hardware Acceleration Design of The SHA-3 For High Throughput and Low Area On FPGA
13 pages
A Vlsi Implementation of Secure Hash Algorithm
No ratings yet
A Vlsi Implementation of Secure Hash Algorithm
7 pages
B25 Block 03
No ratings yet
B25 Block 03
3 pages
FPGA Acceleration of AES Algorithm For High-Perfor
No ratings yet
FPGA Acceleration of AES Algorithm For High-Perfor
11 pages
SHA-224/256 Based Digital Signature Using FPGA: Lalitha Sowmya M & Prof.P.Ravikanth
No ratings yet
SHA-224/256 Based Digital Signature Using FPGA: Lalitha Sowmya M & Prof.P.Ravikanth
6 pages
An Efficient Implementation of Hash Function Processor For Ipsec
No ratings yet
An Efficient Implementation of Hash Function Processor For Ipsec
4 pages
Domain Specific High-Level Synthesis For Cryptographic Workloads Ayesha Khalid Download
No ratings yet
Domain Specific High-Level Synthesis For Cryptographic Workloads Ayesha Khalid Download
93 pages
HS Unit 2
No ratings yet
HS Unit 2
16 pages
FPGA Optimization for Keccak SHA-3
No ratings yet
FPGA Optimization for Keccak SHA-3
14 pages
INTEGRATION, The VLSI Journal: H.E. Michail, G.S. Athanasiou, G. Theodoridis, C.E. Goutis
No ratings yet
INTEGRATION, The VLSI Journal: H.E. Michail, G.S. Athanasiou, G. Theodoridis, C.E. Goutis
21 pages
Hardware Design For SHA-1 Based On FPGA
No ratings yet
Hardware Design For SHA-1 Based On FPGA
3 pages
A Robust Scan-Based Side-Channel Attack Method Against HMAC-SHA-256 Circuits
No ratings yet
A Robust Scan-Based Side-Channel Attack Method Against HMAC-SHA-256 Circuits
6 pages
Efficient Bitcoin Miner System Implemented On Zynq SoC
No ratings yet
Efficient Bitcoin Miner System Implemented On Zynq SoC
7 pages
Serial Communications Implementations On FPGAs1
No ratings yet
Serial Communications Implementations On FPGAs1
4 pages
Lect 17 SHA 256 11122023 021605pm
No ratings yet
Lect 17 SHA 256 11122023 021605pm
27 pages
Algorithms On STM32
No ratings yet
Algorithms On STM32
5 pages
Blockchain Architecture & Hash Functions
No ratings yet
Blockchain Architecture & Hash Functions
35 pages
INTEGRATION, The VLSI Journal: H.E. Michail, G.S. Athanasiou, G. Theodoridis, C.E. Goutis
No ratings yet
INTEGRATION, The VLSI Journal: H.E. Michail, G.S. Athanasiou, G. Theodoridis, C.E. Goutis
21 pages
3.design of MIPS Processor Using FPGA
No ratings yet
3.design of MIPS Processor Using FPGA
6 pages
A Novel Hardware Architecture For Enhancing The Ke
No ratings yet
A Novel Hardware Architecture For Enhancing The Ke
15 pages
Low Power Implementation of Secure Hashing Algorithm (SHA-2) Using VHDL On FPGA of SHA-256
No ratings yet
Low Power Implementation of Secure Hashing Algorithm (SHA-2) Using VHDL On FPGA of SHA-256
6 pages
10 1109@tcsi 2020 2997916
No ratings yet
10 1109@tcsi 2020 2997916
14 pages
Descrite Research
No ratings yet
Descrite Research
15 pages
Resource-Shared Crypto-Coprocessor of AES Enc Dec With SHA-3
No ratings yet
Resource-Shared Crypto-Coprocessor of AES Enc Dec With SHA-3
14 pages
FALLSEM2023-24 CSE4003 ETH VL2023240103567 2023-09-08 Reference-Material-II
No ratings yet
FALLSEM2023-24 CSE4003 ETH VL2023240103567 2023-09-08 Reference-Material-II
18 pages
FPGA-based Tunable Keccak Core
No ratings yet
FPGA-based Tunable Keccak Core
6 pages
Electronics 13 04767
No ratings yet
Electronics 13 04767
22 pages
2018 - A Comparative Study of Message Digest 5 (MD5) and SHA256 Algorithm
No ratings yet
2018 - A Comparative Study of Message Digest 5 (MD5) and SHA256 Algorithm
7 pages
16-Secure Hash Algorithms (SHA) - 08!03!2024
No ratings yet
16-Secure Hash Algorithms (SHA) - 08!03!2024
27 pages
Hardware Performance Evaluation of SHA-3 Candidate Algorithms
No ratings yet
Hardware Performance Evaluation of SHA-3 Candidate Algorithms
8 pages
Hash Family and How Hash Functions Work
No ratings yet
Hash Family and How Hash Functions Work
4 pages
Fpga MP
No ratings yet
Fpga MP
6 pages
CS - Lab Assignment-4
No ratings yet
CS - Lab Assignment-4
12 pages
SHA256 (Secure Hash Algorithm)
No ratings yet
SHA256 (Secure Hash Algorithm)
12 pages
Securing Data Blockchain AI
No ratings yet
Securing Data Blockchain AI
9 pages
Enhancement of Security in IoT Using Secure Hash Cryptography Algorithm With Double Encryption SHCA-De
No ratings yet
Enhancement of Security in IoT Using Secure Hash Cryptography Algorithm With Double Encryption SHCA-De
5 pages
Blockchain & FPGA
No ratings yet
Blockchain & FPGA
37 pages
IJNRD2405729
No ratings yet
IJNRD2405729
7 pages
Cryptography: Sunita Prasad M.Tech
No ratings yet
Cryptography: Sunita Prasad M.Tech
9 pages
1st Review
No ratings yet
1st Review
2 pages
16 Rsa
No ratings yet
16 Rsa
14 pages
Literature Survey Papers
No ratings yet
Literature Survey Papers
677 pages
Tem 1
No ratings yet
Tem 1
11 pages
Performance Analysis of Postquantum Cryptographic Schemes For Securing Large-Scale Wireless Sensor Networks
No ratings yet
Performance Analysis of Postquantum Cryptographic Schemes For Securing Large-Scale Wireless Sensor Networks
11 pages
Base Main-1
No ratings yet
Base Main-1
17 pages
Test Platform Export 2024-04-24 10 - 56
No ratings yet
Test Platform Export 2024-04-24 10 - 56
12 pages
BlackHat_DC_2011_Grand-Workshop
No ratings yet
BlackHat_DC_2011_Grand-Workshop
93 pages
ICAM501/502: Installation & Maintenance Instructions & Operating Manual
No ratings yet
ICAM501/502: Installation & Maintenance Instructions & Operating Manual
19 pages
Precision ADCs for Industrial Use
No ratings yet
Precision ADCs for Industrial Use
108 pages
PV Branding Guidelines Logos Lock Ups. CB1539191655
100% (1)
PV Branding Guidelines Logos Lock Ups. CB1539191655
22 pages
SWIFT CSP Architecture Guide
No ratings yet
SWIFT CSP Architecture Guide
23 pages
DataScience SOP
No ratings yet
DataScience SOP
3 pages
Judge Dredd d20
No ratings yet
Judge Dredd d20
10 pages
Objective
No ratings yet
Objective
9 pages
CNC LTS Edp
No ratings yet
CNC LTS Edp
35 pages
Internet Concepts
No ratings yet
Internet Concepts
66 pages
Data Patterns Interview Questions
86% (7)
Data Patterns Interview Questions
2 pages
The Impact of Information Technology On PDF
No ratings yet
The Impact of Information Technology On PDF
8 pages
User Manual TecnoManager ENG
No ratings yet
User Manual TecnoManager ENG
43 pages
Who Killed Nokia? Nokia Did: Strategy
No ratings yet
Who Killed Nokia? Nokia Did: Strategy
1 page
Price List: Fourstar Electronic Technology Co., Ltd. Deyang China
No ratings yet
Price List: Fourstar Electronic Technology Co., Ltd. Deyang China
30 pages
CPD Tutorial
50% (6)
CPD Tutorial
23 pages
Lesson Plan For Implementing NETS - S-Template I: (More Directed Learning Activities)
No ratings yet
Lesson Plan For Implementing NETS - S-Template I: (More Directed Learning Activities)
8 pages
Varitronix - 12052017 - VL FS MGLS240128Z 05 1225169
No ratings yet
Varitronix - 12052017 - VL FS MGLS240128Z 05 1225169
14 pages
Instructions SC WD365 2021 EOM9-1
No ratings yet
Instructions SC WD365 2021 EOM9-1
3 pages
Ptofit Loss
No ratings yet
Ptofit Loss
18 pages
Shamanth Internship Report
No ratings yet
Shamanth Internship Report
33 pages
Engineering Drawing by P S Gill PDF
0% (8)
Engineering Drawing by P S Gill PDF
5 pages
Devesh Sir
No ratings yet
Devesh Sir
2 pages
3024 DS Iss.11 EN
No ratings yet
3024 DS Iss.11 EN
2 pages
Automatic Generation of Illustrations For Children's Fairy Tales From Text, AI-Powered App
No ratings yet
Automatic Generation of Illustrations For Children's Fairy Tales From Text, AI-Powered App
5 pages
Deep Learning For Deepfakes Creation and Detection: A Survey
No ratings yet
Deep Learning For Deepfakes Creation and Detection: A Survey
16 pages
Data Reshaping and Pivoting
No ratings yet
Data Reshaping and Pivoting
4 pages
QRD Form 2 Checklist Submission Day 25 Files Human - en
No ratings yet
QRD Form 2 Checklist Submission Day 25 Files Human - en
6 pages
05 MIS Development Process
100% (1)
05 MIS Development Process
29 pages

A High-Performance Multimem SHA-256 Accelerator For Society 5.0

Uploaded by

A High-Performance Multimem SHA-256 Accelerator For Society 5.0

Uploaded by

Received February 16, 2021, accepted February 28, 2021, date of publication March 2, 2021, date of current version

March 16, 2021.

A High-Performance Multimem SHA-256

I. INTRODUCTION disadvantage of the Bitcoin blockchain network of extremely

introduced the rescheduling of the SHA-256 round compu-

in which the input data characteristic of double SHA-256 was

FIGURE 2. The overview operation of the SHA-256 algorithm.

VOLUME 9, 2021 39183

39184 VOLUME 9, 2021

out (SBo1, SBo2). The pipelined ALU includes the arith-

B. PIPELINED ALU ARCHITECTURE

VOLUME 9, 2021 39185

FIGURE 6. The pipelined arithmetic logic unit (ALU) architecture.

Therefore, all 64 loops can be completed in a single ALU.

39186 VOLUME 9, 2021

the initial hash value of L hash functions is written into all

D. SHIFT BUFFER IN SHIFT BUFFER OUT (SBi-SBo)

FIGURE 9. SBi1 and SBi2 architectures. FIGURE 11. SBo2 architecture.

VOLUME 9, 2021 39187

TABLE 1. Comparison based on ASIC results.

39188 VOLUME 9, 2021

TABLE 2. A comparison of FPGA synthesis results.

B. FPGA EXPERIMENT stores the input/output data. To ensure a fair comparison

VOLUME 9, 2021 39189

FIGURE 13. SoC-based multimem SHA-256 diagram.

efficiency of the ALU part of our accelerator version 1-PE

39190 VOLUME 9, 2021

VOLUME 9, 2021 39191

39192 VOLUME 9, 2021

You might also like