A High-Performance Multimem SHA-256 Accelerator For Society 5.0
A High-Performance Multimem SHA-256 Accelerator For Society 5.0
ABSTRACT The development of a low-cost high-performance secure hash algorithm (SHA)-256 accelerator
has recently received extensive interest because SHA-256 is important in widespread applications, such as
cryptocurrencies, data security, data integrity, and digital signatures. Unfortunately, most current researches
have focused on the performance of the SHA-256 accelerator but not on a system level, in which the
data transfer between the external memory and accelerator occupies a large time fraction. In this paper,
we solve the state-of-art problem with a novel SHA-256 architecture named the multimem SHA-256
accelerator that achieves high performance at the system on chip (SoC) level. Notably, our accelerator
employs three novel techniques, the pipelined arithmetic logic unit (ALU), multimem processing element
(PE), and shift buffer in shift buffer out (SBi-SBo), to reduce the critical path delay and significantly increase
the processing rate. Experiments on a field-programmable gate array (FPGA) and an application-specific
integrated circuit (ASIC) show that the proposed accelerator achieves significantly better processing rate and
hardware efficiency than previous works. The accelerator accuracy is verified on a real hardware platform
(FPGA ZCU102). The accelerator is synthesized and laid out with 180 nm complementary metal oxide
semiconductor (CMOS) technology with a chip sized 8.5 mm × 8.5 mm, consumes 1.86 W, and provides a
maximum processing rate of 40.96 Gbps at 80 MHz and 1.8 V. With FPGA Xilinx 16 nm FinFET technology,
the accelerator processing rate is as high as 284 Gbps.
INDEX TERMS SHA-256, blockchain, society 5.0, cryptography hash function, SHA-256 accelerator,
FPGA, ASIC.
Wj (16 ≤ j ≤ 63). σ0 (x) and σ1 (x) are computed as eq. (2) in a low processing rate. One of the main reasons is that
and eq. (3). Note that S n (x) and Rn (x) denote the right rotation memory blocks such as double data random access mem-
and the right shift of data x by n bits, respectively. ory (DDRAM) and cache L1 and L2 of these platforms are
placed far from the calculation unit, and the data transfer time
Wj = σ1 (Wj−2 ) + Wj−7 + σ0 (Wj−15 ) + Wj−16 (1)
between the memories and calculation units thus constitutes a
σ0 (x) = S (x) ⊕ S (x) ⊕ R (x)
7 18 3
(2) large amount of the total processing time. Furthermore, most
σ1 (x) = S 17 (x) ⊕ S 19 (x) ⊕ R10 (x) (3) of the existing hardware resources of CPUs and GPUs, such
as multipliers and exponentials, are useless for SHA-256 but
The MC process computes the 256-bit hash value from
still consume energy. Despite these disadvantages, multicore
the outputs of the ME process (64 chunks of Wj (0 ≤ j ≤
CPUs and GPUs are currently considered to be the most
63)). The process involves two main steps: loops and hash
applicable hardware platforms for calculating SHA-256 in
updates. In the loop step, eight loop hash values (denoted
Bitcoin mining and other blockchain networks.
a, b, c, d, e, f , g, h) are initialized by the initial hash values
In another approach, the systolic array-based accelerator
H0 , H1 ,. . . ,H7 . The loop hash values a, b, c, d, e, f , g, h are
named EMAXVR in [13] and its improved version in [14]
then computed and updated through 64 loops. In each loop
were applied to reduce the data access time by implement-
(loop j, 0 ≤ j ≤ 63), eqs. (4) to (8) are applied.
ing local memory near the ALU. Although this platform
T1 = h + 61 (e) + Ch(e, f , g) + Kj + Wj (4) achieves high performance on image processing and AI learn-
T2 = 60 (a) + Maj(a, b, c) (5) ing [13], [14], its performance for computing SHA-256 is
very poor [15]. The key reason for the low processing rate is
a = T1 + T2 (6) the data-dependent characteristic of the SHA-256 algorithm,
e = d + T1 (7) as mentioned above. The processing element (PE) architec-
b = a; c = b; d = c; f = e; g = f ; h = g (8) ture of EMAXVR is not yet suitable for SHA-256 computa-
tion; therefore, a large number of PEs are required to compute
where logical functions such as 60 (x), 61 (x), Ch(x, y, z), and a single hash loop. Furthermore, the data dependence among
Maj(x, y, z) are computed using the following equations. the loops requires 1) multiple copies of data to be stored in
60 (x) = S 2 (x) ⊕ S 13 (x) ⊕ S 22 (x) (9) different PEs, which results in low hardware efficiency, and
61 (x) = S 6 (x) ⊕ S 11 (x) ⊕ S 25 (x) (10) 2) external memory data transfer after the completion of a
loop, which results in a low processing rate.
Ch(x, y, z) = (x ∧ y) ⊕ (¬x ∧ z) (11) Studying the weaknesses of the existing hardware plat-
Maj(x, y, z) = (x ∧ y) ⊕ (x ∧ z) ⊕ (y ∧ z) (12) forms for SHA-256 and deeply understanding the char-
In the hash update step, the final 256-bit hash value, which acteristics of SHA-256, we propose a novel multimem
is divided into 8 chunks of 32-bit data HO0 , HO1 ,. . . ,HO7 , SHA-256 accelerator that achieves high performance in both
is computed by adding the initial hashes H0 , H1 ,. . . ,H7 to the hardware efficiency and processing rate. We propose multiple
loop hashes a, b, c, d, e, f , g, h, as illustrated in eq. (13). local memory structures near the ALU for reducing data
access time and improving hardware efficiency. Furthermore,
HO0 = H0 + a; . . . ; HO7 = H7 + h (13) we design a processing element (PE) architecture that is able
to compute all loops of the hash function without requiring
B. SHA-256 CHARACTERISTICS AND PRELIMINARY IDEA data from the nearby PEs. Thus, the PEs can work inde-
FOR THE HIGH PERFORMANCE ACCELERATOR pendently and parallel, and no wire connections are required
There are three characteristic points of SHA-256 that should among the PEs. As a result, the accelerator does not need to
be noted. First, the SHA-256 algorithm needs only low-cost frequently request data from external memories, and the data
arithmetic logic operators such as adders, shifts, rotations, access time is significantly shortened. Furthermore, the hard-
and XORs. No complex operators such as multipliers and ware efficiency and processing rate of the SHA-256 acceler-
dividers are required. Second, the number of operators (nots, ator are expected to be remarkably enhanced.
XORs, adders, shifts, rotations, etc.) per loop calculation is
significantly large (approximately 50 operators per loop).
Third, data dependence among the loops is present. This III. THE PROPOSED MULTIMEM SHA-256 ACCELERATOR
means that the current loop needs the result of the previ- A. OVERVIEW OF THE ARCHITECTURE
ous loops for calculation. In addition, some common data Fig. 3 shows the overview architecture of our proposed
are required in multiple loops. Because of these character- accelerator. It can be employed in any embedded system
istics, the current high-performance CPU (Intel multicores) via an advanced extensible interface (AXI) bus. The CPU is
and GPU (GPU tesla V100, GTX 1080, etc.) platforms do the main microprocessor controlling operation of the entire
not perform well in calculating the SHA-256 algorithm. embedded system. Basically, the CPU is busy with many
Although these general purpose platforms are rich in hard- tasks, including the control of the hash calculation inside the
ware resources, they still need a large number of clock SHA-256 accelerator. In the task of controlling the accelera-
cycles to compute a single loop of SHA-256, which results tor, the CPU must send commands to transfer data from DDR
Fig. 9 a and b shows the SBi1 and SBi2 architectures that SBo2 has two main functions. First, it collects and sched-
connect the outputs of WRAM and HRAM to the ALU inputs, ules the loop hash words a, . . . , h output from the ALU
respectively. SBi1 includes 4 shift buffers named SBi1-1, after 64 loops are complete. Second, it calculates the final
SBi1-2, SBi1-3, and SBi1-4 receiving data flows from WM1, hash values HO1 , . . . , HO7 by adding the loop hash words
WM2, WM3, and WM4 of WRAM, respectively. Each shift a, . . . , h and the initial hash values H0 , . . . , H7 following
buffer has 4 32-bit shift registers that receive data from the eq. (13). Fig. 11 shows the inside of SBo2. It includes four
WRAM and then shift the data to the right once per clock shift buffers on the left named SBo2-l1, SBo2-l2, SBo2-l3,
cycle. The timings of 4 data flows from 4 RAM blocks and SBo2-l4 and four shift buffers on the right named
WM1, . . . , WM4 are interleaved, as shown in fig. 10. The SBo2-r1, SBo2-r2, SBo2-r3, and SBo2-r4. Each shift buffer
multiplexer MUX-1 then selects the appropriate dataset from has 8 32-bit shift registers. The buffers on the left (SBo2-l1,
4 data flows one per clock cycle to input to the ALU. SBo2-l2, SBo2-l3, SBo2-l4) store the ALU outputs a, . . . , h,
Similarly, SBi2 shown in Fig. 9b has 4 shift buffers named while the buffers on the right (SBo2-r1, SBo2-r2, SBo2-r3,
SBi2-1, SBi2-2, SBi2-3, and SBi2-4 receiving data flows SBo2-r4) receive the initial hash values H0 , . . . , H7 from
from HM1, HM2, HM3, and HM4 of HRAM, respectively. SBi2. Each left-right set of buffers aims to process one data
Each shift buffer has 8 32-bit shift registers that receive data flow. The content of the left-right set of buffers is shifted to
a, . . . , h from the HRAM and then shift the data to the right the left (direction of the most significant register) once per
clock cycle. The values from the most significant registers are IV. EVALUATION
read out and added. The addition results become the final hash In this section, we evaluate the performance of our proposed
words HO1 , . . . , HO7 that are written into the corresponding multimem SHA-256 accelerator from both ASIC and FPGA
RAM blocks (HM1, . . . , HM4) of the HRAM. experimental results.
A. ASIC EXPERIMENT
The proposed multimem SHA-256 circuits for cases of 1-PE
and 64-PEs were coded in Verilog and synthesized in an ASIC
using a Synopsys Design Compiler with the Rohm 0.18µm
CMOS standard cell library [16]. Table 1 shows the synthe-
sized area of the SHA-256 circuit in comparison with those
of other works. To ensure a fair comparison, we provide the
processing throughput in terms of both the number of clocks
per hash function and Mbps. We also normalize the process-
ing throughput Mbps and hardware efficiency of other works
to the same 180 nm technology. The proposed architecture in
cases of both 1 PE and 64 PEs exhibits significantly better
processing throughput and hardware efficiency. Compared
with the works in [17] and [18], which reported only the ALU
circuit results, our ALU generates a processing rate as high
as 2,930 Mbps, which is 1.7 times and 1.9 times faster than
the works in [17] and [18], respectively. Our ALU hardware
efficiency is 123 Kbps/gate, which is 1.1 times and 1.4 times
better than those of [17] and [18], respectively.
The proposed architecture with only 1 PE (No RAM PE)
provides a processing rate as high as 2,431 Mbps, which is
12.3 times faster than that of the work in [22]. Its hardware
FIGURE 12. Timing chart of SBo2.
efficiency is 43 Kbps/gate, which is almost 2 times better than
that of [22]. Notably, our architecture with 64 PEs achieves
Fig. 12 illustrates the timing chart of the SBo2. First, the best processing rate. It provides up to 139 Gbps at a
the ALU output data, such as a, . . . , and h, of the 4 data flows frequency of 272 MHz, which is 70 times faster than [22].
are stored in the 8 registers of the shift buffers SBo2-l1 to Note that the term ‘‘No RAM’’ refers to the synthesis result
SBo-l4 in 4 interleave clock cycles. The values of the most except the local memory resources inside the HRAM and
significant register (for example SBo2-l1[7]), which are a, b, WRAM.
c, d, e, f , g, and h at clock cycles #1, #2, #3, #4, #5, #6, #7, Furthermore, we successfully layout the proposed
respectively, are read out to add to the initial hash values H0 , SHA-256 circuit in the case of 64 PEs in ASIC technology
. . . , H7 before writing into the HRAM (for example WM1). with a Rohm 180 nm CMOS standard cell library. The size of
It needs 8 clock cycles to complete the writing of 8 32-bit the layout chip is 8.5 mm × 8.5 mm and chip processing rate
hash words for one data flow and 8 + 3 = 11 clock cycles to is 40.96 Gbps when a global operating voltage of 1.8 V and
complete the writing for 4 data flows in pipelined mode. operating frequency of 80 MHz are applied.
better than the architectures in [6], [19], [20], and [21], Overall, our proposed SHA-256 accelerator obtains signif-
respectively. The hardware efficiency of our accelerator ver- icantly better processing rate and hardware efficiency and has
sion 1-PE is 1,864 Kbps/LUT, which is 12.4 times, 2.5 times, the ability to scale the trade-off between the processing rate
2.5 times, and 1.76 times higher than the architectures in [6], and hardware cost. This means that our accelerator processing
[19], [20], and [21], respectively. rate can be changed according to the requirements of the real
On the Virtex E device, the processing rate and hard- system.
ware efficiency of the full circuit of our accelerator version
1-PE are 7.6 times (662 vs. 87 Mbps) and 2.7 times (183 vs.
69 Kbps/LUT) higher than those of the architecture in [22],
respectively.
On the Virtex 4 device, compared to that of the architec-
tures in [10], [11], [19], and [21], the performance of our
accelerator version 1-PE is 34 times, 1.6 times, 2.3 times, and
3.5 times better in terms of processing rate and 13.5 times,
1.4 times, 1.3 times, and 1.6 times better in terms of hardware
efficiency, respectively.
On the Virtex 5 device, the processing rate of our archi-
tecture version 1-PE is 28.5 times and 3.1 times better than
those of the architectures in [19] and [21], respectively.
Our circuit hardware efficiency is improved by 5.1 times
and 1.1 times compared to that of the architectures in [19]
and [21], respectively.
On the Virtex 6 device, the processing rate and hardware FIGURE 14. SoC-based multimem SHA-256 experiment with real devices.
devices connect and exchange data via JTAG and UART achieves significantly better processing rate and hardware
cables. The ZCU102 board comprises an ARMv8 Cortex- efficiency than previous works. The functional behavior of
A53 microprocessor, a programmable logic (PL), and a clock our accelerator has been proven to run correctly on a real SoC
generator. Our multimem SHA-256 is developed in the PL FPGA platform Zynq UltraScale+ ZCU102. The maximum
part and controlled by the ARMv8 microprocessor. The clock processing rate of our accelerator in theory is 284 Gbps.
generator generates a clock speed of 100 MHz to operate Version 64-PE of our accelerator has been successfully laid
the ARMv8 microprocessor and our accelerator embedded in out with 180 nm CMOS technology with a chip size of
the PL. 8.5 mm × 8.5 mm, consumes a power of 1.86 W, and provides
The host PC runs two Xilinx tools: a Vivado and a maximum processing rate of 40.96 Gbps at 80 MHz fre-
a Software Development Kit (SDK). We use Vivado to quency and 1.8 V voltage.
design and load the SHA-256 circuit onto the Zynq Obviously, the processing rate of an entire SoC embed-
UltraScale+ ZCU102 board. To control the operation of the ding SHA-256 accelerator is much more important than
SHA-256 accelerator, we use the SDK to embed C code that of the SHA-256 accelerator alone. This research has
onto the ARMv8 microprocessor. The C code runs tasks such proposed an SoC-compatible SHA-256 accelerator and sig-
as transferring data between the external memory and the nificantly improved the accelerator processing rate. How-
SHA-256 accelerator, activating the operation of the accel- ever, the potential of the accelerator may not be sufficiently
erator, and verifying the accelerator results. utilized due to the bottleneck data transfer rate provided
In our experiment, we develop two testing cases named by the SoC platform itself. Therefore, we believe that
individual PE and parallel 64-PE, as shown in Fig. 13. developing a new architecture and technique to overcome
In the individual PE testing case, all 64 PEs of the acceler- the data transfer rate bottleneck of SoC platforms will
ator are programmed to perform the same task. They are all be an important research trend of the field in the near
programmed to compute the hash values for 100,000 indepen- future.
dent data blocks. The results from each PE are then compared
against the correct 100,000 hashing values computed by the REFERENCES
Intel Xeon Gold 6144 CPU. Our implementation results have [1] U.S. Department of Commerce. (Jul. 2013). Digital Signature
shown that all 64 PEs work correctly to generate hash values Standard (DSS). [Online]. Available: https://nvlpubs.nist.gov/nistpubs/
matched with those computed by the Xeon CPU. In other FIPS/NIST.FIPS.186-4.pdf
[2] National Institute of Standards and Technologies, U.S. Department
words, the correction ratio of our accelerator is 100%. of Commerce. (Jun. 2015). The Keyed-Hash Message Authentication
In the parallel 64-PE testing case, the 64 PEs of the Code (HMAC). [Online]. Available: https://nvlpubs.nist.gov/nistpubs/
accelerator are programmed to perform different tasks in FIPS/NIST.FIPS.198-1.pdf
[3] Recommendation for Random Number Generation Using Deterministic
parallel. The input data of different data blocks are loaded Random Bit Generators, 800-90a Revision 1, Nat. Inst. Standards Tech-
into different PEs. The results are then compared against the nol., U.S. Dept. Commerce, Washington, DC, USA, Jun. 2015, doi:
corresponding hashing outputs computed by the Intel Xeon 10.6028/NIST.SP.800-90Ar1.
Gold 6144 CPU. Again, the implementation results show that [4] R. S. Kent, IP Authentication Header, document RFC 2404, 1998.
[Online]. Available: https://www.ietf.org/rfc/rfc2402.txt
all 64 PEs are calculated correctly with a correction ratio [5] H. L. Pham, T. H. Tran, T. D. Phan, V. T. D. Le, D. K. Lam, and
of 100%. Y. Nakashima, ‘‘Double SHA-256 hardware architecture with com-
pact message expander for bitcoin mining,’’ IEEE Access, vol. 8,
pp. 139634–139646, 2020, doi: 10.1109/ACCESS.2020.3012581.
V. CONCLUSION [6] R. P. McEvoy, F. M. Crowe, C. C. Murphy, and W. P. Marnane, ‘‘Opti-
The cryptography hash function SHA-256 is an important misation of the SHA-2 family of hash functions on FPGAs,’’ in Proc.
process in keeping the decentralized blockchain network IEEE Comput. Soc. Annu. Symp. Emerg. VLSI Technol. Archit. (ISVLSI),
Mar. 2006, pp. 317–322.
secure and preserving data security and integrity in the [7] H. E. Michail, G. S. Athanasiou, V. Kelefouras, G. Theodoridis, and
smart systems of society 5.0. Developing a high-performance C. E. Goutis, ‘‘On the exploitation of a high-throughput SHA-256 FPGA
SHA-256 circuit compatible with SoC is thus a require- design for HMAC,’’ ACM Trans. Reconfigurable Technol. Syst., vol. 5,
no. 1, pp. 1–28, Mar. 2012.
ment. Unfortunately, state-of-art SHA-256 architectures
[8] L. Dadda, M. Macchetti, and J. Owen, ‘‘The design of a high speed ASIC
have focused only on improving the performance of the unit for the hash function SHA-256 (384, 512),’’ in Proc. Design, Automat.
SHA-256 core as a standing-alone circuit. The compatibility Test Eur. Conf. Exhib., 2004, pp. 70–75.
of the SHA-256 circuit with the SoC as well as the data [9] R. Chaves, G. Kuzmanov, L. Sousa, and S. Vassiliadis, ‘‘Cost-efficient
SHA hardware accelerators,’’ IEEE Trans. Very Large Scale Integr. (VLSI)
transfer between the SHA-256 core and other parts of the SoC Syst., vol. 16, no. 8, pp. 999–1008, Aug. 2008.
have not yet been studied. In this paper, we have solved the [10] M. Padhi and R. Chaudhari, ‘‘An optimized pipelined architecture of
problem by developing an SoC-compatible SHA-256 accel- SHA-256 hash function,’’ in Proc. 7th Int. Symp. Embedded Comput. Syst.
Design (ISED), Dec. 2017, pp. 1–4.
erator. Several new techniques such as multiple local memory
[11] Y. Chen and S. Li, ‘‘A high-throughput hardware implementation of
structures (multimem), pipelined ALU, and SBi-SBo have SHA-256 algorithm,’’ in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS),
been introduced to achieve this purpose. Our experimental Oct. 2020, pp. 1–4.
results on both FPGA and ASIC have shown that the proposed [12] National Technical Information Service, U.S. Department of
Commerce/NIST, Springfield, VA, USA. (2002). FIPS 180-2—Secure
accelerator is not only compatible with the SoC via an AXI Hash Standard. [Online]. Available: http://csrc.nist.gov/publications/
bus and provides an efficient data transfer mechanism but also fips/fips180-2/fips180-2.pdf
[13] T. Ichikura, R. Yamano, Y. Kikutani, R. Zhang, and Y. Nakashima, THI HONG TRAN (Member, IEEE) received
‘‘EMAXVR: A programmable accelerator employing near ALU utilization the bachelor’s degree in physics and the mas-
to DSA,’’ in Proc. IEEE Symp. Low-Power High-Speed Chips (COOL ter’s degree in microelectronics from the Vietnam
CHIPS), Apr. 2018, pp. 1–3. National University Ho Chi Minh City University
[14] J. Iwamoto, Y. Kikutani, R. Zhang, and Y. Nakashima, ‘‘Daisy-chained of Science (VNU-HCMUS), Vietnam, in 2008 and
systolic array and reconfigurable memory space for narrow memory 2012, respectively, and the Ph.D. degree in infor-
bandwidth,’’ IEICE Trans. Inf. Syst., vol. E103.D, no. 3, pp. 578–589, mation science from the Kyushu Institute of Tech-
Mar. 2020, doi: 10.1587/transinf.2019EDP7144.
nology, Japan, in 2014. Since 2015, she has been
[15] D. Phan, T. H. Tran, and Y. Nakashima, ‘‘SHA-256 implementation on
working with the Nara Institute of Science and
coarse-grained reconfigurable architecture,’’ in Proc. IEEE Symp. Low-
Power High-Speed Chips, Japan, Apr. 2020. Technology (NAIST), Japan, as a full-time Assis-
[16] T. H. Tran, S. Kanagawa, D. P. Nguyen, and Y. Nakashima, ‘‘ASIC tant Professor. Her research interests include digital hardware circuit design
design of MUL-RED radix-2 pipeline FFT circuit for 802.11ah system,’’ and algorithm related to wireless communications, communication security,
in Proc. IEEE Symp. Low-Power High-Speed Chips (COOL CHIPS XIX), blockchain technologies, SHA-2, SHA-3, and cryptography. She is currently
Yokohama, Japan, Apr. 2016, pp. 1–3. a Regular Member of IEICE and REV-JEC.
[17] Helion Technology. Helion IP Core Products. Accessed: Jan. 10, 2021.
[Online]. Available: http://heliontech.com/core.htm
[18] A. Satoh and T. Inoue, ‘‘ASIC hardware focused comparison for hash
functions MD5, RIPEMD-160, and SHS,’’ in Proc. Int. Conf. Inf. Technol.,
Coding Comput. (ITCC), Las Vegas, NV, USA, vol. 1, 2005, pp. 532–537,
doi: 10.1109/ITCC.2005.92.
[19] R. García, I. Algredo-Badillo, M. Morales-Sandoval, C. Feregrino-Uribe, HOAI LUAN PHAM received the bachelor’s
and R. Cumplido, ‘‘A compact FPGA-based processor for the secure hash degree in computer engineering from the Univer-
algorithm SHA-256,’’ Comput. Electr. Eng., vol. 40, no. 1, pp. 194–202, sity of Information Technology Vietnam National
Jan. 2014. University-Ho Chi Minh (VNU-HCM), Vietnam,
[20] I. Algredo-Badillo, C. Feregrino-Uribe, R. Cumplido, and
in 2018, and the master’s degree in informa-
M. Morales-Sandoval, ‘‘Novel hardware architecture for implementing
tion science from the Nara Institute of Science
the inner loop of the SHA-2 algorithms,’’ in Proc. 14th Euromicro
Conf. Digit. Syst. Design, Oulu, Finland, Aug. 2011, pp. 543–549, doi: and Technology (NAIST), Japan, in 2020, where
10.1109/DSD.2011.75. he is currently pursuing the Ph.D. degree. His
[21] M. M. Wong, V. Pudi, and A. Chattopadhyay, ‘‘Lightweight and high per- research interests include blockchain technology
formance SHA-256 using architectural folding and 4-2 adder compressor,’’ and cryptography.
in Proc. IFIP/IEEE Int. Conf. Very Large Scale Integr. (VLSI-SoC), Verona,
Italy, Oct. 2018, pp. 95–100, doi: 10.1109/VLSI-SoC.2018.8644825.
[22] M. Kim, J. Ryou, and S. Jun, ‘‘Efficient hardware architecture
of SHA-256 algorithm for trusted mobile computing,’’ in Proc.
4th Int. Conf. Inf. Secur. Cryptol., China, Jun. 2009, pp. 240–252,
doi: 10.1007/978-3-642-01440-6_19.
[23] K. K. Ting, S. C. L. Yuen, K. H. Lee, and P. H. W. Leong, ‘‘An FPGA based
SHA-256 processor,’’ in Proc. Int. Conf. Field Program. Log. Appl., 2002, YASUHIKO NAKASHIMA (Senior Member,
pp. 577–585. IEEE) received the B.E., M.E., and Ph.D. degrees
[24] C. Jeong and Y. Kim, ‘‘Implementation of efficient SHA-256 hash algo- in computer engineering from Kyoto University,
rithm for secure vehicle communication using FPGA,’’ in Proc. Int. SoC in 1986, 1988, and 1998, respectively. From
Design Conf. (ISOCC), Jeju, South Korea, Nov. 2014, pp. 224–225, doi: 1988 to 1999, he was a Computer Architect with
10.1109/ISOCC.2014.7087617. the Computer and System Architecture Depart-
[25] M. D. Rote, N. Vijendran, and D. Selvakumar, ‘‘High performance SHA-2 ment, FUJITSU Ltd. From 1999 to 2005, he was an
core using the round pipelined technique,’’ in Proc. IEEE Int. Conf.
Associate Professor with the Graduate School of
Electron., Comput. Commun. Technol. (CONECCT), Bangalore, India,
Jul. 2015, pp. 1–6, doi: 10.1109/CONECCT.2015.7383912.
Economics, Kyoto University. Since 2006, he has
[26] M. Kammoun, M. Elleuchi, M. Abid, and M. S. BenSaleh, ‘‘FPGA-based been a Professor with the Graduate School of
implementation of the SHA-256 hash algorithm,’’ in Proc. IEEE Int. Conf. Information Science, Nara Institute of Science and Technology. His research
Design Test Integr. Micro Nano-Syst. (DTS), Jun. 2020, pp. 1–6. interests include computer architecture, emulation, circuit design, and accel-
[27] R. Martino and A. Cilardo, ‘‘A flexible framework for exploring, erators. He is currently a fellow of IEICE, a Senior Member of IPSJ, and a
evaluating, and comparing SHA-2 designs,’’ IEEE Access, vol. 7, member of IEEE CS and ACM.
pp. 72443–72456, 2019.