An Embedded Memory-Centric Reconfigurable Hardware Accelerator For Security Applications

The document presents HASK, a novel reconfigurable hardware accelerator designed specifically for security applications, which enhances energy efficiency and performance compared to general-purpose processors and FPGAs. HASK features a coarse-grained datapath and supports parallel computations through an array of lightweight processing elements, achieving significant improvements in energy-delay product and throughput for various cryptographic algorithms. The architecture is tailored to meet the specific needs of security tasks, providing a scalable solution that balances spatial and temporal computing demands.

Uploaded by

Bhargav M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views7 pages

An Embedded Memory-Centric Reconfigurable Hardware Accelerator For Security Applications

Uploaded by

Bhargav M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

3196 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO.

10, OCTOBER 2016

An Embedded Memory-Centric processors (GPPs), with a single block encryption typically taking
Reconfigurable Hardware Accelerator hundreds of cycles on an embedded processor without hardware
support [5]. This strains real-time applications where a software-
for Security Applications only approach may not meet throughput requirements. The energy
efficiency of security algorithms on GPPs is also poor, which can
Christopher Babecki, Student Member, IEEE, be unattractive in energy-constrained systems.
Wenchao Qian, Student Member, IEEE, One approach to alleviate this energy and performance bottle-
Somnath Paul, Member, IEEE, neck is to develop dedicated hardware for specific application ker-
Robert Karam, Student Member, IEEE, and nels like AES, such as the AES-NI instruction set extension for the
Swarup Bhunia, Senior Member, IEEE x86 platform [6]. This yields excellent performance and energy
results, but only functions for a specific algorithm and requires
Abstract—Security has emerged as a critical need in today’s computer
substantial development and verification effort to implement for
applications. Unfortunately, most security algorithms are computationally
expensive and often do not map efficiently to general purpose processors. Fixed- each kernel. Therefore, this approach does not scale well to diverse
function accelerators offer significant improvement in energy-efficiency, but they do security kernels [3]. Similarly, FPGAs can be used to improve flexi-
not allow more than one application to reuse hardware resources. Mapping bility, and while this may enhance energy efficiency over software-
applications to generic reconfigurable fabrics can achieve the desired flexibility, but only implementations, the highly reconfigurable interconnect
at the cost of area and energy efficiency. This paper presents a novel architecture can impose significant penalties to power, area, and
reconfigurable framework, referred to as hardware accelerator for security kernel latency [17]. Instead, domain-specific reconfigurable architectures
(HASK), for accelerating a wide array of security applications. This framework
have been introduced which demonstrate improved performance
incorporates a coarse-grained datapath, supports for lookup functions, and flexible
interconnect optimizations, which enable on-demand pipelining and parallel
and power results over general purpose reconfigurable platforms,
computations in multiple ultralight-weight processing elements. These features are [2], [7], [8]. Notable architectures include asynchronous array of
highly effective for energy-efficient operation in a diverse set of security simple processors (AsAP) [7], [15] and Morphosys [2], while the
applications. Through simulations, we have compared the performance of HASK to Totem system [8] aims to automate the process of domain specific
software and field programmable gate array (FPGA) platforms. Simulation results accelerator design. However, none of these platforms has been
for a set of six common security applications show comparable latency between designed specifically for the domain of security acceleration.
HASK and FPGA with 2.5X improvement in energy-delay product and 4X
To address this important need, in this paper, we present a
improvement in iso-area throughput. HASK also shows 5X improvement in
iso-area throughput and 45X improvement in energy-delay product compared to
hardware accelerator for security kernels (HASK), a novel and
optimized software implementations. highly energy-efficient reconfigurable framework for the accelera-
tion of cryptographic kernels, based on an analysis of security
Index Terms—Security applications, domain-specific hardware accelerator, applications such as AES, Blowfish, IDEA, SHA-1, MD5, and CAST-
energy-efficiency, reconfigurable computing 128. HASK is an array of parallel, coarse-grained nano-processors
with hardware support for common operations in security tasks. It
Ç aims to balance spatial and temporal computing by time-multi-
plexing hardware resources while having a large number of inter-
connected processing elements to distribute a task. The overall
1 INTRODUCTION system serves as a loosely coupled accelerator on a shared bus
accessible by both a host processor and by peripherals. Fig. 1 illus-
SECURITY is becoming an important design metric, specifically in the trates a conceptual diagram of a modern system on a chip (SoC)
domain of embedded systems [1]. Due to the ever-growing impor- which incorporates an acceleration engine for security algorithms.
tance of data security in these systems, the inclusion of hardware HASK’s primary distinctions from existing reconfigurable
cryptographic modules is rapidly becoming a requirement. In par- architectures, such as AsAP and Morphosys, are as follows: (a) a
ticular, hardware modules for diverse security tasks, ranging from novel, fully distributed memory architecture with multiple instruc-
encryption to hashing, are more effective than software realizations tion, multiple data (MIMD) support; (b) a processing and intercon-
in terms of meeting energy-efficiency and/or real-time performance nect architecture tailored to security applications; (c) fully
demands [2], [3], [4]. These security modules are implemented routerless communication, which improves data transfer energy
either inside embedded processors or outside them as co-processors and latency; and (d) hardware support for lookup table operations
or hardware accelerators [2]. Typically, these modules are realized with varying input and output width, and fused logical operations
in one of two ways: (a) as a specialized custom hardware module, or to implement complex functions as atomics. In particular, the
(b) as a reconfigurable accelerator using a field programmable gate paper makes the following key contributions:
array (FPGA) or similar platform. While the first option provides
optimal performance and energy-efficiency, the second one offers 1) It proposes HASK, a scalable and energy-efficient reconfig-
the flexibility to map a variety of security tasks satisfying diverse urable architecture suitable for the domain of security
compute and communication requirements. applications. HASK supports spatio-temporal computing
The Advanced Encryption Standard (AES) is widely used in with an array of light-weight processing elements. The
embedded applications to achieve data security. Unfortunately, datapath and interconnect structure for HASK are opti-
most encryption algorithms like AES are slow on general purpose mized to commonly occurring operations and communica-
tion patterns for target applications.
C. Babecki, W. Qian, R. Karam, and S. Bhunia are with the Department of Electrical 2) It explains the micro-architecture level optimizations for
Engineering and Computer Science, Case Western Reserve University, Cleveland, HASK and the corresponding design trade-offs. It also
OH 44106. describes the design space exploration that trades off
E-mail: {christopher.babecki, wenchao.qian, robert.karam, swarup.bhunia}@case.edu.
S. Paul is with Intel Labs, Intel Corporation, Hillsboro, Hillsboro, OR 97124. between spatial and temporal computing to optimize
E-mail: [email protected]. energy efficiency.
Manuscript received 14 Nov. 2014; revised 13 Nov. 2015; accepted 9 Dec. 2015. Date of 3) It evaluates the performance and energy efficiency of
publication 24 Dec. 2015; date of current version 14 Sept. 2016. HASK for six common applications and then compares the
Recommended for acceptance by W.W. Ro. results against a commercial FPGA device, a GPP, and
For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.
org, and reference the Digital Object Identifier below. alternative reconfigurable accelerator MorphoSys [2] and
Digital Object Identifier no. 10.1109/TC.2015.2512858 AsAP [15].
Authorized licensed use limited to: VTU Consortium. Downloaded on April 03,2025 at 14:08:46 UTC from IEEE Xplore. Restrictions apply.
0018-9340 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016 3197

Fig. 2. Block diagram of security nano-processor.

application has a certain number of computing rounds and each

computing round requires almost the same set of operations
Fig. 1. Block diagram of a SoC (or a processor) using HASK as an external accel-
using different keys as inputs. Therefore, the latency for each
erator for security applications. round is almost the same, and pipelining can be easily applied to
improve the throughput. Latency can also be improved by apply-
ing instruction-level and data-level parallelism within each
The rest of the paper is organized as follows: Section 2 explains
computing round.
the system architecture for HASK, with details on both datapath
and interconnect structure. Section 3 presents the performance
2.2 SNP Hardware Architecture
results for candidate security applications and compares with
FPGA and other reconfigurable platforms. Section 4 concludes the The SNP is modeled after a standard RISC-style processing ele-
paper and provides future directions. ment, containing an 8 kB SRAM memory array, a lightweight cus-
tom datapath, a 32-byte register file, and a program counter. The
register file is designed to have a large number of read ports (8)
2 SYSTEM ARCHITECTURE and write ports (4) to support the wide execution engine of the
A single HASK processing element, termed a security nano-proces- SNP. Unlike a processor which typically would fetch instructions
sor (SNP) is shown in Fig. 2. Each SNP operates independent of the from memory, each SNP has a dedicated schedule table that holds
others as a MIMD machine and has its own local data and instruc- the 80-bit wide instructions which are preloaded when the SNP is
tion memory. The interconnect fabric (both local and global as configured. The proposed SNP design holds 128 instruction words.
shown in Fig. 1) does not contain any memory elements. SNPs are Support is provided for several standard operations including add,
organized into two levels of hierarchy: Clusters, which contain four shift, simple logical operations, and load/store.1
SNPs, and Tiles, which contain four Clusters. Each Tile has a cen- Although each SNP behaves like a processor, it implements
tral controller responsible for writing to the instruction memory of many features that attempt to mimic common hardware compo-
each SNP and transferring data between the main system memory nents necessary to not just accelerate security applications, but also
or a peripheral and local SNP memory. In this section, we analyze maximize energy efficiency. These domain-specific optimizations
the application requirements and describe in detail the m-architec- are as follows:
ture of each SNP and the interconnects between them.
a two-way execution engine;
2.1 Analysis of Security Applications hardware support for variable width vectorized lookup
table (LUT) operations;
We have considered six security applications to map into the
a fused logical unit for arbitrary three-input functions;
HASK framework, including AES [9], Blowfish [10], CAST-128 [11],
a byte-addressable register file;
IDEA [12], MD5 [13], and SHA-1 [14]. Inputs to each kernel are
support for SIMD-style datapath operations.
assumed to be stored in the local memory, so data reads need to be
Instruction-level and data-level parallelism can be realized
performed before execution. These applications make use of addi-
through statically mapped VLIW and vector architectures. The
tion, bitwise logical operations (AND, OR, XOR), circular and logi-
SNP exploits this by allowing two operations to be statically sched-
cal shifts, and non-linear functions (Substitution Boxes, or S-
uled to execute in a given cycle. Each SNP operation can be
Boxes). The S-Box functions these applications use take 8 bits of
encoded in 40 bits; thus, the full 80 bit instruction can hold any two
input and have either 8 or 32 bits of output. The Bitwise logical
independent operations, even simultaneous memory accesses.
operations are 8, 16, or 32 bits wide. Logic operations are bit-slice-
Security applications also commonly exploit highly nonlinear func-
able, for example, a 32-bit AND operation can be sliced into four 8-
tions to “mask” the data being processed (e.g. S-boxes in ciphering
bit wide AND operations, which can be performed in parallel.
algorithms like AES, Blowfish, and CAST-128). Each SNP supports
Complex logical operations are also commonly used in these appli-
8-bit input lookup operations with variable output sizes, including
cations. For example, cryptographic hashing algorithms such as
8, 16, and 32 bits to map these functions efficiently. The first 4 kB of
MD5 and SHA-1 contain complex logical operations such as
^ CÞ. These types of operations typically use the same memory in each SNP is reserved for LUTs and uses an asymmetric
ðA ^ BÞ _ ðA
memory design to achieve a 40 percent reduction in read energy
inputs but perform different combinations of logic functions over
time. Increasing computing efficiency of these operations helps to
improve performance and energy efficiency. 1. Due to space limitation, a detailed description of the instruction set archi-
tecture has been moved to the supplementary material, which can be found on
The deterministic nature of the security kernels is amenable to the Computer Society Digital Library at http://doi.ieeecomputersociety.org/
pipelining and parallelism. Due to their iterative nature, each 10.1109/TC.2015.2512858
Authorized licensed use limited to: VTU Consortium. Downloaded on April 03,2025 at 14:08:46 UTC from IEEE Xplore. Restrictions apply.
3198 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016

Fig. 5. Two modes of inter-SNP communications: (a) Explicit data transfer through
a MOVE instruction and (b) implicit transfer using intra-cluster bus as virtual
register.
Fig. 3. Hierarchical bus-based interconnect structure of: (a) a single Cluster of 4
SNPs, (b) a single Tile of four Clusters. The tiles are connected in a mesh topology.
simultaneously reading four sets of input operands from the regis-
ter file. The inputs are then run through four separate datapaths
[4], as well as fine-grained wordline segmentation allowing effi- and written back through separate write ports. This improves the
cient access to the variable width LUTs. The remaining memory is overall execution time of the AES algorithm by about 10 percent.
used as a byte addressable scratchpad-memory that stores inputs Addition operations can be accelerated in this manner as well.
and resultant data.
For complex logical operations, such as ðA ^ BÞ _ ðA ^ CÞ, it 2.3 Interconnect Structure
could take up to 4 cycles by using atomic logic operations. Though HASK employs a sparse hierarchical interconnect that exploits data
HASK supports bit-sliced logic operations, mapping this complex locality. SNPs within the same Cluster have a fully connected shared
operation requires substantially more energy than dedicated logic. bus structure (Fig. 3a). Each 16-bit connection allows for the transmis-
In addition, a LUT-based approach would require a substantial sion of a single 8 or 16-bit value per cycle; at 1.25 GHz (see Section 2.4),
amount of memory space to hold the responses. Instead, these this leads to a rate of 20 Gbps. Similarly, a 16-bit shared bus is used
complex logical operations are mapped to a novel reconfigurable for inter-cluster communication (Fig. 3b); however, unlike the fully-
logic datapath inside each SNP capable of implementing arbitrary connected intra-cluster bus, the inter-cluster bus can only be reached
logical functions of up to three inputs. This is realized and encoded from one SNP per cluster, termed Gateway SNP, or gSNP, through
using a Reed-Muller expansion, which results in substantially which all communications must be routed. Ideally, applications can
fewer required transistors than a canonical representation for an be mapped such that only the gSNP needs to communicate with other
arbitrary function of three inputs [16]. For example, the fused data- clusters, maximizing the inter-cluster bus throughput. The lower
path can be configured to perform ðA ^ BÞ _ ðA CÞ in a single bound of the bandwidth per SNP is 1/4 of the intra-cluster bus, or 5
instruction as 1 A C AB AC. Gbps. At higher levels, gSNPs can communicate with each of the 16
Most security kernels are dominated by byte level operations gSNPs in adjacent Tiles through a 16-bit wide 2D bi-directional mesh
(e.g. S-boxes in AES, Blowfish, and CAST-128 and Galois Field interconnect structure. It enables the architecture to easily add large
multiplications in AES). To mitigate this requirement, all data oper- number of tiles (and hence, SNPs). A gSNP can broadcast to multiple
ations in HASK support variable input/output size down to a sin- inter-Tile buses in the same cycle, allowing the architecture to scale to
gle byte for increased energy efficiency. The register file is also an arbitrary size while maintaining limited connectivity between any
designed accordingly to be byte addressable, enabling more com- two nodes. Data transmission between tiles requires two cycles with-
pact information storage. out any stall. Thus, the cycle time is independent of communication
Finally, many security tasks are amenable to additional parallel- latency between distant nodes. In the worst case, this communication
ism beyond what a VLIW-2 architecture allows. For example, con- is limited to 625 Mbps, when all 16 gSNPs need to communicate as
sider the MixColumns step of the AES algorithm, which requires 12 often as possible.
XOR operations comprising four independent expressions. The These communication buses form a time-multiplexed program-
SNPs on-demand SIMD instruction exploits this data parallelism by mable interconnect. Because the communication requirements for
the security applications are both constant and known a priori, they
can be scheduled at compile time. Routing information is stored in
the instructions as an immediate value and decoded at runtime to
control communication buses. If a buffer is enabled on a given cycle,
output data from the SNP’s operation is written to the appropriate
bus. Subsequently, other SNPs can read the data into their local reg-
ister files. A detailed view of the bus output structure is provided in
Fig. 4. As the communications are statically scheduled, routers are
not required in the fabric which eliminates their associated power,
latency, and area overhead.
Communication at any level of the hierarchy can be handled
either implicitly as part of an instruction, or explicitly as a separate
MOVE instruction. Fig. 5a shows an example instruction sequence
that performs this type of data transfer. If data Z needs to be trans-
ferred from the register of SNPi to the register of SNPj , SNPi needs
a MOVE to send data onto the bus, and in the next cycle, SNPj
needs another MOVE to get the data from the bus and store it
locally. Additionally, some instructions contain a single bit field
that, when enabled, writes the result to the local bus as well as local
Fig. 4. Detailed view of the SNP bus interface. memory. Every instruction is capable of an implicit read through
Authorized licensed use limited to: VTU Consortium. Downloaded on April 03,2025 at 14:08:46 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016 3199

TABLE 1
SNP Energy Breakdown for Different Operations

Operation Sched. Table & Reg. file (pJ) Execution (pJ) Total Energy Implicit Total Energy with
Decoder (pJ) (pJ) MOVEa (pJ) Implicit MOVE (pJ)
Load/Store (64b) 2.234 3.913 4.082
8x8 LUT Op 0.283 1.962 2.131
8x16 LUT Op 0.563 2.242 2.411
8x32 LUT Op 1.123 2.802 2.971
Add/Simple Logical Op 1.358 0.32 0.017 1.696 0.17 1.866
Shift 0.035 1.714 1.883
Fused Datapath Op 0.018 1.697 1.866
Intra-cluster Move 0.169 1.848 2.018
Inter-cluster Move 0.511 2.189 2.359
Inter-tile Move 0.678 2.356 2.526
Leakage Power (mW/SNP) 3.024
a
Energy for an implicit move is the same as the execution energy for an intra-cluster move.

the use of virtual register ports, which map directly into the regis- tree delay and sub-array output driver energy, are therefore
ter files of other SNPs within a cluster (Fig. 4). This method of com- removed. Wordline segmentation using AND-gates is added as an
munication is only available on the intra-cluster bus (Fig. 5b). The overhead to the CACTI model, and LUT accesses for 8, 16, and 32
output Z of the operation in SNPi can be directly sent onto the bus, bits are similarly reduced to the appropriate proportion of the data
and in the case that Z is used immediately in SNPj as an input, it access energy reported by CACTI.
can be directly read from the bus because the bus is treated as a vir- The delays for each component were used to compute the critical
tual register. Implicit moves are useful in many security applica- path through a single SNP. The critical path lies on the datapath of a
tions, including AES, where four SNPs within 1 Cluster can memory read operation, and the total path delay is calculated to be
operate on data cyclically and in parallel, taking only 83 cycles for approximately 800 ps, yielding a maximum clock speed of 1.25 GHz.
one encrypt operation. This structure allows for very high data The estimated area for a single SNP is approximately 70,000 mm2 ,
availability at the lowest level where it is needed, while reducing about 60 percent of which is the 8 kB SRAM array, 18 percent the
interconnect complexity, routing delays, and communication datapath elements, and the remaining 22 percent the schedule table
energy between the upper levels. and register file. The energy requirement for each major operation
type is presented in Table 1 along with the estimated static leakage
2.4 Modeling Approach power consumption of a single SNP. The majority of the leakage
We developed a register-transfer level (RTL) model of the HASK energy comes from the SRAM arrays (schedule table, LUT memory,
framework to obtain performance characterization of the architec- and data memory). Unfortunately, CACTI cannot effectively model
ture. Key components (e.g. datapath, register file, etc.) were mod- advanced manufacturing processes such as High-K Metal Gates or
eled in RTL then synthesized to a 32 nm technology library strained channel devices [20]. Hence, the actual leakage power of a
provided by Synopsys using DesignCompiler. The area, delay, and HASK system should be substantially lower than the reported result.
power consumption (considering a 12.5 percent activity factor) for
each of these components were then used to model a full SNP. 2.5 Design Space Exploration
Based on the area estimates for a single SNP, the estimated com- The spatio-temporal computing model of HASK provides the
munication delay and energy requirements were computed using opportunity to explore the right balance of spatial and temporal
an RC wire-loading model assuming a standard Fan-Out of 4 load computing during application mapping to achieve optimal energy
on each bus line. The RTL model for a single SNP has been exten- efficiency. We select SHA-1 as an example kernel to study this. We
sively validated to confirm that each operation functions correctly vary the number of SNPs from 8 to 1,024, to map this kernel, and
in simulations using Synopsys VCS. Table 2 provides a summary observe varying performance and energy profiles. To map SHA-1,
of the key model parameters. we first created its control and data flow graph and then translated
Memory elements (schedule table, LUT and data memory) were the operations into a set of instructions for the SNP architecture. By
modeled using the CACTI toolset to determine area, energy, and increasing the number of SNPs, the total 80 rounds required for
delay estimates for SRAM arrays of the appropriate sizes. For the SHA-1 can be distributed into multiple pipelining stages. In gen-
schedule table and 2 kB of the 8 kB main memory, the SRAM eral, higher SNP usage resulted in fewer cycles between hashed
design was assumed to be read-skewed and a 40 percent reduction blocks; however, increasing pipelining stages resulted in diminish-
in energy use for read operations was accordingly considered [4]. ing returns, as each stage begins to be dominated by the transfer
CACTI models a memory bank assuming that it is part of a larger time rather than actual computation. If area is not a constraint, one
cache, and therefore includes latency and energy components not can solve the presented optimization problem to find the point of
relevant to this SRAM model. These components, including the H- minimal energy delay product (EDP) as shown in Fig. 6. When
data transfer is not the dominant contributor to total delay (in the
TABLE 2 case of 8-128 SNPs), performance improves as the number of SNPs
SNP Key Parameters increases. We observe that using 128 SNPs is very close to optimal
in terms of EDP, which combines the impact from both delay and
SNP Area 70,000 mm2 energy, while using substantially less area than alternative imple-
Critical Path Delay 790 ns mentations with 256 SNPs or more.
Max Clock Frequency 1.25 GHz
Registers 32 x 8 b
Schedule Table 128 x 80 b 3 PERFORMANCE ANALYSIS
LUT Memory 4 kB
Data Memory 4 kB Our application suite consists of AES, Blowfish, IDEA, SHA-1, MD5,
and CAST-128. It includes a combination of both symmetric key
Authorized licensed use limited to: VTU Consortium. Downloaded on April 03,2025 at 14:08:46 UTC from IEEE Xplore. Restrictions apply.
3200 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016

Fig. 6. Results for design space exploration in case of SHA-1 by trading off spatial versus temporal computing: (a) Delay, (b) Energy, and (c) EDP.

ciphers and cryptographically secure hashes that are representative The software implementations use the crypto++ C++ library,
of the security domain. Each benchmark was implemented in the compiled and optimized for the target architecture, and timestamps
HASK framework, on an Altera Stratix V FPGA (E series, 28-nm are measured using the C++ chrono library. The programs run 3E+5
CMOS process, 0.9V supply), and in software on a quad-core Intel input vectors 200 times each and return the average energy and
Q8200 processor (45 nm, 0.85-1.35 V). For each case, the power, delay values per application to mitigate measurement noise.
latency, EDP, and area were collected for a single instance of a given The target Q8200 processor does not support the AES-NI instruction
kernel working on a single block of data. Additionally, maximum set extension, but can exploit SSE (Streaming SIMD Extensions)
throughput values for a single instance of the kernel are presented. instructions to enhance performance. This is chosen to illustrate per-
For HASK and FPGA, we also present the size of the configuration formance on a modern superscalar processor without any form
bitstream required to program the device. The results are shown in of hardware acceleration. The CPU power consumption is assumed
Table 3. The HASK, FPGA, and CPU platforms differ greatly, and to to be half the rated TDP (95 W) [7] divided by 4, since only one of
facilitate a fair comparison, all devices are scaled to the same process the four cores was active. We estimate that a single core occupies
node (32 nm) and voltage (0.95 V), and the same 12.5 percent switch- approximately one quarter of the die shown [18], or roughly
ing activity is used for power analysis. 25 mm2 at 45 nm, and use this approximation when calculating the
Each application was hand-mapped to the HASK framework throughput per unit area. Results for the Q8200 are presented in the
using the approach outlined in Section 2.5. We considered a soft “GPP” columns of Table 3.
area constraint of 4 SNPs with an exception for IDEA, where nine The area results presented for both HASK and FPGA include the
SNPs were used to allow for pre-loading the round keys, greatly area of a single scaled GPP core. This is more representative of a real
reducing the energy cost from load operations. Latency values system where a GPP core would be present, and the security tasks
were computed assuming a 800 ps clock period (1.25 GHz), based would be offloaded to an accelerator. The table shows average ratio
on the critical path delay, and multiplying by the total number of (GPP over HASK and FPGA over HASK) for each parameter, which
cycles required. To obtain the dynamic energy values, the number is computed by taking average over the individual ratios for all
of each operation type for a given application was counted and benchmarks. This is done to accommodate for wide variations in
then multiplied by corresponding energy consumption, as pre- latency/energy across the benchmarks. Compared to GPP, FPGA
sented in Table 1. Static power was computed by taking the total and HASK improve latency about 40 and 10 percent, respectively. As
leakage power per SNP (Table 1), then multiplying by the number a result, both platforms improve iso-area throughput substantially.
of required SNPs and the latency of the application. Energy values Similarly, both accelerators see an order of magnitude improvement
presented are the sum of the static and dynamic energy. in energy efficiency, specifically, average EDP improvements of 34x
FPGA mapping was accomplished by describing an application (FPGA) and 45x (HASK). These improvements for both FPGA and
in Verilog and then compiling it to the Stratix V using Quartus II. HASK relative to GPP are due to the following common factors:
To match the HASK memory, access to an 8 kB memory array is
assumed to load the initial values. Latency estimates were obtained HASK and FPGA perform their computations in light-
using TimeQuest, and energy estimates were obtained by multiply- weight processing elements.
ing the number of effective cycles, cycle time, and power estimates The GPP’s inclusion of hardware structures not beneficial
reported in PowerPlay. To estimate the area, we multiply the ALM to the target domain lead to poor resource utilization.
tile area, which includes estimated routing area [19], by the number The highly parallel nature of HASK and FPGA lend them-
of utilized ALMs reported by Quartus II. selves better to pipelining of processing steps.

TABLE 3
Comparison of Latency, Throughput, Energy and EDP among HASK, FPGA and GPP

Kernel Latency (ns) Throughput (bps/mm2 ) Energy/bit (pJ/bit) EDP (nJ-ns)

HASK FPGA GPP HASK FPGA GPP HASK FPGA GPP HASK FPGA GPP
AES 66.4 33.0 128.20 152.48 306.82 78.98 14.22 38.81 470.60 1.21E+2 1.64E+2 7.72E+3
Blowfish 93.6 93.5 82.06 54.09 54.14 61.70 9.96 49.01 385.58 5.97E+1 2.93E+2 2.02E+3
CAST-128 167.2 128.0 88.97 90.83 39.55 56.90 35.30 52.07 453.35 3.78E+2 4.27E+2 2.58E+3
IDEA 70.4 126.0 155.62 647.19 40.18 32.53 37.09 14.43 1386.87 1.67E+2 1.16E+2 1.38E+4
MD5 566.4 396.8 323.91 35.75 12.76 15.63 17.41 51.47 751.05 5.05E+3 1.05E+4 1.25E+5
SHA-1 714.4 584.0 495.15 14.17 4.33 5.11 21.58 130.37 1755.04 7.89E+3 3.90E+4 4.45E+5
Avg. Ratio 1.14X 1.39X — 5.1X 1.5X — 41X 25X — 45X 34X —
(GPP / HASK)
Avg. Ratio 0.93X — — 4.3X — — 3.1X — — 2.5X — —
(FPGA / HASK)
Authorized licensed use limited to: VTU Consortium. Downloaded on April 03,2025 at 14:08:46 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016 3201

Fig. 7. The effect of scaling operating frequency (with associated scaling of voltage) on (a) energy, (b) throughput, and (c) EDP improvement (higher is better) in HASK
compared to FPGA.

Complex and fused functions are performed in dedicated applications. We present Singh’s [2] and Liu’s [7] findings scaled to
hardware structures and/or LUTs. a 32 nm process node operating at 0.95 V and compare them with
For the majority of the benchmark applications, HASK latency HASK results in Table 4. We derive iso-area throughput and
is slightly worse than that of FPGA. However, since HASK imple- energy per bit as points of comparison between these disparate
mentations generally use less die area than FPGA, the iso-area computing fabrics. Since no energy results were provided for Mor-
throughput is 4.3X better than that of FPGA, and HASK uses 3.1x phosys, only throughput is compared.
less energy than FPGA on average. Moreover, a 2.5X improvement A more conservative operating frequency of 500 MHz has also
in EDP is observed. Note that in practice, we anticipate the energy been considered for HASK. In this case, the operating voltage can
improvement of HASK over FPGA to be even higher. As stated in be reduced to 0.8 V. Energy and latency values are scaled accord-
Section 2.4, the HASK leakage energy is overestimated, since ingly for all the components of the HASK model and improve-
CACTI cannot model leakage reduction techniques used for nano- ment versus FPGA for all benchmark applications is compared in
scale processes. Conversely, the FPGA leakage is underestimated, Fig. 7. Throughput improvement scales down linearly to 1.7X
since Quartus does not report contributions from programmable versus FPGA; however, total energy increases due to a greater
interconnects and embedded memory blocks. The primary reasons leakage energy contribution. As a result, the EDP improvement
behind the improvement in energy efficiency for HASK over FPGA compared to an FPGA at its maximum frequency goes down to
are as follows: (1) HASK supports LUT operations of different bit- 0.64X. However, energy required per bit for HASK still remains
width and has dedicated hardware for fused logic operations, 2X better than FPGA.
which reduces the total number of operations and hence, energy;
(2) the spatio-temporal mapping greatly reduces the interconnect 4 CONCLUSIONS
complexity and energy; and (3) the highly customizable memory We have presented HASK, a novel reconfigurable framework for
structure of FPGAs results in energy inefficient memory accesses accelerating security applications. The proposed architecture
[19]. Additionally, the HASK configuration bitstream is an average offers comparable latency and die area to FPGA with an average
2.3 percent smaller than equivalent FPGA implementation. It is of 3X improvement in energy efficiency. This is achieved using
worth noting that HASK’s energy performance on the IDEA cipher high-density SRAM for lookup operations, a sparse interconnect,
is poor compared to FPGA. The major reason behind this is that application-level pipelining, and a custom datapath tailored to the
IDEA uses multiplications which must be implemented as LUT computing needs of the security application domain. We pre-
operations in the SNP. These memory accesses are substantially sented simulation results which show the improvement compared
less energy efficient than using the embedded multiplier blocks of to implementations in FPGAs and GPPs. The interconnect fabric
the Stratix V, so HASK’s energy efficiency suffers. can accommodate a large number of SNPs and hence is scalable,
We also consider a comparison to three other prominent alter- enabling the mapping of larger kernels or even parallel instances
native hardware acceleration platforms: (a) AsAP, (b) Morphosys, of a kernel. The performance of HASK is also compared with
and (c) graphics processing units (GPUs). GPUs are suitable for alternative CGRAs such as AsAP and Morphosys, as well as
floating point intensive kernels, but cannot provide a fine-grained GPUs. The memory-dominated architecure of HASK is very ame-
spatio-temporal mapping like HASK, AsAP, and Morphosys, nable to technology scaling. Hence, emerging high-density nano-
resulting in sub-optimal performance. In contrast to these frame- scale memory technologies can significantly improve its
works, HASK has four key distinguishing factors: (1) HASK performance and area. Future work will include the development
requires no shared memory on-die, (2) it provides hardware sup- of an application mapping tool and adding support for dynamic
port for variable input and output LUT operations, (3) SNPs con- instruction scheduling.
tain local fixed function optimization for security, and (4) HASK
implements a hierarchical interconnect appropriate for security
ACKNOWLEDGMENTS
TABLE 4 This work was supported in part by Semiconductor Research Cor-
Comparison of HASK with CGRA/GPU with Respect to Throughput poration (SRC) under Grant 2015-EP-2650.
and Energy
REFERENCES
Benchmark Fabric Scaled Throughput Energy/Bit
(Gbps/mm2 ) (pJ/bit) [1]P. Kocher, R Lee, G. McGraw, A. Raghunathan, and S. Ravi, “Security as a
new dimension in embedded system design,” in Proc. Des. Autom. Conf.,
AES HASK 7.44 13.20 2004, pp. 753–760.
AsAP 1.29 742.9 [2] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M.
Chaves Filho, “MorphoSys: An integrated reconfigurable system for data-
GPU 0.39 2147.2
parallel and computation-intensive applications,” IEEE Trans. Comput.,
IDEA HASK 14.03 — vol. 49, no. 5, pp. 465–481, May 2000.
Morphosys 10.34 — [3] K. Eguro, “RaPiD-AES: Developing an encryption-specific FPGA
architecture,” M.S. thesis, Univ. Washington, Seattle, WA, USA, 2002.
Authorized licensed use limited to: VTU Consortium. Downloaded on April 03,2025 at 14:08:46 UTC from IEEE Xplore. Restrictions apply.
3202 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016

[4] S. Paul, S. Chatterjee, S. Mukhopadhyay, and S. Bhunia, “Energy-efficient

reconfigurable computing using a circuit-architecture-software co-design
approach,” IEEE J. Emerging Sel. Topics Circuits Syst., vol. 1, no. 3, pp. 369–
380, Sep. 2011.
[5] D. A. Osvik, J. W. Bos, D. Stefan, and D. Canright, Fast Software AES Encryp-
tion. Berlin, Germany: Springer, 2010, pp. 75–93.
[6] S. Gueron, “Intel Advanced Encryption Standard (AES) new instructions
set,” Intel Corp., Santa Clara, CA, USA, Tech. Rep. 323641-001, 2012.
[7] B. Liu and B. Baas, “Parallel AES encryption engines for many-core proces-
sor arrays,” IEEE Trans. Comput., vol. 62, no. 3, pp. 536–547, Mar. 2013.
[8] K. Compton and S. Hauck, “Totem: Custom reconfigurable array gener-
ation,” in Proc. 9th Annu. IEEE Symp. Field-Programmable Custom Comput.
Mach., 2001, pp. 111–119.
[9] J. Daemen and V. Rijmen, “AES proposal: Rijndael,” in Proc. 1st Adv.
Encryption Standard Candidate Conf., Mar. 1999.
[10] B. Schneier, “Description of a new variable-length key, 64-bit block cipher
(Blowfish),” in Fast Software Encryption. Berlin, Germany: Springer, 1994.
[11] C. Adams, “The CAST-128 encryption algorithm,” Entrust Technol.,
Addison, TX, USA, RFC 2144, May 1997.
[12] X. Lai, On the Design and Security of Block Ciphers (ETH Series in Information
Processing). Konstanz, Germany: Hartung-Gorre Verlag, 1992.
[13] R. L. Rivest, The MD5 Message-Digest Algorithm, MIT Lab. Comput. Sci. RSA
Data Security, Cambridge, MA, USA, 1992.
[14] D. Eastlake and P. Jones, “US secure hash algorithm 1 (SHA1),” Motorola
and Cisco Systems, San Jose, CA, USA, RFC 3174, Sep. 2001.
[15] D. N. Truong, W. H. Cheng, T. Mohsenin, Y. Zhiyi, A. T. Jacobson,
G. Landge, M. J. Meeuwsen, C. Watnik, A. T. Tran, X. Zhibin, E. W. Work,
J. W. Webb, P. V. Mejia, and B. M. Baas, “A 167-processor computational
platform in 65 nm CMOS,” IEEE J. Solid-State Circuits, vol. 44, no. 4,
pp. 1130–1144, Apr. 2009.
[16] X. Wu, X. Chen, and S. L. Hurst, “Mapping of Reed-Muller coefficients
and the minimization of exclusive OR-switching functions,” IEE Comput.
Digital Tech., vol. 129, no. 1, pp. 15–20, 1982.
[17] A. Rahman, S. Das, A. P. Chandrakasan, and R. Reif, “Wiring requirement
and three-dimensional integration technology for field programmable gate
arrays,” IEEE Trans. Very Large Scale Integration (VLSI) Syst., vol. 11, no. 1,
pp. 44–54, Feb. 2003.
[18] V. George, S. Jahagirdar, C. Tong, K. Smits, S. Damaraju, S. Siers, V.
Naydenov, T. Khondker, S. Sarkar, and P. Singh, “Penryn: 45-nm next
generation Intel coreTM 2 processor,” in Proc. Asian Solid-States Circuits
Conf., 2007, pp. 14–17.
[19] H. Wong, V. Betz, and J. Rose, “Comparing FPGA vs. custom CMOS and
the impact on processor microarchitecture,” in Proc. Int. Symp. Field Pro-
gram. Gate Arrays, 2011, pp. 5–14.
[20] S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi,“CACTI-P:
Architecture-level modeling for SRAM-based structures with advanced
leakage reduction techniques,” in Proc. Int. Conf. Comput.-Aided Design,
2011, pp. 684–701.

Authorized licensed use limited to: VTU Consortium. Downloaded on April 03,2025 at 14:08:46 UTC from IEEE Xplore. Restrictions apply.

10 1109@tcsi 2020 2997916
No ratings yet
10 1109@tcsi 2020 2997916
14 pages
A Lightweight AES Coprocessor Based On RISC V Custom Instructions
No ratings yet
A Lightweight AES Coprocessor Based On RISC V Custom Instructions
13 pages
Computers 13 00009 v2
No ratings yet
Computers 13 00009 v2
16 pages
Resource-Shared Crypto-Coprocessor of AES Enc Dec With SHA-3
No ratings yet
Resource-Shared Crypto-Coprocessor of AES Enc Dec With SHA-3
14 pages
HS Unit 2
No ratings yet
HS Unit 2
16 pages
RISC-V Crypto Efficiency on Embedded Systems
No ratings yet
RISC-V Crypto Efficiency on Embedded Systems
6 pages
A High-Performance Multimem SHA-256 Accelerator For Society 5.0
No ratings yet
A High-Performance Multimem SHA-256 Accelerator For Society 5.0
11 pages
Embedded Hardware Security Guide
100% (1)
Embedded Hardware Security Guide
134 pages
FPGA-Based SHA-3 Accelerator
No ratings yet
FPGA-Based SHA-3 Accelerator
11 pages
Hardware Security KA Webinar Slides V 2
No ratings yet
Hardware Security KA Webinar Slides V 2
48 pages
FPGA Acceleration of AES Algorithm For High-Perfor
No ratings yet
FPGA Acceleration of AES Algorithm For High-Perfor
11 pages
VLSI Design of Advanced-Features AES Cryptoprocessor in The Framework of The European Processor Initiative
No ratings yet
VLSI Design of Advanced-Features AES Cryptoprocessor in The Framework of The European Processor Initiative
10 pages
Efficient and High-Speed CGRA Accelerator For Cryptographic Applications
No ratings yet
Efficient and High-Speed CGRA Accelerator For Cryptographic Applications
7 pages
FPGA Security in Embedded Systems
No ratings yet
FPGA Security in Embedded Systems
24 pages
Cheshire A Lightweight Linux-Capable RISC-V Host Platform For Domain-Specific Ac
No ratings yet
Cheshire A Lightweight Linux-Capable RISC-V Host Platform For Domain-Specific Ac
5 pages
Reconfigurable Hardware For High-Security/ High-Performance Embedded Systems: The SAFES Perspective
No ratings yet
Reconfigurable Hardware For High-Security/ High-Performance Embedded Systems: The SAFES Perspective
15 pages
Security Aware EDA
No ratings yet
Security Aware EDA
34 pages
Jang Asplos19
No ratings yet
Jang Asplos19
14 pages
Huffmire - Managing Security in FPGA-Based Embedded Systems
No ratings yet
Huffmire - Managing Security in FPGA-Based Embedded Systems
11 pages
Sec20 Delshadtehrani PHMon
No ratings yet
Sec20 Delshadtehrani PHMon
19 pages
Cost-Efficient SHA Hardware Accelerators
No ratings yet
Cost-Efficient SHA Hardware Accelerators
10 pages
FPGA-based Tunable Keccak Core
No ratings yet
FPGA-based Tunable Keccak Core
6 pages
1 s2.0 S1383762104000578 Main
No ratings yet
1 s2.0 S1383762104000578 Main
17 pages
Secure Processors Part 1 BG Taxonomy For Secure Enclaves and Intel
No ratings yet
Secure Processors Part 1 BG Taxonomy For Secure Enclaves and Intel
252 pages
Potlapally 2011
No ratings yet
Potlapally 2011
6 pages
Iompu
No ratings yet
Iompu
9 pages
Blockchain Processor - Number of Options
No ratings yet
Blockchain Processor - Number of Options
8 pages
Embedded Systems Security Guide
100% (1)
Embedded Systems Security Guide
204 pages
Cryptographic Accelerators For Trusted Execution Environment in RISC-V Processors
No ratings yet
Cryptographic Accelerators For Trusted Execution Environment in RISC-V Processors
4 pages
Automotive Embedded Security Solutions
No ratings yet
Automotive Embedded Security Solutions
10 pages
Low Power and Area SHA-256 Hardware Accelerator On Virtex-7 FPGA
No ratings yet
Low Power and Area SHA-256 Hardware Accelerator On Virtex-7 FPGA
5 pages
VLSI Design and Test View of Computer Security
No ratings yet
VLSI Design and Test View of Computer Security
4 pages
VLSI Design and Test View of Computer Security
No ratings yet
VLSI Design and Test View of Computer Security
4 pages
Microprocessors and Security. Summary
No ratings yet
Microprocessors and Security. Summary
1 page
Securing Hardware Accelerators - A New Challenge For High-Level Synthesis
No ratings yet
Securing Hardware Accelerators - A New Challenge For High-Level Synthesis
4 pages
DATE 2023 SP
No ratings yet
DATE 2023 SP
6 pages
Q2 - Physically Unclonable Functions (2018)
No ratings yet
Q2 - Physically Unclonable Functions (2018)
254 pages
Secure Embedded Processors
No ratings yet
Secure Embedded Processors
17 pages
Litehash
No ratings yet
Litehash
17 pages
Security Challenges in Embedded Systems
No ratings yet
Security Challenges in Embedded Systems
10 pages
Zeghid
No ratings yet
Zeghid
12 pages
IoT SoC with FPGA for Low-Power Nodes
No ratings yet
IoT SoC with FPGA for Low-Power Nodes
14 pages
Secure Memory Architectures in Emerging Electronic Systems
No ratings yet
Secure Memory Architectures in Emerging Electronic Systems
18 pages
1 s2.0 S0167404823005874 Main
No ratings yet
1 s2.0 S0167404823005874 Main
22 pages
Intro To Hardware Security & Smartcards: Erik Poll
No ratings yet
Intro To Hardware Security & Smartcards: Erik Poll
50 pages
Simplifying The Complexities of Multicore Processors With COTS Single Board Computer Solutions
No ratings yet
Simplifying The Complexities of Multicore Processors With COTS Single Board Computer Solutions
30 pages
Engineering Secure Devices Dominik Merli Instant Download Full Chapters
No ratings yet
Engineering Secure Devices Dominik Merli Instant Download Full Chapters
161 pages
FPGA Side Channel Attack Analysis
100% (1)
FPGA Side Channel Attack Analysis
89 pages
Hardware Security Vulnerabilities and Countermeasures
No ratings yet
Hardware Security Vulnerabilities and Countermeasures
18 pages
Reconfigurable Risc-V Secure Processor and Soc Integration: Zhenya Zang Yao Liu Ray C.C. Cheung
No ratings yet
Reconfigurable Risc-V Secure Processor and Soc Integration: Zhenya Zang Yao Liu Ray C.C. Cheung
6 pages
Lab Mannual
No ratings yet
Lab Mannual
65 pages
The Design of Malware On Modern Hardware: Malware Inside Intel SGX Enclaves
No ratings yet
The Design of Malware On Modern Hardware: Malware Inside Intel SGX Enclaves
22 pages
Compton Hauck RCOverview Paper 2002
No ratings yet
Compton Hauck RCOverview Paper 2002
40 pages
Brian Oblivion
No ratings yet
Brian Oblivion
48 pages
Hardware Security Attack Methods
No ratings yet
Hardware Security Attack Methods
144 pages
Ucam CL TR 630
No ratings yet
Ucam CL TR 630
144 pages
A Parallel Processing CNN Accelerator On Embedded Devices Based On Optimized MobileNet
No ratings yet
A Parallel Processing CNN Accelerator On Embedded Devices Based On Optimized MobileNet
9 pages
FPGA-Based Digital TaylorFourier Transform
No ratings yet
FPGA-Based Digital TaylorFourier Transform
4 pages
Embedded Implementation of Efficient Accelerator Architecture
No ratings yet
Embedded Implementation of Efficient Accelerator Architecture
6 pages
Final Module 3
No ratings yet
Final Module 3
17 pages
Yi Deng 2010
No ratings yet
Yi Deng 2010
4 pages
Module 2
No ratings yet
Module 2
33 pages
Module 1
No ratings yet
Module 1
19 pages
Project File Guideline - 1519120828
No ratings yet
Project File Guideline - 1519120828
5 pages
Board Races
No ratings yet
Board Races
1 page
TOOL PD LR Materials Quality Standards - Isabela
No ratings yet
TOOL PD LR Materials Quality Standards - Isabela
9 pages
System Simulation: General Principles
No ratings yet
System Simulation: General Principles
57 pages
Sem Code Course Name SCU: Information Systems Course Structure For Binusian 2023
No ratings yet
Sem Code Course Name SCU: Information Systems Course Structure For Binusian 2023
36 pages
DAY 1 NO 2 (LTE Protocol Stacks) v1.1
No ratings yet
DAY 1 NO 2 (LTE Protocol Stacks) v1.1
21 pages
Lesson 1.1 Set Statements and Reasoning
No ratings yet
Lesson 1.1 Set Statements and Reasoning
10 pages
Formative Assessment Year 4
No ratings yet
Formative Assessment Year 4
3 pages
PREPOSITIONS-2 Question.
No ratings yet
PREPOSITIONS-2 Question.
1 page
SAP Business Data Cloud - Overview and Guide To Resources
No ratings yet
SAP Business Data Cloud - Overview and Guide To Resources
5 pages
How To Make Letterhead
No ratings yet
How To Make Letterhead
9 pages
21st Century Lit.
No ratings yet
21st Century Lit.
6 pages
Commands
No ratings yet
Commands
7 pages
Leading To A Prosperous Life
No ratings yet
Leading To A Prosperous Life
4 pages
2006 - 09 Question Papers
No ratings yet
2006 - 09 Question Papers
202 pages
The Reality and Elements of Poetry
No ratings yet
The Reality and Elements of Poetry
59 pages
Discourse Analysis in France - A Conversation PDF
No ratings yet
Discourse Analysis in France - A Conversation PDF
15 pages
Unit V Development of Surfaces
No ratings yet
Unit V Development of Surfaces
6 pages
Ftce Exceptional Student Education K12 061 Book Online 2e Ken Springer Full
No ratings yet
Ftce Exceptional Student Education K12 061 Book Online 2e Ken Springer Full
37 pages
A Ghostly Wife Class 11 Long Questions - WBCHSE Semester 2
No ratings yet
A Ghostly Wife Class 11 Long Questions - WBCHSE Semester 2
4 pages
LBS Itp - Cbt-Lbs-Qac-Itp-001
No ratings yet
LBS Itp - Cbt-Lbs-Qac-Itp-001
5 pages
Reela82023 rg3
No ratings yet
Reela82023 rg3
28 pages
Wa0034.
No ratings yet
Wa0034.
11 pages
Second Year Synopsis Format
No ratings yet
Second Year Synopsis Format
7 pages
Grade 6 Comprehension Worksheets PDF
33% (3)
Grade 6 Comprehension Worksheets PDF
3 pages
Python Programming Exercises
No ratings yet
Python Programming Exercises
3 pages
Adobe Photoshop: Features & History
No ratings yet
Adobe Photoshop: Features & History
2 pages
Divine Protection: Psalm 91, by Mary Kretzmann
100% (4)
Divine Protection: Psalm 91, by Mary Kretzmann
5 pages
03-Sec III - Manual Control
No ratings yet
03-Sec III - Manual Control
2 pages
Sundry Free Moors Act 2012
93% (60)
Sundry Free Moors Act 2012
81 pages

An Embedded Memory-Centric Reconfigurable Hardware Accelerator For Security Applications

Uploaded by

An Embedded Memory-Centric Reconfigurable Hardware Accelerator For Security Applications

Uploaded by

3196 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO.

10, OCTOBER 2016

Fig. 2. Block diagram of security nano-processor.

application has a certain number of computing rounds and each

Kernel Latency (ns) Throughput (bps/mm2 ) Energy/bit (pJ/bit) EDP (nJ-ns)

[4] S. Paul, S. Chatterjee, S. Mukhopadhyay, and S. Bhunia, “Energy-efficient

You might also like