An Embedded Memory-Centric Reconfigurable Hardware Accelerator For Security Applications
An Embedded Memory-Centric Reconfigurable Hardware Accelerator For Security Applications
An Embedded Memory-Centric processors (GPPs), with a single block encryption typically taking
Reconfigurable Hardware Accelerator hundreds of cycles on an embedded processor without hardware
support [5]. This strains real-time applications where a software-
for Security Applications only approach may not meet throughput requirements. The energy
efficiency of security algorithms on GPPs is also poor, which can
Christopher Babecki, Student Member, IEEE, be unattractive in energy-constrained systems.
Wenchao Qian, Student Member, IEEE, One approach to alleviate this energy and performance bottle-
Somnath Paul, Member, IEEE, neck is to develop dedicated hardware for specific application ker-
Robert Karam, Student Member, IEEE, and nels like AES, such as the AES-NI instruction set extension for the
Swarup Bhunia, Senior Member, IEEE x86 platform [6]. This yields excellent performance and energy
results, but only functions for a specific algorithm and requires
Abstract—Security has emerged as a critical need in today’s computer
substantial development and verification effort to implement for
applications. Unfortunately, most security algorithms are computationally
expensive and often do not map efficiently to general purpose processors. Fixed- each kernel. Therefore, this approach does not scale well to diverse
function accelerators offer significant improvement in energy-efficiency, but they do security kernels [3]. Similarly, FPGAs can be used to improve flexi-
not allow more than one application to reuse hardware resources. Mapping bility, and while this may enhance energy efficiency over software-
applications to generic reconfigurable fabrics can achieve the desired flexibility, but only implementations, the highly reconfigurable interconnect
at the cost of area and energy efficiency. This paper presents a novel architecture can impose significant penalties to power, area, and
reconfigurable framework, referred to as hardware accelerator for security kernel latency [17]. Instead, domain-specific reconfigurable architectures
(HASK), for accelerating a wide array of security applications. This framework
have been introduced which demonstrate improved performance
incorporates a coarse-grained datapath, supports for lookup functions, and flexible
interconnect optimizations, which enable on-demand pipelining and parallel
and power results over general purpose reconfigurable platforms,
computations in multiple ultralight-weight processing elements. These features are [2], [7], [8]. Notable architectures include asynchronous array of
highly effective for energy-efficient operation in a diverse set of security simple processors (AsAP) [7], [15] and Morphosys [2], while the
applications. Through simulations, we have compared the performance of HASK to Totem system [8] aims to automate the process of domain specific
software and field programmable gate array (FPGA) platforms. Simulation results accelerator design. However, none of these platforms has been
for a set of six common security applications show comparable latency between designed specifically for the domain of security acceleration.
HASK and FPGA with 2.5X improvement in energy-delay product and 4X
To address this important need, in this paper, we present a
improvement in iso-area throughput. HASK also shows 5X improvement in
iso-area throughput and 45X improvement in energy-delay product compared to
hardware accelerator for security kernels (HASK), a novel and
optimized software implementations. highly energy-efficient reconfigurable framework for the accelera-
tion of cryptographic kernels, based on an analysis of security
Index Terms—Security applications, domain-specific hardware accelerator, applications such as AES, Blowfish, IDEA, SHA-1, MD5, and CAST-
energy-efficiency, reconfigurable computing 128. HASK is an array of parallel, coarse-grained nano-processors
with hardware support for common operations in security tasks. It
Ç aims to balance spatial and temporal computing by time-multi-
plexing hardware resources while having a large number of inter-
connected processing elements to distribute a task. The overall
1 INTRODUCTION system serves as a loosely coupled accelerator on a shared bus
accessible by both a host processor and by peripherals. Fig. 1 illus-
SECURITY is becoming an important design metric, specifically in the trates a conceptual diagram of a modern system on a chip (SoC)
domain of embedded systems [1]. Due to the ever-growing impor- which incorporates an acceleration engine for security algorithms.
tance of data security in these systems, the inclusion of hardware HASK’s primary distinctions from existing reconfigurable
cryptographic modules is rapidly becoming a requirement. In par- architectures, such as AsAP and Morphosys, are as follows: (a) a
ticular, hardware modules for diverse security tasks, ranging from novel, fully distributed memory architecture with multiple instruc-
encryption to hashing, are more effective than software realizations tion, multiple data (MIMD) support; (b) a processing and intercon-
in terms of meeting energy-efficiency and/or real-time performance nect architecture tailored to security applications; (c) fully
demands [2], [3], [4]. These security modules are implemented routerless communication, which improves data transfer energy
either inside embedded processors or outside them as co-processors and latency; and (d) hardware support for lookup table operations
or hardware accelerators [2]. Typically, these modules are realized with varying input and output width, and fused logical operations
in one of two ways: (a) as a specialized custom hardware module, or to implement complex functions as atomics. In particular, the
(b) as a reconfigurable accelerator using a field programmable gate paper makes the following key contributions:
array (FPGA) or similar platform. While the first option provides
optimal performance and energy-efficiency, the second one offers 1) It proposes HASK, a scalable and energy-efficient reconfig-
the flexibility to map a variety of security tasks satisfying diverse urable architecture suitable for the domain of security
compute and communication requirements. applications. HASK supports spatio-temporal computing
The Advanced Encryption Standard (AES) is widely used in with an array of light-weight processing elements. The
embedded applications to achieve data security. Unfortunately, datapath and interconnect structure for HASK are opti-
most encryption algorithms like AES are slow on general purpose mized to commonly occurring operations and communica-
tion patterns for target applications.
C. Babecki, W. Qian, R. Karam, and S. Bhunia are with the Department of Electrical 2) It explains the micro-architecture level optimizations for
Engineering and Computer Science, Case Western Reserve University, Cleveland, HASK and the corresponding design trade-offs. It also
OH 44106. describes the design space exploration that trades off
E-mail: {christopher.babecki, wenchao.qian, robert.karam, swarup.bhunia}@case.edu.
S. Paul is with Intel Labs, Intel Corporation, Hillsboro, Hillsboro, OR 97124. between spatial and temporal computing to optimize
E-mail: [email protected]. energy efficiency.
Manuscript received 14 Nov. 2014; revised 13 Nov. 2015; accepted 9 Dec. 2015. Date of 3) It evaluates the performance and energy efficiency of
publication 24 Dec. 2015; date of current version 14 Sept. 2016. HASK for six common applications and then compares the
Recommended for acceptance by W.W. Ro. results against a commercial FPGA device, a GPP, and
For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.
org, and reference the Digital Object Identifier below. alternative reconfigurable accelerator MorphoSys [2] and
Digital Object Identifier no. 10.1109/TC.2015.2512858 AsAP [15].
Authorized licensed use limited to: VTU Consortium. Downloaded on April 03,2025 at 14:08:46 UTC from IEEE Xplore. Restrictions apply.
0018-9340 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016 3197
Fig. 5. Two modes of inter-SNP communications: (a) Explicit data transfer through
a MOVE instruction and (b) implicit transfer using intra-cluster bus as virtual
register.
Fig. 3. Hierarchical bus-based interconnect structure of: (a) a single Cluster of 4
SNPs, (b) a single Tile of four Clusters. The tiles are connected in a mesh topology.
simultaneously reading four sets of input operands from the regis-
ter file. The inputs are then run through four separate datapaths
[4], as well as fine-grained wordline segmentation allowing effi- and written back through separate write ports. This improves the
cient access to the variable width LUTs. The remaining memory is overall execution time of the AES algorithm by about 10 percent.
used as a byte addressable scratchpad-memory that stores inputs Addition operations can be accelerated in this manner as well.
and resultant data.
For complex logical operations, such as ðA ^ BÞ _ ðA ^ CÞ, it 2.3 Interconnect Structure
could take up to 4 cycles by using atomic logic operations. Though HASK employs a sparse hierarchical interconnect that exploits data
HASK supports bit-sliced logic operations, mapping this complex locality. SNPs within the same Cluster have a fully connected shared
operation requires substantially more energy than dedicated logic. bus structure (Fig. 3a). Each 16-bit connection allows for the transmis-
In addition, a LUT-based approach would require a substantial sion of a single 8 or 16-bit value per cycle; at 1.25 GHz (see Section 2.4),
amount of memory space to hold the responses. Instead, these this leads to a rate of 20 Gbps. Similarly, a 16-bit shared bus is used
complex logical operations are mapped to a novel reconfigurable for inter-cluster communication (Fig. 3b); however, unlike the fully-
logic datapath inside each SNP capable of implementing arbitrary connected intra-cluster bus, the inter-cluster bus can only be reached
logical functions of up to three inputs. This is realized and encoded from one SNP per cluster, termed Gateway SNP, or gSNP, through
using a Reed-Muller expansion, which results in substantially which all communications must be routed. Ideally, applications can
fewer required transistors than a canonical representation for an be mapped such that only the gSNP needs to communicate with other
arbitrary function of three inputs [16]. For example, the fused data- clusters, maximizing the inter-cluster bus throughput. The lower
path can be configured to perform ðA ^ BÞ _ ðA CÞ in a single bound of the bandwidth per SNP is 1/4 of the intra-cluster bus, or 5
instruction as 1 A C AB AC. Gbps. At higher levels, gSNPs can communicate with each of the 16
Most security kernels are dominated by byte level operations gSNPs in adjacent Tiles through a 16-bit wide 2D bi-directional mesh
(e.g. S-boxes in AES, Blowfish, and CAST-128 and Galois Field interconnect structure. It enables the architecture to easily add large
multiplications in AES). To mitigate this requirement, all data oper- number of tiles (and hence, SNPs). A gSNP can broadcast to multiple
ations in HASK support variable input/output size down to a sin- inter-Tile buses in the same cycle, allowing the architecture to scale to
gle byte for increased energy efficiency. The register file is also an arbitrary size while maintaining limited connectivity between any
designed accordingly to be byte addressable, enabling more com- two nodes. Data transmission between tiles requires two cycles with-
pact information storage. out any stall. Thus, the cycle time is independent of communication
Finally, many security tasks are amenable to additional parallel- latency between distant nodes. In the worst case, this communication
ism beyond what a VLIW-2 architecture allows. For example, con- is limited to 625 Mbps, when all 16 gSNPs need to communicate as
sider the MixColumns step of the AES algorithm, which requires 12 often as possible.
XOR operations comprising four independent expressions. The These communication buses form a time-multiplexed program-
SNPs on-demand SIMD instruction exploits this data parallelism by mable interconnect. Because the communication requirements for
the security applications are both constant and known a priori, they
can be scheduled at compile time. Routing information is stored in
the instructions as an immediate value and decoded at runtime to
control communication buses. If a buffer is enabled on a given cycle,
output data from the SNP’s operation is written to the appropriate
bus. Subsequently, other SNPs can read the data into their local reg-
ister files. A detailed view of the bus output structure is provided in
Fig. 4. As the communications are statically scheduled, routers are
not required in the fabric which eliminates their associated power,
latency, and area overhead.
Communication at any level of the hierarchy can be handled
either implicitly as part of an instruction, or explicitly as a separate
MOVE instruction. Fig. 5a shows an example instruction sequence
that performs this type of data transfer. If data Z needs to be trans-
ferred from the register of SNPi to the register of SNPj , SNPi needs
a MOVE to send data onto the bus, and in the next cycle, SNPj
needs another MOVE to get the data from the bus and store it
locally. Additionally, some instructions contain a single bit field
that, when enabled, writes the result to the local bus as well as local
Fig. 4. Detailed view of the SNP bus interface. memory. Every instruction is capable of an implicit read through
Authorized licensed use limited to: VTU Consortium. Downloaded on April 03,2025 at 14:08:46 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016 3199
TABLE 1
SNP Energy Breakdown for Different Operations
Operation Sched. Table & Reg. file (pJ) Execution (pJ) Total Energy Implicit Total Energy with
Decoder (pJ) (pJ) MOVEa (pJ) Implicit MOVE (pJ)
Load/Store (64b) 2.234 3.913 4.082
8x8 LUT Op 0.283 1.962 2.131
8x16 LUT Op 0.563 2.242 2.411
8x32 LUT Op 1.123 2.802 2.971
Add/Simple Logical Op 1.358 0.32 0.017 1.696 0.17 1.866
Shift 0.035 1.714 1.883
Fused Datapath Op 0.018 1.697 1.866
Intra-cluster Move 0.169 1.848 2.018
Inter-cluster Move 0.511 2.189 2.359
Inter-tile Move 0.678 2.356 2.526
Leakage Power (mW/SNP) 3.024
a
Energy for an implicit move is the same as the execution energy for an intra-cluster move.
the use of virtual register ports, which map directly into the regis- tree delay and sub-array output driver energy, are therefore
ter files of other SNPs within a cluster (Fig. 4). This method of com- removed. Wordline segmentation using AND-gates is added as an
munication is only available on the intra-cluster bus (Fig. 5b). The overhead to the CACTI model, and LUT accesses for 8, 16, and 32
output Z of the operation in SNPi can be directly sent onto the bus, bits are similarly reduced to the appropriate proportion of the data
and in the case that Z is used immediately in SNPj as an input, it access energy reported by CACTI.
can be directly read from the bus because the bus is treated as a vir- The delays for each component were used to compute the critical
tual register. Implicit moves are useful in many security applica- path through a single SNP. The critical path lies on the datapath of a
tions, including AES, where four SNPs within 1 Cluster can memory read operation, and the total path delay is calculated to be
operate on data cyclically and in parallel, taking only 83 cycles for approximately 800 ps, yielding a maximum clock speed of 1.25 GHz.
one encrypt operation. This structure allows for very high data The estimated area for a single SNP is approximately 70,000 mm2 ,
availability at the lowest level where it is needed, while reducing about 60 percent of which is the 8 kB SRAM array, 18 percent the
interconnect complexity, routing delays, and communication datapath elements, and the remaining 22 percent the schedule table
energy between the upper levels. and register file. The energy requirement for each major operation
type is presented in Table 1 along with the estimated static leakage
2.4 Modeling Approach power consumption of a single SNP. The majority of the leakage
We developed a register-transfer level (RTL) model of the HASK energy comes from the SRAM arrays (schedule table, LUT memory,
framework to obtain performance characterization of the architec- and data memory). Unfortunately, CACTI cannot effectively model
ture. Key components (e.g. datapath, register file, etc.) were mod- advanced manufacturing processes such as High-K Metal Gates or
eled in RTL then synthesized to a 32 nm technology library strained channel devices [20]. Hence, the actual leakage power of a
provided by Synopsys using DesignCompiler. The area, delay, and HASK system should be substantially lower than the reported result.
power consumption (considering a 12.5 percent activity factor) for
each of these components were then used to model a full SNP. 2.5 Design Space Exploration
Based on the area estimates for a single SNP, the estimated com- The spatio-temporal computing model of HASK provides the
munication delay and energy requirements were computed using opportunity to explore the right balance of spatial and temporal
an RC wire-loading model assuming a standard Fan-Out of 4 load computing during application mapping to achieve optimal energy
on each bus line. The RTL model for a single SNP has been exten- efficiency. We select SHA-1 as an example kernel to study this. We
sively validated to confirm that each operation functions correctly vary the number of SNPs from 8 to 1,024, to map this kernel, and
in simulations using Synopsys VCS. Table 2 provides a summary observe varying performance and energy profiles. To map SHA-1,
of the key model parameters. we first created its control and data flow graph and then translated
Memory elements (schedule table, LUT and data memory) were the operations into a set of instructions for the SNP architecture. By
modeled using the CACTI toolset to determine area, energy, and increasing the number of SNPs, the total 80 rounds required for
delay estimates for SRAM arrays of the appropriate sizes. For the SHA-1 can be distributed into multiple pipelining stages. In gen-
schedule table and 2 kB of the 8 kB main memory, the SRAM eral, higher SNP usage resulted in fewer cycles between hashed
design was assumed to be read-skewed and a 40 percent reduction blocks; however, increasing pipelining stages resulted in diminish-
in energy use for read operations was accordingly considered [4]. ing returns, as each stage begins to be dominated by the transfer
CACTI models a memory bank assuming that it is part of a larger time rather than actual computation. If area is not a constraint, one
cache, and therefore includes latency and energy components not can solve the presented optimization problem to find the point of
relevant to this SRAM model. These components, including the H- minimal energy delay product (EDP) as shown in Fig. 6. When
data transfer is not the dominant contributor to total delay (in the
TABLE 2 case of 8-128 SNPs), performance improves as the number of SNPs
SNP Key Parameters increases. We observe that using 128 SNPs is very close to optimal
in terms of EDP, which combines the impact from both delay and
SNP Area 70,000 mm2 energy, while using substantially less area than alternative imple-
Critical Path Delay 790 ns mentations with 256 SNPs or more.
Max Clock Frequency 1.25 GHz
Registers 32 x 8 b
Schedule Table 128 x 80 b 3 PERFORMANCE ANALYSIS
LUT Memory 4 kB
Data Memory 4 kB Our application suite consists of AES, Blowfish, IDEA, SHA-1, MD5,
and CAST-128. It includes a combination of both symmetric key
Authorized licensed use limited to: VTU Consortium. Downloaded on April 03,2025 at 14:08:46 UTC from IEEE Xplore. Restrictions apply.
3200 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016
Fig. 6. Results for design space exploration in case of SHA-1 by trading off spatial versus temporal computing: (a) Delay, (b) Energy, and (c) EDP.
ciphers and cryptographically secure hashes that are representative The software implementations use the crypto++ C++ library,
of the security domain. Each benchmark was implemented in the compiled and optimized for the target architecture, and timestamps
HASK framework, on an Altera Stratix V FPGA (E series, 28-nm are measured using the C++ chrono library. The programs run 3E+5
CMOS process, 0.9V supply), and in software on a quad-core Intel input vectors 200 times each and return the average energy and
Q8200 processor (45 nm, 0.85-1.35 V). For each case, the power, delay values per application to mitigate measurement noise.
latency, EDP, and area were collected for a single instance of a given The target Q8200 processor does not support the AES-NI instruction
kernel working on a single block of data. Additionally, maximum set extension, but can exploit SSE (Streaming SIMD Extensions)
throughput values for a single instance of the kernel are presented. instructions to enhance performance. This is chosen to illustrate per-
For HASK and FPGA, we also present the size of the configuration formance on a modern superscalar processor without any form
bitstream required to program the device. The results are shown in of hardware acceleration. The CPU power consumption is assumed
Table 3. The HASK, FPGA, and CPU platforms differ greatly, and to to be half the rated TDP (95 W) [7] divided by 4, since only one of
facilitate a fair comparison, all devices are scaled to the same process the four cores was active. We estimate that a single core occupies
node (32 nm) and voltage (0.95 V), and the same 12.5 percent switch- approximately one quarter of the die shown [18], or roughly
ing activity is used for power analysis. 25 mm2 at 45 nm, and use this approximation when calculating the
Each application was hand-mapped to the HASK framework throughput per unit area. Results for the Q8200 are presented in the
using the approach outlined in Section 2.5. We considered a soft “GPP” columns of Table 3.
area constraint of 4 SNPs with an exception for IDEA, where nine The area results presented for both HASK and FPGA include the
SNPs were used to allow for pre-loading the round keys, greatly area of a single scaled GPP core. This is more representative of a real
reducing the energy cost from load operations. Latency values system where a GPP core would be present, and the security tasks
were computed assuming a 800 ps clock period (1.25 GHz), based would be offloaded to an accelerator. The table shows average ratio
on the critical path delay, and multiplying by the total number of (GPP over HASK and FPGA over HASK) for each parameter, which
cycles required. To obtain the dynamic energy values, the number is computed by taking average over the individual ratios for all
of each operation type for a given application was counted and benchmarks. This is done to accommodate for wide variations in
then multiplied by corresponding energy consumption, as pre- latency/energy across the benchmarks. Compared to GPP, FPGA
sented in Table 1. Static power was computed by taking the total and HASK improve latency about 40 and 10 percent, respectively. As
leakage power per SNP (Table 1), then multiplying by the number a result, both platforms improve iso-area throughput substantially.
of required SNPs and the latency of the application. Energy values Similarly, both accelerators see an order of magnitude improvement
presented are the sum of the static and dynamic energy. in energy efficiency, specifically, average EDP improvements of 34x
FPGA mapping was accomplished by describing an application (FPGA) and 45x (HASK). These improvements for both FPGA and
in Verilog and then compiling it to the Stratix V using Quartus II. HASK relative to GPP are due to the following common factors:
To match the HASK memory, access to an 8 kB memory array is
assumed to load the initial values. Latency estimates were obtained HASK and FPGA perform their computations in light-
using TimeQuest, and energy estimates were obtained by multiply- weight processing elements.
ing the number of effective cycles, cycle time, and power estimates The GPP’s inclusion of hardware structures not beneficial
reported in PowerPlay. To estimate the area, we multiply the ALM to the target domain lead to poor resource utilization.
tile area, which includes estimated routing area [19], by the number The highly parallel nature of HASK and FPGA lend them-
of utilized ALMs reported by Quartus II. selves better to pipelining of processing steps.
TABLE 3
Comparison of Latency, Throughput, Energy and EDP among HASK, FPGA and GPP
Fig. 7. The effect of scaling operating frequency (with associated scaling of voltage) on (a) energy, (b) throughput, and (c) EDP improvement (higher is better) in HASK
compared to FPGA.
Complex and fused functions are performed in dedicated applications. We present Singh’s [2] and Liu’s [7] findings scaled to
hardware structures and/or LUTs. a 32 nm process node operating at 0.95 V and compare them with
For the majority of the benchmark applications, HASK latency HASK results in Table 4. We derive iso-area throughput and
is slightly worse than that of FPGA. However, since HASK imple- energy per bit as points of comparison between these disparate
mentations generally use less die area than FPGA, the iso-area computing fabrics. Since no energy results were provided for Mor-
throughput is 4.3X better than that of FPGA, and HASK uses 3.1x phosys, only throughput is compared.
less energy than FPGA on average. Moreover, a 2.5X improvement A more conservative operating frequency of 500 MHz has also
in EDP is observed. Note that in practice, we anticipate the energy been considered for HASK. In this case, the operating voltage can
improvement of HASK over FPGA to be even higher. As stated in be reduced to 0.8 V. Energy and latency values are scaled accord-
Section 2.4, the HASK leakage energy is overestimated, since ingly for all the components of the HASK model and improve-
CACTI cannot model leakage reduction techniques used for nano- ment versus FPGA for all benchmark applications is compared in
scale processes. Conversely, the FPGA leakage is underestimated, Fig. 7. Throughput improvement scales down linearly to 1.7X
since Quartus does not report contributions from programmable versus FPGA; however, total energy increases due to a greater
interconnects and embedded memory blocks. The primary reasons leakage energy contribution. As a result, the EDP improvement
behind the improvement in energy efficiency for HASK over FPGA compared to an FPGA at its maximum frequency goes down to
are as follows: (1) HASK supports LUT operations of different bit- 0.64X. However, energy required per bit for HASK still remains
width and has dedicated hardware for fused logic operations, 2X better than FPGA.
which reduces the total number of operations and hence, energy;
(2) the spatio-temporal mapping greatly reduces the interconnect 4 CONCLUSIONS
complexity and energy; and (3) the highly customizable memory We have presented HASK, a novel reconfigurable framework for
structure of FPGAs results in energy inefficient memory accesses accelerating security applications. The proposed architecture
[19]. Additionally, the HASK configuration bitstream is an average offers comparable latency and die area to FPGA with an average
2.3 percent smaller than equivalent FPGA implementation. It is of 3X improvement in energy efficiency. This is achieved using
worth noting that HASK’s energy performance on the IDEA cipher high-density SRAM for lookup operations, a sparse interconnect,
is poor compared to FPGA. The major reason behind this is that application-level pipelining, and a custom datapath tailored to the
IDEA uses multiplications which must be implemented as LUT computing needs of the security application domain. We pre-
operations in the SNP. These memory accesses are substantially sented simulation results which show the improvement compared
less energy efficient than using the embedded multiplier blocks of to implementations in FPGAs and GPPs. The interconnect fabric
the Stratix V, so HASK’s energy efficiency suffers. can accommodate a large number of SNPs and hence is scalable,
We also consider a comparison to three other prominent alter- enabling the mapping of larger kernels or even parallel instances
native hardware acceleration platforms: (a) AsAP, (b) Morphosys, of a kernel. The performance of HASK is also compared with
and (c) graphics processing units (GPUs). GPUs are suitable for alternative CGRAs such as AsAP and Morphosys, as well as
floating point intensive kernels, but cannot provide a fine-grained GPUs. The memory-dominated architecure of HASK is very ame-
spatio-temporal mapping like HASK, AsAP, and Morphosys, nable to technology scaling. Hence, emerging high-density nano-
resulting in sub-optimal performance. In contrast to these frame- scale memory technologies can significantly improve its
works, HASK has four key distinguishing factors: (1) HASK performance and area. Future work will include the development
requires no shared memory on-die, (2) it provides hardware sup- of an application mapping tool and adding support for dynamic
port for variable input and output LUT operations, (3) SNPs con- instruction scheduling.
tain local fixed function optimization for security, and (4) HASK
implements a hierarchical interconnect appropriate for security
ACKNOWLEDGMENTS
TABLE 4 This work was supported in part by Semiconductor Research Cor-
Comparison of HASK with CGRA/GPU with Respect to Throughput poration (SRC) under Grant 2015-EP-2650.
and Energy
REFERENCES
Benchmark Fabric Scaled Throughput Energy/Bit
(Gbps/mm2 ) (pJ/bit) [1]P. Kocher, R Lee, G. McGraw, A. Raghunathan, and S. Ravi, “Security as a
new dimension in embedded system design,” in Proc. Des. Autom. Conf.,
AES HASK 7.44 13.20 2004, pp. 753–760.
AsAP 1.29 742.9 [2] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M.
Chaves Filho, “MorphoSys: An integrated reconfigurable system for data-
GPU 0.39 2147.2
parallel and computation-intensive applications,” IEEE Trans. Comput.,
IDEA HASK 14.03 — vol. 49, no. 5, pp. 465–481, May 2000.
Morphosys 10.34 — [3] K. Eguro, “RaPiD-AES: Developing an encryption-specific FPGA
architecture,” M.S. thesis, Univ. Washington, Seattle, WA, USA, 2002.
Authorized licensed use limited to: VTU Consortium. Downloaded on April 03,2025 at 14:08:46 UTC from IEEE Xplore. Restrictions apply.
3202 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 10, OCTOBER 2016
Authorized licensed use limited to: VTU Consortium. Downloaded on April 03,2025 at 14:08:46 UTC from IEEE Xplore. Restrictions apply.