2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)
iMACE: In-Memory Acceleration of Classic
McEliece Encoder
(Invited Paper)
Karthikeyan Nagarajan, Sina Sayyah Ensan, Swagata Mandal*
Swaroop Ghosh, and Anupam Chattopadhyay**
School of EECS, Pennsylvania State University, Univeristy Park, USA
School of ECE, Jalpaiguri Government Engineering College, India*
School of CSE, Nanyang Technological University, Singapore**
{kxn287, sxs2541, szg212}@[Link] [Link]@[Link]* anupam@[Link]**
Abstract— Asymmetric code-based crypto-systems have been to conventional computing.
developed in the last decade due to rapid evolution of quantum Hardware acceleration of McEliece crypto-system is pro-
computing that can potentially compromise RSA and ECC posed [11] using alternate codes (QC-MDPC) to achieve a
based crypto-systems. The McEliece crypto-system based on
the general decoding problem is one of the front runner smaller resource footprint while ensuring reasonable perfor-
candidates for post-quantum cryptography but the energy- mance for various applications. However, this implementa-
efficiency is limited by the heavy data traffic between the pro- tion deviates from the traditional usage of Goppa codes [12]
cessing elements and the memory. In memory-computing (IMC) for McEliece. A smart-card implementation of McEliece has
architectures can remove the energy-efficiency barriers posed been proposed [13] on an Infineon SLE76 chip. However,
by Von-Neumann computing due to movement of data between
the processor and the memory. Emerging non-volatile memories this implementation poses the problem of transmission and
(NVM) such as, Resistive RAM (ReRAM) implemented in a storage of the large public keys required for McEliece
crossbar array are promising substrates to realize IMC due encryption. There has been little work on exploiting IMC for
to excellent High Resistance State (HRS) to Low Resistance security oriented applications. In [14], a hardware based hash
State (LRS) ratios and high-densities. Therefore, McEliece can function is proposed using crossbar memristive technology. It
be benefited substantially by in-memory acceleration. We pro-
pose, iMACE, a high performance and area-efficient hardware exploits the write disturb phenomenon and process variations
implementation of the core encoding function of McEliece in the crossbar array for implicit key embedding. A ReRAM-
by exploiting ReRAM-based IMC. Simulation results show based in-memory implementation of the SHA-3 encryption
18.8X-94X better throughput and 46%-97% reduction in energy algorithm is proposed [15].
consumption compared to the FPGA-based implementation. In this work, we propose iMACE: In-memory accelera-
tion of classic McEliece encoder using ReRAM-based In-
I. I NTRODUCTION
memory computing. iMACE employs a dynamic in-memory
The security of traditional security primitives such as, computing (DCIM) architecture [16] that is energy-efficient
Rivest Shamir Adleman (RSA) [1], Digital Signature Algo- compared to other contemporary IMC architectures. Follow-
rithm (DSA) [2], Elliptic Curve Digital Signature Algorithm ing contributions are made in this paper. We, (a) propose
(ECDSA) [3], and other Error-Correcting Code (ECC)-based an in-memory implementation for the Classic McEliece
crypto-systems rely on the assumption about the difficulty crypto-system’s encryption algorithm including its complex
of various problems in number theory, such as, the Inte- mathematical operations on DCIM arrays; (b) propose two
ger Prime Factorization Problem or the Discrete Logarithm versions of pipelining the array operations to create a trade-
Problem. Recent developments have shown that quantum off surface between throughput and energy; (c) perform pro-
computers can break public-key crypto-systems based on cess variation analysis on iMACE designs; and, (d) provide
such hard number theory problems. Therefore, it is essential comparative analysis against FPGA implementation.
to explore alternate quantum resistant crypto-systems. The The paper is organized as follows: Section II describes
Classic McEliece code-based crypto-system is a candidate the basics of Classic McEliece and the DCIM architecture.
for the Post-Quantum Cryptography Standardization project Section III presents the in-memory design of the encryption
by NIST [4]. Conventional von-Neumann computing sepa- algorithm including two pipelined versions: iMACE-1 and
rates processing and memory elements resulting in energy iMACE-2. Section IV presents process variation analysis and
and performance bottlenecks when Classic McEliece algo- comparison with FPGA implementations. Finally, Section V
rithm is executed in the processor. In-Memory Computing draws the conclusion.
(IMC) architectures have gained significant attention due to
features such as, high degree of parallelism, elimination of II. BACKGROUND ON M C E LIECE AND IMC
external data movement, and the flexibility of partitioning A. Classic McEliece Encoder
the memory resources between computation and storage. The McEliece crypto-system [17] is a public key crypto-
Many works have been proposed exploring various IMC system that uses a public and private key pair to encrypt and
architectures such as, native realization of neural network decrypt a message.
architectures [5], programmable architectures [6], matrix Public key crypto-system: Assuming that the message
multiplications [7], neuromorphic computing [8][9], and ap- receiver is A and the sender is B, A publishes its public key
proximate computing [10]. These techniques have shown to everyone. B uses A’s public key to encrypt the message
improvement in performance and power efficiency compared and transmit it to A. Finally, A uses its private key to decrypt
978-1-7281-3391-1/19/$31.00 ©2019 IEEE 513
DOI 10.1109/ISVLSI.2019.00098
Fig. 1: Representative illustration of calculation of G and plaintext encryption in the McEliece crypto-system with n = 12
and k = 8.
Fig. 2: (a) XOR implementation using DCIM in ReRAM crossbar array; (b) Timing diagram of logical XOR operation. [16]
the received message. An adversarial interceptor, C, would (B) has to create a random binary vector (e) of length n
be unable to decode the transmitted message with just the and a Hamming weight of wt(e) ≤ t. The ciphertext, c, is
public key. This is because it is computationally infeasible then calculated as follows: c = mG + e. The overview of
to compute the private key based on the public key. the calculation of the G matrix and plaintext encryption is
Classical Goppa Codes: Goppa codes [12], a class of shown in Fig. 1.
linear error correcting codes, are used in the McEliece McEliece Decryption: Once A receives the encrypted
crypto-system. Specifically, it uses a class of irreducible signal c, it computes c = cP −1 . This is followed by
binary Goppa codes. The code defined has a length of n, computing a value known as syndrome (Sz = z H T ) and
dimension of k, and can correct up to t errors. A binary the application of an error correction algorithm that uses c
Goppa code uses a polynomial g(x) over GF (2m ) of degree and Sz as its inputs. Once the error vector, e, is calculated
t. The x is chosen to ensure that the Goppa polynomial is the message is recovered as the first k bits of z ⊕ e.
irreducible. Note that an irreducible Goppa code Γ(L, g(x)) In this paper, we focus on the implementation of McEliece
has a minimum distance d ≥ 2t + 1. Similar to other block encryption using the ReRAM based In-memory computing
codes, Goppa codes use a parity check matrix (H) such that architecture described in Section II-B.
HcT = 0 for all code vectors c in GF (2m ). This satisfies
the Goppa code requirement. Encoding requires the plaintext B. Logic operations using In-Memory Computing
message (m) to be multiplied by a generator matrix of the Various IMC architectures have been proposed in the
Goppa code. This generator matrix is defined to be a k × n literature. iMACE uses DCIM [16], which is a ReRAM
matrix G, where the rows of G form the basis of the Goppa crossbar based architecture shown in Fig. 2 (a). Each bitcell
code. Any k ×n matrix G with a rank k, such that GH T = 0 in DCIM, is composed of a bidirectional diode in series with
is a generator matrix. a ReRAM. We have used the ASU ReRAM Verilog-A model
McEliece Key Generation: In McEliece, A contructs the [18] along with PTM 65nm technology for the analysis. The
public key by selecting a Goppa polynomial g(z) of degree ReRAM is bipolar HfOx -based resistive switching memory
t and computes a generator matrix G of the Goppa code. [18]. Oxide thickness = 5nm, gap = 0.1nm/1.7nm, atomic
A then chooses a random invertible matrix S and a random distance for oxide = 0.25nm, and atomic energy for vacancy
permutation matrix P . These matrices (S, G, and P) are then generation/recombination = 1.501eV/1.5eV are used here.
used to
compute G = SGP . The public key of A consists By implementing the functions in the Sum-of-Product
of (G , t) and the private key consists of (S, G, P ). (SoP) form, DCIM executes in-memory computation. DCIM
McEliece Encryption: In order to encrypt a plaintext, dedicates different arrays for AN D and OR. Wordlines
m ∈ {0, 1}k , where k refers to the dimensionality, the sender (WL) and bitlines (BL) serve as inputs and outputs to
514
Fig. 3: Division of the S and G matrices in sub-matrices. Note that n = 12 and k = 8 and the numbers shown are only
examples.
Fig. 4: Division of the SG and P matrices to sub-matrices. Note that n = 12 and k = 8 and the numbers shown are only
examples.
the AN D and OR functions, respectively. Arrays are pre- III. I MACE A RCHITECTURE
programmed to perform the desired functions. For instance,
in order to implement in0 .in1 , the bitcells connected to A. Preliminaries
in0 and in1 are programmed to Low Resistance State The key parameters considered for the in-memory im-
(LRS) while the bitcells connected to in0 and in1 are plementation of McEliece are the message length (n) and
programmed to High Resistance State (HRS) (Fig. 2(a)). All the dimensionality (k). In this paper we have chosen a
other ReRAMs are programmed to HRS (e.g., the ReRAMs representative example of n = 12 and k = 8. Additionally,
connected to input inn and inn ). for a fixed n and k, the dimensions of the matrices S,
G, and P are fixed as (8 × 8), (8 × 12), and (12 × 12)
An XOR function implemented in DCIM is shown in Fig. respectively. Note that iMACE allows other values of n and
2. At first, the P re signal that precharges the AN D array’s k as well (Section IV-C). iMACE performs the computation
BLs to VDD is asserted. Then, inputs (in0 and in1 ) are of the public key G (calculated as G = SGP ). This
applied by asserting the ENAN D signal. As shown in Fig. operation is split into generation of the S × G matrix and
2(b), when in0 = in1 = 1 both BL0 and BL1 drop below the SG × P matrix. This is followed by the multiplication of
the reference voltage (VRef −AN D ). As a result, the Sense the plaintext message by G , addition of an error vector (e),
Amplifier (SA) that which determines the results of in0 .in1 and subsequently the generation of the encrypted message c.
and in0 .in1 functions are pulled down to 0 at the edge of B. S × G Computation
SEAN D . Next, the AN D array SA outputs are provided
as inputs to the OR array. Since inputs of the OR array The first step involves the multiplication of the (8 × 8) S
are ‘0’, the BL (BL0OR ) remains discharged and results in matrix and the (8 × 12) G matrix. The computation of each
in0 ⊕ in1 = 0. If in0 = 0 and in1 = 1, BL0 discharges element in the SG matrix requires 8 2-bit ANDs and 1 8-bit
while BL1 remains precharged and results in in0 .in1 = 0 XOR. Each of the 8 elements of a row (i ∈ {1, 8}) of S
and in0 .in1 = 1. Therefore, BL0OR starts charging at the is AND’ed with its corresponding element for all columns
edge of ENOR . Finally, the voltage of BL0OR is compared (j ∈ {1, 12}) in G. The 8 AND’ed outputs for a particular i
against VRef −OR at the edge of SEOR and produces ‘1’ at and j are then XOR’ed with each other to compute the SG
the output of the SA. Note that the OR array sense enable element at location {i, j}.
(SEOR ) is an active low signal. This overall XOR operation AND operation: The total number of AND operations
shown in Fig. 2 also depicts the AND and OR operations required is 8 × 8 × 12 = 768. We perform the calculation
that are required for iMACE’s implementation. after splitting the matrices S and G into smaller matrices as
shown in Fig. III-B. Note that this division of matrices is just
From simulation results it is determined that the XOR a representative example and done to ensure sufficient sense
operation takes 1.96ns to 3ns (delay increases when ReRAM margin in all DCIM arrays. S is split into two matrices by
HRS/LRS resistances change with time). iMACE uses a rows as S1 (4 × 8) and S2 (4 × 8). G is split into 3 matrices
simpler version of DCIM called FPCAS [19] that uses by columns as G1 (8 × 4), G2 (8 × 4), and G3 (8 × 4). Both
N AN D-N AN D memory arrays instead of AN D-OR ar- S1 and S2 are multiplied by each of the G1, G2, and G3
rays. N AN D arrays are the same as AN D arrays but out matrices. This operation requires a total of 8 ReRAM DCIM
node of SA is considered as the final result instead of SA’s arrays with each array having 64 inputs (8 × 4 + 8 × 4) and
out node. N AN D is a complete function and any logical 128 outputs (4 × 8 × 4).
function can be implemented using two stages of N AN D XOR operation: In order to perform the 8-bit XOR for
each SG element, the 8 bits are divided into 2 sets of 4 bits
functions (e.g. a.b + a.b = a.b + a.b = a.b.a.b). each. Each of the 4 bits are XOR’ed with each other and
515
(a) (b)
Fig. 5: (a) iMACE-I pipeline architecture with a latency of 2.5ns and a throughput of 3.2Gbps; (b) Size and number of
arrays required for each operation for both iMACE-I & -II.
(a) (b)
Fig. 6: (a) iMACE-II pipeline architecture with a latency of 8.5ns and a throughput of 16Gbps; (b) Area, aggregate power,
and delay of each of the McEliece encryption operations.
then the resultant 2 bits are XOR’ed again to obtain the SG the 8×12 SG matrix. Fig. 5b gives a overview of all the
element. designed array sizes and numbers.
Simulation Results: The overall area consumed by the
4-bit XOR: The 4-bit XOR is divided into 2 stages. The DCIM arrays for this step is 25.94 KGEs where 1 GE (0.25
first stage has a total of 8 × 8 × 12 = 768 inputs. This μm2 / 60F 2 ) is the area of a minimum sized NAND gate in
translates to a total of 192 4-bit sets. Each 4-input XOR gate 65nm process [20]. The aggregate power consumed by each
has 8 intermediate minterms and therefore the XOR array of the arrays is 15.28mW. Note that the energy consumed
requires 8 BLs per input. Since, each input requires 2 WLs will depend on the configuration of the pipeline as discussed
(input and its complement), a 4 input XOR gate requires 8 in Sections III-F and III-G. The sum of individual delays
WLs. Therefore, the operation requires a total of 12 ReRAM (each < 500ps) of all operations is 2.33ns. Fig. 6b shows
DCIM arrays with each array taking 128 inputs and resulting the simulation results for each operation.
in 128 outputs (128 × 128). In the second stage of the 4-bit
XOR operation, we have a total of 12×128 = 1536 inputs as C. SG × P Computation
generated by the first stage. Each set of 8 input bits requires This step involves the multiplication of the (8 × 12) SG
1 output BL. Therefore, this stage requires 24 arrays each matrix with the (12 × 12) P matrix to generate a (8 × 12)
taking 128 inputs and resulting in 8 outputs (8 × 128). A SGP (also G ) matrix. The SG matrix is split into 4 (2×12)
total of 24 × 8 = 192 outputs are generated. matrices by rows as SG1, SG2, SG3, and SG4 (Fig. 4). The
2-bit XOR: The 2-bit XOR is divided into 2 stages as P matrix is split into 4 (12 × 3) matrices by columns as
well. The first stage has a total of 192 inputs as generated P1, P2, P3, and P4 (Fig. 4). Similar to the previous SG
by the 4-bit XOR’s final output. Each set of 2 bits requires computation, the operation is divided into two stages: AND
2 BLs for its 2 minterm outputs and requires 4 WLs for its and XOR. The XOR is further divided into 4-input XORs
input and complement. A total of 3 arrays are required, each and 3-input XORs. The number of inputs, outputs, BLs and
taking 128 inputs and resulting in 64 outputs (64 × 124). WLs required for SGP computation is calculated using the
3 × 64 = 192 outputs are generated. The second stage takes rules of DCIM as shown in Section. III-B. For the sake of
in the 192 intermediate minterm inputs. A set of 2 minterm brevity, we only report the choices on the number and size
inputs requires 1 BL for its output and 4 WLs for its input of arrays required for the above mentioned operations.
and complement. Three arrays each taking 128 inputs and AND operation: AND requires a total of 16 DCIM arrays
resulting in 32 outputs (32 × 128) are required. A total of each taking 128 inputs and resulting in 72 outputs. In order
3 × 32 = 96 outputs are generated. These 96 outputs forms to ensure that the array sizes are a power of 2, we round
516
up the number of outputs to 128 (with 56 unused output G. iMACE-II Pipelining
BLs). The 16 (128 × 128) AND arrays generate a total of iMACE-II is an aggressively pipelined configuration
16 × 72 = 128 outputs. where each single mathematical operation (e.g. 4-bit XOR,
XOR operation: In order to perform the 12-bit XOR for AND etc) is pipelined as shown in Fig. 6a. This ensures that
each SGP element, the 12 bits are divided into 3 sets of 4 no array lays dormant in any cycle. iMACE-II concurrently
bits each. Each of the 4 bits are XOR’ed and the resultant 3 works on 10 plaintext messages as compared to 4 proposed in
bits are XOR’ed again to obtain the SGP element. iMACE-I. This configuration produces an encrypted output
4-bit XOR: The first stage requires 18 arrays each taking every 500ps and has a throughput of 16Gbps. The power
128 inputs and resulting in 128 outputs (128 × 128). The consumed is 31.4mW. It is geared towards applications that
second stage requires 36 arrays each taking 128 inputs and are throughput sensitive.
resulting in 8 outputs (8 × 128).
3-bit XOR: The first stage requires 5 arrays each taking IV. R ESULTS , A NALYSIS , AND D ISCUSSION
128 inputs and resulting in 128 outputs (128×128). Note that A. Process Variation Analysis
some of the BLs and WLs are not used in the arrays since Process variations can lead to worst case scenarios where
we are rounding up the array sizes to the nearest power of 2. the bitline voltage change may not provide a sufficiently high
This stage generates 384 outputs. The second stage requires sense margin (SM) for the SA to detect the change in bitline
6 arrays each taking 128 inputs and resulting in 16 outputs voltage. We conduct a 1000-point Monte-Carlo analysis at
(16 × 128). It generates a total of 3 × 32 = 96 outputs that -10◦ C, 25◦ C, and 90◦ C on the DCIM NAND arrays. The
form the 8 × 12 SGP matrix. process variation is modeled by changing the key metrics of
Simulation Results: Fig. 6b shows the simulation results. the crossbar design including ReRAM low resistance gap
D. Plaintext encryption (GIL), high resistance gap (GIH), oxide thickness of the
MOSFETs, and the channel length of all the transistors.
The plaintext message (m) which is represented as a (1 × GIL and GIH are modeled as a 3σ variation of 7% of their
8) matrix is multiplied with the (8 × 12) G matrix (also nominal values of 0.1nm and 1.7nm respectively. The oxide
SGP ). This intermediate 1 × 12 matrix (c ) is computed thickness and channel lengths of the MOSFETS are modeled
similar to the previously discussed matrix multiplications. as a 3σ variation of 10% of their nominal values of 1.2nm
It consists of 2-input AND, 4-input XOR, and 2-input XOR and 65nm respectively. In DCIM arrays it is observed that
gates implemented using DCIM arrays. The size and number the SM is inversely proportional to the number of inputs per
of arrays are listed as follows. output. Our implementation utilizes 2, 4, and 8-input NAND
AND: AND requires 2 arrays each taking 128 inputs and gates. The distributions of SM observed for each of these
resulting in 64 outputs (64 × 128). cases under different temperatures are shown in Fig. 7. The
XOR Operation: The XOR bits of the AND outputs is lowest SM is 17mV (at 90C o ). Note that the number of sense
divided into 2 sets of 4 bits each. The 4 bits are XOR’ed and amplifiers in iMACE designs is 8344 which corresponds to
the resultant 2 bits are XOR’ed again to obtain c element. 3.38 sigma. The sense amplifiers should be upsized to keep
4-bit XOR: The first stage requires 1 (128 × 128) array sense amplifier offset to ∼ 5mV per sigma.
and 1 (64 × 64) array. The second stage requires 3 (8 × 128)
arrays. B. Comparative Analysis of iMACE
2-bit XOR: The first stage requires 1 (32 × 64) array and We have compared iMACE with a FPGA based im-
the second stage requires 1 (16 × 64) array. plementation. Since this paper focuses on the encryption
Simulation Results: Fig. 6b shows the simulation results. portion of the McEliece encoder, we have implemented the
encryption portion alone (Section III) on a FPGA device
E. Error vector addition
(used xc7k325t-2ffg900 as device during synthesis). The
The final step of the encryption process is a bit-wise XOR FPGA implementation has following characteristics: No. of
of an error vector of length 12 with the c generated in the slice registers = 64, No. of slice LUTs = 401, No. of
previous step. It only requires 12 2-input XOR gates. LUT-FF pair = 48, No. of bonded IOB = 112, and No. of
XOR Operation: The first stage of the XOR operation BUFG/BUFGCTRL/BUFHCEs = 2. The IMC design of the
requires 1 (32 × 64) array and the second stage required 1 McEliece Encryption requires a total area of 74.88 KGE.
(16 × 64) array. It is not possible to compare the resource utilization of
Simulation Results: Fig. 6b shows the simulation results. an FPGA with the area requirements in a CMOS-based
implementation. Therefore, our comparison uses energy and
F. iMACE-I Pipelining throughput for each implementation as shown in Table I.
The above discussed operations to encrypt the plaintext The FPGA implementation mimics the Von-Neumann
message can be pipelined to maximize the throughput. We computing model with separate processing and memory
have proposed two different configurations. The first config- elements. Therefore, the throughput and energy requirements
uration, iMACE-I, has each of the stages (i.e. S ×G, SG×P , for the encryption will worsen due to the delay and energy
m × G , and mG + e) pipelined as shown in Fig. 5a. Since consumption of transmitting the plaintext bits to the pro-
the delay of all mathematical DCIM array operations for cessor. The FPGA implementation would require >0.21nJ
McEliece is < 500ps, the clock frequency for this operation (assuming 40pJ/bit for LPDDR2 memory) [21] and 42ns
is chosen to be 2GHz. The configuration concurrently works latency (RAS timing) to just move 1 block of message (8
on 4 plaintext messages based on the array sizes and numbers bits of plaintext message) from the memory to the processor.
shown in Fig. 5b. The pipeline allows the encryption of 8-bits It is seen that iMACE-I and iMACE-2 have 1.15X and
of plaintext message every 2.5ns. Therefore, iMACE-I has a 5.76X better throughput than the FPGA implementation
throughput of 3.2Gbps and power consumption of 6.28mW. while not considering the data transmission. The throughput
517
(a) (b) (c)
Fig. 7: Process variation analysis of DCIM arrays at temperatures for 2-input, 3-input, 4-input and 8-input NAND gates at
(a) −10C o ; (b) 25C o ; and (c) 90C o .
TABLE I: Comparison of iMACE implementations [3] D. Johnson et al, “The elliptic curve digital signature algorithm
Work Tech. f Lat. Tput. Energy (ECDSA),” International journal of information security, vol. 1, no. 1,
(MHz) (ns) (Mbps) (nJ) pp. 36–63, 2001.
[4] National Institute of Standards and Technology (NIST),
iMACE-I DCIM 2K 8.5 3.2K 0.02 “Post-Quantum Cryptography Standardization.” Online:
iMACE-II DCIM 2K 8.5 16K 0.34 [Link]
FPGA - 1K 2.88 2.78k 0.41 quantum-cryptography-standardization.
Data Trans. - 1K 42 0.19K 0.21 [5] M. Hu et al, “Dot-product Engine for Neuromorphic Computing:
FPGA Total. - 1K 44.88 0.17K 0.62 Programming 1T1M Crossbar to Accelerate Matrix-vector Multi-
plication,” in Proceedings of the 53rd Annual Design Automation
Conference, DAC ’16, (New York, NY, USA), pp. 19:1–19:6, ACM,
performance increases to 18.8X and 94X when the data 2016.
transmission delay is included. iMACE-I and iMACE-II [6] Y. Zha et al, “Reconfigurable In-memory Computing with Resistive
reduces energy consumption by 95% and 18% (without Memory Crossbar,” in Proceedings of the 35th International Confer-
ence on Computer-Aided Design, ICCAD ’16, pp. 120:1–120:8, 2016.
FPGA data transfer) and by 97% and 46% (with data [7] L. Ni et al, “An energy-efficient matrix multiplication accelerator by
transfer) respectively. In summary, the comparison shows distributed in-memory computing on binary RRAM crossbar,” in 2016
superior throughput and energy consumption in all scenarios 21st Asia and South Pacific Design Automation Conference (ASP-
DAC), pp. 280–285, Jan 2016.
for both flavors of iMACE against a traditional FPGA-based [8] G. W. Burr et al, “Experimental Demonstration and Tolerancing of a
implementation of the same encryption algorithm. Large-Scale Neural Network (165000 Synapses) Using Phase-Change
Memory as the Synaptic Weight Element,” IEEE Transactions on
C. Scalability Electron Devices, vol. 62, pp. 3498–3507, Nov 2015.
[9] S. Yu et al, “A neuromorphic visual system using RRAM synaptic
The proposed iMACE design can be scaled to higher devices with Sub-pj energy and tolerance to variability: Experimental
values of n and k by increasing the number of DCIM arrays characterization and large-scale modeling,” in 2012 International
Electron Devices Meeting, pp. 10.4.1–10.4.4, Dec 2012.
to a feasible quantity based on area and power constraints [10] B. Li et al, “Memristor-based Approximated Computation,” in Pro-
of the application. For example, an alternative selection ceedings of the 2013 International Symposium on Low Power Elec-
of McEliece parameters such as n = 16 and k = 12 tronics and Design, ISLPED ’13, pp. 242–247, 2013.
[11] I. von Maurich and T. Gneysu, “Lightweight code-based cryptography:
requires would result in increase of array area by 5.9X Qc-mdpc mceliece encryption on reconfigurable devices,” in 2014
and energy consumption by 2.6X. Similarly, higher values Design, Automation Test in Europe Conference Exhibition (DATE),
of n and k required for post-quantum cryptography can be pp. 1–6, March 2014.
[12] E. Berlekamp, “Goppa codes,” IEEE Transactions on Information
accommodated by iMACE. Under resource constraints, it is Theory, vol. 19, no. 5, pp. 590–592, 1973.
also possible to implement resource sharing to accommodate [13] F. Strenzke, “A smart card implementation of the mceliece pkc,”
larger key sizes but with a relatively lower throughput. in Information Security Theory and Practices. Security and Privacy
of Pervasive Systems and Smart Devices (P. Samarati, M. Tunstall,
J. Posegga, K. Markantonakis, and D. Sauveron, eds.), (Berlin, Hei-
V. C ONCLUSIONS delberg), pp. 47–59, Springer Berlin Heidelberg, 2010.
We proposed 2 flavors of McEliece encryption imple- [14] L. Azriel et al, “Towards a Memristive Hardware Secure Hash
Function (MemHash),” in 2017 IEEE International Symposium on
mentation using dynamic in-memory computing. The salient Hardware Oriented Security and Trust (HOST), pp. 51–55, May 2017.
features include, i) elimination of large amounts of data [15] K. Nagarajan et al, “SHINE: A Novel SHA-3 Implementation Using
movement between the processor and the memory saving ReRAM-based In-Memory Computing,” in ISLPED, 2019.
[16] H. Motaman et al, “Dynamic Computing in Memory (DCIM) in
orders of magnitude energy consumption, ii) re-configurable Resistive Crossbar Arrays,” ICCD, Oct 2019.
design that allows optimization between energy and through- [17] R. J. McEliece, “A public-key cryptosystem based on algebraic coding
put, and, (iii) low memory footprint. theory. Technical report, NASA,” Coding Thv, vol. 4244, pp. 114–116,
1978.
Acknowledgements: This work is supported by SRC [18] P. Y. Chen et al, “Compact Modeling of RRAM Devices and Its
(2847.001), NSF (CNS- 1814710, CNS- 1722557, CCF- Applications in 1T1R and 1S1R Array Design,” IEEE Transactions
1718474, DGE-1723687 and DGE-1821766) and DARPA on Electron Devices, vol. 62, pp. 4022–4028, Dec 2015.
[19] Sayyah Ensan et al, “FPCAS: In-memory Floating Point Computations
Young Faculty Award (D15AP00089). for Autonomous Systems,” The International Joint Conference on
Neural Network (IJCNN), Jun 2020.
R EFERENCES [20] P. Pessl et al, “Pushing the Limits of SHA-3 Hardware Implemen-
[1] R. Rivest et al, “A Method for Obtaining Digital Signatures and Public- tations to Fit on RFID,” in Cryptographic Hardware and Embedded
key Cryptosystems,” Commun. ACM, vol. 21, pp. 120–126, Feb. 1978. Systems - CHES 2013, pp. 126–141, Springer Berlin Heidelberg, 2013.
[2] D. W. Kravitz, “Digital Signature Algorithm,” July 27 1993. US Patent [21] K. T. Malladi et al, “Towards Energy-Proportional Datacenter Memory
5,231,668. with Mobile DRAM,” in 2012 39th Annual International Symposium
on Computer Architecture (ISCA), pp. 37–48, June 2012.
518