RRAM-Based In-Memory Computing for
Embedded Deep Neural Networks
D. Bankman, J. Messner, A. Gural, and B. Murmann
Stanford University, Stanford, CA, USA
Email:
[email protected] Abstract—Deploying state-of-the-art deep neural networks through a low capacity SRAM line buffer [4]. For example,
(DNNs) on embedded devices has created a major implemen- with 3×3 convolutional filters, a line buffer storing 4 feature
tation challenge, largely due to the energy cost of memory map lines suffices for IPU n − 1 to write and IPU n to
access. RRAM-based in-memory processing units (IPUs) enable
fully layerwise-pipelined architectures, minimizing the required read simultaneously. In contrast, a weight-stationary digital
SRAM memory capacity for storing feature maps and amor- architecture is too bulky to allow full layerwise-pipelining,
tizing its access over hundreds to thousands of arithmetic and must instead be time-multiplexed. This requires activation
operations. This paper presents an RRAM-based IPU featuring memory capacity equal to the size of the largest feature map
dynamic voltage-mode multiply-accumulate and a single-slope in the network.
A/D readout scheme with RRAM-embedded ramp generator,
which together eliminate power-hungry current-mode circuitry
without sacrificing linearity. SPICE simulations suggest that this
RRAM-based IPU architecture can achieve an array-level energy
efficiency up to 1.2 2b-POps/s/W and an area efficiency exceeding
45 2b-TOps/s/mm2 .
I. I NTRODUCTION
Today, there exists significant interest in deploying deep
neural networks (DNNs) on embedded devices in order to
minimize latency and preserve user privacy in domains such
as computer vision and speech recognition. However, the large
scale of DNN computations combined with the resource con- Fig. 1. Layerwise-pipelining of IPUs reduces required activation memory to
straints of embedded devices leads to a major implementation one low capacity line buffer per CNN layer.
challenge. For example, deploying a DNN that performs 100
billion arithmetic operations per inference at 10 inferences In the described RRAM-based IPU, input activation bits are
per second within a 1 mW power and 1 mm2 area budget applied as wordline pulses, and output activation bits are read
would require an energy efficiency of 1 POps/s/W and an area out through voltage comparators. Filter weights for one com-
efficiency of 1 TOps/s/mm2 . Existing DNN accelerators fall plete CNN layer are stored in the RRAM cell conductances.
short of satisfying these constraints [1]. One differential pair of RRAM columns performs the function
While digital DNN accelerators achieve high area efficiency, of an internally analog, externally digital dedicated hardware
their energy efficiency is limited by a relatively low num- neuron. We refer to this building block as the RRAM column
ber of arithmetic operations performed per byte accessed in neuron.
memory (low arithmetic intensity) [2]. Memory-like architec- Past approaches to RRAM-based in-memory computing
tures achieve high energy efficiency by amortizing the energy focused on current-mode multiply-accumulate [5]. However,
cost of memory access over many arithmetic operations [3]. such designs are energy-limited by static bias currents, result-
However, their throughput per area is limited by the need to ing in energy efficiency below 100 TOps/s/W. The RRAM-
broadcast data across many processing elements, resulting in based IPU presented in this paper instead uses dynamic
low clock speed. By cramming the functionality of a multiply- voltage-mode multiply-accumulate, improving energy effi-
accumulate unit into the area of several 1T1R RRAM cells ciency by an order of magnitude relative to [5].
(theoretically 12 F2 each), RRAM-based in-memory architec- Although dynamic voltage-mode operation is energy effi-
tures potentially offer both high energy efficiency and high cient, it comes with the drawback of inherently nonlinear
area efficiency. transfer characteristics due to exponentially decaying bitline
Fig. 1 illustrates the RRAM-based in-memory processing transients. Even though the neural network can in principle
unit (IPU) described in this paper. Enabled by the density absorb the first-order terms of this nonlinearity, we found
of RRAM, one IPU can be allocated to every layer in a that the actual dynamics in the dense RRAM array are too
convolutional neural network (CNN). In such a layerwise- complex to model during training. To address this problem,
pipelined architecture, one IPU feeds activation data to the next our design uses a single-slope A/D readout scheme with an
978-1-7281-4300-2/19/$31.00 ©2019 IEEE 1511 Asilomar 2019
RRAM-embedded ramp generator. As long as the nonlinearity the bitlines. With w1 = −1, the resulting contribution to the
does not flip the polarity of the differential bitline voltage, it differential conductance Gm − Gp is −2Gu . The differential
effectively disappears from the system. conductance contributed by the complete weight-activation dot
This paper is organized as follows. Section II presents the product group is
circuit architecture enabling dynamic voltage-mode multiply- X
accumulate. Section III describes the quantized neural network Gwx = 2Gu wi xi (1)
architecture mapped onto the RRAM-based IPU pipeline. Sec- i
tion IV presents behavioral Monte Carlo and SPICE simulation
results, and Section V concludes the paper.
II. C IRCUIT A RCHITECTURE
The RRAM column neuron performs integer addition in the
conductance domain by connecting/disconnecting RRAM cells
to/from the bitline. Because conductances are strictly positive,
a differential system is needed to implement subtraction.
Integer multiplication is implemented by repeated addition.
Fig. 2 illustrates the RRAM column neuron, which contains
three wordline groups: weight-activation dot product, bias
addition, and ramp generation. The differential conductance
Gm − Gp contains the result of the neuron computation.
Fig. 3. Weight-activation dot product wordline group. Gk denotes the
conductance G0 + kGu . A binary-coded activation xi switches a binary-
weighted RRAM cell array where all cells are programmed to represent weight
wi . This contributes 2Gu wi xi to the differential conductance.
Across the RRAM array, a fixed group of wordlines is
allocated for bias addition. Within each RRAM column neu-
ron, a subset of RRAM cells is programmed to contribute
differential conductance proportional to the particular neuron’s
Fig. 2. The RRAM column neuron (two shown here) contains wordline groups
bias value. Remaining bias addition cells in this column are
for weight-activation dot product, bias addition, and ramp generation. programmed to the mid-scale conductance G0 . In every clock
cycle, pulses are applied to all bias addition wordlines. The
A. Weight-Activation Dot Product and Bias Addition bias addition wordline group contributes differential conduc-
tance Gb = 2Gu b.
Each integer weight is represented by a differential pair
of RRAM cells. The conductance of an RRAM cell is pro- B. Ramp Generation
grammed to the mid-scale conductance G0 plus/minus an The result of the neuron computation is digitized in a
integer number of conductance steps Gu , equal to the integer manner similar to a single-slope ADC, shown in Fig. 4. In each
weight value w. A conductance of G0 − wGu is connected to clock cycle, the conductance representing the result Gwx + Gb
the plus bitline, and a conductance of G0 + wGu is connected is compared against a reference conductance Gr . Digitizing a
to the minus bitline. This design uses 5 weight levels (2.3 bits), B-bit output requires comparison against 2B − 1 reference
a mid-scale conductance of 30 µS, and a conductance step- levels. For 2-bit activations, three clock cycles per neuron
size of 10 µS. Hence, integer weight values [-2, -1, 0, 1, 2] computation are required.
correspond to conductances [10, 20, 30, 40, 50] µS (resistances Fig. 5 shows the ramp generation wordline group, which
[100, 50, 33, 25, 20] kΩ). subtracts the reference conductance Gr = 2Gu r from the
Non-negative integer activations x are multiplied by signed- result Gwx + Gb . The differential conductance Gwx + Gb − Gr
integer weights by switching binary-weighted RRAM cell ar- is converted to a differential voltage vOD = VBL,P − VBL,M
rays, as shown in Fig. 3. For B-bit activations, each differential by precharging the bitlines to a reference voltage VREF and
pair of RRAM cells representing one weight value is replicated discharging them through the connected RRAM cells. One
2B − 1 times. This design uses 2-bit activations x ∈ [0, 1, 2, voltage comparator decision is carried out in each clock cycle.
3]. The least-significant bit (LSB) gates the switching of one Bitline voltage waveforms are illustrated in Fig. 6.
RRAM cell pair in every clock cycle, and the most-significant The reference conductances Gr are set according to the
bit (MSB) gates the switching of two RRAM cell pairs. In transition levels of a unipolar quantizer, allowing A/D con-
Fig. 3, activation x1 = 1 connects one RRAM cell pair to version to simultaneously serve the purpose of the quantized
1512
Fig. 4. In each clock cycle, the conductance representing the result of a
neuron computation is compared against a reference conductance, similar to Fig. 6. The differential conductance Gwx + Gb − Gr is converted to a
a single-slope ADC. differential voltage vOD by precharging bitlines to VREF and discharging
through the connected RRAM cells.
C. Sparsity-Aware Bitline Precharge
The RRAM-embedded ramp generator illustrated in Fig. 5
strictly subtracts from the differential conductance, in accor-
dance with the transition levels of a unipolar quantizer. Once
the voltage comparator senses a negative vOD , it is guaranteed
that subsequent vOD values are also negative. Hence, carrying
out subsequent decisions is unnecessary. In the VGG-like
CNN described in Section III, activations are sparse, implying
that most often vOD is negative on the first decision. On
the average, fewer than 1.2 out of 3 decisions are actually
needed for a particular RRAM column neuron computation.
Skipping the unnecessary decisions leads to 2.5× lower energy
Fig. 5. Ramp generation wordline group. Here, the quantizer step size ∆ = 4. consumed in bitline precharge.
One quantizer transition level is subtracted in each clock cycle.
III. N EURAL A RCHITECTURE
We evaluated the RRAM-based IPU using a VGG-like
rectified linear unit (ReLU) activation function. A 2-bit unipo- CNN which we refer to as VGG-6, with topology denoted
lar quantizer with step size ∆ has transition levels at 1/2∆, by 2×(128-C3) + MP2 + 2×(256-C3) + MP2 + 2×(512-
3/2∆, and 5/2∆. Fig. 5 shows an example where ∆ = 4, C3) + MP2 + Softmax. This network is identical to VGG-
subtracting r1 = 2 in the first cycle, r2 = 6 in the second 7 from [6] with the exception of omitting the 1024-neuron
cycle, and r3 = 10 in the third. Similar to bias addition, fully-connected layer. Omitting this layer reduces parameter
a fixed group of wordlines is allocated for ramp generation count from 13.0 M to 4.7 M at minimal accuracy loss -
and a subset of RRAM cells is programmed to realize the an appropriate design choice for the embedded applications
2B − 1 quantizer transition levels r for a particular CNN targeted by this work. Activations are quantized to 2 bits using
layer. Unlike bias values, the quantizer transition levels for all the PArameterized Clipping acTivation function (PACT), and
RRAM column neurons in the array are identical. In Section weights are quantized to 5 levels using Statistics-Aware Weight
IV, it is shown that a relatively small number of wordlines (less Binning (SAWB) [7].
than 3% of the total) are allocated for ramp generation. This is Although PACT and SAWB permit storage of weights and
possible because the result of the neuron computation has low activations as low precision digital codes, they do not com-
swing relative to its maximum possible excursion, leading to pletely eliminate floating-point parameters from the network.
a relatively small step size ∆. Considering all three wordline This is due to the requirement for batch normalization, as well
groups, we define the normalized differential conductance y as the floating-point step sizes employed by these quantization
as follows. schemes. Here, we transform the original PACT/SAWB neuron
transfer function into the form
P
Gm − Gp Gwx + Gb − Gr i wi xi + b
X z = round (3)
y= = = wi xi + b − r (2) ∆
2Gu 2Gu i
where the integer neuron bias b and integer step size ∆ (from
Section II) are given by
1513
! 2) Balanced Off-Cells: To mitigate the effect of switched-
b̂ − µ + βσ/γ off 1T1R RRAM cells on bitline dynamics, we program the
b = round (4)
∆w ∆x,i dummy cells such that the differential conductance is as close
as possible to zero when all cells are connected. During
∆x,o σ/γ
∆ = 2 · round /2 (5) normal operation, dummy cells are always off, and function
∆w ∆x,i to balance the total off-conductance hanging on the plus and
In Eqs. 4 and 5, batch normalization parameters are denoted minus bitlines.
by µ, σ, γ, and β, the neuron bias before transformation is b̂,
B. Transient Simulation
the input and output activation step sizes are ∆x,i and ∆x,o ,
and the weight step size is ∆w . The integer step size ∆ is We simulated each RRAM column neuron from Table I
rounded to even, because the ramp generation section must in SPICE, using one differential column pair under test sur-
be able to realize the integer value 1/2∆. For each neuron, rounded by four functioning column pairs. To prevent parasitic
we clip the bias value such that |b| ≤ 200, which limits the capacitive coupling between adjacent bitlines, we utilize only
bias addition group to 100 wordlines. Because all neurons in half of the columns at a time, grounding the bitlines of every
one CNN layer must share the same step size ∆ but the batch other column. Our RRAM cell SPICE model is based on post-
normalization parameters σ and γ vary by neuron, we set ∆ layout parasitic extraction from a foundry-provided 40 nm
to the mode of its distribution in a particular layer. These RRAM array. For all CNN layers, we used VREF = 0.9 V
transformations allow the PACT/SAWB neuron computation and a 50 MHz clock. During each clock cycle, a wordline is
to be carried out using strictly integer math on an IPU. In the driven with a 100 ps pulse, gated according to Figs. 3 and 5.
absence of circuit nonidealities, our 2-bit activation, 5-level Table II lists the fraction of neuron output code errors
weight VGG-6 CNN achieves 93.2% accuracy on the CIFAR- obtained in transient simulation in the presence and absence
10 image classification dataset. of interleaved wordline groups and balanced off-cells. Input
patterns were selected randomly from the CIFAR-10 test data,
IV. S IMULATION R ESULTS and then down-selected for cases where the magnitude of
A. Implementation Details the normalized differential conductance |y| ≤ 20 (Eq. 2),
Table I shows the number of wordlines allocated to each to exercise the RRAM column neuron in a regime where
group for each CNN layer. As discussed below, a number parasitics are most likely to flip the polarity of vOD . Using
of dummy wordlines equal to 10% of the dot product, bias, both interleaved wordline groups and balanced off-cells, fewer
and ramp subtotal is added to the array to improve the signal than 5% of these low-swing input patterns cause errors.
integrity. In our system architecture, CNN layers 2 through 6
TABLE II
are processed on RRAM-based IPUs, and CNN layer 1 and S YSTEMATIC E RRORS IN T RANSIENT S IMULATION
softmax are carried out with floating-point precision.
Interleaved WL Groups Yes No No Yes
TABLE I Balanced Off-Cells Yes No Yes No
W ORDLINE A LLOCATION Conv. 2 3 / 30 9 / 30 6 / 30 6 / 30
Conv. 3 1 / 30 23 / 30 18 / 30 4 / 30
Dot Product Bias Ramp Dummy Total Conv. 4 2 / 30 5 / 30 8 / 30 2 / 30
Conv. 2 3,456 100 100 366 4,022 Conv. 5 0 / 30 12 / 30 13 / 30 2 / 30
Conv. 3 3,456 100 100 366 4,022 Conv. 6 1 / 30 11 / 30 11 / 30 2 / 30
Conv. 4 6,912 100 100 711 7,823 Total Errors 7 / 150 60 / 150 56 / 150 16 / 150
Conv. 5 6,912 100 100 711 7,823
Conv. 6 13,824 100 100 1,402 15,426 Fig. 7 shows an energy breakdown for each CNN layer. We
estimate that comparator energy accounts for less than 3%
Examination of the RRAM cell parasitics reveals two effects of the total in all CNN layers, hence it is omitted from the
that can potentially flip the polarity of vOD in a given breakdown. A comparator with the input-referred noise speci-
clock cycle, resulting in a neuron output code error. First, fied in Section IV-C (roughly 400 µV RMS) consumes on the
in the presence of parasitic bitline resistance, RRAM cell order of 100 fJ per decision in a scaled CMOS technology [3],
conductances do not add perfectly linearly as assumed by which is amortized across thousands of arithmetic operations.
Eq. 2. Second, due to parasitic capacitance at the drain of To quantify voltage swing, we define the voltage step size ∆V
the 1T1R cell transistor, RRAM cell conductances contribute according to the small-signal approximation vOD ≈ y · ∆V .
to bitline dynamics even when the transistor is off. We present Table III lists energy efficiency, area efficiency, and voltage
two circuit techniques to mitigate these parasitic effects: swing for each CNN layer. Layer 6 achieves the highest energy
1) Interleaved Wordline Groups: Rather than physically efficiency at 1.2 2b-POps/s/W, owing to activation sparsity
separating wordline groups, we interleave them such that the (few wordlines are switched per cycle) and low bitline swing.
weight-activation dot product, bias addition, ramp generation,
and dummy wordlines are uniformly distributed throughout C. Behavioral Monte Carlo
the array height. This mitigates the impact of parasitic bitline We consider two sources of random error in the RRAM
resistance. column neuron that may affect CNN classification accuracy:
1514
Fig. 7. Energy per 2-bit arithmetic operation breakdown.
TABLE III
E FFICIENCY AND S WING IN T RANSIENT S IMULATION
Energy Area Voltage
Efficiency Efficiency Step Size
[2b-TOps/s/W] [2b-TOps/s/mm2 ] ∆V [µV]
Conv. 2 297 45.7 151
Fig. 8. Shmoo plot showing region of accuracy degradation less than 1%
Conv. 3 340 45.7 190
in behavioral Monte Carlo simulation of RRAM-based VGG-6 CNN. The
Conv. 4 835 47.0 141 conductance step size Gu = 10 µS, and the voltage step size ∆V ranges
Conv. 5 624 47.0 124 from 75.2 µV to 190 µV across CNN layers (Table III).
Conv. 6 1,208 47.7 75.2
ACKNOWLEDGMENTS
RRAM cell conductance variation and comparator noise. We This work was supported by the Stanford SystemX Alliance
parameterize these error sources in terms of RRAM cell with contributions from Facebook, Analog Devices and Robert
conductance standard deviation normalized by conductance Bosch RTC. Albert Gural was supported in part by a National
step size σG /Gu , and comparator input-referred noise voltage Science Foundation Graduate Research Fellowship. We thank
standard deviation normalized by voltage step size σv /∆V . At Mentor Graphics for access to the Analog FastSPICE (AFS)
each point in a 2D-sweep over these parameters, we ran 100 platform.
behavioral Monte Carlo iterations over the complete 10,000
image CIFAR-10 test set. Each Monte Carlo iteration begins R EFERENCES
with random generation of RRAM cell conductances for CNN [1] K. Guo, W. Li, K. Zhong, Z. Zhu, S. Zeng, S. Han,
layers 2 to 6, modeling static write errors. Comparator noise Y. Xie, P. Debacker, M. Verhelst, and Y. Wang, “Neural
network accelerator comparison,” 2019. [Online]. Available:
is introduced during convolution over the 10,000 test images. https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/
Fig. 8 shows a Shmoo plot indicating the region of accuracy [2] B. Zimmer, R. Venkatesan, Y. S. Shao, J. Clemons, M. Fojtik, N. Jiang,
degradation less than 1% from the 93.2% baseline achieved by B. Keller, A. Klinefelter, N. Pinckney, P. Raina, S. G. Tell, Y. Zhang,
W. J. Dally, J. S. Emer, C. T. Gray, S. W. Keckler, and B. Khailany, “A
VGG-6. We select a preferred design point at σG /Gu = 0.25 0.11 pj/op, 0.32-128 tops, scalable multi-chip-module-based deep neural
and σv /∆V = 5, where classification accuracy is 92.5%. At network accelerator with ground-reference signaling in 16nm,” in Proc.
this design point, a comparator input-referred noise of 376 µV Symp. VLSI Circuits, June 2019, pp. C300–C301.
[3] D. Bankman, L. Yang, B. Moons, M. Verhelst, and B. Murmann, “An
RMS and an RRAM cell conductance standard deviation of always-on 3.8 µ j/86% cifar-10 mixed-signal binary cnn processor with
2.5 µS can be tolerated. all memory on chip in 28-nm cmos,” IEEE Journal of Solid-State Circuits,
vol. 54, no. 1, pp. 158–172, Jan 2019.
V. C ONCLUSION [4] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P.
Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional
We have presented an RRAM-based IPU that 1) reduces neural network accelerator with in-situ analog arithmetic in crossbars,”
in Proc. 43rd Annu. Int. Symp. Comput. Archit. (ISCA), June 2016, pp.
activation memory to one low capacity line buffer per CNN 14–26.
layer, 2) uses dynamic voltage-mode operation to eliminate [5] C. Xue, W. Chen, J. Liu, J. Li, W. Lin, W. Lin, J. Wang, W. Wei, T. Chang,
power-hungry static bias currents, and 3) employs a single- T. Chang, T. Huang, H. Kao, S. Wei, Y. Chiu, C. Lee, C. Lo, Y. King,
C. Lin, R. Liu, C. Hsieh, K. Tang, and M. Chang, “A 1mb multibit reram
slope A/D readout scheme with RRAM-embedded ramp gen- computing-in-memory macro with 14.6ns parallel mac computing time
erator to minimize nonlinearity due to bitline transients. SPICE for cnn based ai edge processors,” in IEEE ISSCC Dig. Tech. Papers,
simulation of the RRAM column neuron indicates array-level Feb 2019, pp. 388–390.
[6] F. Li and B. Liu, “Ternary weight networks,” CoRR, vol. abs/1605.04711,
energy efficiency up to 1.2 2b-POps/s/W and area efficiency 2016. [Online]. Available: http://arxiv.org/abs/1605.04711
exceeding 45 2b-TOps/s/mm2 . Behavioral Monte Carlo sim- [7] J. Choi, S. Venkataramin, V. Srinivasan, K. Gopalakrishnan, Z. Wang,
ulation of the complete VGG-6 CNN indicates that 92.5% and P. Chuang, “Accurate and efficient 2-bit quantized neural networks,”
2019. [Online]. Available: https://www.sysml.cc/doc/2019/168.pdf
accuracy on the CIFAR-10 image classification dataset is
maintained in the presence of realistic RRAM cell conductance
variation and comparator noise.
1515