0% found this document useful (0 votes)
9 views15 pages

GENERATIVE AI

This paper discusses the advancements in neuromorphic deep learning architectures using silicon CMOS technology, emphasizing the shift from general-purpose designs to specialized hardware for improved compute efficiency. It highlights the potential of analog computing to enhance deep learning processes by performing matrix operations directly in nonvolatile memory, which can lead to significant performance gains. The authors also analyze the requirements for materials and design to optimize analog computing for deep learning applications.

Uploaded by

SAMI UR REHMAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views15 pages

GENERATIVE AI

This paper discusses the advancements in neuromorphic deep learning architectures using silicon CMOS technology, emphasizing the shift from general-purpose designs to specialized hardware for improved compute efficiency. It highlights the potential of analog computing to enhance deep learning processes by performing matrix operations directly in nonvolatile memory, which can lead to significant performance gains. The authors also analyze the requirements for materials and design to optimize analog computing for deep learning applications.

Uploaded by

SAMI UR REHMAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

The Next Generation of Deep

Learning Hardware: Analog


Computing
This paper explores the current state of neuromorphic deep learning architectures in
silicon CMOS technology.
By W ILFRIED H AENSCH , Fellow IEEE, TAYFUN G OKMEN , AND RUCHIR P URI , Fellow IEEE

ABSTRACT | Initially developed for gaming and 3-D rendering, and the reoccurring nature of these operations in the
graphics processing units (GPUs) were recognized to be a underlying algorithms, was first successfully exploited
good fit to accelerate deep learning training. Its simple math- using graphics processing units (GPUs) for gaming and 3-D
ematical structure can easily be parallelized and can therefore rendering [1]. GPUs allow a high degree of parallelism for
take advantage of GPUs in a natural way. Further progress in such workloads and, therefore, significantly enhance the
compute efficiency for deep learning training can be made by throughput compared to a conventional central processing
exploiting the more random and approximate nature of deep unit (CPU). Since deep learning algorithms also use a lim-
learning work flows. In the digital space that means to trade off ited amount of mathematical operations that are repeated
numerical precision for accuracy at the benefit of compute effi- they would also benefit from the parallel execution using
ciency. It also opens the possibility to revisit analog computing, a GPU [2]–[5]. To drive further improvement of com-
which is intrinsically noisy, to execute the matrix operations pute efficiency, features of deep learning algorithms can
for deep learning in constant time on arrays of nonvolatile be exploited that are unique to that space [6], [7]. For
memories. To take full advantage of this in-memory compute example, deep learning algorithms are resilient to noise
paradigm, current nonvolatile memory materials are of limited and uncertainty and allow, in part, a tradeoff between
use. A detailed analysis and design guidelines how these algorithmic accuracy and numerical precision [8], [9]. This
materials need to be reengineered for optimal performance tradeoff is not present in the traditional application space
in the deep learning space shows a strong deviation from the for GPUs, and this key difference is driving a new gener-
materials used in memory applications. ation of ASIC [10]–[12] chip designs for deep learning.
This also opens the opportunity to revisit the use of analog
KEYWORDS | Analog computing; deep learning; neural net-
computing. The analog approach to deep learning hard-
work; neuromorphic computing; nonvolatile memory; synapse
ware we consider here is an extension of the in-memory
compute [13] concept in which data movement is reduced
I. I N T R O D U C T I O N by performing calculations directly in memory. Arrays of
Recent hardware developments for deep learning show a nonvolatile memory (NVM) can be used to execute matrix
migration from a general-purpose design to more special- operations used in deep learning in constant time, rather
ized hardware to improve compute efficiency, which can than as a sequence of individual multiplication and sum-
be measured in operations per second per watt (ops/s/W). mation operations. For instance, in an n × m array it is pos-
The limited number of mathematical operations needed, sible to execute n × m multiply and accumulate operations
in parallel by exploiting Kirchhoff’s law [14]. However, to
Manuscript received March 1, 2018; revised June 18, 2018; accepted
effectively use analog computing based on NVM in deep
September 1, 2018. Date of publication October 12, 2018; date of current learning applications, implementation details and material
version December 21, 2018. (Corresponding author: Wilfried Haensch.)
properties need to be aligned with the requirements of the
W. Haensch and T. Gokmen are with IBM Research, Yorktown Heights,
NY 10598 USA (e-mail: [email protected]). algorithms.
R. Puri is with IBM Watson AI, Yorktown Heights, NY 10598 USA. Deep learning can be divided into two distinct operation
Digital Object Identifier 10.1109/JPROC.2018.2871057 modes: training and inference. The training phase is an

0018-9219 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

108 P ROCEEDINGS OF THE IEEE | Vol. 107, No. 1, January 2019

Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing

optimization problem in a multidimensional parameter the design and material requirements for analog comput-
space to build a model that can be used to provide a wider ing elements. Finally, in Section V, we briefly summarize
generalization in the inference process. A model usually some of the interesting recent material developments.
consists of a multilayer network with many free parameters
(weights) whose values are set during the training process. II. D E E P L E A R N I N G N E T W O R K S
The optimal form and structure of these networks is an As data scientists seek to increase the accuracy and speed
intense area of current research in deep learning with of deep learning, the complexity of network architectures
neural networks. The optimal network structure depends has exploded [16], [17]. We can group the most commonly
on the task to be solved and on the computer hardware used deep learning neural networks (DNNs) into three
that is available. In the training phase, the backpropaga- principle classes: fully connected networks (FCNs), con-
tion algorithm (BP) [15] is used to implement stochastic volutional neural networks (CNNs), and recurrent neural
gradient descent (SGD) to solve the weight optimization networks (RNNs). The latter two are designed to take
problem. Backpropagation consists of three components: advantage of spatial or temporal (or sequential) correla-
1) the forward path—the presented data propagates for- tions in the data, for instance, in image processing, text,
ward through the network until the end where an error and speech processing. The building block of a multilayer
is computed; 2) the backward path—the error propagates FCN is a linkage that connects every element of an input
backward through the network to compute the gradients layer with every element of an output layer. It can easily
in the parameters; and 3) the parameter (weight) update. be represented as an n × m matrix with n input channels
During the training process, a large body of labeled data (rows) and m output channels (columns). The matrix
is repeatedly presented to the network to determine the elements wi j determine the strength of the connection
best values for the parameters. Due to the volume of data from input xi (i = 1, . . . , n) to output y j ( j = 1, . . . , m).
that are processed repeatedly through the many layers of The multilayer network is then built up as a sequence
the neural net network, the training process tends to be of these connected layers, with the output of each layer
time consuming and can take weeks for a realistic data serving as the input to the next. One complication is that
set, which can require models with hundreds of millions a nonlinear activation function is applied to the output of
of weight parameters. To find an optimal solution, fine each layer before it is passed to the next layer. Without
tuning of the hyperparameters, e.g., parameters related to this nonlinear element between the layers, the network
the structure of the network and the training algorithm, would be equivalent to a simple two-layer linear regression
is also usually required. Optimizing the hyperparameters model and the advantages of the BP algorithm to capture
requires several complete learning cycles. Therefore, the complex data structures would vanish. A CNN [2] consists
development of customized hardware to reduce training of convolution and pooling layers. A convolution layer
time is desirable to speed up model development. For [Fig. 1(a)] consists of a set of filters or kernels of size
inference, the optimized (trained) network is only oper- k × k and d input channels. The input channels d can
ated in the forward path mode, and the computational be considered as the decomposition of the input into
requirements can vary depending on the specific applica- appropriate components. For the first layer, that could
tion. Inference in mobile applications will stress low power be, for instance, the decomposition into red, green, and
while in data center applications speed (latency) may be blue components for a color picture, and for successive
more important. Therefore, optimal solutions for training layers, it is the number of independent kernels from the
and inference can be quite different. From a software previous layer. The kernels are moved across the input
point of view, both training a model and executing it data with a given stride s (the number of pixels moved)
for inference do not depend on the underlying hardware. to scan the whole picture. There could be a number of
This can, however, lead to a suboptimal performance since kernels M in one convolution layer. The filters contain
hardware optimization for training and inference-only will the weight elements wli j (i = 1, . . . , k, j = 1, . . . , k,
tend to be different. As a practical matter, and with the l = 1, . . . , d) and are moved across the entire range of
advent of specialized hardware, it is advantageous to run the input data. The convolution process associated for one
training and inference on the same hardware platform for filter bank can be interpreted as a scan for a feature in
a seamless transition from training to inference. For exam- the data. The convolution process is repeated for several
ple, if training and inference are performed on different distinct filter banks. Each of these filter banks will produce
hardware platforms, retraining or tuning of the model to a feature map which is then reduced in dimensionality
accommodate the difference may be needed. by the pooling process. One common approach is max
The remainder of this paper is organized as follows. pooling, in which a window is defined in the feature map
In Section II, a short discussion of the basic types of and the maximum element in this window is retained. In
networks is given, and in Section III, we briefly discuss the the next convolution layer, these pooled feature maps will
current state of various hardware implementations. This serve as the input data set and the process is repeated.
discussion is by no means complete but describes some of Unlike the processing that occurs in FCNs, convolution and
the tradeoffs that need to be addressed to optimize training pooling cannot be represented as simple standard matrix
performance. In Section IV, we give a detailed analysis of multiplication. For example, since the same weights are

Vol. 107, No. 1, January 2019 | P ROCEEDINGS OF THE IEEE 109

Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing

is fed into a weight matrix that has components that either


emphasize or deemphasize the processed data according to
the data history (sequence). This combination of feedback
and ingestion of new data can be repeated many times
before the backward path and update process is executed.
While the data handling is quite different in LSTM than
in FCNs and CNNs, the weight matrix represents a fully
connected network such as the FCN and is therefore also
a computational bottleneck. However, there is some signif-
icant digital postprocessing of the matrix output required
to accommodate the data history and intermediate results
need to be stored in memory.
Matrix multiplications are at the core of the deep learn-
ing networks. For fully connected networks, data travel
through the network in form of vector–matrix multiplica-
tions. For a better utilization of compute resources, several
data points are usually batched together (mini-batch).
To reduce the convolution process to a simple matrix
multiplication requires the input to become a matrix, to
begin with, and the convolution operation is a matrix–
matrix multiplication. We will see later that this will have
impact on the performance of an analog implementation.

Fig. 1. (a) Convolution layer n × n input at d channels and M


convolution kernels. (b) Mapping a convolution layer into standard
III. D E E P L E A R N I N G H A R D W A R E
matrix multiplication. The weight matrix has M rows, for M different The most popular acceleration scheme for deep learning
kernels and each row contains the k × k filter pixels for d channels training today involves the use of GPUs. Since GPUs are
for each kernel. The input matrix maps the input data of dimension
more efficient at handling matrix–matrix multiplication
n × n into a matrix of dimension [(n − k)/s C ] × k  d.
compared to vector–matrix multiplications, the input data
are usually grouped into so-called mini-batches, combining
Mb data input vectors into an input matrix that trav-
els through the network together. The weight update is
used across the entire input for one filter, the weights are
performed as an average across the mini-batch. In addi-
reused many times on the data. The input data stream
tion, the structure of the network is often designed to
into the convolution layer can, however, be transformed
avoid costly memory access. These requirements drive,
so that the convolution process can be cast into a standard
for example, smaller filter sizes and deeper networks. To
matrix multiplication form, with the columns of the weight
further utilize parallel processing, n g GPUs on a single
matrix containing the weight coefficients of the different
node running on N nodes are used (Fig. 2) with a total
filter banks [Fig. 1(b)]. The reuse of the weights is shifted
number of NGPU = n g N . The efficiency or speedup S of
now to a multiple use of data points in the input matrix.
these distributed learning solutions depends strongly on
This transformation will have the effect that the simple
balancing several system properties: GPU utilization, node
data input vector of dimension n 2 d is now replaced by
and GPU I/O, and communication between the compo-
an input matrix of dimension ((n − k)/s + 1)2 × k 2 d, with
nents as shown in
n 2 d the number of pixels in the input data, feeding into
the convolution matrix. The weight matrices are relatively Mb
tio+GPU + tGPU
Mb
small of dimension M × (k 2 d + 1). For instance, for the first S = NGPU Mb,n g
. (1)
convolution layer for AlexNet, the weight matrix would tio+GPU + tGPU
Mb + t N
comm
have the dimension 96 × [11 × 11 × 3 + 1] = 96 × 364 [the
number of kernels × (kernel dimension squared × input The quantity S (S ≤ NGPU ) is a measure how effective
channels + bias)] [5]. the distributed learning solution is. The optimal balance
The most popular implementation of an RNN is the long- will depend on the exact details of the workload (network
short-term-memory (LSTM) network [18]. The structure architecture and algorithm details) [19]–[21]. Since the
of this network enables the emphasis or deemphasis of mini-batch size is a critical parameter for the individual
portions of the data based on the data history, or its GPU utilization, distributed learning systems usually work
sequential (temporal) order, by feeding the output of the with a very large mini-batch size that is then distrib-
network back into itself. The complete input sequence is uted across the GPUs for optimal results. Typically, mini-
represented by an input vector that combines the training batch sizes for optimal use of a single GPU are in the
data with the feedback from the previous step. This vector range 32–512. For a system with 256 GPUs in parallel and

110 P ROCEEDINGS OF THE IEEE | Vol. 107, No. 1, January 2019

Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing

power (ops/s/W), or per chip area (ops/s/mm2 ). This


measure is a reasonable metric if communication can
be ignored, and if networks are confined to single chip.
Unfortunately, this is typically not the case. However, if
we consider the individual components of a larger system,
these metrics are still a valid basis for comparison. Ulti-
mately operation and system design will determine how
much of this “raw performance” can be ultimately realized.

IV. A N A L O G C O M P U T E F O R
Fig. 2. Distributed learning system. A number of N nodes with ng
GPUs and a number of CPUs (here two as an example) are connected
DEEP LEARNING
through network. Optimal operation requires balancing utilization of We now turn to a more detailed discussion of the core
components and communication. CPUs serve as node control units
component of a deep learning system. At the heart of the
and can also be used for additional computation.
BP algorithm are three distinct operations: matrix multi-
plication, weight update, and the application of activation
an individual GPU mini-batch Mb of, for example 128, that functions. For purely digital computation, these operations
would result in a mini-batch size of 32 768 (for ImageNet can be reduced to floating-point or fixed-point operations
∼5 GB). These data need to be read from the disk and with an appropriate accuracy requirement. Alternatively,
distributed across the GPUs without creating a bottleneck. analog computing elements can be used to perform the
To find the right balance between compute and communi- matrix operations. Analog computations for matrix oper-
cation is a key challenge in hardware optimization [22]. ation exploit the fact that a 2-D matrix can be mapped
Mb
Here tio+GPU Mb are the time to load the data for
and tGPU into a physical array (Fig. 3) with the same number of
minibatch size Mb onto one GPU and the time to execute rows and columns as the abstract mathematical object.
Mb,n g
that minibatch, respectively. In the denominator tio+GPU At the intersection of each row and column there will
N
and tcomm are the time to load the data onto the node with be an element with conductance G that represents the
n g GPUs and the communication between the N nodes, strength of connection between that row and column (i.e.,
respectively. A perfectly balanced system would hide the the weight). If we now apply a voltage difference V across
communication between nodes and would gate the data a given row and column, there will be a current flow j
input/output (I/O) for a node by the GPU I/O. That would
result in S = NGPU . Accelerating deep learning on a j = GV. (2)
distributed system of GPUs is easily generalizable for any
accelerator solution. The solution to the balancing problem We can easily generalize this concept for an n × m array
will, however, depend on the accelerator architecture and (Fig. 3). At the n rows, we apply the components of a
the details of the network. voltage vector v i (i = 1, . . . , n) and collect the current at
To accelerate deep learning with conventional digital the n columns j j ( j = 1, . . . , m). Simple network analysis
hardware, the following strategies have been employed: 1) applying Ohm’s and Kirchhoff’s laws relates the current
exploit the possibility of operating with reduced precision vectors to the voltage vectors
to increase compute efficiency; and 2) use data compres-
sion [23] to reduce the amount of data that is moved n
between components. Reduced precision provides a very jj = g j,i Vi . (3)
i=1
effective mechanism to improve compute efficiency since it
scales quadratically with the bit width [8], [7]. The use of The above equation is exactly equivalent to the
reduced precision requires, however, a careful analysis of conventional result of a matrix multiplication if we iden-
the complete algorithmic workflow to understand where, tify the physical array with its connections gi j with the
and to what extent, precision can be reduced without abstract mathematical construct. The use of analog arrays
impacting classification accuracy [24]. The choice of fixed- allows us to replace two n × m floating-point operations
point or floating-point arithmetic is another lever that can (n × m multiplications and n × m additions) associated
be exploited [8]. Again, it is important to understand the with vector–matrix multiplication by one single (parallel)
impact of these choices on the performance and versatility operation. If we now further assume that the connec-
of a given network architecture. Since there is no funda- tion strengths gi j (i = 1, . . . , n, j = 1, . . . , m) can be
mental theory of deep learning at present, most of these simultaneously changed, the weight update operation can
tradeoffs must be studied empirically, and a certain “safety also be mapped into a single operation (in time). We
margin” must be provided in order to apply these ideas to would again replace two n × m floating-point operations
networks that have not been explicitly tested. by a single operation. The benefit is twofold: 1) we avoid
The appropriate metric for digital computation is the moving the weight elements from memory to the chip for
number of operations performed per second per unit processing; and 2) we can replace two n × m floating-

Vol. 107, No. 1, January 2019 | P ROCEEDINGS OF THE IEEE 111

Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing

digital input vector of length n into a time or voltage


encoded signal that is applied to the rows. The resulting
column current is integrated at each column by charging a
capacitor which feeds into an integrator/amplifier circuit
that creates an output voltage Vout that is appropriate
for further processing. The next step at the output would
be to compute the activation function. This can be done
either directly in the analog space, e.g., integrated into
the amplifier function [29], or alternatively the output of
the operational amplifier can be fed into an analog-to-
digital converter (ADC) to calculate the activation digitally
[we address the performance requirements of the input
(DAC) and output (ADC) elements below]. The benefit
of retaining a time-encoded signal at the output is the
elimination of ADCs (power and real estate) at the cost,
however, of the flexibility that a digital solution might offer
in additional processing capabilities (choice of activation
Fig. 3. Analog memory array—read operation. A time-encoded
function, data renormalization, network configurability,
signal is applied to the rows and the current is integrated at each
etc.). A critical quantity in this scheme is the integration
column. The ADC decodes the signal from analog into digital for
further processing. If data flow between layers is time encoded, the time tint needed to accurately determine the integrated
ADC can be eliminated. column charge. The integration time depends on the tol-
erated signal-to-noise ratio (SNR) and is influenced by the
array size n, the dynamic range of the cross-point element
β = gmax /gmin , the operating voltage Vin , and the cross-
point operations by one single operation. Both compute
point device resistance Rdev = 1/gdev . If we assume an
efficiency and communication are dramatically improved
SNR of 10% at the integrator/amplifier output, the design
simultaneously [25]. For backpropagation, matrix multi-
tradeoffs are captured by [30]
plications is done with the transposed matrix, simply swap-
ping the rows and columns, including the functionality of   
the peripheral circuits. 1 β−1 tint
Vin >1 (4)
Of course, this begs the question: If it is that simple, why 10 β+1 n Rdev k B T
are we not doing it already? To understand the answer, we
now will examine this analog process in more detail. where k B is the Boltzmann constant and T is the chip tem-
The use of arrays of conductive elements for matrix perature. For the estimation of the feedback capacitance
multiplication is not new; it was proposed many years Cint (see also Fig 3), we have
ago [14]. With renewed interest in deep learning, it gained  
attention again as a possible solution to accelerate the Vin β−1 tint
Cint = 40 . (5)
required computations [26]–[28]. To maintain the benefits Vmax β+1 Rdev
noted above, this would mean that the weight data are
stored in a physical array, and that all operations are The last constraint comes from the fact that the voltage
performed locally with the weights in place (i.e., not drop across the metal lines, i.e., within the rows and
moved in and out of memory). The natural choice for such columns, should be no more than 10% of the voltage drop
arrays come from memory technologies. We seek a memory across the device itself
solution that 1) can store and retain weights; 2) has a
nondestructive readout mechanism; and 3) has the ability n 2 Rcell
to read and write the entire memory array in one single < 0.1. (6)
Rdev
operation. While 1) and 2) are conceivable, 3) is diametral
to conventional memory operation which in its extreme Here Rcell is the resistance of the metal line in a unit cell
implementations is optimized for random sequential access of the array. Equations (4)–(6) constrain the design of a
or at least will limited the accessible address space. This suitable analog array for deep learning. They implicitly
means we might be able to use conventional memory assume that no select device is needed. The introduction
elements but we must create an array architecture that is of a select device might be necessary, as discussed below,
different form the architecture of conventional memory. to compensate for nonideal behavior in the memory ele-
The basic array architecture that accomplishes vector– ments. The design constraints presented in (4)–(6) provide
matrix multiplication as a single operation is a cross-point the densest array solution for optimal power and perfor-
array with n rows and m columns, as shown in Fig. 3. mance at the array level. An n × m array configuration
A digital-to-analog (DAC) converts the components of a can perform an equivalent of two n × m floating-point

112 P ROCEEDINGS OF THE IEEE | Vol. 107, No. 1, January 2019

Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing

considered fixed, and are merely sampled by applying a


voltage. The weight update process is considerably more
complex. Here the weights must be changed in response to
xi and δ j . The challenge is to execute the weight change
locally—at each individual cross-point element—for all
array elements at the same time. This requires a physical
mechanism in which resistance of the cross-point material
changes in response to a stimulus. One class of material
that has this property is nonvolatile memory (NVM). Non-
volatility means that the conductance of the element, i.e.,
the weight value, persists over a considerable time. Typical
Fig. 4. Array sizing tradeoff at constant performance: resistive
noise limit (blue solid line), metal line limit (brown dashed–dotted
NVM elements are used to store a small number of bits
line), integrator capacitance limit (gray dashed–dashed line). (e.g., one or two) in such a way that the stored information
Parameters used are: t = 80 ns, R = 144 mΩ, V and V = 1 V, can be recovered individually for each element. For deep
β = . Shaded area indicates optimal device operation learning applications, many more states per device must
range 5 to 25 MΩ.
be accessible to enable an incremental (or analog) weight
change during training. We will see below that required
properties of NVM elements for deep learning are qualita-
operations at constant time. With an integration time tint tively different from materials optimized for conventional
this gives the equivalent performance of two n × m/tint memory applications. For example, the required changes
ops/s. For example, parallel processing of a square 1000 × in the conductance level are more gradual and that it is,
1000 array with tint = 1μs corresponds to 2 Tops/s. If in general, not necessary to recover the conductance of a
the integration is reduced to 80 ns and the array size single element.
increased to 4000 × 4000, the throughput is equivalent to There are many potential schemes for updating the
400 Tops/s. Basic restrictions on noise, voltage drop in conductances of the cross-point elements. For example,
the metallic cross-bar lines, and integrator capacitance size a selector device could be used so that each device in
will limit the practical array size. For example, the tradeoff the array is independently updated in a serial fashion.
between cross-point element resistance and array size is Closed-loop iterative methods can be implemented to pre-
shown in Fig. 4, at constant performance, e.g., integration cisely adjust each conductance. However, these methods
time. On the one hand, a larger device resistance enables increase circuit complexity and are too slow for prac-
a larger array at a given metallization technology because tical applications. Furthermore, as noted above, parallel
the voltage drop in the metal lines will remain relatively weight update would result in significantly higher through-
small compared to Rdev . On the other hand, a larger resis- put. One scheme that enables open-loop, parallel weight
tance increases the thermal noise. Array size and device update without a selector device is shown in Fig. 5. The
resistance will be optimal at near the cross point of these approach exploits coincidence between voltage signals on
two factors. For low resistance and array size tradeoff, both the rows and columns. A sequence of voltage pulses
the integrator capacitor will (at the constant performance representing the input vector x is applied to the rows.
scaling) be the limiting factor since the integrator circuit Similarly, a sequence of voltage pulses representing the
needs to fit into the array pitch which is difficult if the backpropagating error vector δ are applied to the columns.
required capacitance is too large. Adjusting performance The components of these vectors are time-encoded repre-
(smaller integration time for smaller arrays) will remedy sentations [29], i.e., series of pulses of constant voltage
this situation. and length, for both the input vector x and the error vector
The vector–matrix multiply function is rather straight- δ. The encoding scheme discretizes the analog input signal
forward to implement. In the backpropagation algorithm, into a fixed bit stream of K bits. Each bit consists of a fixed-
vector–matrix multiplication occurs in both the forward length voltage pulse with an amplitude of 0 or V , with
and backward paths. The third component of the BP algo- the number of nonzero pulses for each input component
rithm is the weight update. During training, the weights ki would be proportional to the input vector component
(or conductances) are updated according to xi . The finite quantization will require a rounding to fit
the grid. A stochastic rounding procedure turns out to be
beneficial for the robustness of algorithm. An alternative
wi, j = wi, j + εxi δ j (7) approach is to encode row and column signal as stochastic
bit stream of length K [30]. The number of nonzero pulses
with xi the forward path input vector into layer L, δ j ki in that bit stream would then, in average, be propor-
the backpropagated error vector coming from layer L + 1 tional to the input signal xi . In both cases, K could be used
and ε the learning rate, which is a hyperparameter that as an adjustable parameter that will modulate the learn-
is adjusted for optimal performance (accuracy and speed). ing rate. Encoding the signals into bit streams with con-
In matrix multiplication, the conductances in the array are stant voltage pulse sequences will simplify the peripheral

Vol. 107, No. 1, January 2019 | P ROCEEDINGS OF THE IEEE 113

Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing

communication between digital and analog components.


With a proper choice of architecture some of these can
be hidden through proper pipelining and others represent
a genuine constraint. For instance, when the ADCs are
sampling the output voltage at the opamp, the array can
already process the vector–matrix multiplication and data
could be encoded at the DAC. The number of analog
arrays, or tiles, that can be operated in parallel will
depend on the available on-chip data rate and the on-
chip memory. From a network-on-chip (NoC) prospective,
an optimal solution requires balance between the optimal
number of analog tiles, digital backbone, on-chip memory,
and on-chip communication bandwidth. The tradeoffs are
like those discussed above [see (1)], however without
the weight movement which reduces the data amount
Fig. 5. Analog array—weight update. Rows (x) and columns (δ) significantly. In addition, these must be balanced with off-
receive a bit stream BL of K constant voltage pulses. Weight update chip I/O. We will not discuss the system design aspects
occurs where row and column pulses coincide. further, but will focus on the operation of a single analog
tile.
circuits. At the core of this scheme is the response of the There are two key considerations that need to be
material to coinciding pulses at the cross points of the rows addressed to understand performance quantitatively:
and columns. Ideally, the conductance at the cross point 1) the digital/analog interface requirements; and 2) the
will change by a maximum amount gi j when there is a material properties of the cross-point elements. To study
coincidence between pulses on the row and column and the impact of the A/D interface and the impact of cross-
the voltage across the device is 2 V. “Half-select” condition, point material properties, we constructed a simulation
e.g., a pulse on the row but not the column, can also cause tool [30] that captures the impact of the peripheral cir-
a response in the material. For materials with a switching cuits, e.g., DAC and ADC accuracy, SNR sensitivity of the
threshold, the signal voltage V can be chosen to be lower integrator/amplifier, as well as the switching behavior of
than the threshold to avoid the half-select problem. This the cross-point element, including device-to-device varia-
is not always possible and the “half-select” problem can tion, stochasticity, and cycle-to-cycle variations in a single
degrade performance. However, with stochastic bit stream device. We do not consider a detailed circuit model for the
encoding, the half-select problem can be mitigated due to periphery [32] but maintain an abstraction level that cap-
random averaging of the half-select signals. Coincidence tures the interaction of the A/D and D/A conversion with
events permit a certain tolerance which would result in a the properties of the analog elements on the algorithmic
fluctuation of the learning rate. The goal is to compute all performance for easy scalability to larger more complex
weight updates in parallel. Since both the row and column networks. The typical switching behavior of NVM materials
inputs could carry a sign, the array update will require four depends very strongly on the switching mechanism. The
cycles (++, -+, +-, - -) with the correct pulse polarity. The weight elements can be positive and negative and will,
time that is needed for that will be determined by the pulse during training, move up and down, often changing sign.
length tpulse and the length K B L of the stimulus bit stream Since physical conductivities are always positive, signed
weights can only be realized using a differential signal. The
tupdate = 4K B L tpulse . (8) differential signal can be obtained either by associating
two conductivities G + and G − to one cross point

Both are hyperparameters that need to be adjusted for


optimal performance. Their values will depend on the G = G+ − G− (10)
materials used for the cross point and on the network
architecture [31]. or comparing a local cross-point conductance G local to a
To estimate the cycle time per layer of the network we global reference G global for all elements
add the times for the forward and backward passes to the
weight update time G = G local − G global . (11)

tcycle = 2tint + tupdate . (9) Changes in the conductivity need to be small to avoid
convergence problems in the algorithm. In deep learning
To this we must add the overhead that comes from software solutions, the change in weight is controlled
the DAC, ADC, other digital computations, and the by the learning rate ε [see (7)]. Weight changes that

114 P ROCEEDINGS OF THE IEEE | Vol. 107, No. 1, January 2019

Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing

Fig. 6. (a) One-sided switching: Conductivity can only change gradually in one direction. Conductance level will eventually saturate and
the differential signal will not change anymore. (b) Two-sided switching: Conductivity can gradually switch up and down. Differential signal is
measured against a fixed reference (global or local).

increase the weight value are called potentiation, and In operation, one-sided switching devices must be paired,
weight changes that decrease the weight value are called with one element carrying the potentiation and the other
depression. Material properties for NVM do not map in a the depression signal as shown in Fig. 6(a). The net
natural way into the requirements for the weight move- conductance is the difference between the conductances
ments, mainly because the response to a given stimulus of the individual elements. To achieve symmetric weight
depends both on the value of the weight and the sign of the update, the pair needs to match in their linearity. For the
stimulus [27]. Ideally, the response would be linear (inde- two-sided device, it is possible to use a global reference
pendent of the weight value) and symmetric (independent (a column for forward or row for backward propagation)
of the sign of the stimulus). For many NVM materials, since the conductivity in the cross point can move locally
conductance increase is associated with the set operation up and down [Fig. 6(b)]. The reference element however
and conductive decrease with the reset operation, e.g., must be the same for all rows and columns because the
phase change memory (PCM) and resistive random-access weights need to be the same for forward and backward
memory (RRAM). In general, for existing NVM materi- propagation. A local fixed reference would eliminate this
als, these two processes are very different due to the restriction at the cost of array size or process complexity
underlying physical mechanisms. NVM materials can be (stacked arrays). The requirements for the cross-point
distinguished by the detailed switching properties of these materials for use in deep learning training can be explored
two branches. There are several materials available that using a fully analog deep learning model. The subtleties
fall under the following categories (Fig. 6). in the material properties are incorporated into our device
switching model. It incorporates spatial variations (device-
1) One-sided or unipolar switching. These are materials
to-device) and temporal stochastic behavior (coincidence-
that show a gradual change of conductivity in one
to-coincidence) shown in Fig. 7.
branch, usually the set branch, and are abrupt in the
To understand the interdependence of the digital inter-
other [Fig. 6(a)]. In practice, the conductivity can
face, material properties, and the BP algorithm, we imple-
only be gradually changed in one direction.
mented a three-layer FCN (784, 256, 128, 10) for the
2) Two-sided or bipolar switching. These materials
handwritten character data set MNIST and investigated
show gradual change in both set and reset branch
the training performance, sensitivity to device parameters,
[Fig. 6(b)]. Gradual conductivity changes can be
and the A/D interface specifications [30]. We find that
made in either direction.
the performance of the FCN is robust against stochastic
The switching behavior can be further classified as either behavior for certain properties and less tolerant to others,
linear or nonlinear depending on the number of stimulus as summarized in Fig. 7. Regarding the material proper-
pulses and the conductance state. As indicated in Fig. 6(a), ties, the most important attributes are: 1) weight move-
the nonlinearity can lead to a saturation since conduc- ment for potentiation and depression must be symmetric
tance change is only in one direction. For a two-sided within 2%; and 2) the granularity of the weight update
switching device, symmetric switching means that under requires 1000 steps ( 10 b), on average, between minimum
the same number of potentiation and successive depression and maximum conductance values in the case that all
pulses the device returns to its original conductance level. variations are considered simultaneously. Relaxed require-
This last requirement is more general than linear switch- ments are observed for individual components which is
ing since symmetric switching does not require linearity. a nonphysical situation. A 5-b DAC and a 9-b ADC are

Vol. 107, No. 1, January 2019 | P ROCEEDINGS OF THE IEEE 115

Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing

Fig. 7. Important device variability components. Δ  minimum step from device to device (spatial) and from coincidence to coincidence
−/C
(temporal). Device to device weight range . Overall up (+) and down (-) change Δ
 and up (+) and down (-) ratio for a single device
Δ C /Δ − . The table shows individual sensitivity and combined sensitivity at 0.3% penalty from floating-point results.
 

required, and a noise level of 6% can be tolerated at the estimate the performance of a given tile. In Fig. 8, we show
integrator/amplifier output. The MNIST data set is very an example for the performance as a function of array size
small, with only about 60 000 images, and the FCN that for one layer in a fully connected network (FCN, LSTM).
is used for training has 235 000 weight parameters. For For a convolutional layer, power and performance will
comparison, the recent networks [6] used for training on degrade significantly due to the mapping discussed above
the ImageNet data set, with 1.2 million pictures, have (Fig. 1). In contrast to the trend in GPU-based convolu-
tens of millions of parameters and many layers [Alexnet tional networks, analog arrays would favor larger kernels.
61 million weights, five convolutions, and three fully con- The operation conditions for the example are: integration
nected layers; ResNet 50 25.5 million weights, 53 convo- time 80 ns, sharing 16 rows (columns) per ADC [34] with a
lutions, and one fully connected layer(s)]. It is not clear, sampling rate of 200 M samples/s at an energy of 23 pJ per
at present, if the sensitivity analysis presented here holds sample and Rdev 24 M. An on-chip data rate of 90 GB/s
true when the networks are massively scaled. In a more is required for the largest tile. In modern NoC, TB/s of on-
recent analysis, we have shown that convolution layers chip band width is available which gives enough headroom
can be implemented with the same device parameters if for the communication between tiles. Latency due to the
noise and bound management is introduced at the digital additional digital operations (activation, data renormaliza-
peripheries [31] and that stochastic rounding is beneficial tion, etc.) can be mostly hidden by pipelined design. Build-
for larger LSTM networks [33]. The sensitivity analysis ing a NoC with multiple analog tiles as a primary building
provides a set of target material properties for useful cross- block can therefore provide a performance potential in the
point elements to implement the BP algorithm for deep thousands of Tops/W/s on chip.
learning networks. Given the material parameters, we can
V. M AT E R I A L S
To capitalize on this performance potential, further innova-
tion is required. Either more-ideal material systems must
be identified or circuit solutions must be developed that
can accommodate imperfect materials. The latter will,
however, come at a cost of performance, power, and
area since it will need an overhead that can be avoided
if suitable materials are identified. A third solution are
innovations or modifications of existing learning algo-
rithms that can function with imperfect devices and take
advantage of the array architecture. Ultimately, the clas-
sification accuracy of analog-based solutions needs to be
on-par with that of digital solutions, however with the
benefit of better power and performance. Conventional
Fig. 8. Performance, power, and data rate for a single tile with an NVM materials are optimized for memory applications that
input vector of size 500, 1000, and 4000. require a large SNR ratio [Fig. 9(a)] to securely recover the

116 P ROCEEDINGS OF THE IEEE | Vol. 107, No. 1, January 2019

Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing

activated and therefore requires a heater. This heating


element can be the PCM material itself (i.e., Joule heating)
if fed with a critical current density. Typically, the memory
element (chalcogenide layers) is in series with a low
resistive contact material of small diameter to supply the
high current density that provides the energy for the phase
transition. The amorphous material gradually crystalizes in
a moving front, lowering the resistance. This mechanism
eventually saturates, and no further conductance increase
is possible. To return the element into the high resistive
Fig. 9. (a) Memory elements have well separated resistive states
state a high and fast current pulse will melt the material
that allow individually read out. (b) Deep leaning requires gradual
symmetric changes. Individual states will not be read out except in
and return it to the amorphous (high resistance) state. The
case that an occasional reset is required (one-sided switch). conductance change in this process is very abrupt. PCM
(a) Binary change. (b) Incremental change. is therefore a 1-D switch with a gradual set (amorphous
–> crystalline) and an abrupt reset (crystalline –> amor-
phous). Since the switching process is related to a change
of crystal structure, PCM switching shows stochasticity and
stored information. Therefore, only a few, or even just two,
relaxation phenomena during the set process which can
conductance levels are supported in conventional NVM.
influence the training process due to unacceptable (and
Since the cross-point elements in a deep learning array are
unintended) weight changes between updates [37], [38].
never accessed in a sequential fashion the individual state
Despite this, PCM materials have been successfully used
is never captured. What matters is the accumulated column
for deep learning [29], [39]. The work around for the sat-
(row) signal (Fig. 3), which is an averaged quantity. Our
uration of the conductivity is to periodically reset both the
sensitivity analysis shows that, unlike a conventional mem-
G + and G − conductances while maintaining the difference
ory, a certain degree of variation is tolerable in a single ele-
which is proportional to the weight [35]. This operation
ment. The deep learning training process is self-correcting
requires a select device because the reset operation needs
with self-consistent weight updates. However, small and
to be executed on a small number of devices and not on the
symmetric weight changes, as shown in Fig. 9(b), are
array. Recently significant progress has been reported by
strict requirements. For one-sided switching elements, the
combining the PCM element with a capacitor and separat-
requirement for symmetric switching will require locally
ing leading and trailing digits of the weight [40]. Frequent
matched linear switching behavior for G + and G − to
weight updates during training are performed by modulat-
realize accurate differential operation. Nonlinearities will
ing the charge on a capacitor and periodically the weight
impact the network performance significantly [35]. For
information is transferred to the PCM element when a
the two-sided switching devices, symmetric switching in
critical charge state is reached. This separation ensures a
both the set and reset branch is required while linearity is
symmetric update for frequently changing weight incre-
less important. There are several NVM materials that have
ments and provides nonvolatility for the most significant
been explored for deep learning. No winner has emerged
digits of the weight, thus, circumventing the limitations of
as a competitive solution for deep learning training. For
the unipolar switching PCM material.
inference-only the material constraints are relaxed: no
symmetric switching is required, and the granularity (num-
ber of states) can be significantly reduced. The weight B. Resistive Random-Access Memory (RRAM)
transfer from a trained floating-point model to an analog
The basic architecture of an RRAM device is a
array can be done sequentially (node by node, row by
metal–oxide film sandwiched between appropriate metal
row, or column by column) with closed-loop feedback
contacts [41], [42]. The top contact controls the infusion
to guarantee accuracy. We find that a replication of the
of oxygen vacancies to form a conductive filament that
floating-point weights within a 5% error is sufficient to
consists of substoichiometric oxide. The conductivity of
replicate the classification error of the original model.
the device is determined by the proximity of the filament
However, inference-only will stress cross-point array yield
to the bottom contact. The conductivity is controlled by
since the training process implicitly assumes all cells are
growing (set) or shrinking (reset) the size of the fila-
functional.
ment. The filament formation (set) and dissolution (reset)
Popular materials that are investigated for use in analog
are reversible which enables conduction changes in both
arrays for deep learning networks are as follows.
directions. Although both branches can show a gradual
change in conductivity the observed changes are not sym-
A. Phase Change Memory (PCM) metric. Most RRAM devices require a formation process
In PCM, different conductance levels are created by that establishes the conducting filament. This formation
changing the morphology of chalcogenide layers from process will determine the base resistance of the RRAM
amorphous to crystalline [36]. This transition is thermally device and it is known that a proper current control during

Vol. 107, No. 1, January 2019 | P ROCEEDINGS OF THE IEEE 117

Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing

the formation process can serve to modulate the operating


resistance range of the device. As with PCM, the change
of the conductance depends on structural changes at the
atomic level, and therefore is intrinsically stochastic. Con-
trolling the filament formation and dissolution, as well as
engineering symmetric set and reset behavior, are the key
challenges for RRAM. Convincing hardware demonstration
for deep learning training with RRAM is still lacking.
However, simulations that incorporate RRAM devices with
improved device characteristic [43] are encouraging. The
use of a select device for RRAM can be avoided if a mate-
rial combination is found that has a controlled filament
formation that does not require the active current limiting Fig. 10. Metal/ferroelectric/dielectric/metal tunnel junction.
Polarization domains are modulated by an applied voltage and
needed for current devices.
changing the interface barriers with respect to the Fermi energy EF
to modulate the tunnel current [51].
C. Conductive Bridge Random Access
Memory (CBRAM)
In contrast to RRAM, in a CBRAM, a conductive path Significant progress has been made using perovskite fer-
is formed by mobile metal ions that move through an roelectrics such as Pb(Zr,Ti)O3 . Yet, implementation on a
electrolyte or dielectric [44], [45]. The typical stack of an conventional silicon CMOS platform remains challenging,
CBRAM consists of an inert electrode, a solid electrolyte due to incompatibilities with CMOS processing. The recent
or dielectric, and an electrochemically active electrode. discovery of a previously unknown FE phase of HfO2
The motion of the metal ions is controlled by applying a (FE–HfO2 ) [48], [49], has the potential to remove the inte-
voltage to the stack and it is reversible. There are a variety gration challenges of the traditional perovskite-based FE
of material combinations discussed in the literature for materials. Ferroelectric two-terminal (capacitor or resistor)
this type of device, mostly addressing their use in memory and three-terminal (transistor) devices can thus be built
applications. More recently they have also been considered from conventional high-k/metal gate materials [50], [51]
as cross-point elements for analog arrays for deep learning. used in commercial CMOS logic FETs, albeit with different
A simple stack of Cu/SiO2 /W showed interesting gradual processing and doping to achieve ferroelectric behavior.
switching properties for set and reset branches [46], how- This opens the possibility of implementing a variety of
ever with the caveat that variable voltage pulses were used, tunable solutions on a CMOS platform, e.g., FeFETs, FE
which is impractical for an array implementation. That capacitors controlling conventional FET gates, or metal–
work suggests that a two-layer diffusive model can explain FE–metal (MFM) ferroelectric tunnel junctions (FTJs) that
the linear switching of this stack and would possibly reduce might be useful as synaptic device. The outstanding issues
the stochasticity of the switching process. are the demonstration of gradual symmetric switching
under constant voltage pulse stimulation, switching distri-
butions that meet the requirements outlined above, oper-
D. Ferroelectric Devices
ating conditions that allow energy-efficient operation, and
A ferroelectric device is a stack of a thin dielectric, fer- dimensional scalability.
roelectric material (FE) located between a suitable metal
electrode and a substrate. While initial material stacks
were based on ferroelectric perovskites, HfO2 -based stacks E. Electrochemical Device
are easier to integrate with conventional CMOS [47]. Electrochemical devices are a newcomer in the field of
Ferroelectric materials will respond to an external field contenders for an analog array element for deep learning.
by changing their electric polarization, either by moving The device idea, however, has been around for a long time
domain wall boundaries or by directly flipping the polar- and is related to the basic principle of a battery [52].
ization of a small crystalline domain. In FE devices, the Compared to the previously discussed switches, which
polarization modulates the interface barriers in the stack only required two terminals, this switch required three
(Fig. 10) that can either be used to tune a threshold voltage terminals. The device structure, shown in Fig. 11, is a
of a field-effect transistor (FET) or to modulate a current stack of an insulator that forms the channel between
through a tunable tunnel junction. Both applications have two contacts source and drain, an electrolyte, and a top
been proposed for a synaptic device. While the FET solu- electrode (reference electrode). Proper bias between the
tion is a three-terminal device, the tunnel junction is a reference electrode and the channel contacts will drive
two-terminal device that will provide higher density for a a chemical reaction at the host/electrolyte interface in
cross-point array due to its smaller size. Proposals to use which positive ions in the electrolyte react with the host,
the adjustable channel conductance of ferroelectric (FE) effectively doping the host material. Charge neutrality
FETs as a synaptic weight date back to the early 1990s. requires the free carriers to enter the channel through the

118 P ROCEEDINGS OF THE IEEE | Vol. 107, No. 1, January 2019

Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing

speed and durability. During the write process, the charge


injection into the floating gate is either accommodated by
hot electron effects or by tunneling. Both processes are
relatively slow and require high voltages. They also tend to
damage the gate dielectric and lead to a limit on durability
of about 105 − 106 write cycles. The number of weight
updates for training on the 1.2 million ImageNet samples
with a minibatch size of 256 at 50 epochs is 2.3∗ 105 .
Recently, proposals have emerged to take advantage of the
3-D stacked architecture for NAND flash or solid-state drive
Fig. 11. Three-terminal electrochemical device. Applying bias
(SSD) configurations [57] for deep learning applications.
between the reference (top) and S/D electrodes moves ions in and Due to the limitations on endurability and high voltage
out of the channel (host matrix). To maintain charge state of the operation, it is questionable if floating gate devices are
channel reference electrode must be disconnected after write. competitive for deep learning training. They might, how-
ever, be useful for inference as only the read operation is
channel contacts. If the connection between the reference required.
electrode and the channel contacts is terminated after the
write step the channel will maintain its state of increased G. Analog CMOS
conductivity. The read process is simply the current flow
In a DRAM, stored charge is used to represent a single
between the two channel contacts, source and drain, with
bit. Conceptually, the amount of stored charge could be
the reference electrode floating. It has been shown [53]
used to represent an analog weight. However, the transla-
that almost symmetric switching can be achieved if the
tion from DRAM to an analog array for deep learning is
reference electrode is controlled with a current source.
unfortunately not as straightforward as one might imag-
Regarding the switching requirements, the group obtained
ine [58]. In DRAM, the charge state of the capacitor is
similar criteria [54] that are shown above. Voltage control
destroyed during the read operation. However, the read
of the reference electrode leads to strongly asymmetric
mechanism remembers the state and writes it right back.
behavior due to the buildup of an open circuit voltage
Unfortunately, the writeback operation can only restore the
(VCO) that can depend on the charge state of the host.
signal to the rails: high or low, which is sufficient for the
If the reference electrode voltage compensates for VCO,
DRAM but not for a deep learning array where the weights
almost symmetric switching can be achieved as well. For
are an arbitrary level between the high and low state. An
a voltage controlled analog array for deep learning, this
addition concern is that charge leaks out of the capacitor,
device is not suited since every cell would require an
and to compensate for this, the entire DRAM array is
individual compensation depending on its conductivity.
periodically refreshed. The typical period for the refresh
Possible solutions are low VCO material stacks.
operation is in the order 32–64 ms, while the characteristic
With the tunable resistive elements at different states
leakage time, the retention time τret , in which 50% of the
of maturity, the question is as follows: Can we imple-
cells fail to give the correct signal, is in the several-second
ment analog arrays for deep learning with existing CMOS
regime. For deep learning, we need to avoid destructive
technology options? Charge is the natural agent in the
reads and compensate for charge leakage. If we assume a
CMOS world to represent the weights. Charges can be
time  between weight updates of the order of 200 ns,
stored either on a floating gate of an EPROM device or
and a retention time in the order of seconds, the updated
in a capacitor. Both possibilities are explored and for both
weight will decay according to
the main switching properties we discussed above hold:
symmetricity in potentiation and depression and sufficient  

granularity in the change. w ←w 1− . (12)
τret

F. Floating Gate Devices This has the same form as the weight decay produced by L2
Floating gate devices for use in analog arrays for DNN regularization [59] which is a method to avoid overfitting.
were proposed in the early 1990s [55], [56] coinciding Typical values for the decay in the software world are
with the emerging Flash memory technology. The weights about 10−5 –10−6 . By comparison we find the required
are represented by charges stored in the floating gate of retention of the order of several seconds. A word of caution
cross-point cell device. The analysis of the required switch- is in order since we assumed a cycle time of 200 ns
ing properties matches our results with respect to symme- between updates. For convolutional networks as discussed
try and weight granularity. To meet these requirements, above the input is a matrix with [(n − k)/s + 1]2 columns.
a modified cell design was proposed, albeit with a very The time between updates will then be approximately
large cell size. There are two additional concerns for using [(n −k)/s +1]2 × 200 ns, which would increase the required
floating gate devices for deep learning analog arrays: write retention time τret by [(n − k)/s + 1]2 , which is difficult to

Vol. 107, No. 1, January 2019 | P ROCEEDINGS OF THE IEEE 119

Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing

achieve in hardware. The input matrix size, for instance, enabled by traditional scaling. We are far away from a
for the first convolution layer for the small CNN LeNET similar scaling law for deep learning; in fact we do not even
(MNIST) is 576 and for AlexNET (ImageNET) is 3025 have a fundamental theory that can guide us. Progress
which would require retention times τret in access of sev- is being made by brute force: we develop more complex
eral minutes or more which are unrealistic to achieve. neural networks with tens of millions of parameters, collect
A controlled symmetrical change of charge requires the and curate huge labeled data sets, and find the hardware
use of current sources: one for injecting charges, the to run the algorithms. For a pervasive use of deep learning,
other for extracting charges. These are usually done with cost is a major issue. Cost means the time to build models
simple FETs that operate in saturation. In the dormant and the computational resources that are needed to train
state, they are turned off and leak. This leakage will and execute them. The realization that GPUs are a good
discharge(charge) the capacitor which is needed to hold fit for these tasks was a critical enabling step. However, it
the charges to a certain level. With the definition of the is now clear that specialized hardware that is customized
retention time, we find for deep learning can do better than conventional GPUs.
We are already seeing the emergence of a new generation
Ilkg τret of deep learning accelerator hardware: trading general use
C≈ . (13)
Vcap for compute efficiency, which ultimately means cost. Unfor-
tunately, the shear complexity of building and training
With the retention time in 1-s range and Vcap in the range models forces us to look at the solution at the system level,
of 1 V, a very low leakage CMOS technology is required. where several deep learning accelerators work together
With an area capacitance of 470 fF/μm2 that can be to solve the problem. We have only briefly touched on
achieved for an eDRAM technology [60] real estate for the the system aspects for deep learning machines, but these
capacitor would scale as 2(Ilkg /pA) × (τret /s)μm2 . There- issues will ultimately determine the viability of new AI
fore, an ultralow leakage CMOS technology is required hardware accelerators. Our discussion focused on the basic
for a competitive arrays size. In addition to these simple design and material properties issues that need to be
scaling considerations, the effect of the device variations addressed for analog accelerators. Only if we can show, in
on the scaling behavior of the cell needs to be explored in a convincing manner, that these are solvable do questions
more details for the implementation of a robust learning concerning system-level integration become relevant. We
algorithm. do not expect that analog computing for deep leaning
will drive a fundamentally new ecosystem but rather will
augment the existing, digital one. We will see a continued
VI. C O N C L U S I O N
push to improve neural networks and to push digital hard-
Now artificial intelligence (AI) is synonymous with deep ware solutions to the limit of what is possible. The analog
learning. The desire to apply deep learning to all facets of solutions, if successful, should be ready to fit seamlessly
life is reminiscent of the pervasive use of microelectronics into this evolution. 

REFERENCES
[1] J. Nickolls and W. J. Dally, “The GPU computing Training deep neural networks with weights and [18] S. Hochreiter and J. Schmidhuber, “Long
era,” IEEE Micro, vol. 30, no. 2, pp. 56–69, activations constrained to +1 or -1.” [Online]. short-term memory,” Neural Comput., vol. 9, no. 8,
Mar./Apr. 2010. Available: https://arxiv.org/abs/1602.02830 pp. 1735–1780, 1997.
[2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, [10] P. A. Merolla et al., “A million spiking-neuron [19] J. Dean et al., “Large scale distributed deep
“Gradient-based learning applied to document integrated circuit with a scalable communication networks,” in Proc. Adv. Neural Inf. Process. Syst.
recognition,” Proc. IEEE, vol. 86, no. 11, network and interface,” Science, vol. 345, no. 6197, (NIPS), 2012, pp. 1223–1231.
pp. 2278–2324, Nov. 1998. pp. 668–673, 2014. [20] P. Goyal et al. (2017). “Accurate, large minibatch
[3] J. Schmidhuber, “Deep learning in neural networks: [11] N. P. Jouppi et al., “In-datacenter performance SGD: Training ImageNet in 1 hour.” [Online].
An overview,” Neural Netw., vol. 61, pp. 85–117, analysis of a tensor processing unit,” in Proc. Int. Available: https://arxiv.org/abs/1706.02677
Jan. 2015. Symp. Comput. Archit. (ISCA), Toronto, ON, [21] M. Cho, U. Finkler, S. Kumar, D. Kung, V. Saxena,
[4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Canada, Jun. 2017, pp. 1–12. and D. Sreedhar (2017). “PowerAI DDL.” [Online].
learning,” Nature, vol. 521, pp. 436–444, [12] M. Davies et al., “Loihi: A neuromorphic manycore Available: https://arxiv.org/abs/1708.02188
May 2015. processor with on-chip learning,” IEEE Micro, [22] S. Shi, Q. Wang, and X. Chu (2017). “Performance
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, vol. 38, no. 1, pp. 82–99, Jan./Feb. 2018. modeling and evaluation of distributed deep
“ImageNet classification with deep convolutional [13] M. Di Ventra and F. L. Traversa, “Memcomputing: learning frameworks on GPUs.” [Online].
neural networks,” in Proc. Adv. Neural Inf. Process. Leveraging memory and physics to compute Available: https://arxiv.org/abs/1711.05979
Syst. (NIPS), 2012, pp. 1097–1105. efficiently,” J. Appl. Phys., vol. 123, no. 18, [23] C.-Y. Chen, J. Choi, D. Brand, A. Agrawal,
[6] V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer (2017). p. 180901, 2018. W. Zhang, and K. Gopalakrishnan (2017).
“Efficient processing of deep neural networks: A [14] K. Steinbuch, “Die lernmatrix,” Kybernetik, vol. 1, “AdaComp: Adaptive residual gradient compression
tutorial and survey.” [Online]. Available: no. 1, pp. 36–45, 1961. for data-parallel distributed training.” [Online].
https://arxiv.org/abs/1703.09039 [15] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Available: https://arxiv.org/abs/1712.02679
[7] A. Agrawal et al., “Approximate computing: “Learning representations by back-propagating [24] B. Fleischer et al., “A scalable multi-TeraOPS deep
Challenges and opportunities,” in Proc. IEEE Int. errors,” Nature, vol. 323, pp. 533–536, Oct. 1986. learning processor core for AI training and
Conf. Rebooting Comput. (ICRC), San Diego, CA, [16] A. Canziani, E. Culurciello, and A. Paszke (2016). inference,” in Proc. Symp. VLSI Circuits, Honolulu,
USA, 2016, pp. 1–8. “An analysis of deep neural network models for HI, USA, 2018.
[8] S. Gupta, A. Agrawal, K. Gopalakrishnan, and practical applications.” [Online]. Available: [25] C. Lehmann, M. Viredaz, and F. A. Blayo, “A generic
P. Narayanan, “Deep learning with limited https://arxiv.org/abs/1605.07678 systolic array building block for neural networks
numerical precision,” in Proc. Int. Conf. Mach. [17] L. Lazebnik. (2017). Convolutional Neural Network with on-chip learning,” IEEE Trans. Neural Netw.,
Learn., Lille, France, 2015, pp. 1–10. Architectures: From LeNet to ResNet. [Online]. vol. 4, no. 3, pp. 400–407, May 1993.
[9] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Available: http://slazebni.cs.illinois.edu/spring17/ [26] J. J. Yang, D. B. Strukov, and D. R. Stewart,
and Y. Bengio, “Binarized neural networks: lec01_cnn_architectures.pdf “Memristive devices for computing,” Nature

120 P ROCEEDINGS OF THE IEEE | Vol. 107, No. 1, January 2019

Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing

Nanotechnol., vol. 8, pp. 13–24, Dec. 2013. metallic surfactant layer as a resistance drift thin films,” Appl. Phys. Lett., vol. 99, no. 10,
[27] G. W. Burr et al., “Neuromorphic computing using stabilizer,” in IEDM Tech. Dig., Washington, DC, p. 102903, 2011.
non-volatile memory,” Adv. Phys. X, vol. 2, no. 1, USA, Dec. 2013, pp. 30.7.1–30.7.4. [49] M. H. Park et al., “Ferroelectricity and
pp. 89–124, 2017. [38] W. W. Koelmans, A. Sebastian, V. P. Jonnalagadda, antiferroelectricity of doped thin HfO2-based
[28] S. Yu, “Neuro-inspired computing with emerging D. Krebs, L. Dellmann, and E. Eleftheriou, films,” Adv. Mater., vol. 27, no. 11, pp. 1811–1831,
nonvolatile memorys,” Proc. IEEE, vol. 106, no. 2, “Projected phase-change memory devices,” Nature 2015.
pp. 260–285, Feb. 2018. Commun., vol. 6, Sep. 2015, Art. no. 8181. [50] M. Jerry et al., “Ferroelectric FET analog synapse
[29] G. W. Burr et al., “Experimental demonstration and [39] S. R. Nandakumar, M. Le Gallo, I. Boybat, for acceleration of deep neural network training,”
tolerancing of a large-scale neural network B. Rajendran, A. Sebastian, and E. Eleftheriou in IEDM Tech. Dig., San Francisco, CA, USA,
(165,000 synapses), using phase-change memory (2017). “Mixed-precision training of deep neural Dec. 2017, pp. 6.2.1–6.2.4.
as the synaptic weight element,” in IEDM Tech. Dig., networks using computational memory.” [Online]. [51] M. M. Frank et al., “Analog resistance tuning in
San Francisco, CA, USA, Dec. 2014, Available: https://arxiv.org/abs/1712.01192 TiN/HfO2 /TiN ferrolelectric tunnel junctions,” in
pp. 29.5.1–29.5.4. [40] S. Ambrogio et al., “Equivalent-accuracy Proc. IEEE SISC, San Diego, CA, USA, Dec. 2018
[30] T. Gokmen and Y. Vlasov, “Acceleration of deep accelerated neural-network training using analogue [52] J. Janek and W. G. Zeier, “A solid future for battery
neural network training with resistive cross-point memory,” Nature, vol. 558, pp. 60–67, Jun. 2018. development,” Nature Energy, vol. 1, Sep. 2016,
devices: Design considerations,” Frontiers Neurosci., [41] R. Waser, R. Dittmann, G. Staikov, and K. Szot, Art. no. 16141.
vol. 10, p. 333, Jul. 2016. “Redox-based resistive switching [53] E. J. Fuller et al., “Li-ion synaptic transistor for low
[31] T. Gokmen, M. Onen, and W. Haensch, “Training memories—Nanoionic mechanisms, prospects, and power analog computing,” Adv. Mater., vol. 29,
deep convolutional neural networks with resistive challenges,” Adv. Mater., vol. 21, nos. 25–26, no. 4, p. 1604310, 2017.
cross-point devices,” Frontiers Neurosci., vol. 11, pp. 2632–2663, 2009. [54] S. Agarwal et al., “Resistive memory device
p. 538, Oct. 2017. [42] G. Bersuker et al., “Toward reliable RRAM requirements for a neural algorithm accelerator,” in
[32] P.-Y. Chen, X. Peng, and S. Yu, “NeuroSim: An performance: Macro- and micro-analysis of Proc. Int. Joint Conf. Neural Netw. (IJCNN),
integrated device-to-algorithm framework for operation processes,” J. Comput. Electron., vol. 16, Jul. 2016, pp. 929–938.
benchmarking synaptic devices and array no. 4, pp. 1085–1094, 2017. [55] O. Fujita and Y. Amemiya, “A floating-gate analog
architectures,” in IEDM Tech. Dig., Dec. 2017, [43] J. Woo et al., “Improved synaptic behavior under memory device for neural networks,” IEEE Trans.
pp. 6.1.1–6.1.4. identical pulses using AlOx/HfO2 bilayer RRAM Electron Devices, vol. 40, no. 11, pp. 2029–2035,
[33] T. Gokmen, M. Rasch, and W. Haensch (2018). array for neuromorphic systems,” IEEE Electron Nov. 1993.
“Training LSTM networks with resistive cross-point Device Lett., vol. 37, no. 8, pp. 994–997, Aug. 2016. [56] T. Morie and Y. Amemiya, “An all-analog
devices.” [Online]. Available: https:// [44] D. Jana et al., “Conductive-bridging random access expandable neural network LSI with on-chip
arxiv.org/abs/1806.00166 memory: Challenges and opportunity for 3D backpropagation learning,” IEEE Trans. Electron
[34] L. Kull et al., “A 10 b 1.5 GS/s pipelined-SAR ADC architecture,” Nanoscale Res. Lett., vol. 10, Devices, vol. 29, no. 9, pp. 1086–1093,
with background second-stage common-mode Dec. 2015, Art. no. 188. Sep. 1994.
regulation and offset calibration in 14 nm CMOS [45] J. R. Jameson, P. Blanchard, J. Dinh, and E. al, [57] H. Choe, S. Lee, H. Nam, S. Park, S. Kim, and
FinFET,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2017, “Conductive bridging RAM (CBRAM): Then, now, E. Y. Chung, “Near-data processing for machine
pp. 474–475. and tomorrow,” ECS Trans., vol. 75, no. 5, learning,” in Proc. ICLR, 2016,
[35] G. W. Burr et al., “Large-scale neural networks pp. 41–54, 2016. pp. 1–12.
implemented with non-volatile memory as the [46] W. Chen et al., “A CMOS-compatible electronic [58] S. Kim, T. Gokmen, H.-M. Lee, and W. Haensch
synaptic weight element: Comparative performance synapse device based on Cu/SiO2/W (2017). “Analog CMOS-based resistive processing
analysis (accuracy, speed, and power),” in IEDM programmable metallization cells,” Nanotechnology, unit for deep neural network training.” [Online].
Tech. Dig., Washington, DC, USA, Dec. 2015, vol. 27, no. 25, pp. 255202-1–2552029, 2016. Available: https://arxiv.org/abs/1706.06620
pp. 4.4.1–4.4.4. [47] Z. Fan, J. Chen, and J. Wang, “Ferroelectric [59] S. Raschka, Python Machine Learning. Birmingham,
[36] G. W. Burr et al., “Recent progress in phase-change HfO2-based materials for next-generation U.K.: Packt Publishing, 2015.
memory technology,” IEEE J. Emerg. Sel. Topics ferroelectric memories,” J. Adv. Dielectrics, vol. 6, [60] G. Freeman et al., “Performance-optimized
Power Electron., vol. 6, no. 2, pp. 146–162, no. 2, p. 1630003, 2016. gate-first 22-nm SOI technology with embedded
Jun. 2016. [48] T. S. Böscke, J. Müller, D. Bräuhaus, U. Schröder, DRAM,” IBM J. Res. Develop., vol. 59, no. 1,
[37] S. Kim et al., “A phase change memory cell with and U. Böttger, “Ferroelectricity in hafnium oxide pp. 5:1–5:14, Jan./Feb. 2015.

ABOUT THE AUTHORS

Wilfried Haensch (Fellow, IEEE) received Tayfun Gokmen received B.S. degrees
the Ph.D. degree in the field of theoretical (double major) from the Department of
solid state physics from the Technical Uni- Physics and the Department of Electri-
versity of Berlin, Berlin, Germany, in 1981. cal and Electronics Engineering, Middle
He started his career in Si technology East Technical University, Ankara, Turkey,
in 1984 at SIEMENS Corporate research, in 2004. As a Francis Robbins Upton Fel-
Munich, Germany. There he worked on high low, he received the Ph.D. degree from
field transport in MOSFETs and later in DRAM the Department of Electrical Engineering,
development and manufacturing. In 2001, Princeton University, Princeton, NJ, USA, in
he joined the IBM T.J. Watson Research Center, Yorktown Heights, 2010, studying 2-D electron systems in the quantum Hall regime.
NY, USA, to lead a group for novel devices and applications. He After working as a Software Developer at Bloomberg L.P. for a
was responsible for the exploration of device concepts for future year, in 2011, he joined IBM T.J. Watson Research Center, Yorktown
technology nodes and new concepts for memory and logic circuits, Heights, NY, USA, as a Postdoctoral Researcher in the photovoltaics
including 3-D integration, early FinFET work, and the exploration of program. IBM group demonstrated the world’s first copper, zinc, tin,
carbon nanotubes for VLSI circuits. He also was active in CMOS inte- sulfur, and selenium (CZTSSe)-based solar cell that can perform
grated silicon photonics to provide high-bandwidth low-cost links above 10% efficiency in 2011. In 2013, after being appointed as
for future compute systems. He is currently responsible for novel a Research Staff Member at IBM and then IBM Research AI, he
technologies for neuromorphic computation with emphasis on began working on projects focused on exploring new hardware
exploring memristive elements (such as PCM, RRAM, FeRAM, etc.) in solutions for machine learning and artificial intelligence. He pro-
neural network arrays. He is the author of a text book on transport posed the concept of resistive processing unit (RPU) devices that
physics and author/coauthor of more than 150 publications. can accelerate machine learning, specifically deep neural network
Dr. Haensch was awarded the Otto Hahn Medal for Outstanding training algorithms, by many orders of magnitude compared to
Research in 1983. conventional digital approaches. He has 20 pending/issued patents,
coauthored a book chapter, and published over 50 papers in vari-
ous fields ranging from condensed matter physics, solar cells, to
machine learning.

Vol. 107, No. 1, January 2019 | P ROCEEDINGS OF THE IEEE 121

Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing

Ruchir Puri (Fellow, IEEE) is CTO and He was honored with John Von-Neumann Chair at the Institute
Chief Architect, IBM Watson AI, Yorktown of Discrete Mathematics, Bonn University, Bonn, Germany. He is
Heights, NY, USA and an IBM Fellow, who an inventor of over 50 U.S. patents and has authored over 100
has held various technical, research, and scientific publications on software, hardware automation methods,
engineering leadership roles across IBM’s and optimization algorithms.
AI and Research businesses. He has been Dr. Puri has been an ACM Distinguished Speaker, an IEEE Distin-
an Adjunct Professor at Columbia University, guished Lecturer, and was awarded 2014 Asian American Engineer
New York, NY, USA and a visiting scientist of the Year.
at Stanford University, Stanford, CA, USA.

122 P ROCEEDINGS OF THE IEEE | Vol. 107, No. 1, January 2019

Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.

You might also like