GENERATIVE AI
GENERATIVE AI
ABSTRACT | Initially developed for gaming and 3-D rendering, and the reoccurring nature of these operations in the
graphics processing units (GPUs) were recognized to be a underlying algorithms, was first successfully exploited
good fit to accelerate deep learning training. Its simple math- using graphics processing units (GPUs) for gaming and 3-D
ematical structure can easily be parallelized and can therefore rendering [1]. GPUs allow a high degree of parallelism for
take advantage of GPUs in a natural way. Further progress in such workloads and, therefore, significantly enhance the
compute efficiency for deep learning training can be made by throughput compared to a conventional central processing
exploiting the more random and approximate nature of deep unit (CPU). Since deep learning algorithms also use a lim-
learning work flows. In the digital space that means to trade off ited amount of mathematical operations that are repeated
numerical precision for accuracy at the benefit of compute effi- they would also benefit from the parallel execution using
ciency. It also opens the possibility to revisit analog computing, a GPU [2]–[5]. To drive further improvement of com-
which is intrinsically noisy, to execute the matrix operations pute efficiency, features of deep learning algorithms can
for deep learning in constant time on arrays of nonvolatile be exploited that are unique to that space [6], [7]. For
memories. To take full advantage of this in-memory compute example, deep learning algorithms are resilient to noise
paradigm, current nonvolatile memory materials are of limited and uncertainty and allow, in part, a tradeoff between
use. A detailed analysis and design guidelines how these algorithmic accuracy and numerical precision [8], [9]. This
materials need to be reengineered for optimal performance tradeoff is not present in the traditional application space
in the deep learning space shows a strong deviation from the for GPUs, and this key difference is driving a new gener-
materials used in memory applications. ation of ASIC [10]–[12] chip designs for deep learning.
This also opens the opportunity to revisit the use of analog
KEYWORDS | Analog computing; deep learning; neural net-
computing. The analog approach to deep learning hard-
work; neuromorphic computing; nonvolatile memory; synapse
ware we consider here is an extension of the in-memory
compute [13] concept in which data movement is reduced
I. I N T R O D U C T I O N by performing calculations directly in memory. Arrays of
Recent hardware developments for deep learning show a nonvolatile memory (NVM) can be used to execute matrix
migration from a general-purpose design to more special- operations used in deep learning in constant time, rather
ized hardware to improve compute efficiency, which can than as a sequence of individual multiplication and sum-
be measured in operations per second per watt (ops/s/W). mation operations. For instance, in an n × m array it is pos-
The limited number of mathematical operations needed, sible to execute n × m multiply and accumulate operations
in parallel by exploiting Kirchhoff’s law [14]. However, to
Manuscript received March 1, 2018; revised June 18, 2018; accepted
effectively use analog computing based on NVM in deep
September 1, 2018. Date of publication October 12, 2018; date of current learning applications, implementation details and material
version December 21, 2018. (Corresponding author: Wilfried Haensch.)
properties need to be aligned with the requirements of the
W. Haensch and T. Gokmen are with IBM Research, Yorktown Heights,
NY 10598 USA (e-mail: [email protected]). algorithms.
R. Puri is with IBM Watson AI, Yorktown Heights, NY 10598 USA. Deep learning can be divided into two distinct operation
Digital Object Identifier 10.1109/JPROC.2018.2871057 modes: training and inference. The training phase is an
0018-9219 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing
optimization problem in a multidimensional parameter the design and material requirements for analog comput-
space to build a model that can be used to provide a wider ing elements. Finally, in Section V, we briefly summarize
generalization in the inference process. A model usually some of the interesting recent material developments.
consists of a multilayer network with many free parameters
(weights) whose values are set during the training process. II. D E E P L E A R N I N G N E T W O R K S
The optimal form and structure of these networks is an As data scientists seek to increase the accuracy and speed
intense area of current research in deep learning with of deep learning, the complexity of network architectures
neural networks. The optimal network structure depends has exploded [16], [17]. We can group the most commonly
on the task to be solved and on the computer hardware used deep learning neural networks (DNNs) into three
that is available. In the training phase, the backpropaga- principle classes: fully connected networks (FCNs), con-
tion algorithm (BP) [15] is used to implement stochastic volutional neural networks (CNNs), and recurrent neural
gradient descent (SGD) to solve the weight optimization networks (RNNs). The latter two are designed to take
problem. Backpropagation consists of three components: advantage of spatial or temporal (or sequential) correla-
1) the forward path—the presented data propagates for- tions in the data, for instance, in image processing, text,
ward through the network until the end where an error and speech processing. The building block of a multilayer
is computed; 2) the backward path—the error propagates FCN is a linkage that connects every element of an input
backward through the network to compute the gradients layer with every element of an output layer. It can easily
in the parameters; and 3) the parameter (weight) update. be represented as an n × m matrix with n input channels
During the training process, a large body of labeled data (rows) and m output channels (columns). The matrix
is repeatedly presented to the network to determine the elements wi j determine the strength of the connection
best values for the parameters. Due to the volume of data from input xi (i = 1, . . . , n) to output y j ( j = 1, . . . , m).
that are processed repeatedly through the many layers of The multilayer network is then built up as a sequence
the neural net network, the training process tends to be of these connected layers, with the output of each layer
time consuming and can take weeks for a realistic data serving as the input to the next. One complication is that
set, which can require models with hundreds of millions a nonlinear activation function is applied to the output of
of weight parameters. To find an optimal solution, fine each layer before it is passed to the next layer. Without
tuning of the hyperparameters, e.g., parameters related to this nonlinear element between the layers, the network
the structure of the network and the training algorithm, would be equivalent to a simple two-layer linear regression
is also usually required. Optimizing the hyperparameters model and the advantages of the BP algorithm to capture
requires several complete learning cycles. Therefore, the complex data structures would vanish. A CNN [2] consists
development of customized hardware to reduce training of convolution and pooling layers. A convolution layer
time is desirable to speed up model development. For [Fig. 1(a)] consists of a set of filters or kernels of size
inference, the optimized (trained) network is only oper- k × k and d input channels. The input channels d can
ated in the forward path mode, and the computational be considered as the decomposition of the input into
requirements can vary depending on the specific applica- appropriate components. For the first layer, that could
tion. Inference in mobile applications will stress low power be, for instance, the decomposition into red, green, and
while in data center applications speed (latency) may be blue components for a color picture, and for successive
more important. Therefore, optimal solutions for training layers, it is the number of independent kernels from the
and inference can be quite different. From a software previous layer. The kernels are moved across the input
point of view, both training a model and executing it data with a given stride s (the number of pixels moved)
for inference do not depend on the underlying hardware. to scan the whole picture. There could be a number of
This can, however, lead to a suboptimal performance since kernels M in one convolution layer. The filters contain
hardware optimization for training and inference-only will the weight elements wli j (i = 1, . . . , k, j = 1, . . . , k,
tend to be different. As a practical matter, and with the l = 1, . . . , d) and are moved across the entire range of
advent of specialized hardware, it is advantageous to run the input data. The convolution process associated for one
training and inference on the same hardware platform for filter bank can be interpreted as a scan for a feature in
a seamless transition from training to inference. For exam- the data. The convolution process is repeated for several
ple, if training and inference are performed on different distinct filter banks. Each of these filter banks will produce
hardware platforms, retraining or tuning of the model to a feature map which is then reduced in dimensionality
accommodate the difference may be needed. by the pooling process. One common approach is max
The remainder of this paper is organized as follows. pooling, in which a window is defined in the feature map
In Section II, a short discussion of the basic types of and the maximum element in this window is retained. In
networks is given, and in Section III, we briefly discuss the the next convolution layer, these pooled feature maps will
current state of various hardware implementations. This serve as the input data set and the process is repeated.
discussion is by no means complete but describes some of Unlike the processing that occurs in FCNs, convolution and
the tradeoffs that need to be addressed to optimize training pooling cannot be represented as simple standard matrix
performance. In Section IV, we give a detailed analysis of multiplication. For example, since the same weights are
Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing
Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing
IV. A N A L O G C O M P U T E F O R
Fig. 2. Distributed learning system. A number of N nodes with ng
GPUs and a number of CPUs (here two as an example) are connected
DEEP LEARNING
through network. Optimal operation requires balancing utilization of We now turn to a more detailed discussion of the core
components and communication. CPUs serve as node control units
component of a deep learning system. At the heart of the
and can also be used for additional computation.
BP algorithm are three distinct operations: matrix multi-
plication, weight update, and the application of activation
an individual GPU mini-batch Mb of, for example 128, that functions. For purely digital computation, these operations
would result in a mini-batch size of 32 768 (for ImageNet can be reduced to floating-point or fixed-point operations
∼5 GB). These data need to be read from the disk and with an appropriate accuracy requirement. Alternatively,
distributed across the GPUs without creating a bottleneck. analog computing elements can be used to perform the
To find the right balance between compute and communi- matrix operations. Analog computations for matrix oper-
cation is a key challenge in hardware optimization [22]. ation exploit the fact that a 2-D matrix can be mapped
Mb
Here tio+GPU Mb are the time to load the data for
and tGPU into a physical array (Fig. 3) with the same number of
minibatch size Mb onto one GPU and the time to execute rows and columns as the abstract mathematical object.
Mb,n g
that minibatch, respectively. In the denominator tio+GPU At the intersection of each row and column there will
N
and tcomm are the time to load the data onto the node with be an element with conductance G that represents the
n g GPUs and the communication between the N nodes, strength of connection between that row and column (i.e.,
respectively. A perfectly balanced system would hide the the weight). If we now apply a voltage difference V across
communication between nodes and would gate the data a given row and column, there will be a current flow j
input/output (I/O) for a node by the GPU I/O. That would
result in S = NGPU . Accelerating deep learning on a j = GV. (2)
distributed system of GPUs is easily generalizable for any
accelerator solution. The solution to the balancing problem We can easily generalize this concept for an n × m array
will, however, depend on the accelerator architecture and (Fig. 3). At the n rows, we apply the components of a
the details of the network. voltage vector v i (i = 1, . . . , n) and collect the current at
To accelerate deep learning with conventional digital the n columns j j ( j = 1, . . . , m). Simple network analysis
hardware, the following strategies have been employed: 1) applying Ohm’s and Kirchhoff’s laws relates the current
exploit the possibility of operating with reduced precision vectors to the voltage vectors
to increase compute efficiency; and 2) use data compres-
sion [23] to reduce the amount of data that is moved n
between components. Reduced precision provides a very jj = g j,i Vi . (3)
i=1
effective mechanism to improve compute efficiency since it
scales quadratically with the bit width [8], [7]. The use of The above equation is exactly equivalent to the
reduced precision requires, however, a careful analysis of conventional result of a matrix multiplication if we iden-
the complete algorithmic workflow to understand where, tify the physical array with its connections gi j with the
and to what extent, precision can be reduced without abstract mathematical construct. The use of analog arrays
impacting classification accuracy [24]. The choice of fixed- allows us to replace two n × m floating-point operations
point or floating-point arithmetic is another lever that can (n × m multiplications and n × m additions) associated
be exploited [8]. Again, it is important to understand the with vector–matrix multiplication by one single (parallel)
impact of these choices on the performance and versatility operation. If we now further assume that the connec-
of a given network architecture. Since there is no funda- tion strengths gi j (i = 1, . . . , n, j = 1, . . . , m) can be
mental theory of deep learning at present, most of these simultaneously changed, the weight update operation can
tradeoffs must be studied empirically, and a certain “safety also be mapped into a single operation (in time). We
margin” must be provided in order to apply these ideas to would again replace two n × m floating-point operations
networks that have not been explicitly tested. by a single operation. The benefit is twofold: 1) we avoid
The appropriate metric for digital computation is the moving the weight elements from memory to the chip for
number of operations performed per second per unit processing; and 2) we can replace two n × m floating-
Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing
Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing
Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing
tcycle = 2tint + tupdate . (9) Changes in the conductivity need to be small to avoid
convergence problems in the algorithm. In deep learning
To this we must add the overhead that comes from software solutions, the change in weight is controlled
the DAC, ADC, other digital computations, and the by the learning rate ε [see (7)]. Weight changes that
Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing
Fig. 6. (a) One-sided switching: Conductivity can only change gradually in one direction. Conductance level will eventually saturate and
the differential signal will not change anymore. (b) Two-sided switching: Conductivity can gradually switch up and down. Differential signal is
measured against a fixed reference (global or local).
increase the weight value are called potentiation, and In operation, one-sided switching devices must be paired,
weight changes that decrease the weight value are called with one element carrying the potentiation and the other
depression. Material properties for NVM do not map in a the depression signal as shown in Fig. 6(a). The net
natural way into the requirements for the weight move- conductance is the difference between the conductances
ments, mainly because the response to a given stimulus of the individual elements. To achieve symmetric weight
depends both on the value of the weight and the sign of the update, the pair needs to match in their linearity. For the
stimulus [27]. Ideally, the response would be linear (inde- two-sided device, it is possible to use a global reference
pendent of the weight value) and symmetric (independent (a column for forward or row for backward propagation)
of the sign of the stimulus). For many NVM materials, since the conductivity in the cross point can move locally
conductance increase is associated with the set operation up and down [Fig. 6(b)]. The reference element however
and conductive decrease with the reset operation, e.g., must be the same for all rows and columns because the
phase change memory (PCM) and resistive random-access weights need to be the same for forward and backward
memory (RRAM). In general, for existing NVM materi- propagation. A local fixed reference would eliminate this
als, these two processes are very different due to the restriction at the cost of array size or process complexity
underlying physical mechanisms. NVM materials can be (stacked arrays). The requirements for the cross-point
distinguished by the detailed switching properties of these materials for use in deep learning training can be explored
two branches. There are several materials available that using a fully analog deep learning model. The subtleties
fall under the following categories (Fig. 6). in the material properties are incorporated into our device
switching model. It incorporates spatial variations (device-
1) One-sided or unipolar switching. These are materials
to-device) and temporal stochastic behavior (coincidence-
that show a gradual change of conductivity in one
to-coincidence) shown in Fig. 7.
branch, usually the set branch, and are abrupt in the
To understand the interdependence of the digital inter-
other [Fig. 6(a)]. In practice, the conductivity can
face, material properties, and the BP algorithm, we imple-
only be gradually changed in one direction.
mented a three-layer FCN (784, 256, 128, 10) for the
2) Two-sided or bipolar switching. These materials
handwritten character data set MNIST and investigated
show gradual change in both set and reset branch
the training performance, sensitivity to device parameters,
[Fig. 6(b)]. Gradual conductivity changes can be
and the A/D interface specifications [30]. We find that
made in either direction.
the performance of the FCN is robust against stochastic
The switching behavior can be further classified as either behavior for certain properties and less tolerant to others,
linear or nonlinear depending on the number of stimulus as summarized in Fig. 7. Regarding the material proper-
pulses and the conductance state. As indicated in Fig. 6(a), ties, the most important attributes are: 1) weight move-
the nonlinearity can lead to a saturation since conduc- ment for potentiation and depression must be symmetric
tance change is only in one direction. For a two-sided within 2%; and 2) the granularity of the weight update
switching device, symmetric switching means that under requires 1000 steps ( 10 b), on average, between minimum
the same number of potentiation and successive depression and maximum conductance values in the case that all
pulses the device returns to its original conductance level. variations are considered simultaneously. Relaxed require-
This last requirement is more general than linear switch- ments are observed for individual components which is
ing since symmetric switching does not require linearity. a nonphysical situation. A 5-b DAC and a 9-b ADC are
Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing
Fig. 7. Important device variability components. Δ minimum step from device to device (spatial) and from coincidence to coincidence
−/C
(temporal). Device to device weight range . Overall up (+) and down (-) change Δ
and up (+) and down (-) ratio for a single device
Δ C /Δ − . The table shows individual sensitivity and combined sensitivity at 0.3% penalty from floating-point results.
required, and a noise level of 6% can be tolerated at the estimate the performance of a given tile. In Fig. 8, we show
integrator/amplifier output. The MNIST data set is very an example for the performance as a function of array size
small, with only about 60 000 images, and the FCN that for one layer in a fully connected network (FCN, LSTM).
is used for training has 235 000 weight parameters. For For a convolutional layer, power and performance will
comparison, the recent networks [6] used for training on degrade significantly due to the mapping discussed above
the ImageNet data set, with 1.2 million pictures, have (Fig. 1). In contrast to the trend in GPU-based convolu-
tens of millions of parameters and many layers [Alexnet tional networks, analog arrays would favor larger kernels.
61 million weights, five convolutions, and three fully con- The operation conditions for the example are: integration
nected layers; ResNet 50 25.5 million weights, 53 convo- time 80 ns, sharing 16 rows (columns) per ADC [34] with a
lutions, and one fully connected layer(s)]. It is not clear, sampling rate of 200 M samples/s at an energy of 23 pJ per
at present, if the sensitivity analysis presented here holds sample and Rdev 24 M. An on-chip data rate of 90 GB/s
true when the networks are massively scaled. In a more is required for the largest tile. In modern NoC, TB/s of on-
recent analysis, we have shown that convolution layers chip band width is available which gives enough headroom
can be implemented with the same device parameters if for the communication between tiles. Latency due to the
noise and bound management is introduced at the digital additional digital operations (activation, data renormaliza-
peripheries [31] and that stochastic rounding is beneficial tion, etc.) can be mostly hidden by pipelined design. Build-
for larger LSTM networks [33]. The sensitivity analysis ing a NoC with multiple analog tiles as a primary building
provides a set of target material properties for useful cross- block can therefore provide a performance potential in the
point elements to implement the BP algorithm for deep thousands of Tops/W/s on chip.
learning networks. Given the material parameters, we can
V. M AT E R I A L S
To capitalize on this performance potential, further innova-
tion is required. Either more-ideal material systems must
be identified or circuit solutions must be developed that
can accommodate imperfect materials. The latter will,
however, come at a cost of performance, power, and
area since it will need an overhead that can be avoided
if suitable materials are identified. A third solution are
innovations or modifications of existing learning algo-
rithms that can function with imperfect devices and take
advantage of the array architecture. Ultimately, the clas-
sification accuracy of analog-based solutions needs to be
on-par with that of digital solutions, however with the
benefit of better power and performance. Conventional
Fig. 8. Performance, power, and data rate for a single tile with an NVM materials are optimized for memory applications that
input vector of size 500, 1000, and 4000. require a large SNR ratio [Fig. 9(a)] to securely recover the
Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing
Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing
Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing
F. Floating Gate Devices This has the same form as the weight decay produced by L2
Floating gate devices for use in analog arrays for DNN regularization [59] which is a method to avoid overfitting.
were proposed in the early 1990s [55], [56] coinciding Typical values for the decay in the software world are
with the emerging Flash memory technology. The weights about 10−5 –10−6 . By comparison we find the required
are represented by charges stored in the floating gate of retention of the order of several seconds. A word of caution
cross-point cell device. The analysis of the required switch- is in order since we assumed a cycle time of 200 ns
ing properties matches our results with respect to symme- between updates. For convolutional networks as discussed
try and weight granularity. To meet these requirements, above the input is a matrix with [(n − k)/s + 1]2 columns.
a modified cell design was proposed, albeit with a very The time between updates will then be approximately
large cell size. There are two additional concerns for using [(n −k)/s +1]2 × 200 ns, which would increase the required
floating gate devices for deep learning analog arrays: write retention time τret by [(n − k)/s + 1]2 , which is difficult to
Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing
achieve in hardware. The input matrix size, for instance, enabled by traditional scaling. We are far away from a
for the first convolution layer for the small CNN LeNET similar scaling law for deep learning; in fact we do not even
(MNIST) is 576 and for AlexNET (ImageNET) is 3025 have a fundamental theory that can guide us. Progress
which would require retention times τret in access of sev- is being made by brute force: we develop more complex
eral minutes or more which are unrealistic to achieve. neural networks with tens of millions of parameters, collect
A controlled symmetrical change of charge requires the and curate huge labeled data sets, and find the hardware
use of current sources: one for injecting charges, the to run the algorithms. For a pervasive use of deep learning,
other for extracting charges. These are usually done with cost is a major issue. Cost means the time to build models
simple FETs that operate in saturation. In the dormant and the computational resources that are needed to train
state, they are turned off and leak. This leakage will and execute them. The realization that GPUs are a good
discharge(charge) the capacitor which is needed to hold fit for these tasks was a critical enabling step. However, it
the charges to a certain level. With the definition of the is now clear that specialized hardware that is customized
retention time, we find for deep learning can do better than conventional GPUs.
We are already seeing the emergence of a new generation
Ilkg τret of deep learning accelerator hardware: trading general use
C≈ . (13)
Vcap for compute efficiency, which ultimately means cost. Unfor-
tunately, the shear complexity of building and training
With the retention time in 1-s range and Vcap in the range models forces us to look at the solution at the system level,
of 1 V, a very low leakage CMOS technology is required. where several deep learning accelerators work together
With an area capacitance of 470 fF/μm2 that can be to solve the problem. We have only briefly touched on
achieved for an eDRAM technology [60] real estate for the the system aspects for deep learning machines, but these
capacitor would scale as 2(Ilkg /pA) × (τret /s)μm2 . There- issues will ultimately determine the viability of new AI
fore, an ultralow leakage CMOS technology is required hardware accelerators. Our discussion focused on the basic
for a competitive arrays size. In addition to these simple design and material properties issues that need to be
scaling considerations, the effect of the device variations addressed for analog accelerators. Only if we can show, in
on the scaling behavior of the cell needs to be explored in a convincing manner, that these are solvable do questions
more details for the implementation of a robust learning concerning system-level integration become relevant. We
algorithm. do not expect that analog computing for deep leaning
will drive a fundamentally new ecosystem but rather will
augment the existing, digital one. We will see a continued
VI. C O N C L U S I O N
push to improve neural networks and to push digital hard-
Now artificial intelligence (AI) is synonymous with deep ware solutions to the limit of what is possible. The analog
learning. The desire to apply deep learning to all facets of solutions, if successful, should be ready to fit seamlessly
life is reminiscent of the pervasive use of microelectronics into this evolution.
REFERENCES
[1] J. Nickolls and W. J. Dally, “The GPU computing Training deep neural networks with weights and [18] S. Hochreiter and J. Schmidhuber, “Long
era,” IEEE Micro, vol. 30, no. 2, pp. 56–69, activations constrained to +1 or -1.” [Online]. short-term memory,” Neural Comput., vol. 9, no. 8,
Mar./Apr. 2010. Available: https://arxiv.org/abs/1602.02830 pp. 1735–1780, 1997.
[2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, [10] P. A. Merolla et al., “A million spiking-neuron [19] J. Dean et al., “Large scale distributed deep
“Gradient-based learning applied to document integrated circuit with a scalable communication networks,” in Proc. Adv. Neural Inf. Process. Syst.
recognition,” Proc. IEEE, vol. 86, no. 11, network and interface,” Science, vol. 345, no. 6197, (NIPS), 2012, pp. 1223–1231.
pp. 2278–2324, Nov. 1998. pp. 668–673, 2014. [20] P. Goyal et al. (2017). “Accurate, large minibatch
[3] J. Schmidhuber, “Deep learning in neural networks: [11] N. P. Jouppi et al., “In-datacenter performance SGD: Training ImageNet in 1 hour.” [Online].
An overview,” Neural Netw., vol. 61, pp. 85–117, analysis of a tensor processing unit,” in Proc. Int. Available: https://arxiv.org/abs/1706.02677
Jan. 2015. Symp. Comput. Archit. (ISCA), Toronto, ON, [21] M. Cho, U. Finkler, S. Kumar, D. Kung, V. Saxena,
[4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Canada, Jun. 2017, pp. 1–12. and D. Sreedhar (2017). “PowerAI DDL.” [Online].
learning,” Nature, vol. 521, pp. 436–444, [12] M. Davies et al., “Loihi: A neuromorphic manycore Available: https://arxiv.org/abs/1708.02188
May 2015. processor with on-chip learning,” IEEE Micro, [22] S. Shi, Q. Wang, and X. Chu (2017). “Performance
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, vol. 38, no. 1, pp. 82–99, Jan./Feb. 2018. modeling and evaluation of distributed deep
“ImageNet classification with deep convolutional [13] M. Di Ventra and F. L. Traversa, “Memcomputing: learning frameworks on GPUs.” [Online].
neural networks,” in Proc. Adv. Neural Inf. Process. Leveraging memory and physics to compute Available: https://arxiv.org/abs/1711.05979
Syst. (NIPS), 2012, pp. 1097–1105. efficiently,” J. Appl. Phys., vol. 123, no. 18, [23] C.-Y. Chen, J. Choi, D. Brand, A. Agrawal,
[6] V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer (2017). p. 180901, 2018. W. Zhang, and K. Gopalakrishnan (2017).
“Efficient processing of deep neural networks: A [14] K. Steinbuch, “Die lernmatrix,” Kybernetik, vol. 1, “AdaComp: Adaptive residual gradient compression
tutorial and survey.” [Online]. Available: no. 1, pp. 36–45, 1961. for data-parallel distributed training.” [Online].
https://arxiv.org/abs/1703.09039 [15] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Available: https://arxiv.org/abs/1712.02679
[7] A. Agrawal et al., “Approximate computing: “Learning representations by back-propagating [24] B. Fleischer et al., “A scalable multi-TeraOPS deep
Challenges and opportunities,” in Proc. IEEE Int. errors,” Nature, vol. 323, pp. 533–536, Oct. 1986. learning processor core for AI training and
Conf. Rebooting Comput. (ICRC), San Diego, CA, [16] A. Canziani, E. Culurciello, and A. Paszke (2016). inference,” in Proc. Symp. VLSI Circuits, Honolulu,
USA, 2016, pp. 1–8. “An analysis of deep neural network models for HI, USA, 2018.
[8] S. Gupta, A. Agrawal, K. Gopalakrishnan, and practical applications.” [Online]. Available: [25] C. Lehmann, M. Viredaz, and F. A. Blayo, “A generic
P. Narayanan, “Deep learning with limited https://arxiv.org/abs/1605.07678 systolic array building block for neural networks
numerical precision,” in Proc. Int. Conf. Mach. [17] L. Lazebnik. (2017). Convolutional Neural Network with on-chip learning,” IEEE Trans. Neural Netw.,
Learn., Lille, France, 2015, pp. 1–10. Architectures: From LeNet to ResNet. [Online]. vol. 4, no. 3, pp. 400–407, May 1993.
[9] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Available: http://slazebni.cs.illinois.edu/spring17/ [26] J. J. Yang, D. B. Strukov, and D. R. Stewart,
and Y. Bengio, “Binarized neural networks: lec01_cnn_architectures.pdf “Memristive devices for computing,” Nature
Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing
Nanotechnol., vol. 8, pp. 13–24, Dec. 2013. metallic surfactant layer as a resistance drift thin films,” Appl. Phys. Lett., vol. 99, no. 10,
[27] G. W. Burr et al., “Neuromorphic computing using stabilizer,” in IEDM Tech. Dig., Washington, DC, p. 102903, 2011.
non-volatile memory,” Adv. Phys. X, vol. 2, no. 1, USA, Dec. 2013, pp. 30.7.1–30.7.4. [49] M. H. Park et al., “Ferroelectricity and
pp. 89–124, 2017. [38] W. W. Koelmans, A. Sebastian, V. P. Jonnalagadda, antiferroelectricity of doped thin HfO2-based
[28] S. Yu, “Neuro-inspired computing with emerging D. Krebs, L. Dellmann, and E. Eleftheriou, films,” Adv. Mater., vol. 27, no. 11, pp. 1811–1831,
nonvolatile memorys,” Proc. IEEE, vol. 106, no. 2, “Projected phase-change memory devices,” Nature 2015.
pp. 260–285, Feb. 2018. Commun., vol. 6, Sep. 2015, Art. no. 8181. [50] M. Jerry et al., “Ferroelectric FET analog synapse
[29] G. W. Burr et al., “Experimental demonstration and [39] S. R. Nandakumar, M. Le Gallo, I. Boybat, for acceleration of deep neural network training,”
tolerancing of a large-scale neural network B. Rajendran, A. Sebastian, and E. Eleftheriou in IEDM Tech. Dig., San Francisco, CA, USA,
(165,000 synapses), using phase-change memory (2017). “Mixed-precision training of deep neural Dec. 2017, pp. 6.2.1–6.2.4.
as the synaptic weight element,” in IEDM Tech. Dig., networks using computational memory.” [Online]. [51] M. M. Frank et al., “Analog resistance tuning in
San Francisco, CA, USA, Dec. 2014, Available: https://arxiv.org/abs/1712.01192 TiN/HfO2 /TiN ferrolelectric tunnel junctions,” in
pp. 29.5.1–29.5.4. [40] S. Ambrogio et al., “Equivalent-accuracy Proc. IEEE SISC, San Diego, CA, USA, Dec. 2018
[30] T. Gokmen and Y. Vlasov, “Acceleration of deep accelerated neural-network training using analogue [52] J. Janek and W. G. Zeier, “A solid future for battery
neural network training with resistive cross-point memory,” Nature, vol. 558, pp. 60–67, Jun. 2018. development,” Nature Energy, vol. 1, Sep. 2016,
devices: Design considerations,” Frontiers Neurosci., [41] R. Waser, R. Dittmann, G. Staikov, and K. Szot, Art. no. 16141.
vol. 10, p. 333, Jul. 2016. “Redox-based resistive switching [53] E. J. Fuller et al., “Li-ion synaptic transistor for low
[31] T. Gokmen, M. Onen, and W. Haensch, “Training memories—Nanoionic mechanisms, prospects, and power analog computing,” Adv. Mater., vol. 29,
deep convolutional neural networks with resistive challenges,” Adv. Mater., vol. 21, nos. 25–26, no. 4, p. 1604310, 2017.
cross-point devices,” Frontiers Neurosci., vol. 11, pp. 2632–2663, 2009. [54] S. Agarwal et al., “Resistive memory device
p. 538, Oct. 2017. [42] G. Bersuker et al., “Toward reliable RRAM requirements for a neural algorithm accelerator,” in
[32] P.-Y. Chen, X. Peng, and S. Yu, “NeuroSim: An performance: Macro- and micro-analysis of Proc. Int. Joint Conf. Neural Netw. (IJCNN),
integrated device-to-algorithm framework for operation processes,” J. Comput. Electron., vol. 16, Jul. 2016, pp. 929–938.
benchmarking synaptic devices and array no. 4, pp. 1085–1094, 2017. [55] O. Fujita and Y. Amemiya, “A floating-gate analog
architectures,” in IEDM Tech. Dig., Dec. 2017, [43] J. Woo et al., “Improved synaptic behavior under memory device for neural networks,” IEEE Trans.
pp. 6.1.1–6.1.4. identical pulses using AlOx/HfO2 bilayer RRAM Electron Devices, vol. 40, no. 11, pp. 2029–2035,
[33] T. Gokmen, M. Rasch, and W. Haensch (2018). array for neuromorphic systems,” IEEE Electron Nov. 1993.
“Training LSTM networks with resistive cross-point Device Lett., vol. 37, no. 8, pp. 994–997, Aug. 2016. [56] T. Morie and Y. Amemiya, “An all-analog
devices.” [Online]. Available: https:// [44] D. Jana et al., “Conductive-bridging random access expandable neural network LSI with on-chip
arxiv.org/abs/1806.00166 memory: Challenges and opportunity for 3D backpropagation learning,” IEEE Trans. Electron
[34] L. Kull et al., “A 10 b 1.5 GS/s pipelined-SAR ADC architecture,” Nanoscale Res. Lett., vol. 10, Devices, vol. 29, no. 9, pp. 1086–1093,
with background second-stage common-mode Dec. 2015, Art. no. 188. Sep. 1994.
regulation and offset calibration in 14 nm CMOS [45] J. R. Jameson, P. Blanchard, J. Dinh, and E. al, [57] H. Choe, S. Lee, H. Nam, S. Park, S. Kim, and
FinFET,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2017, “Conductive bridging RAM (CBRAM): Then, now, E. Y. Chung, “Near-data processing for machine
pp. 474–475. and tomorrow,” ECS Trans., vol. 75, no. 5, learning,” in Proc. ICLR, 2016,
[35] G. W. Burr et al., “Large-scale neural networks pp. 41–54, 2016. pp. 1–12.
implemented with non-volatile memory as the [46] W. Chen et al., “A CMOS-compatible electronic [58] S. Kim, T. Gokmen, H.-M. Lee, and W. Haensch
synaptic weight element: Comparative performance synapse device based on Cu/SiO2/W (2017). “Analog CMOS-based resistive processing
analysis (accuracy, speed, and power),” in IEDM programmable metallization cells,” Nanotechnology, unit for deep neural network training.” [Online].
Tech. Dig., Washington, DC, USA, Dec. 2015, vol. 27, no. 25, pp. 255202-1–2552029, 2016. Available: https://arxiv.org/abs/1706.06620
pp. 4.4.1–4.4.4. [47] Z. Fan, J. Chen, and J. Wang, “Ferroelectric [59] S. Raschka, Python Machine Learning. Birmingham,
[36] G. W. Burr et al., “Recent progress in phase-change HfO2-based materials for next-generation U.K.: Packt Publishing, 2015.
memory technology,” IEEE J. Emerg. Sel. Topics ferroelectric memories,” J. Adv. Dielectrics, vol. 6, [60] G. Freeman et al., “Performance-optimized
Power Electron., vol. 6, no. 2, pp. 146–162, no. 2, p. 1630003, 2016. gate-first 22-nm SOI technology with embedded
Jun. 2016. [48] T. S. Böscke, J. Müller, D. Bräuhaus, U. Schröder, DRAM,” IBM J. Res. Develop., vol. 59, no. 1,
[37] S. Kim et al., “A phase change memory cell with and U. Böttger, “Ferroelectricity in hafnium oxide pp. 5:1–5:14, Jan./Feb. 2015.
Wilfried Haensch (Fellow, IEEE) received Tayfun Gokmen received B.S. degrees
the Ph.D. degree in the field of theoretical (double major) from the Department of
solid state physics from the Technical Uni- Physics and the Department of Electri-
versity of Berlin, Berlin, Germany, in 1981. cal and Electronics Engineering, Middle
He started his career in Si technology East Technical University, Ankara, Turkey,
in 1984 at SIEMENS Corporate research, in 2004. As a Francis Robbins Upton Fel-
Munich, Germany. There he worked on high low, he received the Ph.D. degree from
field transport in MOSFETs and later in DRAM the Department of Electrical Engineering,
development and manufacturing. In 2001, Princeton University, Princeton, NJ, USA, in
he joined the IBM T.J. Watson Research Center, Yorktown Heights, 2010, studying 2-D electron systems in the quantum Hall regime.
NY, USA, to lead a group for novel devices and applications. He After working as a Software Developer at Bloomberg L.P. for a
was responsible for the exploration of device concepts for future year, in 2011, he joined IBM T.J. Watson Research Center, Yorktown
technology nodes and new concepts for memory and logic circuits, Heights, NY, USA, as a Postdoctoral Researcher in the photovoltaics
including 3-D integration, early FinFET work, and the exploration of program. IBM group demonstrated the world’s first copper, zinc, tin,
carbon nanotubes for VLSI circuits. He also was active in CMOS inte- sulfur, and selenium (CZTSSe)-based solar cell that can perform
grated silicon photonics to provide high-bandwidth low-cost links above 10% efficiency in 2011. In 2013, after being appointed as
for future compute systems. He is currently responsible for novel a Research Staff Member at IBM and then IBM Research AI, he
technologies for neuromorphic computation with emphasis on began working on projects focused on exploring new hardware
exploring memristive elements (such as PCM, RRAM, FeRAM, etc.) in solutions for machine learning and artificial intelligence. He pro-
neural network arrays. He is the author of a text book on transport posed the concept of resistive processing unit (RPU) devices that
physics and author/coauthor of more than 150 publications. can accelerate machine learning, specifically deep neural network
Dr. Haensch was awarded the Otto Hahn Medal for Outstanding training algorithms, by many orders of magnitude compared to
Research in 1983. conventional digital approaches. He has 20 pending/issued patents,
coauthored a book chapter, and published over 50 papers in vari-
ous fields ranging from condensed matter physics, solar cells, to
machine learning.
Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.
Haensch et al.: The Next Generation of Deep Learning Hardware: Analog Computing
Ruchir Puri (Fellow, IEEE) is CTO and He was honored with John Von-Neumann Chair at the Institute
Chief Architect, IBM Watson AI, Yorktown of Discrete Mathematics, Bonn University, Bonn, Germany. He is
Heights, NY, USA and an IBM Fellow, who an inventor of over 50 U.S. patents and has authored over 100
has held various technical, research, and scientific publications on software, hardware automation methods,
engineering leadership roles across IBM’s and optimization algorithms.
AI and Research businesses. He has been Dr. Puri has been an ACM Distinguished Speaker, an IEEE Distin-
an Adjunct Professor at Columbia University, guished Lecturer, and was awarded 2014 Asian American Engineer
New York, NY, USA and a visiting scientist of the Year.
at Stanford University, Stanford, CA, USA.
Authorized licensed use limited to: Middlesex University. Downloaded on September 02,2020 at 15:44:02 UTC from IEEE Xplore. Restrictions apply.