Tensor Flow
Tensor Flow
Martn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,
Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,
Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray,
Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar,
Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals,
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng
Google Research
Abstract
sequence prediction [47], move selection for Go [34],
pedestrian detection [2], reinforcement learning [38],
and other areas [17, 5]. In addition, often in close collaboration with the Google Brain team, more than 50 teams
at Google and other Alphabet companies have deployed
deep neural networks using DistBelief in a wide variety
of products, including Google Search [11], our advertising products, our speech recognition systems [50, 6, 46],
Google Photos [43], Google Maps and StreetView [19],
Google Translate [18], YouTube, and many others.
TensorFlow [1] is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be
executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones
and tablets up to large-scale distributed systems of hundreds
of machines and thousands of computational devices such as
GPU cards. The system is flexible and can be used to express
a wide variety of algorithms, including training and inference
algorithms for deep neural network models, and it has been
used for conducting research and for deploying machine learning systems into production across more than a dozen areas of
computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural
language processing, geographic information extraction, and
computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that
we have built at Google. The TensorFlow API and a reference
implementation were released as an open-source package under
the Apache 2.0 license in November, 2015 and are available at
www.tensorflow.org.
Introduction
graph, with many different computational devices all collaborating to update a set of shared parameters or other
state. Modest changes in the description of the computation allow a wide variety of different approaches
to parallelism to be achieved and tried with low effort
[14, 29, 42]. Some TensorFlow uses allow some flexibility in terms of the consistency of parameter updates, and
we can easily express and take advantage of these relaxed
synchronization requirements in some of our larger deployments. Compared to DistBelief, TensorFlows programming model is more flexible, its performance is significantly better, and it supports training and using a
broader range of models on a wider variety of heterogeneous hardware platforms.
Dozens of our internal clients of DistBelief have already switched to TensorFlow. These clients rely on
TensorFlow for research and production, with tasks as
diverse as running inference for computer vision models on mobile phones to large-scale training of deep
neural networks with hundreds of billions of parameters on hundreds of billions of example records using
many hundreds of machines [11, 47, 48, 18, 53, 41].
Although these applications have concentrated on machine learning and deep neural networks in particular,
we expect that TensorFlows abstractions will be useful
in a variety of other domains, including other kinds of
machine learning algorithms, and possibly other kinds
of numerical computations. We have open-sourced the
TensorFlow API and a reference implementation under
the Apache 2.0 license in November, 2015, available at
www.tensorflow.org.
The rest of this paper describes TensorFlow in more
detail. Section 2 describes the programming model and
basic concepts of the TensorFlow interface, and Section 3
describes both our single machine and distributed implementations. Section 4 describes several extensions to
the basic programming model, and Section 5 describes
several optimizations to the basic implementations. Section 6 describes some of our experiences in using TensorFlow, Section 7 describes several programming idioms we have found helpful when using TensorFlow, and
Section 9 describes several auxiliary tools we have built
around the core TensorFlow system. Sections 10 and 11
discuss future and related work, respectively, and Section 12 offers concluding thoughts.
Sessions
import tensorflow as tf
b = tf.Variable(tf.zeros([100]))
W = tf.Variable(tf.random_uniform([784,100],-1,1))
x = tf.placeholder(name="x")
relu = tf.nn.relu(tf.matmul(W, x) + b)
C = [...]
s = tf.Session()
for step in xrange(0, 10):
input = ...construct 100-D input array ...
result = s.run(C, feed_dict={x: input})
print step, result
#
#
#
#
#
#
C
...
ReLU
Add
MatMul
Examples
Add, Sub, Mul, Div, Exp, Log, Greater, Less, Equal, ...
Concat, Slice, Split, Constant, Rank, Shape, Shuffle, ...
MatMul, MatrixInverse, MatrixDeterminant, ...
Variable, Assign, AssignAdd, ...
SoftMax, Sigmoid, ReLU, Convolution2D, MaxPool, ...
Save, Restore
Enqueue, Dequeue, MutexAcquire, MutexRelease, ...
Merge, Switch, Enter, Leave, NextIteration
Variables
Tensors
A tensor in our implementation is a typed, multidimensional array. We support a variety of tensor element types, including signed and unsigned integers ranging in size from 8 bits to 64 bits, IEEE float and double
types, a complex number type, and a string type (an arbitrary byte array). Backing store of the appropriate size
is managed by an allocator that is specific to the device
on which the tensor resides. Tensor backing store buffers
are reference counted and are deallocated when no references remain.
Implementation
3.1
Lets first consider the simplest execution scenario: a single worker process with a single device. The nodes of the
graph are executed in an order that respects the dependencies between nodes. In particular, we keep track of
a count per node of the number of dependencies of that
node that have not yet been executed. Once this count
drops to zero, the node is eligible for execution and is
added to a ready queue. The ready queue is processed in
some unspecified order, delegating execution of the kernel for a node to the device object. When a node has
finished executing, the counts of all nodes that depend
on the completed node are decremented.
3.2
Multi-Device Execution
Devices
Devices are the computational heart of TensorFlow. Each
worker is responsible for one or more devices, and
each device has a device type, and a name. Device
names are composed of pieces that identify the devices type, the devices index within the worker, and,
in our distributed setting, an identification of the job
and task of the worker (or localhost for the case where
the devices are local to the process). Example device
names are "/job:localhost/device:cpu:0" or
"/job:worker/task:17/device:gpu:3". We
3.2.1
Node Placement
Given a computation graph, one of the main responsibilities of the TensorFlow implementation is to map the
computation onto the set of available devices. A simplified version of this algorithm is presented here. See
Section 4.3 for extensions supported by this algorithm.
One input to the placement algorithm is a cost model,
which contains estimates of the sizes (in bytes) of the
4
single process
client
session
run
master
client
process
session
run
master
process
execute
subgraph
execute
subgraph
worker
process 1
worker
GPU0
GPU1
...
CPU0
worker
process 2
worker
process 3
GPU0
...
GPU0
...
GPU0
...
GPU1
CPU0
GPU1
CPU0
GPU1
CPU0
input and output tensors for each graph node, along with
estimates of the computation time required for each node
when presented with its input tensors. This cost model is
either statically estimated based on heuristics associated
with different operation types, or is measured based on
an actual set of placement decisions for earlier executions of the graph.
Cross-Device Communication
The placement algorithm first runs a simulated execution of the graph. The simulation is described below and
ends up picking a device for each node in the graph using
greedy heuristics. The node to device placement generated by this simulation is also used as the placement for
the real execution.
Device B
Device B
y
W
a
Device A
b
recv
y
W
recv
send
send
Device A
...
...
ReLU
dReLU
Add
dAdd
3.3
Distributed Execution
Distributed execution of a graph is very similar to multidevice execution. After device placement, a subgraph is
created per device. Send/Receive node pairs that communicate across worker processes use remote communication mechanisms such as TCP or RDMA to move data
across machine boundaries.
dC/dW
dC/dx
common need, TensorFlow has built-in support for automatic gradient computation. If a tensor C in a TensorFlow graph depends, perhaps through a complex subgraph of operations, on some set of tensors {Xk }, then
there is a built-in function that will return the tensors
{dC/dXk }. Gradient tensors are computed, like other
tensors, by extending the TensorFlow graph, using the
following procedure.
When TensorFlow needs to compute the gradient of
a tensor C with respect to some tensor I on which C
depends, it first finds the path in the computation graph
from I to C. Then it backtracks from C to I, and for
each operation on the backward path it adds a node to
the TensorFlow graph, composing the partial gradients
along the backwards path using the chain rule. The newly
added node computes the gradient function for the corresponding operation in the forward path. A gradient
function may be registered by any operation. This function takes as input not only the partial gradients computed already along the backward path, but also, optionally, the inputs and outputs of the forward operation. Figure 5 shows gradients for a cost computed from the example of Figure 2. Grey arrows show potential inputs
to gradient functions that are not used for the particular
operations shown. The addition needed to Figure 1 to
compute these gradients is:
Extensions
In this section we describe several more advanced features of the basic programming model that was introduced in Section 2.
4.1
dMatMul
Fault Tolerance
dC/db
MatMul
Automatic gradient computation complicates optimization, particularly of memory usage. When executing forward computation subgraphs, i.e., those that are
explicitly constructed by the user, a sensible heuristic
breaks ties when deciding which node to execute next by
observing the order in which the graph was constructed.
Gradient Computation
Many optimization algorithms, including common machine learning training algorithms like stochastic gradient descent [45], compute the gradient of a cost function
with respect to a set of inputs. Because this is such a
6
This generally means that temporary outputs are consumed soon after being constructed, so their memory can
be reused quickly. When the heuristic is ineffective, the
user can change the order of graph construction, or add
control dependencies as described in Section 5. When
gradient nodes are automatically added to the graph, the
user has less control, and the heuristics may break down.
In particular, because gradients reverse the forward computation order, tensors that are used early in a graphs
execution are frequently needed again near the end of a
gradient computation. Such tensors can hold on to a lot
of scarce GPU memory and unnecessarily limit the size
of computations. We are actively working on improvements to memory management to deal better with such
cases. Options include using more sophisticated heuristics to determine the order of graph execution, recomputing tensors instead of retaining them in memory, and
swapping out long-lived tensors from GPU memory to
more plentiful host CPU memory.
4.2
fetch
e
c
d
a
d
b
b
feed
Partial Execution
4.3
Device Constraints
The graph is transformed based on the values of inputs and outputs. Each node:port specified in inputs is
replaced with a feed node, which will pick up the provided input tensor from specially-initialized entries in a
Rendezvous object used for the Run call. Similarly, each
output name with a port is connected to a special fetch
node that arranges to save the output tensor and return it
to the client when the Run call is complete. Finally, once
the graph has been rewritten with the insertion of these
4.4
Control Flow
4.5
Input Operations
4.6
Queues
Queues are a useful feature that we have added to TensorFlow. They allow different portions of the graph to
execute asynchronously, possibly at different candences,
and to hand off data through Enqueue and Dequeue operations. Enqueue operations can block until space becomes available in the queue, and Dequeue operations
can block until a desired minimum number of elements
are available in the queue. One use of queues is to allow
input data to be prefetched from disk files while a previous batch of data is still being processed by the computational portion of a machine learning model. They can
also be used for other kinds of grouping, including accumulating many gradients in order to compute some more
complex combination of gradients over a larger batch,
or to group different input sentences for recurrent language models into bins of sentences that are approximately the same length, which can then be processed
more efficiently.
TensorFlow uses a distributed coordination mechanism to execute graphs with control flow. In general, a
loop can contain nodes that are assigned to many different devices. Therefore, managing the state of a loop
becomes a problem of distributed termination detection.
TensorFlows solution is based on graph rewriting. During the graph partitioning, we automatically add control
nodes to each partition. These nodes implement a small
state machine that orchestrates the start and termination
of each iteration, and decides the termination of the loop.
For each iteration, the device that owns the loop termination predicate sends a tiny control message to every
participating device.
As explained above, we often train machine learning
models by gradient descent, and represent gradient computations as part of dataflow graphs. When a model
includes control-flow operations, we must account for
them in the corresponding gradient computation. For example, the gradient computation for a model with an ifconditional will need to know which branch of the conditional was taken, then apply the gradient logic to this
branch. Similarly, the gradient computation for a model
with a while-loop will need to know how many iterations
were taken, and will also rely on the intermediate values
computed during those iterations. The basic technique is
to rewrite the graph so to memorize the values needed for
the gradient computation. We omit the somewhat intricate details of this encoding.
In addition to normal FIFO queues, we have also implemented a shuffling queue, which randomly shuffles its
elements within a large in-memory buffer. This shuffling
functionality is useful for machine learning algorithms
that want to randomize the order in which they process
examples, for example.
4.7
Containers
5.3
can be reset by clearing it of its contents entirely. Using containers, it is possible to share state even across
completely disjoint computation graphs associated with
different Sessions.
Optimizations
5.1
5.2
Asynchronous Kernels
5.4
We often make use of pre-existing highly-optimized numerical libraries to implement kernels for some operations. For example, there are a number of optimized libraries for performing matrix multiplies on different devices, including BLAS [15] and cuBLAS [39], or GPU
libraries for convolutional kernels for deep neural nets
such as cuda-convnet [28] and cuDNN [9]. Many of
our kernel implementations are relatively thin wrappers
around such optimized libraries.
We make fairly extensive use of the open-source Eigen
linear algebra library [25] for many of the kernel implementations in the system. As one part of the development of TensorFlow, our team (primarily Benoit Steiner)
has extended the open source Eigen library with support
for arbitrary dimensionality tensor operations.
5.5
Lossy Compression
Some machine learning algorithms, including those typically used for training neural networks, are tolerant of
noise and reduced precision arithmetic. In a manner similar to the DistBelief system [14], we often use lossy
compression of higher precision internal representations
when sending data between devices (sometimes within
the same machine but especially across machine boundaries). For example, we often insert special conversion
nodes that convert 32-bit floating point representations
into a 16-bit floating point representation (not the proposed IEEE 16-bit floating point standard, but rather just
a 32-bit IEEE 794 float format, but with 16 bits less precision in the mantissa), and then convert back to a 32bit representation on the other side of the communication channel (by just filling in zeroes for the lost portion
While there are many opportunities for scheduling optimizations, here we focus on one that we found particularly necessary and effective. It concerns the scheduling of Receive nodes for reading remote values. If no
precautions are taken, these nodes may start much earlier than necessary, possibly all at once when execution
starts. By performing an as-soon-as-possible/as-late-aspossible (ASAP/ALAP) calculation, of the kind common
in operations research, we analyze the critical paths of
graphs, in order to estimate when to start the Receive
nodes. We then insert control edges with the aim of delaying the start of these nodes until just before their results are needed.
9
of the mantissa, since thats less computationally expensive than doing the mathematically correct probabilistic
rounding when doing this 32 16 32-bit conversion).
strated subtle flaws in a complex network architecture specification. In particular we were able to
identify operations and variables instantiated incorrectly due to automatic broadcasting in a mathematical operation across a dimension.
The TensorFlow interface and a reference implementation have been open sourced under an Apache 2.0
license, and the system is available for download at
www.tensorflow.org. The system includes detailed documentation, a number of tutorials, and a number of examples demonstrating how to use the system for a variety
of different machine learning tasks. The examples include models for classifying hand-written digits from the
MNIST dataset (the hello world of machine learning
algorithms) [32], classifying images from the CIFAR10 dataset [30], doing language modeling using a recurrent LSTM [22] network, training word embedding vectors [35] and more.
The system includes front-ends for specifying TensorFlow computations in Python and C++, and we expect
other front-ends to be added over time in response to
the desires of both internal Google users and the broader
open-source community.
We have quite a few machine learning models in our
previous DistBelief system [14] that we have migrated
over to TensorFlow. The rest of this section discusses
some lessons we have learned that are generalizable for
any such migration of machine learning models from one
system to another, and therefore may be valuable to others.
In particular, we focus on our lessons from porting a
state-of-the-art convolutional neural network for image
recognition termed Inception [23]. This image recognition system classifies 224 224 pixel images into one
of 1000 labels (e.g., cheetah, garbage truck, etc.).
Such a model comprises 13.6 million learnable parameters and 36,000 operations when expressed as a TensorFlow graph. Running inference on a single image requires 2 billion multiply-add operations.
After building all necessary mathematical operations
in TensorFlow, assembling and debugging all 36,000 operations into the correct graph structure proved challenging. Validating correctness is a difficult enterprise because the system is inherently stochastic and only intended to behave in a certain way in expectation potentially after hours of computation. Given these circumstances, we found the following strategies critical for
porting the Inception model to TensorFlow:
Parameter Device(s)
Add
Device A
Client
model
Update
Device B
model
input
Device C
model
input
input
Client 3
Update
Client 2
Update
Client 1
Update
P
P
P
Device A
Device B
model
model
input
Device C
model
input
input
One simple technique for speeding up SGD is to parallelize the computation of the gradient for a mini-batch
across mini-batch elements. For example, if we are using a mini-batch size of 1000 elements, we can use 10
replicas of the model to each compute the gradient for
100 elements, and then combine the gradients and apply
updates to the parameters synchronously, in order to behave exactly as if we were running the sequential SGD
algorithm with a batch size of 1000 elements. In this
case, the TensorFlow graph simply has many replicas of
the portion of the graph that does the bulk of the model
computation, and a single client thread drives the entire
training loop for this large graph. This is illustrated in
the top portion of Figure 7.
Client
Device 3
P3
Device 2
P2
Device 1
P1
A future version of this white paper will have a comprehensive performance evaluation section of both the single machine and distributed implementations.
9.1
Figure 8: Model parallel training
Update
model
model
model
input
input
Client
Update
Tools
Update
Performance
input
Many of the computation graphs for deep neural networks can be quite complex. For example, the computation graph for training a model similar to Googles Inception model [48], a deep convolutional neural net that had
the best classification performance in the ImageNet 2014
contest, has over 36,000 nodes in its TensorFlow computation graph, and some deep recurrent LSTM models for
language modeling have more than 15,000 nodes.
Due to the size and topology of these graphs, naive visualization techniques often produce cluttered and overwhelming diagrams. To help users see the underlying
organization of the graphs, the algorithms in TensorBoard collapse nodes into high-level blocks, highlighting
groups with identical structures. The system also separates out high-degree nodes, which often serve bookkeeping functions, into a separate area of the screen. Doing so reduces visual clutter and focuses attention on the
core sections of the computation graph.
The entire visualization is interactive: users can pan,
zoom, and expand grouped nodes to drill down for details. An example of the visualization for the graph of a
deep convolutional image model is shown in Figure 10.
Figure 11: TensorBoard graphical display of model summary statistics time series data
including scalar summaries (e.g., for examining overall
properties of the model, such as the value of the loss
function averaged across a collection of examples, or the
time taken to execute the computation graph), histogrambased summaries (e.g., the distribution of weight values
in a neural network layer), or image-based summaries
(e.g., a visualization of the filter weights learned in a
convolutional neural network). Typically computation
graphs are set up so that Summary nodes are included
to monitor various interesting values, and every so often
during execution of the training graph, the set of summary nodes are also executed, in addition to the normal
set of nodes that are executed, and the client driver program writes the summary data to a log file associated
with the model training. The TensorBoard program is
then configured to watch this log file for new summary
9.2
Performance Tracing
10
teresting machine learning models for artificial intelligence, and in the course of doing this, we may discover
ways in which we will need to extend the basic TensorFlow system. The open source community may also
come up with new and interesting directions for the TensorFlow implementation.
One extension to the basic programming model that
we are considering is a function mechanism, whereby
a user can specify an entire subgraph of a TensorFlow
computation to be a reusable component. In the implementation we have designed, these functions can become
reusable components even across different front-end languages for TensorFlow, so that a user could define a function using the Python front end, but then use that function as a basic building block from within the C++ frontend. We are hopeful that this cross-language reusability
will bootstrap a vibrant community of machine learning
researchers publishing not just whole examples of their
research, but also small reusable components from their
work that can be reused in other contexts.
We also have a number of concrete directions to improve the performance of TensorFlow. One such direction is our initial work on a just-in-time compiler that
can take a subgraph of a TensorFlow execution, perhaps
with some runtime profiling information about the typical sizes and shapes of tensors, and can generate an optimized routine for this subgraph. This compiler will understand the semantics of perform a number of optimizations such as loop fusion, blocking and tiling for locality,
specialization for particular shapes and sizes, etc.
We also imagine that a significant area for future work
will be in improving the placement and node scheduling
algorithms used to decide where different nodes will execute, and when they should start executing. We have currently implemented a number of heuristics in these subsystems, and wed like to have the system instead learn
to make good placement decisions (perhaps using a deep
neural network, combined with a reinforcement learning
objective function).
11
Related Work
Future Work
Figure 12: EEG visualization of multi-threaded CPU operations (x-axis is time in s).
Figure 13: EEG visualization of Inception training showing CPU and GPU activity.
of trained models in a wide variety of production settings, including memory- and computation-constrained
environments such as mobile devices.
machine learning models using relatively high-level descriptions. Unlike DistBelief and Project Adam, though,
the general-purpose dataflow graph model in TensorFlow
is more flexible and more amenable to expressing a wider
variety of machine learning models and optimization algorithms. It also permits a significant simplification by
allowing the expression of stateful parameter nodes as
variables, and variable update operations that are just
additional nodes in the graph; in contrast, DistBelief,
Project Adam and the Parameter Server systems all have
The TensorFlow system shares some design characteristics with its predecessor system, DistBelief [14],
and with later systems with similar designs like Project
Adam [10] and the Parameter Server project [33]. Like
DistBelief and Project Adam, TensorFlow allows computations to be spread out across many computational devices across many machines, and allows users to specify
15
12
Conclusions
We have described TensorFlow, a flexible data flowbased programming model, as well as single machine
and distributed implementations of this programming
model. The system is borne from real-world experience
in conducting research and deploying more than one hundred machine learning projects throughout a wide range
of Google products and services. We have open sourced
a version of TensorFlow, and hope that a vibrant shared
community develops around the use of TensorFlow. We
are excited to see how others outside of Google make use
of TensorFlow in their own work.
16
Acknowledgements
Dataflow Architectures,
pages 225253. 1986.
www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&
doc=GetTRDoc.pdf&AD=ADA166235.
The development of TensorFlow has benefitted enormously from the large and broad machine learning community at Google, and in particular from the suggestions
and contributions from rest of the Google Brain team
and also from the hundreds of DistBelief and TensorFlow
users within Google. Without a doubt, the usability and
functionality of TensorFlow has been greatly expanded
by listening to their feedback.
Many individuals have contributed to TensorFlow
and to its open source release, including John Giannandrea (for creating a supportive research environment), Irina Kofman and Phing Turner (project management), Bill Gruber and David Westbrook (technical writing), Dave Andersen, Anelia Angelova, Yaroslav Bulatov, Jianmin Chen, Jerjou Cheng, George Dahl, Andrew Dai, Lucy Gao, mig Gerard, Stephan Gouws,
Naveen Kumar, Geoffrey Hinton, Mrinal Kalarishnan,
Anjuli Kannan, Yutaka Leon-Suematsu, Frank Li, Peter Liu, Xiaobing Lu, Nishant Patil, Pierre Sermanet,
Noam Shazeer, Jascha Sohl-dickstein, Philip Tucker,
Yonghui Wu, Ke Yang, and Cliff Young (general contributions), Doug Fritz, Patrick Hurst, Dilip Krishnan, Daniel Smilkov, James Wexler, Jimbo Wilson,
Kanit Ham Wongsuphasawat, Cassandra Xia, and the
Big Picture team (graph visualization), Chris Leary,
Robert Springer and the Stream Executor team,
Kayur Patel, Michael Piatek, and the coLab team, and
the many others who have contributed to the TensorFlow
design and code base.
References
[2] Anelia Angelova, Alex Krizhevsky, and Vincent Vanhoucke. Pedestrian detection with a large-field-of-view
deep network. In Robotics and Automation (ICRA), 2015
IEEE International Conference on, pages 704711. IEEE,
2015. CalTech PDF.
[3] Arvind and David E. Culler.
of computer science vol. 1,
Annual review
1986.
chapter
17
[15] Jack J Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S Duff. A set of level 3 basic linear algebra subprograms.
ACM Transactions on
Mathematical Software (TOMS), 16(1):117, 1990.
www.maths.manchester.ac.uk/sven/pubs/Level3BLAS1-TOMS16-90.pdf.
[17] Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno, Pedro J Moreno, and Joaquin Gonzalez-Rodriguez. Frameby-frame language identification in short utterances using
deep neural networks. Neural Networks, 64:4958, 2015.
[28] A
Krizhevsky.
Cuda-convnet,
code.google.com/p/cuda-convnet/.
2014.
[18] Otavio
Good.
How
Google
Translate
squeezes deep learning onto a phone, 2015.
googleresearch.blogspot.com/2015/07/how-googletranslate-squeezes-deep.html.
Parameter
[22] Sepp Hochreiter and Jurgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):17351780,
1997. ftp.idsia.ch/pub/juergen/lstm.pdf.
[23] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. CoRR, abs/1502.03167, 2015.
arxiv.org/abs/1502.03167.
18
[50] Vincent Vanhoucke. Speech recognition and deep learning, 2015. googleresearch.blogspot.com/2012/08/speechrecognition-and-deep-learning.html.
[51] Abhishek Verma, Luis Pedrosa, Madhukar Korupolu,
David Oppenheimer, Eric Tune, and John Wilkes.
Large-scale cluster management at Google with Borg.
In Proceedings of the Tenth European Conference
on Computer Systems, page 18. ACM, 2015.
research.google.com/pubs/archive/43438.pdf.
19