Theano A CPU and GPU Math Compiler in Python
Theano A CPU and GPU Math Compiler in Python
net/publication/228832149
CITATIONS READS
440 5,278
9 authors, including:
All content following this page was uploaded by Razvan Pascanu on 03 March 2014.
Abstract— Theano is a compiler for mathematical expressions in program optimization can be time-consuming and error-prone,
Python that combines the convenience of NumPy’s syntax with the making an automated approach to performance optimization
speed of optimized native machine language. The user composes highly desirable.
mathematical expressions in a high-level description that mimics
NumPy’s syntax and semantics, while being statically typed and Theano, on the other hand, works on a symbolic represen-
functional (as opposed to imperative). These expressions allow tation of mathematical expressions, provided by the user in
Theano to provide symbolic differentiation. Before performing a NumPy-like syntax. Access to the full computational graph
computation, Theano optimizes the choice of expressions, translates of an expression opens the door to advanced features such
them into C++ (or CUDA for GPU), compiles them into dynamically as symbolic differentiation of complex expressions, but more
loaded Python modules, all automatically. Common machine learn-
ing algorithms implemented with Theano are from 1.6× to 7.5× importantly allows Theano to perform local graph transforma-
faster than competitive alternatives (including those implemented tions that can correct many unnecessary, slow or numerically
with C/C++, NumPy/SciPy and MATLAB) when compiled for the unstable expression patterns. Once optimized, the same graph
CPU and between 6.5× and 44× faster when compiled for the GPU. can be used to generate CPU as well as GPU implementations
This paper illustrates how to use Theano, outlines the scope of the (the latter using CUDA) without requiring changes to user
compiler, provides benchmarks on both CPU and GPU processors,
and explains its overall design. code.
Theano is similar to [SymPy], in that both libraries ma-
nipulate symbolic mathematical graphs, but the two projects
Introduction have a distinctly different focus. While SymPy implements a
richer set of mathematical operations of the kind expected in
Python is a powerful and flexible language for describing
a modern computer algebra system, Theano focuses on fast,
large-scale mathematical calculations, but the Python inter-
efficient evaluation of primarily array-valued expressions.
preter is in many cases a poor engine for executing them.
Theano is free open source software, licensed under the
One reason is that Python uses full-fledged Python objects
New (3-clause) BSD license. It depends upon NumPy, and
on the heap to represent simple numeric scalars. To reduce
can optionally use SciPy. Theano includes many custom C
the overhead in numeric calculations, it is important to use
and CUDA code generators which are able to specialize
array types such as NumPy’s ndarray so that single Python
for particular types, sizes, and shapes of inputs; leveraging
objects on the heap can stand for multidimensional arrays of
these code generators requires gcc (CPU) and nvcc (GPU)
numeric scalars, each stored efficiently in the host processor’s
compilers, respectively. Theano can be extended with custom
native format.
graph expressions, which can leverage scipy.weave, Py-
[NumPy] provides an N-dimensional array data type, and
CUDA, Cython, and other numerical libraries and compilation
many functions for indexing, reshaping, and performing ele-
technologies at the user’s discretion. Theano has been actively
mentary computations (exp, log, sin, etc.) on entire arrays
and continuously developed and used since January 2008.
at once. These functions are implemented in C for use within
It has been used in the preparation of numerous scientific
Python programs. However, the composition of many such
papers and as a teaching platform for machine learning in
NumPy functions can be unnecessarily slow when each call
graduate courses at l’Université de Montréal. Documentation
is dominated by the cost of transferring memory rather than
and installation instructions can be found on Theano’s website
the cost of performing calculations [Alted]. [numexpr] goes
[theano]. All Theano users should subscribe to the announce1
one step further by providing a loop fusion optimization that
mailing list (low traffic). There are medium traffic mailing lists
can glue several element-wise computations together. Unfor-
for developer discussion2 and user support3 .
tunately, numexpr requires an unusual syntax (the expression
must be encoded as a string within the code), and at the time This paper is divided as follows: Case Study: Logistic
of this writing, numexpr is limited to optimizing element-wise Regression shows how Theano can be used to solve a sim-
computations. [Cython] and [scipy.weave] address Python’s ple problem in statistical prediction. Benchmarking Results
performance issue by offering a simple way to hand-write presents some results of performance benchmarking on prob-
crucial segments of code in C (or a dialect of Python which lems related to machine learning and expression evaluation.
can be easily compiled to C, in Cython’s case). While this What’s in Theano gives an overview of the design of Theano.
approach can yield significant speed gains, it is labor-intensive: Limitations and Future Work outlines current limitations of our
if the bottleneck of a program is a large mathematical expres- implementation and currently planned additions to Theano.
sion comprising hundreds of elementary operations, manual 1 http://groups.google.com/group/theano-announce
2 http://groups.google.com/group/theano-dev
The corresponding author is with Université de Montréal, e-mail:
[email protected]. 3 http://groups.google.com/group/theano-users
2 PROC. OF THE 9th PYTHON IN SCIENCE CONF. (SCIPY 2010)
Case Study: Logistic Regression calling said functions to perform numerical computations.
To get a sense of how Theano feels from a user’s perspec- The code listings in Figures 1 - 4 illustrate these steps with
tive, we will look at how to solve a binary logistic regression a working program that fits a logistic regression model to
problem. Binary logistic regression is a classification model random data.
parameterized by a weight matrix W and bias vector b. The 1: import numpy
model estimates the probability P(Y = 1|x) (which we will 2: import theano.tensor as T
3: from theano import shared, function
denote with shorthand p) that the input x belongs to class 4:
y = 1 as: 5: x = T.matrix()
(i)
6: y = T.lvector()
eW x +b 7: w = shared(numpy.random.randn(100))
P(Y = 1|x(i) ) = p(i) = (i)
(1) 8: b = shared(numpy.zeros(()))
1 + eW x +b 9: print "Initial model:"
10: print w.get_value(), b.get_value()
The goal is to optimize the log probability of N training
examples, D = {(x(i) , y(i) ), 0 < i ≤ N}), with respect to W and Figure 1: Logistic regression, part 1: declaring variables.
b. To maximize the log likelihood we will instead minimize
the (average) negative log likelihood4 :
The code in Figure 1 declares four symbolic variables x, y
1 w, and b to represent the data and parameters of the model.
`(W, b) = − ∑ y(i) log p(i) + (1 − y(i) ) log(1 − p(i) ) (2)
N i Each tensor variable is strictly typed to include its data type,
its number of dimensions, and the dimensions along which it
To make it a bit more interesting, we can also include an may broadcast (like NumPy’s broadcasting) in element-wise
`2 penalty on W , giving a cost function E(W, b) defined as: expressions. The variable x is a matrix of the default data type
(float64), and y is a vector of type long (or int64). Each
E(W, b) = `(W, b) + 0.01 ∑ ∑ w2i j (3)
i j
row of x will store an example x(i) , and each element of y will
store the corresponding label y(i) . The number of examples to
In this example, tuning parameters W and b will be use at once represents a tradeoff between computational and
done through stochastic gradient descent (SGD) on E(W, b). statistical efficiency.
Stochastic gradient descent is a method for minimizing a The shared() function creates shared variables for W
differentiable loss function which is the expectation of some and b and assigns them initial values. Shared variables behave
per-example loss over a set of training examples. SGD esti- much like other Theano variables, with the exception that
mates this expectation with an average over one or several they also have a persistent value. A shared variable’s value is
examples and performs a step in the approximate direction maintained throughout the execution of the program and can
of steepest descent. Though more sophisticated algorithms for be accessed with .get_value() and .set_value(), as
numerical optimization exist, in particular for smooth convex shown in line 10.
functions such as E(W, b), stochastic gradient descent remains
11: p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b))
the method of choice when the number of training examples 12: xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1)
is too large to fit in memory, or in the setting where training 13: cost = xent.mean() + 0.01*(w**2).sum()
examples arrive in a continuous stream. Even with relatively 14: gw,gb = T.grad(cost, [w,b])
15: prediction = p_1 > 0.5
manageable dataset sizes, SGD can be particularly advanta-
geous for non-convex loss functions (such as those explored Figure 2: Logistic regression, part 2: the computation graph.
in Benchmarking Results), where the stochasticity can allow
the optimizer to escape shallow local minima [Bottou]. The code in Figure 2 specifies the computational graph
According to the SGD algorithm, the update on W is required to perform stochastic gradient descent on the pa-
1 ∂ E(W, b, x, y) rameters of our cost function. Since Theano’s interface shares
W ←W −µ 0 ∑ , (4)
N i ∂W x=x(i) ,y=y(i)
much in common with that of NumPy, lines 11-15 should
be self-explanatory for anyone familiar with that module. On
where µ = 0.1 is the step size and N is the number of line 11, we start by defining P(Y = 1|x(i) ) = 1 as the sym-
examples with which we will approximate the gradient (i.e. bolic variable p_1. Notice that the matrix multiplication and
the number of rows of x). The update on b is likewise element-wise exponential functions are simply called via the
1 ∂ E(W, b, x, y) T.dot and T.exp functions, analogous to numpy.dot and
b ← b−µ ∑ . (5)
0
N i ∂b x=x(i) ,y=y(i)
numpy.exp. xent defines the cross-entropy loss function,
which is then combined with the `2 penalty on line 13, to form
Implementing this minimization procedure in Theano in- the cost function of Eq (3) and denoted by cost.
volves the following four conceptual steps: (1) declaring sym- Line 14 is crucial to our implementation of SGD, as it
bolic variables, (2) using these variables to build a symbolic performs symbolic differentiation of the scalar-valued cost
expression graph, (3) compiling Theano functions, and (4) variable with respect to variables w and b. T.grad operates
4 Taking the mean in this fashion decouples the choice of the regularization
by iterating backwards over the expression graph, applying the
coefficient and the stochastic gradient step size from the number of training chain rule of differentiation and building symbolic expressions
examples. for the gradients on w and b. As such, gw and gb are also
THEANO: A CPU AND GPU MATH COMPILER IN PYTHON 3
Speed up vs NumPy
6 3.5
5 3.0
4 2.5
2.0
3 1.5
2 1.0
1 0.5
01e3 1e5 1e7 0.01e3 1e5 1e7
1.0 a+1 40 2*a + b**10
Speed up vs NumPy
0.9 35
0.8 30
0.7 25
0.6 20
0.5
0.4 15
0.3 10
0.2 5
0.11e3 1e5 0
1e7 1e3 1e5 1e7
Dimension of vectors a and b Dimension of vectors a and b
Figure 6: Fitting a convolutional network using different software. Figure 7: Speed comparison between NumPy, numexpr, and Theano
The benchmark stresses convolutions of medium-sized (256 by 256) for different sizes of input on four element-wise formulae. In each
images with small (7 by 7) filters. subplot, the solid blue line represents Theano, the dashed red line
represent numexpr, and performance is plotted with respect to NumPy.
normal form so that it is easier to recognize expression patterns default GPU optimizations.
in subsequent compilation stages. Tensors stored on the GPU use a special internal data type
Stabilization: The stabilization transformation improves the with an interface similar to the ndarray. This datatype fully
numerical stability of the computations implied by the ex- supports strided tensors, and arbitrary numbers of dimensions.
pression graph. For instance, consider the function log(1 + The support for strides means that several operations such as
exp(x)), which tends toward zero as limx→−∞ , and x as the transpose and simple slice indexing can be performed in
limx→−∞ . Due to limitations in the representation of double constant time.
precision numbers, the computation as written yields infinity Code Generation: The code generation phase of the com-
for x > 709. The stabilization phase replaces patterns like pilation process produces and loads dynamically-compiled
one with an expression that simply returns x when x is Python modules with specialized implementations for the
sufficiently large (using doubles, this is accurate beyond the expressions in the computation graph. Not all expressions
least significant digit). It should be noted that this phase have C (technically C++) implementations, but many (roughly
cannot guarantee the stability of computations. It helps in some 80%) of Theano’s expressions generate and compile C or
cases, but the user is still advised to be wary of numerically CUDA code during theano.function. The majority of
problematic computations. expressions that generate C code specialize the code based on
Specialization: The specialization transformation replaces the dtype, broadcasting pattern, and number of dimensions of
expressions with faster ones. Expressions like pow(x,2) be- their arguments. A few expressions, such as the small-filter
come sqr(x). Theano also performs more elaborate special- convolution (conv2d), further specialize code based on the
izations: for example, expressions involving scalar-multiplied size the arguments will have.
matrix additions and multiplications may become BLAS Why is it so important to specialize C code in this way?
General matrix multiply (GEMM) nodes and reshape, Modern x86 architectures are relatively forgiving of code that
transpose, and subtensor expressions (which create does not make good use techniques such as loop unrolling
copies by default) are replaced by constant-time versions and prefetching contiguous blocks of memory, and only the
that work by aliasing memory. Expressions subgraphs in- conv2d expression goes to any great length to generate many
volving element-wise operations are fused together (as in special case implementations for the CPU. By comparison,
numexpr) in order to avoid the creation and use of unnec- GPU architectures are much less forgiving of code that is
essary temporary variables. For instance, if we denote the not carefully specialized for the size and physical layout of
a + b operation on tensors as map(+, a, b), then an function arguments. Consequently, the code generators for
expression such as map(+, map(*, a, b), c) would GPU expressions like GpuSum, GpuElementwise, and
become map(lambda ai,bi,ci: ai*bi+ci, a, b, GpuConv2d generate a wider variety of implementations than
c). If the user desires to use the GPU, expressions with corre- their respective host expressions. With the current generation
sponding GPU implementations are substituted in, and transfer of graphics cards, the difference in speed between a naïve
expressions are introduced where needed. Specialization also implementation and an optimal implementation of an expres-
introduces expressions that treat inputs as workspace buffers. sion as simple as matrix row summation can be an order of
Such expressions use less memory and make better use of magnitude or more. The fact that Theano’s GPU ndarray-
hierarchical memory, but they must be used with care because like type supports strided tensors makes it even more important
they effectively destroy intermediate results. Many expressions for the GPU code generators to support a variety of memory
(e.g. GEMM and all element-wise ones) have such equivalents. layouts. These compile-time specialized CUDA kernels are
Reusing memory this way allows more computation to take integral to Theano’s GPU performance.
place on GPUs, where memory is at a premium.
Moving Computation to the GPU: Each expression in Limitations and Future Work
Theano is associated with an implementation that runs on While most of the development effort has been directed
either the host (a host expression) or a GPU device (a at making Theano produce fast code, not as much attention
GPU expression). The GPU-transfer transformation replaces has been paid to the optimization of the compilation process
host expressions with GPU expressions. The majority of host itself. At present, the compilation time tends to grow super-
expression types have GPU equivalents and the proportion is linearly with the size of the expression graph. Theano can deal
always growing. with graphs up to a few thousand nodes, with compilation
The heuristic that guides GPU allocation is simple: if any times typically on the order of seconds. Beyond that, it can
input or output of an expression resides on the GPU and the be impractically slow, unless some of the more expensive
expression has a GPU equivalent, then the GPU equivalent optimizations are disabled, or pieces of the graph are com-
is substituted in. Shared variables storing float32 tensors piledseparately.
default to GPU storage, and the expressions derived from A Theano function call also requires more overhead (on the
them consequently default to using GPU implementations. It is order of microseconds) than a native Python function call. For
possible to explicitly force any float32 variable to reside on this reason, Theano is suited to applications where functions
the GPU, so one can start the chain reaction of optimizations correspond to expressions that are not too small (see Figure
and use the GPU even in graphs with no shared variables. It is 5).
possible (though awkward, and discouraged) to specify exactly The set of types and operations that Theano provides
which computations to perform on the GPU by disabling the continues to grow, but it does not cover all the function-
THEANO: A CPU AND GPU MATH COMPILER IN PYTHON 7