Matrix Template Library for HPC
Matrix Template Library for HPC
APPARENT TO US THAT GENERIC PROGRAMMING, THE FUNDAMEN- 1. Identify useful and efficient algo-
rithms.
tal paradigm underlying the STL, was nitude fewer lines of code. 2. Find their generic representation
an important and powerful new soft- Composability and genericity are all (parameterize each algorithm such
ware development methodology—and well and good, but to many scientific that it makes the fewest possible re-
this has been borne out by the tremen- computing users the advantages of an quirements of the data on which it
dous success of the STL for general- elegant programming interface are sec- operates).
purpose programming. Not so obvious ondary to issues of performance. Given 3. Derive a set of (minimal) require-
then, however, was how (or even if) MTL’s heavy use (and many layers) of ments that allow these algorithms to
generic programming could apply to abstraction, we might naturally assume run and to run efficiently.
other problem domains. To investigate that there is a corresponding perfor- 4. Construct a framework based on
the merit of this approach for scientific mance penalty. It turns out that generic classifications of requirements.
computing, we embarked on a research programming is a powerful tool with
project to apply generic programming respect to performance as well—in two In applying this process to numerical
to high-performance numerical linear regards: linear algebra, the first step can be di-
algebra. rectly motivated by the mathematical
That effort brought forth the Matrix • Properly designed abstractions (in definition of linear algebra. That is, we
Template Library, a generic compo- conjunction with modern compilers) need algorithms that implement the ba-
nent library for scientific computing.2 incur no performance penalty per se. sic axiomatic operations of linear alge-
Although MTL consists of a relatively That is, generic components are as bras: multiplying a vector by a scalar,
small number of software components, efficient as their handwritten coun- adding two vectors, or applying a linear
its power and utility arise from the terparts. transformation to a vector. In addition
composability of the components and • High-level performance-tuning me- to these operations, we can include op-
the generic nature of the algorithms. chanisms (such as cache or register- erations that give our linear space extra
That is, the components can be com- blocking schemes) can be described structure, such as inner products and
posed arbitrarily to produce an ex- in a generic fashion, giving users norms as well as algorithms for working
tremely wide variety of matrix formats. vendor-optimized performance lev- with dual spaces (that is, a transpose op-
Similarly, each algorithm can operate els in a portable and easy-to-tune eration). Thus, our set of useful and ef-
on any matrix type defined in this fash- fashion. This has been borne out in ficient algorithms consists of just these
ion. The resulting linear algebra func- our own experiments—MTL can six (respectively performing the desired
tionality far exceeds that of libraries match the performance of vendor- operations described above): scale(),
that are not based on component tech- tuned libraries on a number of plat- sum(), mult(), dot(), norm(), and
nology, while requiring orders of mag- forms. transpose().
NOVEMBER/DECEMBER 1999 71
SCIENTIFIC PROGRAMMING
As with everything these days, to find out more about MTL you can refer to
the official MTL Web site: http://lsc.nd. edu/research/mtl. This site contains com-
plete documentation of the MTL as well as the freely available source code. In
addition, links to some of the other generic programming efforts mentioned in
this article (such as the Standard Template Library, Iterative Template Library, element of x and z (or visit them in any
and Generic Graph Component Library) are available there. particular order).
We can implement the requirements
that we do face using an interface similar
Let’s look at the most interesting of these operations to STL’s. As with the STL, iterators form the principal in-
(mult()) and consider how we can make it generic. To be terface to data types (for using operator++() and opera-
generic, we would like to realize the operation, say, z ← Ax tor*() for traversal and access). We extend the STL notion
for any concrete representation of vectors x and z as well for of iterator with respect to a matrix in two ways. First, MTL
any matrix (linear operator) A. We can implement type in- has a two-level hierarchy of iterators to traverse matrices (to
dependence (at least syntactically) in C++ by using templates. reflect the objects’ 2D nature). Second, iterators over ma-
Thus, we could prototype a generic mult()as follows: trix elements can provide row and column index informa-
tion as well as be de-referenced for the matrix element value.
template <c
class Matrix, class VecX, class VecZ> The body of the mult() algorithm is then:
void mult (c
const Matrix& A, const VecX& x, VecZ z);
{
The next step is to define the body of mult() such that it typename Matrix::const_iterator i;
can work with arbitrary types (having a suitably defined in- typename Matrix::OneD::const_iterator j;
terface). At first, this might seem impossible. After all, there
are myriad matrix types: sparse, dense, rectangular, banded, for (i = A.begin(); i != A.end(); ++i)
column oriented, row oriented, and so on. for (j = (*i).begin(); j != (*i).end(); ++j)
Interestingly, the mathematical description of a matrix- z[row(j)] += *j * x[column(j)];
vector product points us in the right direction. We can write }
down what we mean by matrix-vector product in an N-
dimensional space this way: For those who might be unfamiliar with C++, or with the
particulars of STL-style C++ generic programs, this algo-
zi = ∑ aij xj . rithm simply embodies the textual description of the mult()
j
algorithm. The two nested loops serve to iterate over all the
One textual interpretation of this mathematical statement elements of A in the order they are stored in memory—
is the following. Let the matrix A be such that each element whether that be by row, column, or diagonal. Thus, for row-
in A has a corresponding row and column index, i and j, re- oriented matrices, the inner loop performs a dot-product op-
spectively. For each element in A, sum the product of that eration on rows of A with vector x; for column-oriented
element with the jth element of x into the ith element of z. matrices, the inner loop takes linear combinations of columns
This leads to the next step of our generic-programming of A scaled by elements of x (in an axpy()-like manner).
process: deriving a set of minimal requirements for the With this basic set of requirements in place, we can move
algorithm. to the last step of the generic-programming process—frame-
In this case, our textual description of the algorithm pro- work construction.
vides those requirements. To realize a generic mult(), we
must be able to Data types in MTL
Although our tour thus far through MTL’s generic-
• visit each element of A (optionally skipping zeroes), programming process has described what the interfaces to
• access the value of (de-reference) each element of A, matrices and vectors should look like, it leaves open the is-
• access the row and column indices of each element of A, and sue of the underlying implementation of those objects—al-
• randomly access values (that is, given some integer k, ac- lowing, in some sense, almost infinite flexibility, provided
cess the kth value) of the vectors x and z. the interface conditions are satisfied. Within MTL, we
sought to provide the largest possible variety of concrete
Note what is not required. We need not visit the elements matrix types (particularly those that are commonly used else-
of A in any particular order. Nor must A be of any particular where) while also requiring only a small number of actual
shape, nor must zero values be stored, nor must we visit every components. To accomplish these simultaneous goals, we
This specification is open-ended. A new component type // Create an empty 1000 x 1000 matrix
can be used with any component group simply by meeting FortranMatrix A(1000, 1000);
the (deliberately minimal) interface specification for that
group. For example, we could use an interval class for Elt- // Create a matrix from a file
Type. A combinatorial number of matrices can be constructed matrix_market_stream<d double> mms (filename);
from the basic components already defined in MTL. We can CompressedRow B(mms);
use the current collection of just 21 MTL matrix components
with the standard numerical types of instance to construct lit- // Create a matrix from existing data
erally thousands of different matrix types! typedef matrix<ddouble, rectangle<>,
The following code examples show how to specify MTL compressed<iint,external>, column_major>::type
matrices corresponding to some of the more commonly used ExtCompRow;
matrix types. ExtCompRow C(m, n, nnz, values, indices, row_ptrs);
NOVEMBER/DECEMBER 1999 73
SCIENTIFIC PROGRAMMING
For the most part, MTL users do not need to access indi- Algorithms. Table 1 lists the principle algorithms included
vidual elements of a matrix once it is constructed because the in MTL. In the table, alpha and s are scalars, x,y, and z
MTL algorithms provide most operations. However, users are 1D containers, A, B, C, and E are matrices, and T is a tri-
may want to construct their own matrix algorithms. With this angular matrix. MTL does not define different operations
in mind, there are several ways to access the elements of an for each permutation of transpose, scaling, and striding as is
MTL matrix. The main access method is through iterators, as typically necessary in traditional libraries. Instead, only one
we showed in the previous matrix-vector multiply example. algorithm is provided, but it can be combined with the use of
Users can also access an MTL matrix in more traditional ways. strided and scaled vector adaptors, or the trans() mod-
For example, a single matrix element can be accessed with the ifier, to create the permutations as described later.
operator(i,j), 1D slices can be accessed with the opera-
tor[i], and a submatrix can be obtained with the sub_ma- Adaptors. One novel aspect of the MTL algorithm inter-
trix(i,j,m,n) method. The type of a 1D slice depends on face is the way we use adaptors to provide algorithm flexibil-
the matrix type (the slice might be a row, a column, or a diag- ity at a small constant implementation cost and with little or
onal). The rows() and columns() helper functions provide no extra runtime cost. Algorithm flexibility improves perfor-
an interface for creating different views of the same matrix re- mance by allowing a single function to carry out entire fami-
gardless of the underlying storage matrix layout. lies of operations. For example, you might wish to scale a vec-
tor while adding it to another (y ← α × x + y). This is the
// Get an Element operation carried out by the daxpy() BLAS function. By us-
FortranMatrix::value_type w = A(4 4,5); ing adaptors, the MTL add() operation can handle any com-
// Get a slice, in this case a column bination of scaling or striding without loss of performance.
FortranMatirx::Column c = A[4 4]; The adaptor modifies the behavior of the vector inside of the
// Get a row explicitly algorithm. In the case of scaling, this causes the elements to
FortanMatrix::Row r = rows (A) [3 3]; be multiplied as they are accessed. The call to scaled() does
not perform the multiplications before the call to add(), as
MTL matrices (and vectors) are reference counted, so the this would hurt performance, but instead the multiplications
user need not be concerned with memory management. happen during the add(). With a good optimizing C++ com-
piler, there is no extra overhead induced by the adaptors.
MTL vectors. MTL provides both dense-vector and
several sparse-vector types, implemented using standard // y ← αx + y
STL components (the MTL layer adds reference count- add (scale(x, alpha), x, y);
ing and handle-based semantics). The MTL vectors ex-
port the same interface as STL containers, including the // equivalent operation using BLAS
begin() and end() iterator accessors and the usual op- daxpy (n, alpha, xptr, 1, yptr, 1);
erator[i]. In addition, MTL vectors provide a conve-
nient way to access subvector views. The transpose of a matrix can be used in an algorithm with
the trans() adaptor. This adaptor performs a type conver-
// subrange vector s refers to elements [10, 30) sion (swapping matrix orientations) at compile time—there
dense1D<ddouble>::subrange_type s = x(110,30); is zero runtime cost. The next example shows how the ma-
trix-vector multiply algorithm (generically written to com-
MTL algorithms and adaptors pute z ← A × x + y) can also compute y ← AT × (α x) + β y.
The MTL provides basic (abstract) linear algebra func-
tionality and also provides a number of utility functions. // y ← AT × (αx) + βy
MTL thus provides functionality basically equivalent to that mult(trans(A), scaled(x, alpha), scaled(y, beta),
available with the BLAS Levels-1, 2 and 3.4–6 However, in y);
contrast to the BLAS, MTL algorithms work with a larger
number of matrix types (any matrix type that can be con- // equivalent operation using BLAS
structed within MTL), such as sparse matrices, and also with dgemv(‘T’, M, N, alpha, A_ptr, A_ld, x_ptr, 1,
any element type, not just single, double, and complex. beta, y_ptr, 1);
NOVEMBER/DECEMBER 1999 75
SCIENTIFIC PROGRAMMING
1 template <cclass Matrix, class Vector, class VectorB, Figure 1. The Iterative Template Library
2 class Preconditioner, class Iteration> (ITL) implementation of the precondi-
3 int gmres (cconst Matrix &A, Vector &x,, const Vector B &b
tioned GMRES(m) algorithm. This algo-
4 const Preconditioner &M, int m, Iteration& outer)
5 { rithm computes an approximate solution
6 using namespace mtl; to Ax = b preconditioned with M. The
7 typedef typename Matrix::value_type T; restart value is specified by the parame-
8 typedef matrix<T, rectangle<>, dense<>,
column_major>::type InternalMatrix;
ter m.
9 InternalMatrix H(m+1, m), V(x.size(), m+1);
10 Vector s(m+1), w(x. size()), r(x.size()), u(x.size());
11 std::vector< givens_rotation<T> > rotations (m+1);
piler. We used all possible compiler-
12
13 mult(A, scaled(x, -1.0), b, w); optimization flags in all cases and
14 solve(M, w, r); // r0 = b – Ax0 cleared the cache between each trial.
15 typename Iteration::real beta = abs(two_norm(r))); To demonstrate portability across dif-
16
ferent architectures and compilers,
17 while (! outer.finished(beta)) { // outer iteration
18 copy (scaled(r, 1./beta), v[0 0]); // v1 = r0/r0 Figure 2b compares the performance
19 set (s, 0.0); of MTL with the Engineering and
20 0] = beta;
s[0 Scientific Subroutine Library (ESSL)
21 int j = 0;
22 Iteration inner (outer.normb(), m, outer.tol());
on an IBM RS/6000 590. In this case,
23 we compiled the MTL executable with
24 do { // Inner iteration the KCC and IBM xlc compilers.
25 mult(A, V[j], u);
26 solve(M, u, w);
27 for (iint i = 0; i <= j; i++) {
Dense and sparse matrix-vector
28 H(i,j) = dot_conj(w,V[I]); // hij =<Avj, vi> multiplication. Figure 3 shows per-
29 add(w, scaled(V[i], -h(i,j)), w); //v^k+1 = Avj — Σi=1 j
hijvi formance results obtained using the
30 } matrix-vector multiplication algorithm
31 H(j + 1, j) = two_norm(w); //hj+1,j = v^j+1
32 copy(scaled(w, 1./H(j + 1)), V[j+1]); //vj+1 = v^ j+1/hj+1,j
for dense and for sparse matrices, and
33 compares the performance to that ob-
34 // QR triangularization of H tained with nongeneric libraries. Fig-
35 for (iint i = 0; i < j; i++
ure 3a compares MTL’s dense matrix-
36 rotations [i].apply(H(i,j), H(i++1,j));
37 vector performance to the Netlib
38 rotations [j] = givens_rotation<T>(H(j,j), H(j+1,j)); BLAS (Fortran) and the Sun Perfor-
39 rotations [j].apply(H(j,j), H(j+1,j)); mance Library. Figure 3b compares
40 rotations [j].apply(s[i], s[i+ +1]);
MTL’s sparse matrix-vector perfor-
41
42 ++inner, ++outer, ++j; mance to Sparskit9 (Fortran), and the
43 } while (! Inner.finished(abs(s[j]))); NIST Sparse BLAS (C).10 We ran the
44 experiments on a Sun Ultra 30, used
45 // Form the approximate solution
46 tri_solve(tri_view<upper>()(H.sub_matrix(0 0, j, 0, j)), s);
sparse matrices from the MatrixMarket
47 mult(V.sub_matrix(0 0, x.size(), 0, j), s, x, x); collection, and did not clear the cache
48 between each matrix-vector timing
49 // Restart trial. This experiment focused on the
50 mult(A, scaled(x, -1.0), b, w);
51 solve(M, w, r);
algorithm’s pipeline behavior. If we had
52 beta = abs(two_norm(r)); cleared the cache, the bottleneck would
53 } have become memory bandwidth, and
54 we could not have seen differences in
55 return outer.error_code();
56 }
pipeline behavior. Blocking for cache is
not as important for matrix-vector
multiplication because there is no reuse
of matrix data.
the performance of dense matrix- get extra attention due to benchmark-
matrix product for MTL, Fortran ing). We compiled the MTL executa- The future of MTL
BLAS, and the Sun Performance Li- bles using Kuck and Associates C++, in Although MTL’s core functionality
brary, all obtained on a Sun Ultra 30. conjunction with v.5.0 of the Solaris C is complete, in many ways our work has
The experiment shows that the MTL compiler. We compiled the Fortran only begun. Using MTL as a founda-
can compete with vendor-tuned li- BLAS (obtained from Netlib) with tion, we plan to develop several li-
braries (on an algorithm that tends to v.5.0 of the Solaris Fortran 77 com- braries with higher levels of function-
MTL MTL
350 Fortran BLAS
500 Sun Perf Lib
Fortran BLAS Sun Perf Lib
300
Performance (Mflops)
Performance (Mflops)
400
250
300 200
150
200
100
100
50
0 1 0
10 10 2 10 3 101 102 103
(a) Matrix size (a) N
300 120
MTL MTL
ESSL SPARSKIT
250 Netlib 100 NIST
Performance (Mflops)
Performance (Mflops)
200 80
150 60
100 40
50 20
0 1 2 3 0 0 1 2
10 10 10 10 10 10
(b) Matrix size (b) Average nonzeroes per row
Figure 2. Performance comparison of the MTL dense matrix- Figure 3. Performance of the MTL matrix-vector product ap-
matrix product with other libraries on (a) Sun Ultra 30 and (b) plied to (a) column-oriented dense and (b) row-oriented sparse
IBM RS6000. data structures compared with other libraries on Sun Ultra 30.
ality (similar to how Lapack11 uses the BLAS). As mentioned example, for library development) the existing syntax is per-
earlier, we have already developed the first such library, a fectly suitable. Nevertheless, such a syntax can have value,
collection of iterative solvers called the Iterative Template so we are investigating an operator-based interface based on
Library. Current work focuses on sparse and dense linear the expression template technology found in Blitz++12 and
solvers, eigenproblem routines, and SVD computations. In PETE.13
the meantime, we provide wrappers to give users a conve- Using an interpretive front end offers an alternative ap-
nient MTL-style interface to Lapack. The growing MTL proach to rapid prototyping (and an operator-based syntax)
user group is actively building on top of MTL and con- with MTL. We have recently developed one such system
tributing algorithms. (having a Matlab-like syntax) and are also investigating the
We are often asked about using overloaded operators for use of other interpreted scripting languages (such as
MTL. Although an operator-based syntax can be very handy Python).
for rapid prototyping, it is in some sense “syntactic sugar” We continue to refine MTL and work on porting it to
that we felt to be orthogonal to MTL’s original goals. We new compilers. Our (perhaps immodest) hope is that it will
don’t feel the present MTL syntax to be a significant draw- ultimately become suitable as a standard. We are currently
back in terms of its original goals—to apply generic pro- working closely with the SGI STL team to better integrate
gramming to the domain of numerical linear algebra. Sim- MTL with the STL.
ilarly, in terms of software-engineering practice (for Beyond MTL, we are also investigating the application of
NOVEMBER/DECEMBER 1999 77
SCIENTIFIC PROGRAMMING
P U R P O S E The
IEEE Com -
puter Society is the
world’s largest association of
computing professionals, and
is the leading provider of
technical information in the
field.