High Performance Pattern Recognition On GPU
High Performance Pattern Recognition On GPU
Abstract—The pattern recognition (PR) process uses a large Parzen windows, k-Nearest Neighbour, etc., are examples of
number of labelled patterns and compute intensive algorithms. the same.
Several components of a PR process are compute and data The rapid increase in the performance of graphics hardware
intensive. Some algorithms compute the parameters required
for classification directly for each test pattern using a large have made GPU a strong candidate for performing many com-
training set. Most algorithms have a training step, the results pute intensive tasks. GPUs now include fully programmable
of which are used by a computationally cheap classification step. processing units that follow a stream programming model
In this paper, we present high-performance pattern recognition and support vectorized floating-point operations. High level
algorithms using a commodity Graphics Processing Unit (GPU). languages have emerged to support the new programmability.
Our algorithms exploit the high-performance SIMD architecture
of GPU. We specifically study the Parzen windows scheme for NVIDIA’s 8-series GPU with CUDA Computing environ-
density estimation and the Artificial Neural Network (ANN) ment provides the standard C like language interface to the
scheme for training and classification in this paper. We present programmable processors, which eliminates the overhead of
fast implementations of these on a NVIDIA 8800 GTX GPU. Our learning an inadequate API [13]. GPUs provide tremendous
implementation of Parzen windows can simultaneously estimate memory bandwidth and computational horsepower. For exam-
probability values for 1K test patterns in about 14ms based on
an input data set of 16K patterns. Our ANN can run an epoch of ple, the NVIDIA GeForce 8800 GTX can achieve a sustained
batch-training on the NIST data set with 56K 484-dimensional memory bandwidth of 86.4 GB/s and a theoretical maximum
patterns and 10 output categories in less than 200 milliseconds. of 346 GFLOPS [13].
The speedup is more than 300 times for Parzen windows and 100 Several GPU algorithms have been developed for sorting
times for ANN over the CPU implementations using a commodity [7], geometric computations, matrix multiplication, FFT [11]
GPU that costs about $400.
and graph algorithms. Larsen and McAllister [10] initially
proposed an approach for computing matrix products using
I. I NTRODUCTION simple blending and texture mapping functionalities on GPUs.
Hall et al. [8] and Moravanszky [12] described improved
Pattern recognition is concerned with the design of systems algorithms that performs implicit blocking. Fatahalian et al. [6]
that detect trends and classify patterns. Important application proposed another approach based on blocking for computing
areas are optical character recognition, speech recognition, matrix products using fragment shaders. NVIDIA’s CUBLAS
fingerprint identification, DNA sequence identification, and library [14] which comes with the CUDA software pack is
many more. an implementation of simple BLAS (Basic Linear Algebra
Most pattern recognition systems have two components: Subprograms) on GPU, that allows optimized matrix and
training and classification [5]. The system is trained using vector operations. Davis [4] presented simulation of ANN on
a large number of labelled patterns using relevant features. GPUs using Brook [2]. Reiter et al. [16] described HMM
The training could be iterative as with ANN, Support Vec- search implementation to compute the Viterbi probability for
tor Machines (SVM), etc., and proceeds till the error on a biological protein sequences. Cao et al. [3] presented algo-
small set of testing patterns is sufficiently low. The trained rithm for scalable clustering on GPUs. However, little work
system can be used for classification of unknown patterns has been done to exploit the computational capability of GPUs
using a computationally low process. The training process is for highly compute intensive aspects of pattern recognition,
compute intensive and time consuming especially when large such as ANN training and Parzen-windows. GPU applications
number of training patterns are involved. The classification on these can be useful, for instance, retraining the network
accuracy often depends on the effort spent on training and with new training patterns added on the fly.
most applications settle for a suitable tradeoff. Using new The Parzen-window approach is a method of estimating
training data to improve the performance is uncommon due non-parametric density from observed patterns. In the clas-
to the large computational effort. There are other types of sifiers based on Parzen-windows, the densities are estimated
pattern recognition techniques that use the whole training set to for each category and the test pattern is classified by the cate-
directly evaluate parameters for classification of each unknown gory corresponding to the maximum posterior. We describe
pattern. These incur heavy computational cost for classifying a parallel implementation of Parzen-windows using CUDA
each pattern as computation involves all the input patterns. API that can be used to classify large number of test patterns
in parallel. We can estimate the probability densities for a
test pattern in 14µs using 16K input patterns. This makes
Parzen-window based classifiers practical. The training of the
ANN changes its parameters based on the signals that flow
through it. It is an iterative process, where each iteration has
high computational complexity. We describe the batch learning
of network as a set of matrix operations, which are well
suited to the GPUs, because of highly parallel computational
requirements and regular data access pattern. We implemented
the backpropagation ANN training algorithm using fragment
shaders and CUBLAS library. Our ANN batch-training can
run an epoch on a data set with 56K 484-dimensional training
patterns in less than 200ms.
II. P RELIMINARIES
We review the concepts related to the GPU and the PR
algorithms we address in this section.
A. GPU Architecture
GPUs have a parallel architecture with massively paral- Fig. 2. A set of SIMD multiprocessors with on-chip shared memory
lel processors. The graphics pipeline is well suited to the
rendering process because it allows the GPU to function as
a stream processor. Recent GPUs with Shader Model 4 [1] It also has a read-only constant cache and texture cache that
allow users to write vertex, fragment and geometry shader is shared by all the processors. A set of local 32-bit registers
programs as shown in Figure 1. The programmable parts of is available per processor. The multiprocessors communicate
the graphics pipeline operates on a large number of vertices through the global or device memory. At the software level,
and fragments spawning a thread for each, to keep the par- the CUDA model is a collection of threads running in parallel.
allel processors occupied. The General-Purpose computation A thread block is a batch of SIMD-parallel threads that runs on
on Graphics Processing Units (GPGPU) uses GPU for non- a multiprocessor at a given time and can communicate through
graphics computations by posing it as a graphics rendering shared memory and can be synchronized. The computations
problem. Most GPGPU algorithms use programmable frag- are organized as a grid of thread blocks as shown in Figure 3.
ment processors, as it the most parallelizable component of Each thread executes a single instruction set called the kernel.
the pipeline, which maps each input pixel to an output pixel Thus, the CUDA model allows programmers to better exploit
of the framebuffer. Since, GPU memory layout is optimized the parallel power of the GPU for general-purpose computing.
for graphics rendering, an optimal data structure may not
be available for GPGPU solutions. Creating efficient data
structures using the GPU memory model is a challenging
problem in itself. Memory size and operations (gather and
scatter) of the GPU are other restricting factors.
Fig. 1. The graphics pipeline with the programmable stages shown shaded
2
B. Parzen Windows
Parzen-window method [15] approximates the unknown
density function p(x) from the N patterns x1 , x2 ,. . . , xN from
a category. The N th density estimate for p(x)is given as:
x − xi
N
1 X
pN (x) = ϕ , (1)
N hN i=1 hN
where, g(A) denotes that e(aij −1)/σ is computed for every (6)
2
∆wkj = η(tk − zk )f 0 (netk )yj
element of the matrix A. Each element Cij of C denotes the
" c #
probability estimate of ith unknown pattern belonging to the wkj (tk − zk )f 0 (netk ) f 0 (netj )xi . (7)
X
∆wji = η
j th category. k=1
3
where nH are the number of hidden units. Hence, Equations Algorithm 2 ANN training Using CUDA Using Shaders
4-7 as matrix operations become: for i = 0 to epochs do
netJ ← X * WJI } CUMAT
)
T
Shader 1
Y = f (netJ ) = f (X * WJI
T
), (8) Y ← f (netJ ) } Kernel 1
netK ← Y * WKJ } CUMAT
)
T
Shader 1
Z ← f (netK ) } Kernel 1
Z = f (netK ) = f (Y * WKJ
T
), (9)
δK ← η(T − Z) . f 0 (netK ) } Kernel 2 } Shader 2
∆WKJ ← δK T
*Y } CUMAT Shader 3
∆WJI = η [{((T − Z) . f 0 (netK )) * WKJ } . f 0 (netJ )]
T
* X, I ← δK * WKJ } CUMAT } Shader 4
δJ ← ηI . f 0 (netJ ) } Kernel 3 } Shader 5
(10)
∆WJI ← δJT * X } CUMAT Shader 3
Kernel: Shader 2:
C[i, j] = e(C[i,j]−1)/σ
2
C[i, j].xyzw = η(A[i, j].xyzw − B[i, j].xyzw) ∗
f 0 (D[i, j].xyzw)
Shader 3:
B. Backpropagation on GPU for m = 0 . . . b − 1 do
C[i, j].r = dot(A[m + k, i].xyzw ∗ B[m + k, j].xyzw)+
The backpropagation training algorithm using matrix op-
C[i, j].xyzw
erations is given in Algorithm 2. We will follow the same
notation for matrices as discussed in Section II-C. The network Shader 4:
parameters, initial synaptic weights, number of hidden units, for m = 0 . . . b − 1 do
sigmoid function and learning rate are set according to the C[i, j].xyzw = A[i, m + k].xyzw ∗ B[m + k, j].xxxx+
techniques for improving backpropagation by Haykin [9]. C[i, j].xyzw
4
Shader 5: of test patterns in parallel. On an average our implementation
C[i, j].xyzw = ηA[i, j].xyzw ∗ f 0 (B[i, j].xyzw) achieves 325 times speedup over an implementation on
MATLAB. For 64 or more test patterns the time taken for a
Shader 6:
pattern on CUDA is 14.1 µs.
C[i, j].x = A[i, j].x + B[i, j].x
We used another dataset from http://www.nist.gov, consist-
ing of 22 × 22 binary images of handwritten numbers from
2) Using CUDA API: In the Algorithm 2, matrix-matrix 0-9. For this data, we trained a 3-layer ANN with 484 (all the
multiplications are done by using cublasSgemm() function image coordinates), 128 and 10 units at the input, hidden and
referred to as CUMAT and the weight matrix is updated output layers respectively. The ANN is trained with 8K to 56K
using cublasSaxpy() function referred to as CUADD. We input images. As shown in Table II and Figure 5, the CUDA
write kernel functions for element-element multiplication and implementation achieves 90-110 times speedup over MATLAB
for computing the sigmoid, f (.) of the netj and netk . We and 120-140 times speedup over FANN implementations.
used 256 threads per block. ANN training with 56K patterns for an epoch on CUDA
requires 190ms as compared to 28.27s for FANN. The shader
Kernel 1: implementation achieves 40-50 times speedup over MATLAB
C[i, j] = f (A[i, j]) and 55-70 times speedup over FANN implementations. For
this large dataset, MATLAB gives memory problems when
Kernel 2: trained for more than 32K patterns.
C[i, j] = η(A[i, j] − B[i, j]) ∗ f 0 (D[i, j])
N t1 (s) t2 (s) t3 (s) t4 (s)
Kernel 3:
CUDA Shader MATLAB FANN
C[i, j] = η(A[i, j]) ∗ f 0 (B[i, j]) 8K 1.41 3.23 128.78 141.83
16K 2.84 6.24 278.74 344.71
24K 4.30 8.80 423.02 524.16
IV. R ESULTS 32K 5.66 11.59 600.61 789.19
40K 6.96 14.37 - 961.67
We tested our algorithms on 3 GHz Pentium IV CPU
48K 9.03 17.14 - 1179.26
and a NVIDIA GeForce 8800 GTX graphics processor. To
56K 9.77 20.00 - 1413.66
generate fragment programs, we use NVIDIA’s Cg compiler.
We benchmarked our GPU algorithms for ANN training and TABLE II
ANN RESULTS FOR N PATTERNS FOR 50 EPOCHS
the optimized CPU based ANN implementation provided by
FANN (Fast Artificial Neural Network) and MATLAB.
5
classification time for different number of test patterns. We [12] Moravanszky A. Dense matrix algebra on the GPU. 2003.
observe that for more than 1K test patterns, the average [13] NVIDIA: NVIDIA CUDA compute unified device architecture program-
ming guide, 2007. http://developer.download.nvidia.com/compute/cuda/
classification time for a pattern is 1.45µs. 1 0/NVIDIA CUDA Programming Guide 1.0.pdf.
[14] NVIDIA: NVIDIA CUBLAS Library, 2007. http://developer.download.
N t1 (ms) t1/N (ms) t2 (ms) t2/N (ms) nvidia.com/compute/cuda/1 0/CUBLAS Library 1.0.pdf.
CUDA CUDA FANN FANN [15] Parzen E. On Estimation of a Probability Density Function and Mode.
Annal of Mathematical Statistics. Vol. 33, pp. 1065-1076, 1962.
128 0.83 6.48e-03 16.59 0.129 [16] Reiter D. H., Houston M. and Hanrahan P. ClawHMMER: A Streaming
512 0.89 1.71e-03 68.27 0.133 HMMer-Search Implementation. Proc. of ACM/IEEE conference on
1K 1.57 1.54e-03 134.37 0.131 Supercomputing, 2005.
2K 3.06 1.49e-03 266.60 0.129
8K 11.86 1.44e-03 1076.51 0.131
16K 24.04 1.46e-03 2139.71 0.130
32K 48.08 1.46e-03 4283.02 0.130
56K 83.49 1.45e-03 7574.14 0.132
TABLE III
C LASSIFICATION TIME FOR N PATTERNS ON ANN
V. C ONCLUSIONS
In this paper, we presented the implementation of two
data intensive pattern recognition operations on commodity
GPUs. The algorithms exploit the high computing power of
the GPUs and provide fast performance of the algorithms.
Our implementation can compute the conditional or posterior
probabilities in about 14 microseconds when done in parallel
using the Parzen window method. This brings the Parzen win-
dow algorithms into the realm of real-time pattern recognition
techniques. Similarly, the GPU ANN can perform an epoch of
batch training with 56K samples in about 200 milliseconds.
This makes frequent retraining possible during an application.
The computational ease of such algorithms on an inexpensive
hardware makes it possible to acquire labelled samples live
and use them to adjust the behaviour of the system.
R EFERENCES
[1] Blythe D. The Direct3D 10 System. Microsoft Corporation, 2006.
[2] BrookGPU http://graphics.stanford.edu/projects/brookgpu/
[3] Cao F., Tung A. K. H. and Zhou A. Scalable Clustering Using Graphics
Processors. Lecture Notes in Computer Science, vol. 4016/2006, pp.
372-384, 2006.
[4] Davis C. E. Graphics Processing Unit Computation of Neural Networks.
University of New Mexico, 2001.
[5] Duda R. O., Hart P. E. and Stork D. G. Pattern Classification. 2nd
edition, John Wiley and Sons, 2001.
[6] Fatahlian K., Sugerman J., and Hanrahan P. Understanding the efficiency
of GPU algorithms for matrix-matrix multiplication. Proc of ACM
SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, Eu-
rographics Association, 2004.
[7] Govindaraju N. K., Manocha D., Raghuvanshi N., Tuft D. Gpusort:
High performance sorting using graphics processors. http://gamma.cs.
unc.edu/GPUSORT
[8] Hall J. D., Carr N., and Hart J. Cache and bandwidth aware matrix
multiplication on the GPU. Technical Report UIUCDCS-R-2003-2328,
University of Illinois at Urbana-Champaign.
[9] Haykin S. Neural Networks: A Comprehensive Foundation, 2nd edition,
1998.
[10] Larsen E. S., and McAllister D. Fast matrix multiplies using graphics
hardware. Proc. of ACM-IEEE conference on Supercomputing, 55-55.
2001.
[11] Moreland K. and Angel E. The FFT on a GPU. Proc. of SIG-
GRAPH/Eurographics Workshop on Graphics Hardware, pp. 112119,
July 2003.