0% found this document useful (0 votes)
91 views

High Performance Pattern Recognition On GPU

This document discusses high performance pattern recognition algorithms using graphics processing units (GPUs). It summarizes two pattern recognition algorithms - Parzen windows for density estimation and artificial neural networks for training and classification. It presents fast implementations of these algorithms on an NVIDIA 8800 GTX GPU. For Parzen windows, it can simultaneously estimate probability values for 1,000 test patterns in about 14 milliseconds using a training set of 16,000 patterns. For artificial neural networks, it can run an epoch of batch training on a dataset with 56,000 patterns in less than 200 milliseconds. The speedups of these GPU implementations over CPU implementations are over 300 times for Parzen windows and over 100 times for artificial neural networks.

Uploaded by

gco222
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views

High Performance Pattern Recognition On GPU

This document discusses high performance pattern recognition algorithms using graphics processing units (GPUs). It summarizes two pattern recognition algorithms - Parzen windows for density estimation and artificial neural networks for training and classification. It presents fast implementations of these algorithms on an NVIDIA 8800 GTX GPU. For Parzen windows, it can simultaneously estimate probability values for 1,000 test patterns in about 14 milliseconds using a training set of 16,000 patterns. For artificial neural networks, it can run an epoch of batch training on a dataset with 56,000 patterns in less than 200 milliseconds. The speedups of these GPU implementations over CPU implementations are over 300 times for Parzen windows and over 100 times for artificial neural networks.

Uploaded by

gco222
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

High Performance Pattern Recognition on GPU

Sheetal Lahabar Pinky Agrawal P. J. Narayanan


Center for Visual Information Technology
International Institute of Information Technology
Hyderabad, 500032 INDIA
{sheetal@students., pinky@students., pjn@}iiit.ac.in

Abstract—The pattern recognition (PR) process uses a large Parzen windows, k-Nearest Neighbour, etc., are examples of
number of labelled patterns and compute intensive algorithms. the same.
Several components of a PR process are compute and data The rapid increase in the performance of graphics hardware
intensive. Some algorithms compute the parameters required
for classification directly for each test pattern using a large have made GPU a strong candidate for performing many com-
training set. Most algorithms have a training step, the results pute intensive tasks. GPUs now include fully programmable
of which are used by a computationally cheap classification step. processing units that follow a stream programming model
In this paper, we present high-performance pattern recognition and support vectorized floating-point operations. High level
algorithms using a commodity Graphics Processing Unit (GPU). languages have emerged to support the new programmability.
Our algorithms exploit the high-performance SIMD architecture
of GPU. We specifically study the Parzen windows scheme for NVIDIA’s 8-series GPU with CUDA Computing environ-
density estimation and the Artificial Neural Network (ANN) ment provides the standard C like language interface to the
scheme for training and classification in this paper. We present programmable processors, which eliminates the overhead of
fast implementations of these on a NVIDIA 8800 GTX GPU. Our learning an inadequate API [13]. GPUs provide tremendous
implementation of Parzen windows can simultaneously estimate memory bandwidth and computational horsepower. For exam-
probability values for 1K test patterns in about 14ms based on
an input data set of 16K patterns. Our ANN can run an epoch of ple, the NVIDIA GeForce 8800 GTX can achieve a sustained
batch-training on the NIST data set with 56K 484-dimensional memory bandwidth of 86.4 GB/s and a theoretical maximum
patterns and 10 output categories in less than 200 milliseconds. of 346 GFLOPS [13].
The speedup is more than 300 times for Parzen windows and 100 Several GPU algorithms have been developed for sorting
times for ANN over the CPU implementations using a commodity [7], geometric computations, matrix multiplication, FFT [11]
GPU that costs about $400.
and graph algorithms. Larsen and McAllister [10] initially
proposed an approach for computing matrix products using
I. I NTRODUCTION simple blending and texture mapping functionalities on GPUs.
Hall et al. [8] and Moravanszky [12] described improved
Pattern recognition is concerned with the design of systems algorithms that performs implicit blocking. Fatahalian et al. [6]
that detect trends and classify patterns. Important application proposed another approach based on blocking for computing
areas are optical character recognition, speech recognition, matrix products using fragment shaders. NVIDIA’s CUBLAS
fingerprint identification, DNA sequence identification, and library [14] which comes with the CUDA software pack is
many more. an implementation of simple BLAS (Basic Linear Algebra
Most pattern recognition systems have two components: Subprograms) on GPU, that allows optimized matrix and
training and classification [5]. The system is trained using vector operations. Davis [4] presented simulation of ANN on
a large number of labelled patterns using relevant features. GPUs using Brook [2]. Reiter et al. [16] described HMM
The training could be iterative as with ANN, Support Vec- search implementation to compute the Viterbi probability for
tor Machines (SVM), etc., and proceeds till the error on a biological protein sequences. Cao et al. [3] presented algo-
small set of testing patterns is sufficiently low. The trained rithm for scalable clustering on GPUs. However, little work
system can be used for classification of unknown patterns has been done to exploit the computational capability of GPUs
using a computationally low process. The training process is for highly compute intensive aspects of pattern recognition,
compute intensive and time consuming especially when large such as ANN training and Parzen-windows. GPU applications
number of training patterns are involved. The classification on these can be useful, for instance, retraining the network
accuracy often depends on the effort spent on training and with new training patterns added on the fly.
most applications settle for a suitable tradeoff. Using new The Parzen-window approach is a method of estimating
training data to improve the performance is uncommon due non-parametric density from observed patterns. In the clas-
to the large computational effort. There are other types of sifiers based on Parzen-windows, the densities are estimated
pattern recognition techniques that use the whole training set to for each category and the test pattern is classified by the cate-
directly evaluate parameters for classification of each unknown gory corresponding to the maximum posterior. We describe
pattern. These incur heavy computational cost for classifying a parallel implementation of Parzen-windows using CUDA
each pattern as computation involves all the input patterns. API that can be used to classify large number of test patterns
in parallel. We can estimate the probability densities for a
test pattern in 14µs using 16K input patterns. This makes
Parzen-window based classifiers practical. The training of the
ANN changes its parameters based on the signals that flow
through it. It is an iterative process, where each iteration has
high computational complexity. We describe the batch learning
of network as a set of matrix operations, which are well
suited to the GPUs, because of highly parallel computational
requirements and regular data access pattern. We implemented
the backpropagation ANN training algorithm using fragment
shaders and CUBLAS library. Our ANN batch-training can
run an epoch on a data set with 56K 484-dimensional training
patterns in less than 200ms.

II. P RELIMINARIES
We review the concepts related to the GPU and the PR
algorithms we address in this section.

A. GPU Architecture
GPUs have a parallel architecture with massively paral- Fig. 2. A set of SIMD multiprocessors with on-chip shared memory
lel processors. The graphics pipeline is well suited to the
rendering process because it allows the GPU to function as
a stream processor. Recent GPUs with Shader Model 4 [1] It also has a read-only constant cache and texture cache that
allow users to write vertex, fragment and geometry shader is shared by all the processors. A set of local 32-bit registers
programs as shown in Figure 1. The programmable parts of is available per processor. The multiprocessors communicate
the graphics pipeline operates on a large number of vertices through the global or device memory. At the software level,
and fragments spawning a thread for each, to keep the par- the CUDA model is a collection of threads running in parallel.
allel processors occupied. The General-Purpose computation A thread block is a batch of SIMD-parallel threads that runs on
on Graphics Processing Units (GPGPU) uses GPU for non- a multiprocessor at a given time and can communicate through
graphics computations by posing it as a graphics rendering shared memory and can be synchronized. The computations
problem. Most GPGPU algorithms use programmable frag- are organized as a grid of thread blocks as shown in Figure 3.
ment processors, as it the most parallelizable component of Each thread executes a single instruction set called the kernel.
the pipeline, which maps each input pixel to an output pixel Thus, the CUDA model allows programmers to better exploit
of the framebuffer. Since, GPU memory layout is optimized the parallel power of the GPU for general-purpose computing.
for graphics rendering, an optimal data structure may not
be available for GPGPU solutions. Creating efficient data
structures using the GPU memory model is a challenging
problem in itself. Memory size and operations (gather and
scatter) of the GPU are other restricting factors.

Fig. 1. The graphics pipeline with the programmable stages shown shaded

NVIDIA’s GeForce 8-series GPUs with the CUDA pro-


gramming model provides an adequate API for non-graphics
applications. The CPU sees a CUDA device as a multi-core co-
processor. CUDA design does not have memory restrictions of
GPGPU. It increases the programming flexibility by providing
both scatter and gather memory operations i.e. ability to read
and write at any location in memory.
At the hardware level, NVIDIA’s 8-series GPU is a set of
SIMD multiprocessors with eight processors each. Each mul-
tiprocessor contains a parallel data cache or shared memory,
which is shared by all its processors as shown in Figure 2. Fig. 3. CUDA programming model

2
B. Parzen Windows
Parzen-window method [15] approximates the unknown
density function p(x) from the N patterns x1 , x2 ,. . . , xN from
a category. The N th density estimate for p(x)is given as:

x − xi
N  
1 X
pN (x) = ϕ , (1)
N hN i=1 hN

where ϕ(x) is a kernel density function and hN is the window


width. pN (x) converges to original p(x), if hN → 0 when
N → ∞.
Parzen-window method can be implemented in a parallel
fashion for computing the probability estimate based on N
normalized d-dimensional patterns, randomly sampled from
Fig. 4. d-nH -c three-layer ANN
c classes, since the probability estimate for each class is
independent as given by Equation 1. For this two tables are
required. The first table has N d-dimensional entries of the
input patterns. This table is used to compute the N kernel emitted by the network, are used as discriminant functions
density functions for a given test pattern. Gaussian kernel for classification.
density function is given by e(xi x−1)/σ , where σ determines
t 2
d
the width of the effective Gaussian window. Another table has xi wji ) = f (wtj x) 1 ≤ j ≤ nH , (4)
X
yj = f (netj ) = f (
N c-dimensional entries which give the information about the i=1
category of the input patterns. The probability that x belongs nH
to a category is computed by summing the kernel density
yj wkj ) = f (wtk y) 1 ≤ k ≤ c, (5)
X
zk = f (netk ) = f (
function for all the input patterns belonging to that category.
j=1
The parallel implementation of Parzen-window can be
achieved by a series of matrix operations. Given the input where f (.) is the non-linear activation function like sigmoid.
patterns, we form a N × d matrix T1 for the first table, where Backpropagation is a general method for supervised training
each row contains an input pattern. A N × c matrix T2 is of multilayer neural networks based on a gradient descent
formed for the second table, where each row has 1 in the procedure. During network training, the output signals are
column corresponding to its category while others are 0. For compared with a target vector t, and any difference (training
classifying m unknown patterns we form a m × d matrix K error) is used in training the weights throughout the network.
and the classification process is expressed as: This process is repeated until the error falls below a threshold.
The weights are changed in the direction that will reduce the
I = g(K * T1T ), (2) error. The hidden-to-output and the input-to-hidden weights
C = I * T2 , (3) are updated using a learning rate η as:

where, g(A) denotes that e(aij −1)/σ is computed for every (6)
2
∆wkj = η(tk − zk )f 0 (netk )yj
element of the matrix A. Each element Cij of C denotes the
" c #
probability estimate of ith unknown pattern belonging to the wkj (tk − zk )f 0 (netk ) f 0 (netj )xi . (7)
X
∆wji = η
j th category. k=1

In the on-line version of backpropagation, the weights are


C. ANN: Training and Classification updated after presenting each input pattern while in the batch
Neural Networks have two primary modes of operation: training, all the training patterns are presented first and their
feedforward and training. Figure 4 shows a simple three- corresponding weight updates summed; only then are the
layer ANN. During the feedforward operation, a d-dimensional actual weight vectors are updated. Batch training has better
input pattern x is presented to the input layer; each input unit convergence properties.
then emits its corresponding component xi . Each of the nH The batch training of ANN can be reformulated as a series
hidden units computes its net activation, netj , as the inner of matrix operations which are inherently parallel. For N d-
product of the input layers signals with weights wji at the dimensional input patterns, each belonging to one of the c
hidden unit. The hidden unit then emits intermediate vector categories, we form a N × d input matrix X by stacking the
yj as in Equation 4. Each of the c output units functions input vectors rowwise and similarly a N × c target matrix T is
in the same manner as the hidden units do, computing netk formed by stacking the c-dimensional target vectors. A nH ×d
as the inner product of the hidden unit signals and weights input-to-hidden weight matrix WJI and a c × nH hidden-to-
at the output unit, as in Equation 5. The final signals zk output weight matrix WKJ are initialized with random values,

3
where nH are the number of hidden units. Hence, Equations Algorithm 2 ANN training Using CUDA Using Shaders
4-7 as matrix operations become: for i = 0 to epochs do
netJ ← X * WJI } CUMAT
)
T
Shader 1
Y = f (netJ ) = f (X * WJI
T
), (8) Y ← f (netJ ) } Kernel 1
netK ← Y * WKJ } CUMAT
)
T
Shader 1
Z ← f (netK ) } Kernel 1
Z = f (netK ) = f (Y * WKJ
T
), (9)
δK ← η(T − Z) . f 0 (netK ) } Kernel 2 } Shader 2
∆WKJ ← δK T
*Y } CUMAT Shader 3
∆WJI = η [{((T − Z) . f 0 (netK )) * WKJ } . f 0 (netJ )]
T
* X, I ← δK * WKJ } CUMAT } Shader 4
δJ ← ηI . f 0 (netJ ) } Kernel 3 } Shader 5
(10)
∆WJI ← δJT * X } CUMAT Shader 3

∆WKJ = η [(T − Z) . f 0 (netK )]


T
* Y, (11) WJI ← WJI + ∆WJI } CUADD } Shader 6
WKJ ← WKJ + ∆WKJ } CUADD } Shader 6
where (.) means element-element multiplication and f (M ) end for
means sigmoid of every element of the matrix M .
1) Using fragment shaders: This algorithm implements
each epoch as a multipass method ( 4×s N
passes) requiring
III. I MPLEMENTATION ON THE GPU multiple renderings to the framebuffer. The input is stored
into 4×s
N
4-channel textures, s being the maximum allowable
In this section, we present the implementations details of texture size. Our implementation packs 4 consecutive elements
the above algorithms. from a matrix column into a texel for input, target and other
intermediate matrices. The 4 elements of a texel p can be
accessed simultaneously in order as p.xyzw. The elements
A. Parzen Windows on GPU can also be accessed in arbitrary order (e.g. p.yxwz) and an
element can be referenced multiple times (e.g. p.xxxx). For
The density estimation algorithm for m unknown patterns example: r.xyzw = p.xxyy, assigns the x element of p to the
using Parzen-windows is given in Algorithm 1. We use the x and y element of texel r and similarly for other elements.
same notations as described in Section II-B. Single channel textures are used for the weight matrices. This
packing of data allows efficient shaders.
Algorithm 1 Parzen-windows using matrix-math The matrix multiplication of any P ×Q and Q×R matrix is
M ← K * T1T } CUMAT a multipass algorithm, requiring Qb passes, where b is a scalar
I ← g(M ) } Kernel representing the block size that gives optimal performance [4].
C ← I * T2 } CUMAT Each pass accepts i, j, and k arguments via interpolated texture
coordinates, multiplies b elements and accumulates the partial
result of C[i, j] using GL BLEND, till all the Q elements are
CUMAT refers to the cublasSgemm() function for matrix multiplied. After each pass, k is incremented by b.
multiplication, provided by the CUBLAS library [14]. Shader 1:
Similarly, Kernel refers to the kernel function, which is for m = 0 . . . b − 1 do
executed on the GPU. We divide the m × N matrix I into C[i, j].xyzw = A[i, m + k].xyzw ∗ B[j, m + k].xxxx+
16 × 16 thread blocks, with 16 × 16 threads per block.
m N
C[i, j].xyzw
The hardware map thread blocks to parallel multiprocessors if last pass
on the GPU. The Kernel function is executed for every thread. C[i, j].xyzw = f (C[i, j].xyzw)

Kernel: Shader 2:
C[i, j] = e(C[i,j]−1)/σ
2
C[i, j].xyzw = η(A[i, j].xyzw − B[i, j].xyzw) ∗
f 0 (D[i, j].xyzw)
Shader 3:
B. Backpropagation on GPU for m = 0 . . . b − 1 do
C[i, j].r = dot(A[m + k, i].xyzw ∗ B[m + k, j].xyzw)+
The backpropagation training algorithm using matrix op-
C[i, j].xyzw
erations is given in Algorithm 2. We will follow the same
notation for matrices as discussed in Section II-C. The network Shader 4:
parameters, initial synaptic weights, number of hidden units, for m = 0 . . . b − 1 do
sigmoid function and learning rate are set according to the C[i, j].xyzw = A[i, m + k].xyzw ∗ B[m + k, j].xxxx+
techniques for improving backpropagation by Haykin [9]. C[i, j].xyzw

4
Shader 5: of test patterns in parallel. On an average our implementation
C[i, j].xyzw = ηA[i, j].xyzw ∗ f 0 (B[i, j].xyzw) achieves 325 times speedup over an implementation on
MATLAB. For 64 or more test patterns the time taken for a
Shader 6:
pattern on CUDA is 14.1 µs.
C[i, j].x = A[i, j].x + B[i, j].x
We used another dataset from http://www.nist.gov, consist-
ing of 22 × 22 binary images of handwritten numbers from
2) Using CUDA API: In the Algorithm 2, matrix-matrix 0-9. For this data, we trained a 3-layer ANN with 484 (all the
multiplications are done by using cublasSgemm() function image coordinates), 128 and 10 units at the input, hidden and
referred to as CUMAT and the weight matrix is updated output layers respectively. The ANN is trained with 8K to 56K
using cublasSaxpy() function referred to as CUADD. We input images. As shown in Table II and Figure 5, the CUDA
write kernel functions for element-element multiplication and implementation achieves 90-110 times speedup over MATLAB
for computing the sigmoid, f (.) of the netj and netk . We and 120-140 times speedup over FANN implementations.
used 256 threads per block. ANN training with 56K patterns for an epoch on CUDA
requires 190ms as compared to 28.27s for FANN. The shader
Kernel 1: implementation achieves 40-50 times speedup over MATLAB
C[i, j] = f (A[i, j]) and 55-70 times speedup over FANN implementations. For
this large dataset, MATLAB gives memory problems when
Kernel 2: trained for more than 32K patterns.
C[i, j] = η(A[i, j] − B[i, j]) ∗ f 0 (D[i, j])
N t1 (s) t2 (s) t3 (s) t4 (s)
Kernel 3:
CUDA Shader MATLAB FANN
C[i, j] = η(A[i, j]) ∗ f 0 (B[i, j]) 8K 1.41 3.23 128.78 141.83
16K 2.84 6.24 278.74 344.71
24K 4.30 8.80 423.02 524.16
IV. R ESULTS 32K 5.66 11.59 600.61 789.19
40K 6.96 14.37 - 961.67
We tested our algorithms on 3 GHz Pentium IV CPU
48K 9.03 17.14 - 1179.26
and a NVIDIA GeForce 8800 GTX graphics processor. To
56K 9.77 20.00 - 1413.66
generate fragment programs, we use NVIDIA’s Cg compiler.
We benchmarked our GPU algorithms for ANN training and TABLE II
ANN RESULTS FOR N PATTERNS FOR 50 EPOCHS
the optimized CPU based ANN implementation provided by
FANN (Fast Artificial Neural Network) and MATLAB.

N t1 (ms) t1/N (ms) t2 (ms) t2/N (ms)


CUDA CUDA MATLAB MATLAB
1 0.319 0.319 10.0 10.00
16 0.434 0.027 70.0 4.37
128 1.870 0.014 520.7 4.06
256 3.647 0.014 1039.4 4.06
512 7.207 0.014 2083.0 4.06
1K 14.736 0.014 4166.0 4.06
2K 28.980 0.014 8360.1 4.08
4K 57.810 0.014 19450.0 4.74
TABLE I
PARZEN - WINDOW RESULTS FOR N PATTERNS

We used two datasets for our experiments. A dataset from


http://archive.ics.uci.edu/beta/datasets/Letter+Recognition,
consisting of 16K 16-dimensional feature vectors, where each
input belongs to one of the 26 capital letters in the English
alphabet is used to estimate the probability density for each Fig. 5. Results for training ANN for 50 epochs
of the 26 categories for the test patterns. We compute the
density estimate for m test patterns in parallel. Table I shows We tested the trained network with 2K unknown patterns.
the performance results of estimating densities for different Training with 32K input patterns for 200 epochs takes 3472s
number of test patterns. We observe that the time taken and gives 84% accuracy on FANN, while GPU takes 23s
for estimating the probability densities for a test pattern and gives 82% accuracy. When trained for 1000 epochs GPU
decreases if the probabilities are estimated for a large number takes 113s and gives 89% accuracy. Table III shows the

5
classification time for different number of test patterns. We [12] Moravanszky A. Dense matrix algebra on the GPU. 2003.
observe that for more than 1K test patterns, the average [13] NVIDIA: NVIDIA CUDA compute unified device architecture program-
ming guide, 2007. http://developer.download.nvidia.com/compute/cuda/
classification time for a pattern is 1.45µs. 1 0/NVIDIA CUDA Programming Guide 1.0.pdf.
[14] NVIDIA: NVIDIA CUBLAS Library, 2007. http://developer.download.
N t1 (ms) t1/N (ms) t2 (ms) t2/N (ms) nvidia.com/compute/cuda/1 0/CUBLAS Library 1.0.pdf.
CUDA CUDA FANN FANN [15] Parzen E. On Estimation of a Probability Density Function and Mode.
Annal of Mathematical Statistics. Vol. 33, pp. 1065-1076, 1962.
128 0.83 6.48e-03 16.59 0.129 [16] Reiter D. H., Houston M. and Hanrahan P. ClawHMMER: A Streaming
512 0.89 1.71e-03 68.27 0.133 HMMer-Search Implementation. Proc. of ACM/IEEE conference on
1K 1.57 1.54e-03 134.37 0.131 Supercomputing, 2005.
2K 3.06 1.49e-03 266.60 0.129
8K 11.86 1.44e-03 1076.51 0.131
16K 24.04 1.46e-03 2139.71 0.130
32K 48.08 1.46e-03 4283.02 0.130
56K 83.49 1.45e-03 7574.14 0.132
TABLE III
C LASSIFICATION TIME FOR N PATTERNS ON ANN

V. C ONCLUSIONS
In this paper, we presented the implementation of two
data intensive pattern recognition operations on commodity
GPUs. The algorithms exploit the high computing power of
the GPUs and provide fast performance of the algorithms.
Our implementation can compute the conditional or posterior
probabilities in about 14 microseconds when done in parallel
using the Parzen window method. This brings the Parzen win-
dow algorithms into the realm of real-time pattern recognition
techniques. Similarly, the GPU ANN can perform an epoch of
batch training with 56K samples in about 200 milliseconds.
This makes frequent retraining possible during an application.
The computational ease of such algorithms on an inexpensive
hardware makes it possible to acquire labelled samples live
and use them to adjust the behaviour of the system.

R EFERENCES
[1] Blythe D. The Direct3D 10 System. Microsoft Corporation, 2006.
[2] BrookGPU http://graphics.stanford.edu/projects/brookgpu/
[3] Cao F., Tung A. K. H. and Zhou A. Scalable Clustering Using Graphics
Processors. Lecture Notes in Computer Science, vol. 4016/2006, pp.
372-384, 2006.
[4] Davis C. E. Graphics Processing Unit Computation of Neural Networks.
University of New Mexico, 2001.
[5] Duda R. O., Hart P. E. and Stork D. G. Pattern Classification. 2nd
edition, John Wiley and Sons, 2001.
[6] Fatahlian K., Sugerman J., and Hanrahan P. Understanding the efficiency
of GPU algorithms for matrix-matrix multiplication. Proc of ACM
SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, Eu-
rographics Association, 2004.
[7] Govindaraju N. K., Manocha D., Raghuvanshi N., Tuft D. Gpusort:
High performance sorting using graphics processors. http://gamma.cs.
unc.edu/GPUSORT
[8] Hall J. D., Carr N., and Hart J. Cache and bandwidth aware matrix
multiplication on the GPU. Technical Report UIUCDCS-R-2003-2328,
University of Illinois at Urbana-Champaign.
[9] Haykin S. Neural Networks: A Comprehensive Foundation, 2nd edition,
1998.
[10] Larsen E. S., and McAllister D. Fast matrix multiplies using graphics
hardware. Proc. of ACM-IEEE conference on Supercomputing, 55-55.
2001.
[11] Moreland K. and Angel E. The FFT on a GPU. Proc. of SIG-
GRAPH/Eurographics Workshop on Graphics Hardware, pp. 112119,
July 2003.

You might also like