Nvidia Opencl Best Practices Guide: Optimization
Nvidia Opencl Best Practices Guide: Optimization
NVIDIA OpenCL
Best Practices Guide
Version 2.3
REVISIONS
Preface............................................................................................................................ v
What Is This Document? v
Who Should Read This Guide? v
Recommendations and Best Practices v
Contents Summary vi
Chapter 1. Heterogeneous Computing with OpenCL .....................................................1
1.1 Differences Between Host and Device 1
1.2 What Runs on an OpenCL-Enabled Device? 2
1.3 Maximum Performance Benefit 3
Chapter 2. Performance Metrics .....................................................................................5
2.1 Timing 5
2.1.1 Using CPU Timers 5
2.1.2 Using OpenCL GPU Timers 6
2.2 Bandwidth 6
2.2.1 Theoretical Bandwidth Calculation 6
2.2.2 Effective Bandwidth Calculation 7
2.2.3 Throughput Reported by the OpenCL Visual Profiler 7
Chapter 3. Memory Optimizations..................................................................................9
3.1 Data Transfer Between Host and Device 9
3.1.1 Pinned Memory 9
3.2 Device Memory Spaces 12
3.2.1 Coalesced Access to Global Memory 13
3.2.1.1 A Simple Access Pattern 14
3.2.1.2 A Sequential but Misaligned Access Pattern 14
3.2.1.3 Effects of Misaligned Accesses 15
3.2.1.4 Strided Accesses 17
3.2.2 Shared Memory 18
3.2.2.1 Shared Memory and Memory Banks 18
3.2.2.2 Shared Memory in Matrix Multiplication (C = AB) 19
T
3.2.2.3 Shared Memory in Matrix Multiplication (C = AA ) 23
3.2.2.4 Shared Memory Use by Kernel Arguments 25
3.2.3 Local Memory 25
applications have the highest priority, while small optimizations that affect only very
specific situations are given a lower priority.
Before implementing lower priority recommendations, it is good practice to make
sure all higher priority recommendations that are relevant have already been applied.
This approach will tend to provide the best results for the time invested and will
avoid the trap of premature optimization.
The criteria of benefit and scope for establishing priority will vary depending on the
nature of the program. In this guide, they represent a typical case. Your code might
reflect different priority factors. Regardless of this possibility, it is good practice to
verify that no higher priority recommendations have been overlooked before
undertaking lower priority items.
Appendix A of this document lists all the recommendations and best practices,
grouping them by priority and adding some additional helpful observations.
Contents Summary
The remainder of this guide is divided into the following sections:
Introduction to Parallel Computing with OpenCL: Important aspects of the
parallel programming architecture.
Performance Metrics: How should performance be measured in OpenCL
applications and what are the factors that most influence performance?
Memory Optimizations: Correct memory management is one of the most
effective means of improving performance. This chapter explores the different
kinds of memory available to OpenCL applications, and it explains in detail how
memory is handled behind the scenes.
NDRanges Optimizations: How to make sure your OpenCL application is
exploiting all the available resources on the GPU.
Instruction Optimizations: Certain operations run faster than others. Using
faster operations and avoiding slower ones often confers remarkable benefits.
Control Flow: Carelessly designed control flow can force parallel code into
serial execution; whereas thoughtfully designed control flow can help the
hardware perform the maximum amount of work per clock cycle.
Getting the Right Answer: How to debug code and how to handle differences
in how the CPU and GPU represent floating-point values.
into different types, each of which has a special purpose and fulfills different
needs. The types of device RAM are explained in the NVIDIA OpenCL
Programming Guide and in Chapter 3 of this document.
These are the primary hardware differences between CPU hosts and GPU devices
with respect to parallel programming. Other differences are discussed as they arise
elsewhere in this document.
High Priority: To get the maximum benefit from OpenCL, focus first on finding ways
to parallelize sequential code.
2.1 Timing
OpenCL calls and kernel executions can be timed using either CPU or GPU timers.
This section examines the functionality, advantages, and pitfalls of both approaches.
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END,
sizeof(cl_ulong), &end, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START,
sizeof(cl_ulong), &start, NULL);
Note that the timings are measured on the GPU clock, and so are operating system–
independent. The resolution of the GPU timer is approximately half a microsecond.
2.2 Bandwidth
Bandwidth is one of the most important gating factors for performance. Almost all
changes to code should be made in the context of how they affect bandwidth. As
described in Chapter 3 of this guide, bandwidth can be dramatically affected by the
choice of memory in which data is stored, how the data is stored and accessed, as
well as other factors.
To measure performance accurately, it is useful to calculate theoretical and effective
bandwidth. When the latter is much lower than the former, design or
implementation details are likely to reduce bandwidth, and it should be the primary
goal of subsequent optimization efforts to increase it.
High Priority: Use the effective bandwidth of your computation as a metric when
measuring performance and optimization benefits.
In this calculation, the memory clock rate is converted in to Hz, multiplied by the
interface width (divided by 8, to convert bits to bytes) and multiplied by 2 due to the
double data rate. Finally, this product is divided by 109 to convert the result to
GB/sec (GBps).
Note that some calculations use 1,0243 instead of 109 for the final calculation. In
such a case, the bandwidth would be 131.9 GBps. It is important to use the same
divisor when calculating theoretical and effective bandwidth, so that the comparison
is valid.
Memory optimizations are the most important area for performance. The goal is to
maximize the use of the hardware by maximizing bandwidth. Bandwidth is best
served by using as much fast memory and as little slow-access memory as possible.
This chapter discusses the various kinds of memory on the host and device and how
best to set up data items to use the memory effectively.
High Priority: Minimize data transfer between the host and the device, even if it
means running some kernels on the device that do not show performance gains when
compared with running them on the host CPU.
OpenCL applications do not have direct control over whether memory objects are
allocated in pinned memory or not, but they can create objects using the
CL_MEM_ALLOC_HOST_PTR flag and such objects are likely to be allocated in
pinned memory by the driver for best performance. The oclBandwidthTest program in
the NVIDIA GPU Computing SDK shows how to use these functions as well as
how to measure memory transfer performance. Additional examples of pinned
memory usage are provided in the oclSobelFilter and oclMedianFilter program samples
in the NVIDIA GPU Computing SDK.
Pinned memory should not be overused. Excessive use can reduce overall system
performance because pinned memory is a scarce resource. How much is too much
is difficult to tell in advance, so as with all optimizations, test the applications and
the systems they run on for optimal performance parameters.
The steps normally needed to use pinned memory are briefly summarized in the
following example.
1) Declare cl_mem buffer objects for the pinned host memory and the GPU
device GMEM, respectively, and standard pointers to reference pinned host
memory.
cl_context cxGPUContext;
cl_mem cmPinnedBufIn = NULL;
cl_mem cmPinnedBufOut = NULL;
cl_mem cmDevBufIn = NULL;
cl_mem cmDevBufOut = NULL;
unsigned char* cDataIn = NULL;
unsigned char* cDataOut = NULL;
2) Allocate cl_mem buffer objects for the pinned host memory and the GPU
device GMEM, respectively. Because these are time consuming operations, and
because many applications don’t need to change the size of these buffers within
time-critical code paths, these functions are commonly executed in an
application initialization function or event driven function (not in any program
loop to be executed quickly and frequently).
cmPinnedBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY |
CL_MEM_ALLOC_HOST_PTR, memSize,
NULL, NULL);
cmPinnedBufOut = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY |
CL_MEM_ALLOC_HOST_PTR, memSize,
NULL, NULL);
3) Map standard pointer to reference the pinned host memory input and output
buffers with standard pointers.
cDataIn = (unsigned char*)clEnqueueMapBuffer(cqCommandQue,
cmPinnedBufIn, CL_TRUE,
CL_MAP_WRITE, 0, memSize, 0,
NULL, NULL, NULL);
cDataOut = (unsigned char*)clEnqueueMapBuffer(cqCommandQue,
cmPinnedBufOut, CL_TRUE,
CL_MAP_READ, 0, memSize, 0,
NULL, NULL, NULL);
4) Initialize or update the pinned memory content, using the standard host pointer
and standard host code. This might be done during program initialization
function or at any appropriate time by such means as an asynchronous data
acquisition function.
for(unsigned int i = 0; i < memSize; i++)
{
cDataIn[i] = (unsigned char)(i & 0xff);
}
5) Write data from pinned host memory to the GPU device GMEM any time in
the application that “fresh” data has been written to the pinned host memory.
This step #5, along with steps #6 and #7, commonly constitute a core
sequence (copy input data to GPU, compute on GPU, copy results back to
CPU) in an application with a recurring main loop, such as an application with a
GLUT display callback loop.
clEnqueueNDRangeKernel(cqCommandQue, …);
7) Read data from GPU device GMEM to pinned host memory. Note that this
example uses a blocking read to assure the read is complete (which would make
sense if the next step in the application was to display or otherwise use the
processed data from the host). Also note that this read would be unnecessary in
applications using CL-GL interop if the destination for the computed data is
only a graphics window, not main CPU memory.
To Host
Of these different memory spaces, global and texture memory are the most
plentiful. There is a 16 KB per thread limit on local memory, a total of 64 KB of
constant memory, and a limit of 16 KB of shared memory, and either 8,192 or
16,384 32-bit registers per multiprocessor. Global, local, and texture memory have
the greatest access latency (although texture is cached), followed by constant
memory, registers, and shared memory.
The various principal traits of the memory types are shown in Table 3.1.
Table 3.1 Salient features of device memory
The access requirements for coalescing depend on the compute capability of the
device:
On devices of compute capability 1.0 or 1.1, the k-th thread in a half warp must
access the k-th word in a segment aligned to 16 times the size of the elements
being accessed; however, not all threads need to participate.
On devices of compute capability 1.2 or higher, coalescing is achieved for any
pattern of accesses that fits into a segment size of 32 bytes for 8-bit words,
64 bytes for 16-bit words, or 128 bytes for 32- and 64-bit words. Smaller
transactions may be issued to avoid wasting bandwidth. More precisely, the
following protocol is used to issue a memory transaction for a half warp:
¾ Find the memory segment that contains the address requested by the lowest
numbered active thread. Segment size is 32 bytes for 8-bit data, 64 bytes for
16-bit data, and 128 bytes for 32-, 64-, and 128-bit data.
¾ Find all other active threads whose requested address lies in the same
segment, and reduce the transaction size if possible:
If the transaction is 128 bytes and only the lower or upper half is used,
reduce the transaction size to 64 bytes.
If the transaction is 64 bytes and only the lower or upper half is used,
reduce the transaction size to 32 bytes.
¾ Carry out the transaction and mark the serviced threads as inactive.
¾ Repeat until all threads in the half warp are serviced.
These concepts are illustrated in the following simple examples.
Figure 3.4 Coalesced access in which all threads but one access the
corresponding word in a segment
This access pattern results in a single 64-byte transaction, indicated by the red
rectangle. Note that even though one word is not requested, all data in the segment
are fetched. If accesses by threads were permuted within this segment, still one 64-
byte transaction would be performed by a device with compute capability 1.2 or
higher, but 16 serialized transactions would be performed by a device with compute
capability 1.1 or lower.
Figure 3.5 Unaligned sequential addresses that fit within a single 128-
byte segment
If a half warp accesses memory that is sequential but split across two 128-byte
segments, then two transactions are performed. In the following case, illustrated in
Figure 3.6, one 64-byte transaction and one 32-byte transaction result.
Figure 3.6 Misaligned sequential addresses that fall within two 128-byte
segments
In Listing 3.5, data is copied from the input array idata to the output array, both of
which exist in global memory. The kernel is executed within a loop in host code that
varies the parameter offset from 1 to 32. (Figures 3.5 and 3.6 correspond to
offsets of 1 and 17, respectively.) The effective bandwidth for the copy with various
offsets on an NVIDIA GeForce GTX 280 (with compute capability 1.3) and an
NVIDIA GeForce GTX 8800 (compute capability 1.0) are shown in Figure 3.7.
For the NVIDIA GeForce GTX 8800 device, global memory accesses with no
offset or with offsets that are multiples of 16 result in a single transaction per half
warp and an effective bandwidth of approximately 74 GBps. Otherwise, 16
transactions are issued per half warp resulting in an effective bandwidth of
approximately 7 GBps. This roughly 8x performance degradation is due to the fact
that 32 bytes, the minimum transaction size, are fetched for each thread. However,
only 4 bytes of data are used for each 32 bytes fetched—resulting in the 4/32=1/8
performance relative to the fully coalesced case. The two numbers also reflect the
different data represented by effective bandwidth (4 bytes) versus actual bandwidth
(32 bytes).
Because of this possible performance degradation, memory coalescing is the most
critical aspect of performance optimization of device memory. For the NVIDIA
GeForce GTX 280 device, the situation is less dire for misaligned accesses because,
in all cases, access by a half warp of threads in this kernel results in either one or
two transactions. As such, the effective bandwidth is between 120 GBps for a single
transaction and 70 GBps for two transactions per half warp. The number of
transactions issued for a half warp of threads depends on the offset and whether the
warp is even- or odd-numbered. For offsets of 0 or 16, each half warp results in a
single 64-byte transaction (Figure 3.4). For offsets of 1 through 7 or 9 through 15,
even-numbered warps result in a single 128-byte transaction (Figure 3.5) and odd-
numbered warps result in two transactions: one 64-byte and one 32-byte (Figure
3.6). For offsets of 8, even-numbered warps result in one 128-byte transaction and
odd-numbered warps result in two 32-byte transactions. The two 32-byte
transactions, rather than a 64- and a 32-byte transaction, are responsible for the blip
at the offset of 8 in Figure 3.7.
Figure 3.8 illustrates a situation that can be created using the code in Listing 3.6;
namely, threads within a half warp access memory with a stride of 2. This action is
coalesced into a single 128-byte transaction on an NVIDIA GeForce GTX 280
(compute capability 1.3).
Although a stride of 2 results in a single transaction, note that half the elements in
the transaction are not used and represent wasted bandwidth. As the stride
increases, the effective bandwidth decreases until the point where 16 transactions
are issued for the 16 threads in a half warp, as indicated in Figure 3.9.
Note, however, that on the NVIDIA GTX 8800 device (compute capability 1.0),
any non-unit stride results in 16 separate transactions per half warp.
As illustrated in Figure 3.9, non-unit stride global memory accesses should be
avoided whenever possible. One method for doing so utilizes shared memory,
which is discussed in the next section.
Shared memory banks are organized such that successive 32-bit words are assigned
to successive banks and each bank has a bandwidth of 32 bits per clock cycle. The
bandwidth of shared memory is 32 bits per bank per clock cycle.
For devices of compute capability 1.x, the warp size is 32 threads and the number of
banks is 16. A shared memory request for a warp is split into one request for the
first half of the warp and one request for the second half of the warp. Note that no
bank conflict occurs if only one memory location per bank is accessed by a half
warp of threads. Refer to the NV IDIA OpenCL Programming Guide for more
information on how accesses and banks can be matched to avoid conflicts.
To do this, the simpleMultiply kernel (Listing 3.7) calculates the output elements
of a tile of matrix C.
__kernel void simpleMultiply(__global float* a,
__global float* b,
__global float* c,
int N)
{
int row = get_global_id(1);
int col = get_global_id(0);
float sum = 0.0f;
for (int i = 0; i < TILE_DIM; i++) {
sum += a[row*TILE_DIM+i] * b[i*N+col];
}
c[row*N+col] = sum;
}
In Listing 3.7, a, b, and c are pointers to global memory for the matrices A, B, and
C, respectively; blockDim.x, blockDim.y, and TILE_DIM are all 16. Each thread in
the 16x16 block calculates one element in a tile of C. row and col are the row and
column of the element in C being calculated by a particular thread. The for loop
over i multiplies a row of A by a column of B, which is then written to C.
The effective bandwidth of this kernel is only 8.7 GBps on an NVIDIA GeForce
GTX 280 and 0.7 GBps on an NVIDIA GeForce GTX 8800. To analyze
performance, it is necessary to consider how half warps of threads access global
memory in the for loop. Each half warp of threads calculates one row of a tile of C,
which depends on a single row of A and an entire tile of B as illustrated in Figure
3.11.
Figure 3.11 Computing a row (half warp) of a tile in C using one row of A
and an entire tile of B
For each iteration i of the for loop, all threads in a half warp read the same value
from global memory (the index row*TILE_DIM+i is constant within a half warp),
resulting in 16 transactions for compute capability 1.1 or lower, and 1 transaction
for compute capability 1.2 or higher. Even though the operation requires only 1
transaction for compute capability 1.2 or higher, there is wasted bandwidth in the
transaction because only 4 bytes out of a 32-byte transaction are used. For each
iteration, the 16 threads in a half warp read a row of the B tile, which is a sequential
and coalesced access for all compute capabilities.
The performance on a device of any compute capability can be improved by reading
a tile of A into shared memory as shown in Listing 3.8.
__kernel void coalescedMultiply(__global float* a,
__global float* b,
__global float* c,
int N,
__local float aTile[TILE_DIM][TILE_DIM])
{
int row = get_global_id(1);
int col = get_global_id(0);
float sum = 0.0f;
int x = get_local_id(0);
int y = get_local_id(1);
aTile[y][x] = a[row*TILE_DIM+x];
for (int i = 0; i < TILE_DIM; i++) {
sum += aTile[y][i]* b[i*N+col];
}
c[row*N+col] = sum;
}
Listing 3.8 Using shared memory to improve the global memory load
efficiency in matrix multiplication
In Listing 3.8, each element in a tile of A is read from global memory only once, in a
fully coalesced fashion (with no wasted bandwidth), to shared memory. Within each
iteration of the for loop, a value in shared memory is broadcast to all threads in a
half warp.
In Listing 3.8, a synchronization barrier call is not needed after reading the tile of A
into shared memory because only threads within the half warp that write the data
into shared memory read the data. This kernel has an effective bandwidth of 14.3
GBps on an NVIDIA GeForce GTX 280, and 8.2 GBps on an NVIDIA GeForce
GTX 8800.
A further improvement can be made to how Listing 3.8 deals with matrix B. In
calculating a tile’s row of matrix C, the entire tile of B is read. The repeated reading
of the B tile can be eliminated by reading it into shared memory once (Listing 3.9).
__kernel void sharedABMultiply(__global float* a,
__global float* b,
__global float* c,
int N,
__local float aTile[TILE_DIM][TILE_DIM],
__local float bTile[TILE_DIM][TILE_DIM])
{
int row = get_global_id(1);
int col = get_global_id(0);
float sum = 0.0f;
int x = get_local_id(0);
int y = get_local_id(1);
aTile[y][threadIdx.x] = a[row*TILE_DIM+x];
bTile[y][threadIdx.x] = b[y*N+col];
barrier(CLK_LOCAL_MEM_FENCE);
for (int i = 0; i < TILE_DIM; i++) {
sum += aTile[y][i]* bTile[i][x];
}
c[row*N+col] = sum;
}
Note that in Listing 3.9, a barrier() call is required after reading the B tile because
a warp reads data from shared memory that were written to shared memory by
different warps. The effective bandwidth of this routine is 29.7 GBps on an
NVIDIA GeForce GTX 280 and 15.7 GBps on an NVIDIA GeForce GTX 8800.
Note that the performance improvement is not due to improved coalescing in either
case, but to avoiding redundant transfers from global memory.
The results of the various optimizations are summarized in Table 3.2.
Table 3.2 Performance improvements optimizing C = AB matrix multiply
Medium Priority: Use shared memory to avoid redundant transfers from global
memory.
In Listing 3.10, the row-th, col-th element of C is obtained by taking the dot product
of the row-th and col-th rows of A. The effective bandwidth for this kernel is
1.1 GBps on an NVIDIA GeForce GTX 280 and 0.5 GBps on an NVIDIA
GeForce GTX 8800. These results are substantially lower than the corresponding
measurements for the C = AB kernel. The difference is in how threads in a half
warp access elements of A in the second term, a[col*TILE_DIM+i], for each
iteration i. For a half warp of threads, col represents sequential columns of the
transpose of A, and therefore col*TILE_DIM represents a strided access of global
memory with a stride of 16. This results in uncoalesced memory accesses on devices
with compute capability 1.1 or lower and plenty of wasted bandwidth on devices
with compute capability 1.2 or higher. The way to avoid strided access is to use
shared memory as before, except in this case a half warp reads a row of A into a
column of a shared memory tile, as shown in Listing 3.11.
__kernel void coalescedMultiply(__global float *a,
__global float *c,
int M,
__local float aTile[TILE_DIM][TILE_DIM],
__local float transposedTile[TILE_DIM][TILE_DIM])
{
Listing 3.11 uses the shared transposedTile to avoid uncoalesced accesses in the
second term in the dot product, and the shared aTile technique from the previous
example to avoid uncoalesced accesses in the first term. The effective bandwidth of
this kernel is 24.9 GBps on an NVIDIA GeForce GTX 280 and 13.2 GBps on an
NVIDIA GeForce GTX 8800. These results are slightly lower than those obtained
by the final kernel for C = AB. The cause of the difference is shared memory bank
conflicts.
The reads of elements in transposedTile within the for loop are free of conflicts,
because threads of each half warp read across rows of the tile, resulting in unit stride
across the banks. However, bank conflicts occur when copying the tile from global
memory into shared memory. To enable the loads from global memory to be
coalesced, data are read from global memory sequentially. However, this requires
writing to shared memory in columns, and because of the use of 16x16 tiles in
shared memory, this results in a stride between threads of 16 banks. These 16-way
bank conflicts are very expensive. The simple remedy is to pad the shared memory
array so that it has an extra column, as in the following line of code.
__local float transposedTile[TILE_DIM][TILE_DIM+1];
This padding eliminates the conflicts entirely, because now the stride between
threads is 17 banks, which, due to modular arithmetic used to compute bank
indices, is equivalent to a unit stride. After this change, the effective bandwidth is
30.4 GBps on an NVIDIA GeForce GTX 280 and 15.6 GBps on an NVIDIA
GeForce GTX 8800, which is comparable to the results from the last C = AB
kernel.
The results of these optimizations are summarized in Table 3.3.
Table 3.3 Performance improvements optimizing C = AAT matrix
multiplication
These results should be compared with those in Table 3.2. As can be seen from
these tables, judicious use of shared memory can dramatically improve performance.
The examples in this section have illustrated three ways to use shared memory:
To enable coalesced accesses to global memory, especially to avoid large strides
(for general matrices, strides are much larger than 16)
To eliminate (or reduce) redundant loads from global memory
To avoid wasted bandwidth
Low Priority: For kernels with long argument lists, place some arguments into
constant memory to save shared memory.
¹The automatic handling of boundary cases in the bottom row of Table 3.4 refers to how a texture coordinate is
resolved when it falls outside the valid addressing range. There are two options: clamp and repeat. If x is the
coordinate and N is the number of texels for a one-dimensional texture, then with clamp, x is replaced by 0 if x < 0
and by 1-1/N if 1 ≤x. With repeat, x is replaced by frac(x) where frac(x) = x – floor(x). Floor returns the largest
integer less than or equal to x. So, in clamp mode where N = 1, an x of 1.3 is clamped to 1.0; whereas in repeat
mode, it is converted to 0.3
3.2.6 Registers
Generally, accessing a register consumes zero extra clock cycles per instruction, but
delays may occur due to register read-after-write dependencies and register memory
bank conflicts.
The latency of read-after-write dependencies is approximately 24 cycles, but this
latency is completely hidden on multiprocessors that have at least 192 active threads
(that is, 6 warps).
The compiler and hardware thread scheduler will schedule instructions as optimally
as possible to avoid register memory bank conflicts. They achieve the best results
when the number of threads per block is a multiple of 64. Other than following this
rule, an application has no direct control over these bank conflicts. In particular,
there is no register-related reason to pack data into float4 or int4 types.
One of the keys to good performance is to keep the multiprocessors on the device
as busy as possible. A device in which work is poorly balanced across the
multiprocessors will deliver suboptimal performance. Hence, it’s important to
design your application to use threads and blocks in a way that maximizes hardware
utilization and to limit practices that impede the free distribution of work. A key
concept in this effort is occupancy, which is explained in the following sections.
Another important concept is the management of system resources allocated for a
particular task. How to manage this resource utilization is discussed in the final
sections of this chapter.
4.1 Occupancy
Thread instructions are executed sequentially in CUDA, and, as a result, executing
other warps when one warp is paused or stalled is the only way to hide latencies and
keep the hardware busy. Some metric related to the number of active warps on a
multiprocessor is therefore important in determining how effectively the hardware is
kept busy. This metric is occupancy.
Occupancy is the ratio of the number of active warps per multiprocessor to the
maximum number of possible active warps. (To determine the latter number, see
the oclDeviceQuery program in the NVIDIA GPU Computing SDK or refer to
Appendix A in the NVIDIA OpenCL Programming Guide.) Another way to view
occupancy is the percentage of the hardware’s ability to process warps that are
actively in use.
Higher occupancy does not always equate to higher performance—there is a point
above which additional occupancy does not improve performance. However, low
occupancy always interferes with the ability to hide memory latency, resulting in
performance degradation.
Figure 4.1 Use the CUDA GPU Occupancy Calculator to project occupancy
Medium Priority: To hide latency arising from register dependencies, maintain at least
25 percent occupancy on devices with compute capability 1.1 and lower, and 18.75
percent occupancy on later devices.
Medium Priority: The number of threads per block should be a multiple of 32 threads,
because this provides optimal computing efficiency and facilitates coalescing.
The dimension and size of blocks per grid and the dimension and size of threads
per block are both important factors. The multidimensional aspect of these
parameters allows easier mapping of multidimensional problems to OpenCL and
does not play a role in performance. As a result, this section discusses size but not
dimension.
Latency hiding and occupancy depend on the number of active warps per
multiprocessor, which is implicitly determined by the execution parameters along
with resource (register and shared memory) constraints. Choosing execution
parameters is a matter of striking a balance between latency hiding (occupancy) and
resource utilization.
Choosing the NDRange parameters should be done in tandem; however, there are
certain heuristics that apply to each parameter individually. When choosing the
number of blocks per grid or grid size (i.e. number of work groups in OpenCL
terminology), the primary concern is keeping the entire GPU busy. The number of
blocks in a grid should be larger than the number of multiprocessors so that all
multiprocessors have at least one block to execute. Furthermore, there should be
multiple active blocks per multiprocessor so that blocks that aren’t waiting for a
memory array can be beneficial even if limits such as threads per block are not an
issue. This is because some common operations can be performed by a thread once
and the cost amortized over the number of shared memory elements processed by a
thread.
A useful technique to determine the sensitivity of performance to occupancy is
through experimentation with the amount of dynamically allocated shared memory.
In OpenCL, the size of any __local pointer argument is specified outside the kernel
using clSetKernelArg(). By simply increasing this amount, it is possible to effectively
reduce the occupancy of the kernel and measure its effect on performance.
As mentioned in the previous section, once an occupancy of more than 50 percent
has been reached, it generally does not pay to optimize parameters to obtain higher
occupancy ratios. The previous technique can be used to determine whether such a
plateau has been reached.
Integer division and modulo operations are particularly costly and should be avoided
or replaced with bitwise operations whenever possible: If n is a power of 2, (i/n) is
equivalent to (i log2(n)) and (i % n) is equivalent to (i & (n-1)).
The compiler will perform these conversions if n is literal. (For further information,
refer to Chapter 3 of the NVIDIA OpenCL Programming Guide).
High Priority: Minimize the use of global memory. Prefer shared memory access
where possible.
Memory instructions include any instruction that reads from or writes to shared,
local, or global memory. The throughput of memory optimizations is 8 operations
per clock cycle. When accessing local or global memory, there are, in addition, 400
to 600 clock cycles of memory latency.
As an example, the throughput for the assignment operator in the following sample
code
__local float shared[32];
__global float* device;
shared[threadIdx.x] = device[threadIdx.x];
is 8 operations per clock cycle to issue a read from global memory, 8 operations per
clock cycle to issue a write to shared memory, but, crucially, there is a latency of 400
to 600 clock cycles to read data from global memory.
Much of this global memory latency can be hidden by the thread scheduler if there
are sufficient independent arithmetic instructions that can be issued while waiting
for the global memory access to complete. However, it is best to avoid accessing
global memory whenever possible.
High Priority: Avoid different execution paths within the same warp.
Any flow control instruction (if, switch, do, for, while) can significantly affect
the instruction throughput by causing threads of the same warp to diverge; that is,
to follow different execution paths. If this happens, the different execution paths
must be serialized, increasing the total number of instructions executed for this
warp. When all the different execution paths have completed, the threads converge
back to the same execution path.
To obtain best performance in cases where the control flow depends on the thread
ID, the controlling condition should be written so as to minimize the number of
divergent warps.
This is possible because the distribution of the warps across the block is
deterministic as mentioned in section 2.1.1 of the NVIDIA OpenCL Programming
Guide. A trivial example is when the controlling condition depends only on
(threadIdx / WSIZE) where WSIZE is the warp size.
In this case, no warp diverges because the controlling condition is perfectly aligned
with the warps.
Low Priority: Make it easy for the compiler to use branch predication in lieu of loops
or control statements.
Sometimes, the compiler may unroll loops or optimize out if or switch statements
by using branch predication instead. In these cases, no warp can ever diverge. The
programmer can also control loop unrolling using
#pragma unroll
For more information on this pragma, refer to the NVIDIA OpenCL Programming
Guide.
When using branch predication, none of the instructions whose execution depends
on the controlling condition is skipped. Instead, each such instruction is associated
with a per-thread condition code or predicate that is set to true or false according to
the controlling condition. Although each of these instructions is scheduled for
execution, only the instructions with a true predicate are actually executed.
Instructions with a false predicate do not write results, and they also do not evaluate
addresses or read operands.
The compiler replaces a branch instruction with predicated instructions only if the
number of instructions controlled by the branch condition is less than or equal to a
certain threshold: If the compiler determines that the condition is likely to produce
many divergent warps, this threshold is 7; otherwise it is 4.
This appendix contains a list of all the recommendations for optimization and the
list of best practices that are explained in this document.
Trademarks
NVIDIA, the NVIDIA logo, CUDA, GeForce, NVIDIA Quadro, and Tesla are trademarks or registered
trademarks of NVIDIA Corporation. Other company and product names may be trademarks of the respective
companies with which they are associated.
Copyright
© 2009 NVIDIA Corporation. All rights reserved.