OpenCL Programming Guide
OpenCL Programming Guide
Programming Guide
for the CUDA
Architecture
Version 3.2
8/16/2010
Table of Contents
Figure 1-1. Floating-Point Operations per Second and Memory Bandwidth for the CPU
and GPU 6
Figure 1-2. The GPU Devotes More Transistors to Data Processing ............................ 7
Figure 1-3. CUDA is Designed to Support Various Languages and Application
Programming Interfaces .................................................................................... 8
Figure 1-4. Automatic Scalability ............................................................................ 9
Figure 2-1. Grid of Thread Blocks ......................................................................... 12
Figure 2-2. Matrix Multipliation without Shared Memory ......................................... 19
Figure 2-3. Matrix Multipliation with Shared Memory .............................................. 24
The reason behind the discrepancy in floating-point capability between the CPU and
the GPU is that the GPU is specialized for compute-intensive, highly parallel
computation – exactly what graphics rendering is about – and therefore designed
such that more transistors are devoted to data processing rather than data caching
and flow control, as schematically illustrated by Figure 1-2.
ALU ALU
Cache
DRAM DRAM
CPU GPU
More specifically, the GPU is especially well-suited to address problems that can be
expressed as data-parallel computations – the same program is executed on many
data elements in parallel – with high arithmetic intensity – the ratio of arithmetic
operations to memory operations. Because the same program is executed for each
data element, there is a lower requirement for sophisticated flow control; and
because it is executed on many data elements and has high arithmetic intensity, the
memory access latency can be hidden with calculations instead of big data caches.
Data-parallel processing maps data elements to parallel processing threads. Many
applications that process large data sets can use a data-parallel programming model
to speed up the computations. In 3D rendering, large sets of pixels and vertices are
mapped to parallel threads. Similarly, image and media processing applications such
as post-processing of rendered images, video encoding and decoding, image scaling,
stereo vision, and pattern recognition can map image blocks and pixels to parallel
processing threads. In fact, many algorithms outside the field of image rendering
and processing are accelerated by data-parallel processing, from general signal
processing or physics simulation to computational finance or computational biology.
This scalable programming model allows the CUDA architecture to span a wide
market range by simply scaling the number of processors and memory partitions:
from the high-performance enthusiast GeForce GTX 280 GPU and professional
Quadro and Tesla computing products to a variety of inexpensive, mainstream
GeForce GPUs (see Appendix A for a list of all CUDA-enabled GPUs).
OpenCL Program
Block 4 Block 5
Block 6 Block 7
A multithreaded program is partitioned into blocks of threads that execute independently from each
other, so that a GPU with more cores will automatically execute the program in less time than a GPU
with fewer cores.
Grid
Block (1, 1)
A thread is also given a unique thread ID within its block. The local ID of a thread
and its thread ID relate to each other in a straightforward way: For a one-
dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy),
the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional block
of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is
(x + y Dx + z Dx Dy).
When an OpenCL program on the host invokes a kernel, the work-groups are
enumerated and distributed as thread blocks to the multiprocessors with available
execution capacity. The threads of a thread block execute concurrently on one
multiprocessor. As thread blocks terminate, new blocks are launched on the vacated
multiprocessors.
A multiprocessor is designed to execute hundreds of threads concurrently. To
manage such a large amount of threads, it employs a unique architecture called
SIMT (Single-Instruction, Multiple-Thread) that is described in Section 2.1.1. To
capability of the device (see Sections C.3.2, C.3.3, C.4.2, and C.4.3) and which thread
performs the final write is undefined.
If an atomic instruction executed by a warp reads, modifies, and writes to the same
location in global memory for more than one of the threads of the warp, each read,
modify, write to that location occurs and they are all serialized, but the order in
which they occur is undefined.
The total amount of shared memory Sblock in bytes allocated for a block is as follows:
S block ceil ( S k , G S )
2.2 Compilation
2.2.1 PTX
Kernels written in OpenCL C are compiled into PTX, which is CUDA’s instruction
set architecture and is described in a separate document.
Currently, the PTX intermediate representation can be obtained by calling
clGetProgramInfo() with CL_PROGRAM_BINARIES. It can be passed to
clCreateProgramWithBinary() to create a program object only if it is
produced and consumed by the same driver. This will likely not be supported in
future versions.
2.2.2 Volatile
Only after the execution of barrier(), mem_fence(), read_mem_fence(), or
write_mem_fence() are prior writes to global or shared memory of a given
thread guaranteed to be visible by other threads. As long as this requirement is met,
the compiler is free to optimize reads and writes to global or shared memory. For
example, in the code sample below, the first reference to myArray[tid] compiles
into a global or shared memory read instruction, but the second reference does not
as the compiler simply reuses the result of the first read.
// myArray is an array of non-zero integers
// located in global or shared memory
__kernel void myKernel(__global int* result) {
int tid = get_local_id(0);
int ref1 = myArray[tid] * 1;
myArray[tid + 1] = 2;
int ref2 = myArray[tid] * 1;
result[tid] = ref1 * ref2;
}
Therefore, ref2 cannot possibly be equal to 2 in thread tid as a result of thread
tid-1 overwriting myArray[tid] by 2.
This behavior can be changed using the volatile keyword: If a variable located in
global or shared memory is declared as volatile, the compiler assumes that its value
can be changed at any time by another thread and therefore any reference to this
variable compiles to an actual memory read instruction.
Note that even if myArray is declared as volatile in the code sample above, there is
no guarantee, in general, that ref2 will be equal to 2 in thread tid since thread
tid might read myArray[tid] into ref2 before thread tid-1 overwrites its
value by 2. Synchronization is required.
// Host code
// Invoke kernel
cl_uint i = 0;
clSetKernelArg(matMulKernel, i++,
sizeof(d_A.width), (void*)&d_A.width);
clSetKernelArg(matMulKernel, i++,
sizeof(d_A.height), (void*)&d_A.height);
clSetKernelArg(matMulKernel, i++,
sizeof(d_A.elements), (void*)&d_A.elements);
clSetKernelArg(matMulKernel, i++,
sizeof(d_B.width), (void*)&d_B.width);
clSetKernelArg(matMulKernel, i++,
sizeof(d_B.height), (void*)&d_B.height);
clSetKernelArg(matMulKernel, i++,
sizeof(d_B.elements), (void*)&d_B.elements);
clSetKernelArg(matMulKernel, i++,
sizeof(d_C.width), (void*)&d_C.width);
clSetKernelArg(matMulKernel, i++,
sizeof(d_C.height), (void*)&d_C.height);
clSetKernelArg(matMulKernel, i++,
sizeof(d_C.elements), (void*)&d_C.elements);
size_t localWorkSize[] = { BLOCK_SIZE, BLOCK_SIZE };
size_t globalWorkSize[] =
{ B.width / dimBlock.x, A.height / dimBlock.y };
clEnqueueNDRangeKernel(queue, matMulKernel, 2, 0,
globalWorkSize, localWorkSize,
0, 0, 0);
// Kernel code
B.width-1
0 col
B.height
0
A C
A.height
row
A.width B.width
A.height-1
Csub. In order to fit into the device’s resources, these two rectangular matrices are
divided into as many square matrices of dimension block_size as necessary and Csub is
computed as the sum of the products of these square matrices. Each of these
products is performed by first loading the two corresponding square matrices from
global memory to shared memory with one thread loading one element of each
matrix, and then by having each thread compute one element of the product. Each
thread accumulates the result of each of these products into a register and once
done writes the result to global memory.
By blocking the computation this way, we take advantage of fast shared memory
and save a lot of global memory bandwidth since A is only read (B.width / block_size)
times from global memory and B is read (A.height / block_size) times.
The Matrix type from the previous code sample is augmented with a stride field, so
that sub-matrices can be efficiently represented with the same type.
// Host code
d_C.elements = clCreateBuffer(context,
CL_MEM_WRITE_ONLY, size, 0, 0);
// Invoke kernel
cl_uint i = 0;
clSetKernelArg(matMulKernel, i++,
sizeof(d_A.width), (void*)&d_A.width);
clSetKernelArg(matMulKernel, i++,
sizeof(d_A.height), (void*)&d_A.height);
clSetKernelArg(matMulKernel, i++,
sizeof(d_A.stride), (void*)&d_A.stride);
clSetKernelArg(matMulKernel, i++,
sizeof(d_A.elements), (void*)&d_A.elements);
clSetKernelArg(matMulKernel, i++,
sizeof(d_B.width), (void*)&d_B.width);
clSetKernelArg(matMulKernel, i++,
sizeof(d_B.height), (void*)&d_B.height);
clSetKernelArg(matMulKernel, i++,
sizeof(d_B. stride), (void*)&d_B.stride);
clSetKernelArg(matMulKernel, i++,
sizeof(d_B.elements), (void*)&d_B.elements);
clSetKernelArg(matMulKernel, i++,
sizeof(d_C.width), (void*)&d_C.width);
clSetKernelArg(matMulKernel, i++,
sizeof(d_C.height), (void*)&d_C.height);
clSetKernelArg(matMulKernel, i++,
sizeof(d_C.stride), (void*)&d_C.stride);
clSetKernelArg(matMulKernel, i++,
sizeof(d_C.elements), (void*)&d_C.elements);
size_t localWorkSize[] = { BLOCK_SIZE, BLOCK_SIZE };
size_t globalWorkSize[] =
{ B.width / dimBlock.x, A.height / dimBlock.y };
clEnqueueNDRangeKernel(queue, matMulKernel, 2, 0,
globalWorkSize, localWorkSize,
0, 0, 0);
// Kernel code
blockCol
BLOCK_SIZE
B
B.height
BLOCK_SIZE
BLOCK_SIZE-1
A C
0 col
BLOCK_SIZE
Csub
blockRow
A.height
row
BLOCK_SIZE-1
A.width B.width
same kernel invocation, or they belong to different blocks, in which case they must
share data through global memory using two separate kernel invocations, one for
writing to and one for reading from global memory. The second case is much less
optimal since it adds the overhead of extra kernel invocations and global memory
traffic. Its occurrence should therefore be minimized by mapping the algorithm to
the OpenCL programming model in such a way that the computations that require
inter-thread communication are performed within a single thread block as much as
possible.
execution has not completed yet. In the case of a back-to-back register dependency
(i.e., some input operand is written by the previous instruction), the latency is equal
to the execution time of the previous instruction and the warp scheduler must
schedule instructions for different warps during that time. Execution time varies
depending on the instruction, but it is typically about 22 clock cycles, which
translates to 6 warps for devices of compute capability 1.x and 11 warps for devices
of compute capability 2.0.
If some input operand resides in off-chip memory, the latency is much higher: 400
to 800 clock cycles. The number of warps required to keep the warp scheduler busy
during such high latency periods depends on the kernel code; in general, more warps
are required if the ratio of the number of instructions with no off-chip memory
operands (i.e., arithmetic instructions most of the time) to the number of
instructions with off-chip memory operands is low (this ratio is commonly called
the arithmetic intensity of the program). If this ratio is 10, for example, then to hide
latencies of about 600 clock cycles, about 15 warps are required for devices of
compute capability 1.x and about 30 for devices of compute capability 2.0.
Another reason a warp is not ready to execute its next instruction is that it is waiting
at some memory fence or synchronization point. A synchronization point can force
the multiprocessor to idle as more and more warps wait for other warps in the same
block to complete execution of instructions prior to the synchronization point.
Having multiple resident blocks per multiprocessor can help reduce idling in this
case, as warps from different blocks do not need to wait for each other at
synchronization points.
The number of blocks and warps residing on each multiprocessor for a given kernel
call depends on the NDRange of the call, the memory resources of the
multiprocessor, and the resource requirements of the kernel as described in Section
2.1.2. To assist programmers in choosing thread block size based on register and
shared memory requirements, the CUDA Software Development Kit provides a
spreadsheet, called the CUDA Occupancy Calculator, where occupancy is defined as
the ratio of the number of resident warps to the maximum number of resident
warps (given in Appendix C for various compute capabilities).
Register, local, shared, and constant memory usages are reported by the compiler
when compiling with the -cl-nv-verbose build option (see
cl_nv_compiler_options extension).
The total amount of shared memory required for a block is equal to the sum of the
amount of statically allocated shared memory, the amount of dynamically allocated
shared memory, and for devices of compute capability 1.x, the amount of shared
memory used to pass the kernel’s arguments.
The number of registers used by a kernel can have a significant impact on the
number of resident warps. For example, for devices of compute capability 1.2, if a
kernel uses 16 registers and each block has 512 threads and requires very little
shared memory, then two blocks (i.e., 32 warps) can reside on the multiprocessor
since they require 2x512x16 registers, which exactly matches the number of registers
available on the multiprocessor. But as soon as the kernel uses one more register,
only one block (i.e., 16 warps) can be resident since two blocks would require
2x512x17 registers, which is more registers than are available on the multiprocessor.
Therefore, the compiler attempts to minimize register usage while keeping register
spilling (see Section 3.3.2.2) and the number of instructions to a minimum. Register
usage can be controlled using the -cl-nv-maxrregcount build option.
Each double variable (on devices that supports native double precision, i.e. devices
of compute capability 1.2 and higher) and each long long variable uses two
registers. However, devices of compute capability 1.2 and higher have at least twice
as many registers per multiprocessor as devices with lower compute capability.
The effect of NDRange on performance for a given kernel call generally depends on
the kernel code. Experimentation is therefore recommended and applications
should set the work-group size explicitly as opposed to rely on the OpenCL
implementation to determine the right size (by setting local_work_size to NULL in
clEnqueueNDRangeKernel()). Applications can also parameterize NDRanges
based on register file size and shared memory size, which depends on the compute
capability of the device, as well as on the number of multiprocessors and memory
bandwidth of the device, all of which can be queried using the runtime or driver
API (see reference manual).
The number of threads per block should be chosen as a multiple of the warp size to
avoid wasting computing resources with under-populated warps as much as
possible.
memory is used for both L1 and shared memory, and how much of it is dedicated to
L1 versus shared memory is configurable for each kernel call.
The throughput of memory accesses by a kernel can vary by an order of magnitude
depending on access pattern for each type of memory. The next step in maximizing
memory throughput is therefore to organize memory accesses as optimally as
possible based on the optimal memory access patterns described in Sections 3.3.2.1,
3.3.2.3, 3.3.2.4, and 3.3.2.5. This optimization is especially important for global
memory accesses as global memory bandwidth is low, so non-optimal global
memory accesses have a higher impact on performance.
If this size and alignment requirement is not fulfilled, the access compiles to
multiple instructions with interleaved access patterns that prevent these instructions
from fully coalescing. It is therefore recommended to use types that meet this
requirement for data that resides in global memory.
The alignment requirement is automatically fulfilled for built-in types.
For structures, the size and alignment requirements can be enforced by the compiler
using the alignment specifiers __attribute__ ((aligned(8))) or
__attribute__ ((aligned(16))), such as
struct {
float a;
float b;
} __attribute__ ((aligned(8)));
or
struct {
float a;
float b;
float c;
} __attribute__ ((aligned(16)));
Any address of a variable residing in global memory or returned by one of the
memory allocation routines from the driver or runtime API is always aligned to at
least 256 bytes.
Reading non-naturally aligned 8-byte or 16-byte words produces incorrect results
(off by a few words), so special care must be taken to maintain alignment of the
starting address of any value or array of values of these types. A typical case where
this might be easily overlooked is when using some custom global memory
allocation scheme, whereby the allocations of multiple arrays (with multiple calls to
cudaMalloc() or cuMemAlloc()) is replaced by the allocation of a single large
block of memory partitioned into multiple arrays, in which case the starting address
of each array is offset from the block’s starting address.
Arrays for which it cannot determine that they are indexed with constant
quantities,
Large structures or arrays that would consume too much register space,
Any variable if the kernel uses more registers than available (this is also known
as register spilling).
Note that some mathematical functions have implementation paths that might
access local memory.
The local memory space resides in device memory, so local memory accesses have
same high latency and low bandwidth as global memory accesses and are subject to
the same requirements for memory coalescing as described in Section 3.3.2.1. Local
memory is however organized such that consecutive 32-bit words are accessed by
consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads
in a warp access the same relative address (e.g. same index in an array variable, same
member in a structure variable).
On devices of compute capability 2.0, local memory accesses are always cached in
L1 and L2 in the same way as global memory accesses (see Section C.4.2).
The resulting requests are then serviced at the throughput of the constant cache in
case of a cache hit, or at the throughput of device memory otherwise.
All throughputs are for one multiprocessor. They must be multiplied by the number
of multiprocessors in the device to get throughput for the whole device.
Compute Compute
Capability 1.x Capability 2.0
32-bit floating-point
8 32
add, multiply, multiply-add
64-bit floating-point
1 16
add, multiply, multiply-add
32-bit integer
8 32
add, logical operation, shift, compare
24-bit integer multiply (mul24(x,y)) 8 Multiple instructions
32-bit integer
Multiple instructions 32
multiply, multiply-add, sum of absolute difference
32-bit floating-point
reciprocal, reciprocal square root,
base-2 logarithm (native_log), 2 4
base-2 exponential (native_exp),
sine (native_sin), cosine (native_cos)
Type conversions 8 32
Other instructions and functions are implemented on top of the native instructions.
The implementation may be different for devices of compute capability 1.x and
devices of compute capability 2.0, and the number of native instructions after
compilation may fluctuate with every compiler version.
Single-Precision Floating-Point Division
native_divide(x, y) provides faster single-precision floating-point division
than the division operator.
Single-Precision Floating-Point Reciprocal Square Root
To preserve IEEE-754 semantics the compiler cannot optimize 1.0/sqrtf() into
rsqrtf(). It is therefore recommended to invoke native_rsqrt() directly
where desired.
Single-Precision Floating-Point Square Root
Because a warp executes one common instruction at a time, threads within a warp
are implicitly synchronized and this can sometimes be used to omit barrier() for
better performance.
In the following code sample, for example, both calls to barrier() are required to
get the expected result (i.e. result[i] = 2 * myArray[i] for i > 0).
Without synchronization, any of the two references to myArray[tid] could
return either 2 or the value initially stored in myArray, depending on whether the
memory read occurs before or after the memory write from
myArray[tid + 1] = 2.
// myArray is an array of integers located in global or shared
// memory
__kernel void myKernel(__global int* result) {
int tid = get_local_id(0);
...
int ref1 = myArray[tid] * 1;
barrier(CLK_LOCAL_MEM_FENCE|CLK_GLOBAL_MEM_FENCE);
myArray[tid + 1] = 2;
barrier(CLK_LOCAL_MEM_FENCE|CLK_GLOBAL_MEM_FENCE);
int ref2 = myArray[tid] * 1;
result[tid] = ref1 * ref2;
...
}
However, in the following slightly modified code sample, threads are guaranteed to
belong to the same warp, so that there is no need for any barrier().
// myArray is an array of integers located in global or shared
// memory
__kernel void myKernel(__global int* result) {
int tid = get_local_id(0);
...
if (tid < warpSize) {
int ref1 = myArray[tid] * 1;
myArray[tid + 1] = 2;
int ref2 = myArray[tid] * 1;
result[tid] = ref1 * ref2;
}
...
}
Simply removing the barrier() is not enough however; myArray must also be
declared as volatile as described in Section 2.2.2.
Table C-1 lists all CUDA-enabled devices with their compute capability, number of
multiprocessors, and number of CUDA cores.
These, as well as the clock frequency and the total amount of device memory, can
be queried using the runtime or driver API (see reference manual).
The general specifications and features of a compute device depend on its compute
capability (see Section 2.3).
Section C.1 gives the features and technical specifications associated to each
compute capability.
Section C.2 reviews the compliance with the IEEE floating-point standard.
Section C.3 and 0 give more details on the architecture of devices of compute
capability 1.x and 2.0, respectively.
cl_khr_byte_addressable_store
cl_khr_icd
cl_nv_compiler_options
cl_nv_device_attribute_query
cl_nv_pragma_unroll
Yes
cl_khr_gl_sharing
cl_nv_d3d9_sharing
cl_nv_d3d10_sharing
cl_khr_d3d10_sharing
cl_nv_d3d11_sharing
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomic No Yes
s
cl_khr_local_int32_base_atomics
No Yes
cl_khr_local_int32_extended_atomics
cl_khr_fp64 No Yes
Appendix C. Compute Capabilities
Compute Capability
C.3.1 Architecture
For devices of compute capability 1.x, a multiprocessor consists of:
8 CUDA cores for integer and single-precision floating-point arithmetic
operations,
1 double-precision floating-point unit for double-precision floating-point
arithmetic operations,
2 special function units for single-precision floating-point transcendental
functions (these units can also handle single-precision floating-point
multiplications),
1 warp scheduler.
To execute an instruction for all threads of a warp, the warp scheduler must
therefore issue the instruction over:
4 clock cycles for an integer or single-precision floating-point arithmetic
instruction,
32 clock cycles for a double-precision floating-point arithmetic instruction,
16 clock cycles for a single-precision floating-point transcendental instruction.
A multiprocessor also has a read-only constant cache that is shared by all functional
units and speeds up reads from the constant memory space, which resides in device
memory.
Multiprocessors are grouped into Texture Processor Clusters (TPCs). The number of
multiprocessors per TPC is:
Find the memory segment that contains the address requested by the lowest
numbered active thread. The segment size depends on the size of the words
accessed by the threads:
32 bytes for 1-byte words,
64 bytes for 2-byte words,
128 bytes for 4-, 8- and 16-byte words.
Find all other active threads whose requested address lies in the same segment.
Reduce the transaction size, if possible:
If the transaction size is 128 bytes and only the lower or upper half is used,
reduce the transaction size to 64 bytes;
If the transaction size is 64 bytes (originally or after reduction from 128
bytes) and only the lower or upper half is used, reduce the transaction size
to 32 bytes.
Carry out the transaction and mark the serviced threads as inactive.
Repeat until all threads in the half-warp are serviced.
request. This reduces the number of bank conflicts when several threads read from
an address within the same 32-bit word. More precisely, a memory read request
made of several addresses is serviced in several steps over time by servicing one
conflict-free subset of these addresses per step until all addresses have been
serviced; at each step, the subset is built from the remaining addresses that have yet
to be serviced using the following procedure:
Select one of the words pointed to by the remaining addresses as the broadcast
word;
Include in the subset:
All addresses that are within the broadcast word,
One address for each bank (other than the broadcasting bank) pointed to
by the remaining addresses.
Which word is selected as the broadcast word and which address is picked up for
each bank at each cycle are unspecified.
A common conflict-free case is when all threads of a half-warp read from an address
within the same 32-bit word.
Figure C-3 shows some examples of memory read accesses that involve the
broadcast mechanism. The same examples apply for devices of compute capability
1.x, but with 16 banks instead of 32.
double dataIn;
tmp = as_int2(dataIn);
shared_lo[BaseIndex + tid] = tmp.x;
C.4.1 Architecture
For devices of compute capability 2.0, a multiprocessor consists of:
32 CUDA cores for integer and floating-point arithmetic operations,
4 special function units for single-precision floating-point transcendental
functions,
2 warp schedulers.
At every instruction issue time, each scheduler issues an instruction for some warp
that is ready to execute, if any. The first scheduler is in charge of the warps with an
odd ID and the second scheduler is in charge of the warps with an even ID. Note
that when a scheduler issues a double-precision floating-point instruction, the other
scheduler cannot issue any instruction.
A warp scheduler can issue an instruction to only half of the CUDA cores. To
execute an instruction for all threads of a warp, a warp scheduler must therefore
issue the instruction over:
2 clock cycles for an integer or floating-point arithmetic instruction,
2 clock cycles for a double-precision floating-point arithmetic instruction,
8 clock cycles for a single-precision floating-point transcendental instruction.
A multiprocessor also has a read-only uniform cache that is shared by all functional
units and speeds up reads from the constant memory space, which resides in device
memory.
There is an L1 cache for each multiprocessor and an L2 cache shared by all
multiprocessors, both of which are used to cache accesses to local or global
memory, including temporary register spills.
Multiprocessors are grouped into Graphics Processor Clusters (GPCs). A GPC includes
four multiprocessors.
Each multiprocessor has a read-only texture cache to speed up reads from the
texture memory space, which resides in device memory. It accesses the texture cache
via a texture unit that implements the various addressing modes and data filtering.
Threads: 0 … 31
Threads: 0 … 31
Threads: 0 … 31
Unlike for devices of compute capability 1.x, there are no bank conflicts for arrays
of doubles accessed as follows, for example:
__local double shared[32];
double data = shared[BaseIndex + tid];
128-Bit Accesses
The majority of 128-bit accesses will cause 2-way bank conflicts, even if no two
threads in a quarter-warp access different addresses belonging to the same bank.
Therefore, to determine the ways of bank conflicts, one must add 1 to the
maximum number of threads in a quarter-warp that access different addresses
belonging to the same bank.
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
10 10 10 10 10 10
11 11 11 11 11 11
12 12 12 12 12 12
13 13 13 13 13 13
14 14 14 14 14 14
15 15 15 15 15 15
16 16 16 16 16 16
17 17 17 17 17 17
18 18 18 18 18 18
19 19 19 19 19 19
20 20 20 20 20 20
21 21 21 21 21 21
22 22 22 22 22 22
23 23 23 23 23 23
24 24 24 24 24 24
25 25 25 25 25 25
26 26 26 26 26 26
27 27 27 27 27 27
28 28 28 28 28 28
29 29 29 29 29 29
30 30 30 30 30 30
31 31 31 31 31 31
Left: Linear addressing with a stride of one 32-bit word (no bank conflict).
Middle: Linear addressing with a stride of two 32-bit words (2-way bank conflicts).
Right: Linear addressing with a stride of three 32-bit words (no bank conflict).
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
10 10 10 10 10 10
11 11 11 11 11 11
12 12 12 12 12 12
13 13 13 13 13 13
14 14 14 14 14 14
15 15 15 15 15 15
16 16 16 16 16 16
17 17 17 17 17 17
18 18 18 18 18 18
19 19 19 19 19 19
20 20 20 20 20 20
21 21 21 21 21 21
22 22 22 22 22 22
23 23 23 23 23 23
24 24 24 24 24 24
25 25 25 25 25 25
26 26 26 26 26 26
27 27 27 27 27 27
28 28 28 28 28 28
29 29 29 29 29 29
30 30 30 30 30 30
31 31 31 31 31 31
NVIDIA Corporation
2701 San Tomas Expressway
Santa Clara, CA 95050
www.nvidia.com