0% found this document useful (0 votes)

27 views14 pages

Kepler Tuning Guide

The document is a tuning guide for optimizing CUDA applications on NVIDIA's Kepler architecture, detailing its features and best practices for performance enhancement. It covers aspects such as device utilization, memory throughput, and new capabilities like concurrent kernels and dynamic parallelism. The guide emphasizes the importance of adapting applications to leverage Kepler's architectural improvements for better performance outcomes.

Uploaded by

edwin bayani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views14 pages

Kepler Tuning Guide

Uploaded by

edwin bayani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

TUNING CUDA APPLICATIONS FOR

KEPLER

DA-06288-001_v7.0 | August 2014

Application Note
TABLE OF CONTENTS

Chapter 1. Kepler Tuning Guide.............................................................................. 1

1.1. NVIDIA Kepler Compute Architecture................................................................. 1
1.2. CUDA Best Practices......................................................................................1
1.3. Application Compatibility............................................................................... 2
1.4. Kepler Tuning.............................................................................................. 2
1.4.1. Device Utilization and Occupancy................................................................ 2
1.4.2. Managing Coarse-Grained Parallelism............................................................ 3
1.4.2.1. Concurrent Kernels............................................................................ 3
1.4.2.2. Hyper-Q.......................................................................................... 3
1.4.2.3. Dynamic Parallelism........................................................................... 4
1.4.3. Shared Memory and Warp Shuffle................................................................ 4
1.4.3.1. Shared Memory Bandwidth................................................................... 4
1.4.3.2. Shared Memory Capacity......................................................................5
1.4.3.3. Warp Shuffle.................................................................................... 5
1.4.4. Memory Throughput.................................................................................5
1.4.4.1. Increased Addressable Registers Per Thread.............................................. 5
1.4.4.2. L1 Cache.........................................................................................6
1.4.4.3. Read-Only Data Cache........................................................................ 6
1.4.4.4. Fast Global Memory Atomics................................................................. 6
1.4.4.5. Global Memory Bandwidth and GPU Boost.................................................7
1.4.4.6. 2D Memory Copies............................................................................. 7
1.4.5. Instruction Throughput............................................................................. 7
1.4.5.1. Single-precision vs. Double-precision....................................................... 7
1.4.6. GPU Boost............................................................................................ 7
1.4.7. Multi-GPU............................................................................................. 7
1.4.8. PCIe 3.0............................................................................................... 8
1.4.9. Warp-synchronous Programming.................................................................. 8
Appendix A. References.......................................................................................10
Appendix B. Revision History................................................................................ 11

www.nvidia.com
Tuning CUDA Applications for Kepler DA-06288-001_v7.0 | ii
Chapter 1.
KEPLER TUNING GUIDE

1.1. NVIDIA Kepler Compute Architecture

Kepler is NVIDIA's 3rd-generation architecture for CUDA compute applications.
Kepler retains and extends the same CUDA programming model as in earlier NVIDIA
architectures such as Fermi, and applications that follow the best practices for the Fermi
architecture should typically see speedups on the Kepler architecture without any code
changes. This guide summarizes the ways that an application can be fine-tuned to gain
additional speedups by leveraging Kepler architectural features.1
The Kepler architecture comprises two major variants: GK104 and GK110. A detailed
overview of the major improvements in GK1042 and GK1103 over the earlier Fermi
architecture are described in a pair of whitepapers [1][2] entitled NVIDIA GeForce GTX
680: The fastest, most efficient GPU ever built for GK104 and NVIDIA's Next Generation
CUDA Compute Architecture: Kepler GK110 for GK110.
For details on the programming features discussed in this guide, please refer to the
CUDA C Programming Guide. Details on the architectural features are covered in the
architecture whitepapers referenced above. Some of the Kepler features described in
this guide are specific to GK110, as noted; if not specified, Kepler features refer to both
GK104 and GK110.

1.2. CUDA Best Practices

The performance guidelines and best practices described in the CUDA C Programming
Guide [3] and the CUDA C Best Practices Guide [4] apply to all CUDA-capable GPU
architectures. Programmers must primarily focus on following those recommendations
to achieve the best performance.
The high-priority recommendations from those guides are as follows:

1
Throughout this guide, Fermi refers to devices of compute capability 2.x and Kepler refers to devices of compute
capability 3.x. GK104 has compute capability 3.0; GK110 has compute capability 3.5.
2
The features of GK107 and of GK20X are similar to those of GK104.
3
The features of GK110B and of GK210 are similar to those of GK110 except where noted.

www.nvidia.com
Tuning CUDA Applications for Kepler DA-06288-001_v7.0 | 1
Kepler Tuning Guide

‣ Find ways to parallelize sequential code,

‣ Minimize data transfers between the host and the device,
‣ Adjust kernel launch configuration to maximize device utilization,
‣ Ensure global memory accesses are coalesced,
‣ Minimize redundant accesses to global memory whenever possible,
‣ Avoid different execution paths within the same warp.

1.3. Application Compatibility

Before addressing the specific performance tuning issues covered in this guide, refer to
the Kepler Compatibility Guide for CUDA Applications to ensure that your application is
being compiled in a way that will be compatible with Kepler.
Note that many of the GK110 architectural features described in this document require
the device code in the application to be compiled for its native compute capability 3.5
target architecture (sm_35).

1.4. Kepler Tuning

1.4.1. Device Utilization and Occupancy
Kepler's new Streaming Multiprocessor, called SMX, has significantly more CUDA
Cores than the SM of Fermi GPUs, yielding a throughput improvement of 2-3x per
clock.4 Furthermore, GK110 has increased memory bandwidth over Fermi and GK104.
To match these throughput increases, we need roughly twice as much parallelism per
multiprocessor on Kepler GPUs, via either an increased number of active warps of
threads or increased instruction-level parallelism (ILP) or some combination thereof.
Balancing this is the fact that GK104 ships with only 8 multiprocessors, half of the size of
Fermi GF110, meaning that GK104 needs roughly the same total amount of parallelism
as is needed by Fermi GF110, though it needs more parallelism per multiprocessor to
achieve this. Since GK110 can have up to 15 multiprocessors, which is similar to the
number of multiprocessors of Fermi GF110, then GK110 typically needs a larger amount
of parallelism than Fermi or GK104.
To enable the increased per-multiprocessor warp occupancy beneficial to both GK104
and GK110, several important multiprocessor resources have been significantly
increased in SMX:
‣ Kepler increases the size of the register file over Fermi by 2x per multiprocessor;
GK210 increases this by a further 2x. On Fermi, the number of registers available
was the primary limiting factor of occupancy for many kernels. On Kepler, these
kernels can automatically fit more thread blocks per multiprocessor. For example,
a kernel using 63 registers per thread and 256 threads per block can fit at most 16
concurrent warps per multiprocessor on Fermi (out of a maximum of 48, i.e., 33%
theoretical occupancy). The same configuration can fit 32 warps per multiprocessor

4
Note, however, that Kepler clocks are generally lower than Fermi clocks for improved power efficiency.

www.nvidia.com
Tuning CUDA Applications for Kepler DA-06288-001_v7.0 | 2
Kepler Tuning Guide

on GK104 and GK110 (out of a maximum of 64, i.e., 50% theoretical occupancy), or
64 warps (100% theoretical occupancy) on GK210.
‣ Kepler has increased the maximum number of simultaneous blocks per
multiprocessor from 8 to 16. As a result, kernels having their occupancy limited
due to reaching the maximum number of thread blocks per multiprocessor will see
increased theoretical occupancy in Kepler.
‣ GK210 more than doubles the shared memory capacity versus Fermi and earlier
Kepler GPUs.
Note that these automatic occupancy improvements require kernel launches with
sufficient total thread blocks to fill Kepler. For this reason, it remains a best practice to
launch kernels with significantly more thread blocks than necessary to fill current GPUs,
allowing this kind of scaling to occur naturally without modifications to the application.
The CUDA Occupancy Calculator [5] spreadsheet is a valuable tool in visualizing the
achievable occupancy for various kernel launch configurations.
Also note that Kepler GPUs can utilize ILP in place of thread/warp-level parallelism
(TLP) more readily than Fermi GPUs can. Furthermore, some degree of ILP in
conjunction with TLP is required by Kepler GPUs in order to approach peak single-
precision performance, since SMX's warp scheduler issues one or two independent
instructions from each of four warps per clock. ILP can be increased by means of, for
example, processing several data items concurrently per thread or unrolling loops in
the device code, though note that either of these approaches may also increase register
pressure.

1.4.2. Managing Coarse-Grained Parallelism

Since GK110 requires more concurrently active threads than either GK104 or Fermi,
GK110 introduces several features that can assist applications having more limited
parallelism, where the expanded multiprocessor resources described in Device
Utilization and Occupancy are difficult to leverage from any single kernel launch. These
improvements allow the application to more readily use several concurrent kernel grids
to fill GK110:

1.4.2.1. Concurrent Kernels

Since the introduction of Fermi, applications have had the ability to launch several
kernels concurrently. This provides a mechanism by which applications can fill the
device with several smaller kernel launches simultaneously as opposed to a single larger
one. On Fermi and on GK104, at most 16 kernels can execute concurrently; GK110 allows
up to 32 concurrent kernels to execute, which can provide a speedup for applications
with necessarily small (but independent) kernel launches.

1.4.2.2. Hyper-Q
GK110 further improves this with the addition of Hyper-Q, which removes the false
dependencies that can be introduced among CUDA streams in cases of suboptimal
kernel launch or memory copy ordering across streams in Fermi or GK104. Hyper-Q
allows GK110 to handle the concurrent kernels and/or memory transfers in separate
CUDA streams truly independently, rather than serializing the several streams into a

www.nvidia.com
Tuning CUDA Applications for Kepler DA-06288-001_v7.0 | 3
Kepler Tuning Guide

single work queue at the hardware level. This allows applications to enqueue work into
separate CUDA streams without considering the relative order of insertion of otherwise
independent work, making concurrency of multiple kernels as well as overlapping of
memory copies with computation much more readily achievable on GK110.
CUDA streams are automatically mapped onto Hyper-Q's multiple hardware work
queues via connections to the hardware allocated by the CUDA Driver. While it is
possible to allocate more CUDA streams than there are connections, this simply implies
that the driver will alias several streams onto some or all of those connections. The
CUDA_DEVICE_MAX_CONNECTIONS environment variable can be used to specify the
preferred number of connections to be allocated to the driver. The default is 8 (or fewer if
CUDA Multi-Process Service is in use); the architectural maximum for GK110 is 32.
CUDA Multi-Process Service (MPS) presents another means by which applications can
take advantage of Hyper-Q, wherein several host processes (typically MPI processes)
share access to and submit work to the same GPU concurrently, each process receiving
some subset of the available connections to that GPU. Using CUDA MPS, processes
can achieve overlap of their respective memory transfers and computations with
or without the use of CUDA streams, although at the cost of some added latency of
work submission and a few other caveats. For more information see the CUDA MPS
Overview[6].

1.4.2.3. Dynamic Parallelism

GK110 also introduces a new architectural feature called Dynamic Parallelism, which
allows the GPU to create additional work for itself. A programming model enhancement
that leverages this architectural feature was introduced in CUDA 5.0 to enable kernels
running on GK110 to launch additional kernels onto the same GPU. Nested kernel
launches are done via the same <<<>>> triple-angle bracket notation used from the host
and can make use of the familiar CUDA streams interface to specify whether or not
the kernels launched are independent of one another. More than one GPU thread can
simultaneously launch kernel grids (of the same or different kernels), further increasing
the application's flexibility in keeping the GPU filled with parallel work.

1.4.3. Shared Memory and Warp Shuffle

1.4.3.1. Shared Memory Bandwidth
In balance with the increased computational throughput in Kepler's SMX described
in Device Utilization and Occupancy, shared memory bandwidth in SMX is twice
that of Fermi's SM. This bandwidth increase is exposed to the application through a
configurable new 8-byte shared memory bank mode. When this mode is enabled, 64-
bit (8-byte) shared memory accesses (such as loading a double-precision floating point
number from shared memory) achieve twice the effective bandwidth of 32-bit (4-byte)
accesses. Applications that are sensitive to shared memory bandwidth can benefit from
enabling this mode as long as their kernels' accesses to shared memory are for 8-byte
entities wherever possible.

www.nvidia.com
Tuning CUDA Applications for Kepler DA-06288-001_v7.0 | 4
Kepler Tuning Guide

1.4.3.2. Shared Memory Capacity

Fermi introduced an L1 cache in addition to the shared memory available since the
earliest CUDA-capable GPUs. In Fermi, the shared memory and the L1 cache share the
same physical on-chip storage, and a split of 48 KB shared memory / 16 KB L1 cache or
vice versa can be selected per application or per kernel launch. Kepler continues this
pattern and introduces an additional setting of 32 KB shared memory / 32 KB L1 cache,
the use of which may benefit L1 hit rate in kernels that need more than 16 KB but less
than 48 KB of shared memory per multiprocessor.
Since the maximum shared memory capacity per multiprocessor remains 48 KB on
GK104 and GK110, however, applications that depend on shared memory capacity
either at a per-block level for data exchange or at a per-thread level for additional
thread-private storage may require some rebalancing on these GPUs to improve
achievable occupancy. The Warp Shuffle operation for data-exchange uses of shared
memory and the Increased Addressable Registers Per Thread as an alternative to
thread-private uses of shared memory present two possible alternatives to achieve this
rebalancing.
GK210 improves on this by increasing the shared memory capacity per multiprocessor
for each of the configurations described above by a further 64 KB (i.e., the application
can select 112 KB, 96 KB, or 80 KB of shared memory), which provides automatic
occupancy improvements for kernels limited by shared memory capacity.

The maximum shared memory per thread block for all Kepler GPUs, including GK210,
remains 48 KB.

1.4.3.3. Warp Shuffle

Kepler introduces a new warp-level intrinsic called the shuffle operation. This feature
allows the threads of a warp to exchange data with each other directly without going
through shared (or global) memory. The shuffle instruction also has lower latency than
shared memory access and does not consume shared memory space for data exchange,
so this can present an attractive way for applications to rapidly interchange data among
threads.

1.4.4. Memory Throughput

1.4.4.1. Increased Addressable Registers Per Thread
GK110 increases the maximum number of registers addressable per thread from 63 to
255. This can improve performance of bandwidth-limited kernels that have significant
register spilling on Fermi or GK104. Experimentation should be used to determine the
optimum balance of spilling vs. occupancy, however, as significant increases in the
number of registers used per thread naturally decreases the warp occupancy that can be
achieved, which trades off latency due to memory traffic for arithmetic latency due to
fewer concurrent warps.

www.nvidia.com
Tuning CUDA Applications for Kepler DA-06288-001_v7.0 | 5
Kepler Tuning Guide

1.4.4.2. L1 Cache
L1 caching in Kepler GPUs is reserved only for local memory accesses, such as register
spills and stack data. Global loads are cached in L2 only (or in the Read-Only Data
Cache).
GK110B-based products such as the Tesla K40 GPU Accelerator and GK210 retain this
behavior by default but also allow applications to opt-in to the Fermi-style behavior
of caching both global and local loads in L1. To select this mode, pass the -Xptxas -
dlcm=ca flag to nvcc at compile time.

1.4.4.3. Read-Only Data Cache

GK110 adds the ability for read-only data in global memory to be loaded through the
same cache used by the texture pipeline via a standard pointer without the need to bind
a texture beforehand and without the sizing limitations of standard textures. Since this
is a separate cache with a separate memory pipe and with relaxed memory coalescing
rules, use of this feature can benefit the performance of bandwidth-limited kernels.
This feature will be automatically enabled and utilized where possible by the compiler
when compiling for GK110 as long as certain conditions are met. Foremost among these
requirements is that the data must be guaranteed read-only for the duration of the kernel,
as the read-only data cache is incoherent with respect to writes. In order to allow the
compiler to detect that this condition is satisfied, a necessary (but not always sufficient)
condition is that pointers used for loading such data should be marked with both the
const and __restrict__ qualifiers. Note that adding these qualifiers where applicable
can improve code generation quality via other mechanisms on earlier GPUs as well.
In cases where more explicit control over the read-only data cache mechanism is desired
than the const __restrict__ qualifiers provide, or where the code is sufficiently
complex that the compiler is unable to detect that the read-only data cache is safe to use,
the __ldg() intrinsic can be used in place of a normal pointer dereference to force the
load to go through the read-only data cache.
Note that the read-only data cache accessed via const __restrict__ is separate and
distinct from the constant cache acessed via the __constant__ qualifier. Data loaded
through the constant cache must be relatively small and must be accessed uniformly
for good performance (i.e., all threads of a warp should access the same location at
any given time), whereas data loaded through the read-only data cache can be much
larger and can be accessed in a non-uniform pattern. These two data paths can be used
simultaneously for different data if desired.

1.4.4.4. Fast Global Memory Atomics

Global memory atomic operations have dramatically higher throughput on Kepler than
on Fermi. Algorithms requiring multiple threads to update the same location in memory
concurrently have at times on earlier GPUs resorted to complex data rearrangements in
order to minimize the number of atomics required. Given the improvements in global
memory atomic performance, many atomics can be performed on Kepler nearly as
quickly as memory loads. This may simplify implementations requiring atomicity or
enable algorithms previously deemed impractical.

www.nvidia.com
Tuning CUDA Applications for Kepler DA-06288-001_v7.0 | 6
Kepler Tuning Guide

1.4.4.5. Global Memory Bandwidth and GPU Boost

GK110B provides higher memory clocks (and, by extension, higher peak global memory
bandwidth) than GK110. For the GK110B-based Tesla K40 GPU Accelerator, while all of
the GPU Boost clock settings use the same 3GHz memory clock, the effective memory
bandwidth utilization can typically be increased by using the highest boost setting for
SM core clocks as well.

1.4.4.6. 2D Memory Copies

The effective bandwidth of cudaMemcpy2D() operations is best when avoiding the use
of small device pitches together with large host pitches (>64 KB).

1.4.5. Instruction Throughput

While the maximum instructions per clock (IPC) of both floating-point and integer
operations has been either increased or maintained in Kepler as compared to Fermi, the
relative ratios of maximum IPC for various specific instructions has changed somewhat.
Refer to the CUDA C Programming Guide for details.

1.4.5.1. Single-precision vs. Double-precision

As one example of these instruction throughput ratios, an important difference between
GK104 and GK110 is the ratio of peak single-precision to peak double-precision
floating point performance. Whereas GK104 focuses primarily on high single-precision
throughput, GK110 significantly improves the peak double-precision throughput over
Fermi as well. Applications that depend heavily on high double-precision performance
will generally perform best with GK110.

1.4.6. GPU Boost

NVIDIA GPU Boost is a feature available on the GK110B-based Tesla K40 GPU
Accelerator that makes use of power headroom to run the SM core clock to a higher
frequency. While the default clock for K40 is set to the base clock, which is necessary
for some applications that are demanding on power (e.g., DGEMM), many application
workloads are less demanding on power and can take advantage of a higher boost clock
setting for added performance.
GK210 further improves GPU Boost functionality. For example, the Tesla K80 GPU
Accelerator provides many more possible boost clock settings than Tesla K40, and
Tesla K80 defaults to a dynamic boost mode, wherein power headroom is leveraged by
increasing clock speeds automatically.
GPU Boost clocks (and optional disabling of dynamic boost for GK210) can be selected
through nvidia-smi or NVML.

1.4.7. Multi-GPU
NVIDIA's Tesla K10 GPU Accelerator is a dual-GK104 board; the Tesla K80 GPU
Accelerator is a dual-GK210 board. As with other dual-GPU NVIDIA boards, the two

www.nvidia.com
Tuning CUDA Applications for Kepler DA-06288-001_v7.0 | 7
Kepler Tuning Guide

GPUs on these board will appear as two separate CUDA devices; they have separate
memories and operate independently. As such, applications that will target the Tesla K10
or K80 GPU Accelerator but that are not yet multi-GPU aware should begin preparing
for the multi-GPU paradigm. Since dual-GPU boards appear to the host application
exactly the same as two separate single-GPU boards, enabling applications for multi-
GPU can benefit application performance on a wide range of systems where more than
one CUDA-capable GPU can be installed.

1.4.8. PCIe 3.0

Kepler's interconnection to the host system has been enhanced to support PCIe 3.0. For
applications where host-to-device, device-to-host, or device-to-device transfer time
is significant and not easily overlapped with computation, the additional bandwidth
provided by PCIe 3.0, given the requisite host system support, over the earlier PCIe 2.0
specification supported by Fermi GPUs should boost application performance without
modifications to the application. Note that best PCIe transfer speeds to or from system
memory with either PCIe generation are achieved when using pinned system memory.

In the Tesla K10 and K80 GPU Accelerator products, the two GPUs sharing a board
are connected via an on-board PCIe 3.0 switch. Since these GPUs are also capable of
GPUDirect Peer-to-Peer transfers, the inter-device memory transfers between GPUs
on the same board can run at PCIe 3.0 speeds even if the host system supports only
PCIe 2.0 or earlier.

While the Kepler architecture is compliant with the PCIe 3.0 specification, not all
Kepler-based products support PCIe 3.0 speeds. For example, while Tesla K10, K40,
and K80 support PCIe 3.0, Tesla K20 and K20X do not.

PCIe 3.0 throughputs may be improved in some circumstances by using the highest-
available GPU Boost clock.

1.4.9. Warp-synchronous Programming

As a means of mitigating the cost of repeated block-level synchronizations, particularly
in parallel primitives such as reduction and prefix sum, some programmers exploit
the knowledge that threads in a warp execute in lock-step with each other to omit
__syncthreads() in some places where it is semantically necessary for correctness in
the CUDA programming model.
The absence of an explicit synchronization in a program where different threads
communicate via memory constitutes a data race condition or synchronization
error. Warp-synchronous programs are unsafe and easily broken by evolutionary
improvements to the optimization strategies used by the CUDA compiler toolchain,
which generally has no visibility into cross-thread interactions of this variety in the
absence of barriers, or by changes to the hardware memory subsystem's behavior.
Such programs also tend to assume that the warp size is 32 threads, which may not
necessarily be the case for all future CUDA-capable architectures.

www.nvidia.com
Tuning CUDA Applications for Kepler DA-06288-001_v7.0 | 8
Kepler Tuning Guide

Therefore, programmers should avoid warp-synchronous programming to ensure

future-proof correctness in CUDA applications. When threads in a block must
communicate or synchronize with each other, regardless of whether those threads
are expected to be in the same warp or not, the appropriate barrier primitives should
be used. Note that the Warp Shuffle primitive presents a future-proof, supported
mechanism for intra-warp communication that can safely be used as an alternative in
many cases.

www.nvidia.com
Tuning CUDA Applications for Kepler DA-06288-001_v7.0 | 9
Appendix A.
REFERENCES

[1] NVIDIA GeForce GTX 680: The fastest, most efficient GPU ever built.
http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-
FINAL.pdf
[2] NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110.
http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-
Whitepaper.pdf
[3] CUDA C Programming Guide.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/
[4] CUDA C Best Practices Guide.
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/
[5] CUDA Occupancy Calculator spreadsheet.
http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls
[6] Sharing A GPU Between MPI Processes: Multi-Process Service (MPS) Overview.
http://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

www.nvidia.com
Tuning CUDA Applications for Kepler DA-06288-001_v7.0 | 10
Appendix B.
REVISION HISTORY

Version 0.9
‣ CUDA 5.0 Preview Release

Version 1.0
‣ Added discussion of ILP vs TLP (see Device Utilization and Occupancy).
‣ Expanded discussion of cache behaviors (see Memory Throughput).
‣ Added section regarding Warp-synchronous Programming.
‣ Added section regarding 2D Memory Copies.
‣ Minor corrections and clarifications.

Version 1.1
‣ Clarified const __restrict__ discussion and mentioned __ldg() intrinsic in
Read-Only Data Cache.

Version 1.2
‣ Add references to GK110B, which allows an opt-in to the caching of global loads in
the L1 Cache and enables higher clock speeds via GPU Boost.
‣ Expand discussion of ILP in Device Utilization and Occupancy.
‣ Expand discussion of Hyper-Q, adding mention of
CUDA_DEVICE_MAX_CONNECTIONS and CUDA Multi-Process Service (MPS).
‣ Clarification of PCIe 3.0 support.
‣ Add hyperlinks to all endnote references.

Version 1.3
‣ Add references to GK210, which increases register file and shared memory
capacities and enables additional GPU Boost modes versus GK110B.

www.nvidia.com
Tuning CUDA Applications for Kepler DA-06288-001_v7.0 | 11
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS,
DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY,
"MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES,
EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE
MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF
NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR
PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA
Corporation assumes no responsibility for the consequences of use of such
information or for any infringement of patents or other rights of third parties
that may result from its use. No license is granted by implication of otherwise
under any patent rights of NVIDIA Corporation. Specifications mentioned in this
publication are subject to change without notice. This publication supersedes and
replaces all other information previously supplied. NVIDIA Corporation products
are not authorized as critical components in life support devices or systems
without express written approval of NVIDIA Corporation.

Trademarks
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA
Corporation in the U.S. and other countries. Other company and product names
may be trademarks of the respective companies with which they are associated.

www.nvidia.com

Parallel Procesing IEE
No ratings yet
Parallel Procesing IEE
23 pages
Investigating The Effect of Varying Block Size On Power and Energy Consumption of GPU Kernels
No ratings yet
Investigating The Effect of Varying Block Size On Power and Energy Consumption of GPU Kernels
21 pages
All About Graphic Processing Unit and Their Working
No ratings yet
All About Graphic Processing Unit and Their Working
27 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
Parallelization of BFS Graph Algorithm Using CUDA
No ratings yet
Parallelization of BFS Graph Algorithm Using CUDA
6 pages
Gpu Cuda Part1
No ratings yet
Gpu Cuda Part1
27 pages
Ansys 2024 r1 Gpu Compute Capabilities
No ratings yet
Ansys 2024 r1 Gpu Compute Capabilities
2 pages
GPUIntro
No ratings yet
GPUIntro
21 pages
GC2 HW Architecture-2021Fall
No ratings yet
GC2 HW Architecture-2021Fall
70 pages
C Flex Lions
No ratings yet
C Flex Lions
9 pages
S51413 - Developing Optimal CUDA Kernels On Hopper Tensor Cores - 1679452516682001bWRm
No ratings yet
S51413 - Developing Optimal CUDA Kernels On Hopper Tensor Cores - 1679452516682001bWRm
80 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
CUDA C Programming Reference Guide
No ratings yet
CUDA C Programming Reference Guide
2 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Whitepaper NVIDIA's Next Generation CUDA Compute Architecture
No ratings yet
Whitepaper NVIDIA's Next Generation CUDA Compute Architecture
21 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Singapore p1
No ratings yet
Singapore p1
46 pages
Multi GPU System Design With Memory Netw
No ratings yet
Multi GPU System Design With Memory Netw
12 pages
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
No ratings yet
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
29 pages
Nvidia Tesla: Gpu Accelerators
No ratings yet
Nvidia Tesla: Gpu Accelerators
3 pages
Nvidia Tesla: Gpu Accelerators
No ratings yet
Nvidia Tesla: Gpu Accelerators
3 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
NVIDIA GPU Evolution: Gaming to AI
100% (1)
NVIDIA GPU Evolution: Gaming to AI
91 pages
GaiaGPU Sharing GPUs in Container Clouds
No ratings yet
GaiaGPU Sharing GPUs in Container Clouds
8 pages
Dissecting GPU Memory Hierarchy Through
No ratings yet
Dissecting GPU Memory Hierarchy Through
14 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
10 GPU-IntroCUDA3
No ratings yet
10 GPU-IntroCUDA3
141 pages
Q7 GPU Cluster For HPC c1
No ratings yet
Q7 GPU Cluster For HPC c1
8 pages
Gpu Accelerator Capabilities 2020 r2
No ratings yet
Gpu Accelerator Capabilities 2020 r2
3 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Jax-Ml - Github.io-How To Think About GPUs
No ratings yet
Jax-Ml - Github.io-How To Think About GPUs
32 pages
Comp206 Lecture14
No ratings yet
Comp206 Lecture14
29 pages
System-Requirements HTML
No ratings yet
System-Requirements HTML
2 pages
Architectural Details of Tesla GPU Microarchitecture
No ratings yet
Architectural Details of Tesla GPU Microarchitecture
9 pages
Graphics Card:: FPS or Frames Per Second
No ratings yet
Graphics Card:: FPS or Frames Per Second
10 pages
NVIDIAFermiComputeArchitectureWhitepaper PDF
No ratings yet
NVIDIAFermiComputeArchitectureWhitepaper PDF
21 pages
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
No ratings yet
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
11 pages
GPGPUs CUDA
No ratings yet
GPGPUs CUDA
21 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
How To Add GPU To Syno DS1820+
No ratings yet
How To Add GPU To Syno DS1820+
6 pages
Cks 2012 It Art 002
No ratings yet
Cks 2012 It Art 002
10 pages
Zhongliang Chen Thesis
No ratings yet
Zhongliang Chen Thesis
71 pages
Parallel BFS On Graphs Using GPGPU
No ratings yet
Parallel BFS On Graphs Using GPGPU
10 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
Unit 4
100% (1)
Unit 4
48 pages
07 cmsc416 Cuda
No ratings yet
07 cmsc416 Cuda
26 pages
Colfax Gemm Kernels Hopper
No ratings yet
Colfax Gemm Kernels Hopper
17 pages
Cuda Lab Manual
100% (1)
Cuda Lab Manual
22 pages
Mingpu: A Minimum Gpu Library For Computer Vision: Pavel Babenko and Mubarak Shah
No ratings yet
Mingpu: A Minimum Gpu Library For Computer Vision: Pavel Babenko and Mubarak Shah
30 pages
Graphics Processing Unit: Shashwat Shriparv Infinitysoft
No ratings yet
Graphics Processing Unit: Shashwat Shriparv Infinitysoft
39 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Cuda Opencl
No ratings yet
Cuda Opencl
17 pages
ROCKY DEM GPU Buying Guide and FAQs
No ratings yet
ROCKY DEM GPU Buying Guide and FAQs
4 pages
Energy Efficiency in Gpu
No ratings yet
Energy Efficiency in Gpu
26 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Wcms 190903
No ratings yet
Wcms 190903
36 pages
# IndexFor Cadd Manual
No ratings yet
# IndexFor Cadd Manual
6 pages
Maxwell Compatibility Guide
No ratings yet
Maxwell Compatibility Guide
8 pages
Vision Architecture Status Feb83
No ratings yet
Vision Architecture Status Feb83
14 pages
Bdeye Black
No ratings yet
Bdeye Black
20 pages
Introduction To Computer Networking Tutorial
No ratings yet
Introduction To Computer Networking Tutorial
1 page
Meridian Setup
No ratings yet
Meridian Setup
29 pages
Empty Nest Challenges for Hidalgo Couples
No ratings yet
Empty Nest Challenges for Hidalgo Couples
11 pages
Sample Resume
No ratings yet
Sample Resume
2 pages
Word Classes
No ratings yet
Word Classes
21 pages
MA English 3rd Sem 2023 Assignments
No ratings yet
MA English 3rd Sem 2023 Assignments
3 pages
고등 영어 어휘 연습 문제
No ratings yet
고등 영어 어휘 연습 문제
92 pages
First Grade Writing Help
100% (1)
First Grade Writing Help
6 pages
1 LESSON PLAN (Meeting 1)
No ratings yet
1 LESSON PLAN (Meeting 1)
6 pages
College of Education: The Premier University in Zamboanga Del Norte
No ratings yet
College of Education: The Premier University in Zamboanga Del Norte
3 pages
The Origin and History of English Language-1
No ratings yet
The Origin and History of English Language-1
3 pages
English Midterm Oral Exam Blueprint
No ratings yet
English Midterm Oral Exam Blueprint
2 pages
CL VI Scheme of Studies - Warm Station
No ratings yet
CL VI Scheme of Studies - Warm Station
122 pages
Ads Record Lab Notes
No ratings yet
Ads Record Lab Notes
64 pages
Campus Recruitment Book
No ratings yet
Campus Recruitment Book
93 pages
Computer Science Graduate Profile
No ratings yet
Computer Science Graduate Profile
2 pages
Critical Discourse Analysis Guide
No ratings yet
Critical Discourse Analysis Guide
25 pages
The History and Future of Java Programming Language
No ratings yet
The History and Future of Java Programming Language
4 pages
Construction
No ratings yet
Construction
16 pages
Python Unit 2
No ratings yet
Python Unit 2
13 pages
Five Major Religions
No ratings yet
Five Major Religions
33 pages
Australia's Population Distribution
No ratings yet
Australia's Population Distribution
10 pages
Data Science Complet Syllabus
No ratings yet
Data Science Complet Syllabus
22 pages
Traces of Ethnic Identities in Etruscan
No ratings yet
Traces of Ethnic Identities in Etruscan
18 pages
History of Fountain Pen
No ratings yet
History of Fountain Pen
1 page
Orientalism
No ratings yet
Orientalism
9 pages
Author Virginia Henley-Miller's New Book, "The Valley of Eebaral: The Land of Myth, Magic, and Mysterious Enchantment," Follows A Unicorn's Journey To Her Herd
No ratings yet
Author Virginia Henley-Miller's New Book, "The Valley of Eebaral: The Land of Myth, Magic, and Mysterious Enchantment," Follows A Unicorn's Journey To Her Herd
4 pages
Urban Challenges Worksheet
No ratings yet
Urban Challenges Worksheet
10 pages
Linux IPTables Rules: Top 25 Examples
No ratings yet
Linux IPTables Rules: Top 25 Examples
8 pages
Daniel Beach - Introduction To Data Engineering-Leanpub - Com (2022)
No ratings yet
Daniel Beach - Introduction To Data Engineering-Leanpub - Com (2022)
172 pages
10 2307@268520
No ratings yet
10 2307@268520
10 pages
Symbols Page
100% (1)
Symbols Page
49 pages

Kepler Tuning Guide

Uploaded by

Kepler Tuning Guide

Uploaded by

TUNING CUDA APPLICATIONS FOR

DA-06288-001_v7.0 | August 2014

Chapter 1. Kepler Tuning Guide.............................................................................. 1

1.1. NVIDIA Kepler Compute Architecture

1.2. CUDA Best Practices

‣ Find ways to parallelize sequential code,

1.3. Application Compatibility

1.4. Kepler Tuning

1.4.2. Managing Coarse-Grained Parallelism

1.4.2.1. Concurrent Kernels

1.4.2.3. Dynamic Parallelism

1.4.3. Shared Memory and Warp Shuffle

1.4.3.2. Shared Memory Capacity

1.4.3.3. Warp Shuffle

1.4.4. Memory Throughput

1.4.4.3. Read-Only Data Cache

1.4.4.4. Fast Global Memory Atomics

1.4.4.5. Global Memory Bandwidth and GPU Boost

1.4.4.6. 2D Memory Copies

1.4.5. Instruction Throughput

1.4.5.1. Single-precision vs. Double-precision

1.4.6. GPU Boost

1.4.8. PCIe 3.0

1.4.9. Warp-synchronous Programming

Therefore, programmers should avoid warp-synchronous programming to ensure

You might also like