0% found this document useful (0 votes)
17 views12 pages

CUDA Programming for High-Performance Computing

The document provides an overview of the CUDA programming model, developed by NVIDIA, which allows developers to utilize the parallel processing capabilities of GPUs for high-performance computing tasks. It explains the architecture, memory management, and execution flow of CUDA, highlighting its applications in various fields and the benefits and limitations of using CUDA. Key concepts such as threads, blocks, and data exchange between CPU and GPU are also discussed to illustrate how CUDA operates effectively.

Uploaded by

rdinesh1106
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views12 pages

CUDA Programming for High-Performance Computing

The document provides an overview of the CUDA programming model, developed by NVIDIA, which allows developers to utilize the parallel processing capabilities of GPUs for high-performance computing tasks. It explains the architecture, memory management, and execution flow of CUDA, highlighting its applications in various fields and the benefits and limitations of using CUDA. Key concepts such as threads, blocks, and data exchange between CPU and GPU are also discussed to illustrate how CUDA operates effectively.

Uploaded by

rdinesh1106
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

MODULE 5:HIGH PERFORMANCE COMPUTING WITH CUDA

CUDA PROGRAMMING MODEL

The CUDA programming model is a powerful parallel computing


platform and application programming interface (API) model
developed by NVIDIA. It enables developers to leverage the massive
parallel processing power of NVIDIA GPUs for High-Performance
Computing (HPC) tasks. CUDA stands for Compute Unified Device
Architecture.

CUDA stands for Compute Unified Device Architecture.


It is an extension of C/C++ programming. CUDA is a
programming language that uses the Graphical
Processing Unit (GPU). It is a parallel computing
platform and an API (Application Programming
Interface) model, Compute Unified Device Architecture
was developed by Nvidia. This allows computations to
be performed in parallel while providing well-formed
speed. Using CUDA, one can harness the power of the
Nvidia GPU to perform common computing tasks, such
as processing matrices and other linear algebra
operations, rather than simply performing graphical
calculations.

Why do we need CUDA?

 GPUs are designed to perform high-speed


parallel computations to display graphics such as
games.
 Use available CUDA resources. More than 100
million GPUs are already deployed.
 It provides 30-100x speed-up over other
microprocessors for some applications.
 GPUs have very small Arithmetic Logic Units
(ALUs) compared to the somewhat larger CPUs.
This allows for many parallel calculations, such
as calculating the color for each pixel on the
screen, etc.

Architecture of CUDA

 16 Streaming Multiprocessor (SM) diagrams are


shown in the above diagram.
 Each Streaming Multiprocessor has 8 Streaming
Processors (SP) ie, we get a total of 128
Streaming Processors (SPs).
 Now, each Streaming processor has a MAD unit
(Multiplication and Addition Unit) and an
additional MU (multiplication unit).
 The GT200 has 30 Streaming Multiprocessors
(SMs) and each Streaming Multiprocessor (SM)
has 8 Streaming Processors (SPs) ie, a total of
240 Streaming Processors (SPs), and more than
1 TFLOP processing power.
 Each Streaming Processor is gracefully threaded
and can run thousands of threads per
application.
 The G80 card has 16 Streaming Multiprocessors
(SMs) and each SM has 8 Streaming Processors
(SPs), i.e., a total of 128 SPs and it supports 768
threads per Streaming Multiprocessor (note: not
per SP).
 Eventually, after each Streaming Multiprocessor
has 8 SPs, each SP supports a maximal of 768/8
= 96 threads. Total threads that can run on 128
SPs - 128 * 96 = 12,228 times.
 Therefore these processors are
called massively parallel.
 The G80 chips have a memory bandwidth of
86.4GB/s.
 It also has an 8GB/s communication channel with
the CPU (4GB/s for uploading to the CPU RAM,
and 4GB/s for downloading from the CPU RAM).

How CUDA works?

 GPUs run one kernel (a group of tasks) at a time.


 Each kernel consists of blocks, which are
independent groups of ALUs.
 Each block contains threads, which are levels of
computation.
 The threads in each block typically work together
to calculate a value.
 Threads in the same block can share memory.
 In CUDA, sending information from the CPU to
the GPU is often the most typical part of the
computation.
 For each thread, local memory is the fastest,
followed by shared memory, global, static, and
texture memory the slowest.
Typical CUDA Program flow
1. Load data into CPU memory
2. Copy data from CPU to GPU memory - e.g.,
cudaMemcpy(..., cudaMemcpyHostToDevice)
3. Call GPU kernel using device variable - e.g.,
kernel<<<>>> (gpuVar)
4. Copy results from GPU to CPU memory - e.g.,
cudaMemcpy(.., cudaMemcpyDeviceToHost)
5. Use results on CPU

How work is distributed?

 Each thread "knows" the x and y coordinates of


the block it is in, and the coordinates where it is
in the block.
 These positions can be used to calculate a
unique thread ID for each thread.
 The computational work done will depend on the
value of the thread ID.
 For example, the thread ID corresponds to a
group of matrix elements.

CUDA Applications

CUDA applications must run parallel operations on a lot


of data, and be processing-intensive.
1. Computational finance
2. Climate, weather, and ocean modelling
3. Data science and analytics
4. Deep learning and machine learning
5. Defence and intelligence
6. Manufacturing/AEC
7. Media and entertainment
8. Medical imaging
9. Oil and gas
10. Research
11. Safety and security
12. Tools and management

Benefits of CUDA
There are several advantages that give CUDA an edge
over traditional general-purpose graphics processor
(GPU) computers with graphics APIs:
 Integrated memory (CUDA 6.0 or later) and
Integrated virtual memory (CUDA 4.0 or later).
 Shared memory provides a fast area of shared
memory for CUDA threads. It can be used as a
caching mechanism and provides more
bandwidth than texture lookup.
 Scattered read codes can be read from any
address in memory.
 Improved performance on downloads and reads,
which works well from the GPU and to the GPU.
 CUDA has full support for bitwise and integer
operations.

Limitations of CUDA

 CUDA source code is given on the host machine


or GPU, as defined by the C++ syntax rules.
Longstanding versions of CUDA use C syntax
rules, which means that up-to-date CUDA source
code may or may not work as required.
 CUDA has unilateral interoperability(the ability of
computer systems or software to exchange and
make use of information) with transferor
languages like OpenGL. OpenGL can access
CUDA registered memory, but CUDA cannot
access OpenGL memory.
 Afterward versions of CUDA do not provide
emulators or fallback support for older versions.
 CUDA supports only NVIDIA hardware

BASIC PRINCIPLES OF CUDA PROGRAMMING

1. ✅ Heterogeneous Computing
 CUDA uses both the CPU (host) and GPU (device).
 CPU runs the main program, and the GPU handles parallel
computation tasks (called kernels).

2. ✅ Kernels and Threads

 A kernel is a function that runs on the GPU.


 When a kernel is launched, it runs many threads in parallel.

__global__ void add(int *a, int *b, int *c) {


int i = threadIdx.x;
c[i] = a[i] + b[i];
}

3. ✅ Thread Hierarchy

Threads are grouped in a hierarchy:

 Threads are organized into blocks


 Blocks are organized into a grid

Each thread has built-in IDs:

 threadIdx, blockIdx, blockDim, gridDim

This allows scalable parallelism.

4. ✅ Memory Hierarchy

CUDA has multiple memory types:

Memory Type Scope Speed Use


Registers Per-thread Fastest Private variables
Shared Per-block Fast Block-level data sharing
Global All threads Slow Large data
Constant Read-only Cached Constant values
👉 Efficient memory use = better performance

5. ✅ SIMT (Single Instruction, Multiple Threads)

 GPU executes instructions in groups of 32 threads, called a


warp
 All threads in a warp execute the same instruction, but on
different data
 Like vectorization, but flexible

6. ✅ Host-Device Memory Management

You must manually allocate and copy memory between host and
device:
cudaMalloc(&d_array, size);
cudaMemcpy(d_array, h_array, size,
cudaMemcpyHostToDevice);

7. ✅ Synchronization and Communication

 Threads in the same block can synchronize using


__syncthreads
 ()
 Threads across blocks cannot directly communicate

 8. ✅ Parallel Execution

 Each thread can execute independently


 Ideal for data-parallel problems (e.g., vector addition, matrix
multiplication)

Example: CUDA Program Structure


// 1. Allocate memory
cudaMalloc(&d_A, size);
cudaMemcpy(d_A, h_A, size,
cudaMemcpyHostToDevice);

// 2. Launch kernel
myKernel<<<numBlocks, threadsPerBlock>>>(d_A);

// 3. Copy results back


cudaMemcpy(h_A, d_A, size,
cudaMemcpyDeviceToHost);

CONCEPTS OF THREAD AND BLOCKS


In CUDA (Compute Unified Device Architecture), which is
NVIDIA's parallel computing platform and API, the concepts of
threads and blocks are fundamental for writing GPU-accelerated
programs. Here's a clear breakdown:
1. Threads

 A thread is the smallest unit of execution in CUDA.


 Each thread runs the same kernel (a GPU function), but operates
on different data.
 Threads have unique IDs to distinguish them from each other,
accessible via.
 Smallest unit of execution.
 Each thread runs the same function (called a kernel) but on
different data.
 Identified by threadIdx.
 [Link] i = threadIdx.x;
 This line gives each thread its own index within a block.

2. Blocks (Thread Blocks)


 A block is a group of threads that execute the same kernel and
can:
o Share data via shared memory
o Synchronize using __syncthreads()
 Threads in a block are organized in 1D, 2D, or 3D:

dim3 blockDim(16, 16); // 2D block of 16x16


threads

Each thread in a block is identified by:


threadIdx.x, threadIdx.y, threadIdx.z
3. Grids

 A grid is a collection of blocks.


 You launch a kernel over a grid of blocks.

kernel<<<gridDim, blockDim>>>(...);

 Blocks in a grid also have IDs:

blockIdx.x, blockIdx.y, blockIdx.z


Total thread index in the grid:

int globalIdx = blockIdx.x * blockDim.x +


threadIdx.x;

Thread Hierarchy Recap

Level Identifier Scope


Thread threadIdx Inside a block
Block blockIdx Inside a grid
Block size blockDim Threads per block
Grid size gridDim Blocks per grid
Key Advantages

 Massive Parallelism: Thousands of threads can run


concurrently.
 Scalability: Same code scales with data and hardware.
 Memory Sharing: Threads in a block can share fast shared
memory.

Example: Vector Addition

__global__ void vectorAdd(int *a, int *b, int


*c, int n) {
int i = blockIdx.x * blockDim.x +
threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i]

GPU AND CPU DATA EXCHANGE IN CUDA

In CUDA, CPU (Host) and GPU (Device) have separate memory


spaces, so data must be explicitly transferred between them. Here's
how data exchange works:

1. Memory Spaces

 Host (CPU): Uses standard RAM.


 Device (GPU): Uses its own global memory.

2. Data Transfer Steps

✅ Step 1: Allocate Memory on GPU

cudaMalloc((void**)&d_A, size);

✅ Step 2: Copy Data from CPU to GPU

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);


✅ Step 3: Run the GPU Kernel

kernel<<<gridDim, blockDim>>>(d_A, d_B);

✅ Step 4: Copy Result from GPU to CPU

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

✅ Step 5: Free GPU Memory

cudaFree(d_A);
Transfer Direction Types
Direction Constant
CPU → GPU cudaMemcpyHostToDevice
GPU → CPU cudaMemcpyDeviceToHost
GPU ↔ GPU cudaMemcpyDeviceToDevice
Example Summary

int *h_A, *h_C; // Host pointers


int *d_A, *d_C; // Device pointers
size_t size = N * sizeof(int);

// 1. Allocate on GPU
cudaMalloc(&d_A, size);
cudaMalloc(&d_C, size);

// 2. Copy from Host to Device


cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

// 3. Kernel Execution
myKernel<<<blocks, threads>>>(d_A, d_C);

// 4. Copy from Device to Host


cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

// 5. Free GPU memory


cudaFree(d_A); cudaFree(d_C);
🔄 CPU–GPU Data Exchange in CUDA (Briefly)

CUDA requires explicit data transfer between the CPU (host) and
GPU (device) because they have separate memory.

Key Steps

1. Allocate GPU Memory


cudaMalloc(&d_ptr, size);
2. Copy Data: CPU → GPU
cudaMemcpy(d_ptr, h_ptr, size, cudaMemcpyHostToDevice);
3. Execute Kernel on GPU
kernel<<<blocks, threads>>>(d_ptr);
4. Copy Result: GPU → CPU
cudaMemcpy(h_ptr, d_ptr, size, cudaMemcpyDeviceToHost);
5. Free GPU Memory
cudaFree(d_ptr);

Transfer Types

Direction CUDA Constant


Host → Device cudaMemcpyHostToDevice
Device → Host cudaMemcpyDeviceToHost
Device → Device cudaMemcpyDeviceToDevice

You might also like