MODULE 5:HIGH PERFORMANCE COMPUTING WITH CUDA
CUDA PROGRAMMING MODEL
The CUDA programming model is a powerful parallel computing
platform and application programming interface (API) model
developed by NVIDIA. It enables developers to leverage the massive
parallel processing power of NVIDIA GPUs for High-Performance
Computing (HPC) tasks. CUDA stands for Compute Unified Device
Architecture.
CUDA stands for Compute Unified Device Architecture.
It is an extension of C/C++ programming. CUDA is a
programming language that uses the Graphical
Processing Unit (GPU). It is a parallel computing
platform and an API (Application Programming
Interface) model, Compute Unified Device Architecture
was developed by Nvidia. This allows computations to
be performed in parallel while providing well-formed
speed. Using CUDA, one can harness the power of the
Nvidia GPU to perform common computing tasks, such
as processing matrices and other linear algebra
operations, rather than simply performing graphical
calculations.
Why do we need CUDA?
GPUs are designed to perform high-speed
parallel computations to display graphics such as
games.
Use available CUDA resources. More than 100
million GPUs are already deployed.
It provides 30-100x speed-up over other
microprocessors for some applications.
GPUs have very small Arithmetic Logic Units
(ALUs) compared to the somewhat larger CPUs.
This allows for many parallel calculations, such
as calculating the color for each pixel on the
screen, etc.
Architecture of CUDA
16 Streaming Multiprocessor (SM) diagrams are
shown in the above diagram.
Each Streaming Multiprocessor has 8 Streaming
Processors (SP) ie, we get a total of 128
Streaming Processors (SPs).
Now, each Streaming processor has a MAD unit
(Multiplication and Addition Unit) and an
additional MU (multiplication unit).
The GT200 has 30 Streaming Multiprocessors
(SMs) and each Streaming Multiprocessor (SM)
has 8 Streaming Processors (SPs) ie, a total of
240 Streaming Processors (SPs), and more than
1 TFLOP processing power.
Each Streaming Processor is gracefully threaded
and can run thousands of threads per
application.
The G80 card has 16 Streaming Multiprocessors
(SMs) and each SM has 8 Streaming Processors
(SPs), i.e., a total of 128 SPs and it supports 768
threads per Streaming Multiprocessor (note: not
per SP).
Eventually, after each Streaming Multiprocessor
has 8 SPs, each SP supports a maximal of 768/8
= 96 threads. Total threads that can run on 128
SPs - 128 * 96 = 12,228 times.
Therefore these processors are
called massively parallel.
The G80 chips have a memory bandwidth of
86.4GB/s.
It also has an 8GB/s communication channel with
the CPU (4GB/s for uploading to the CPU RAM,
and 4GB/s for downloading from the CPU RAM).
How CUDA works?
GPUs run one kernel (a group of tasks) at a time.
Each kernel consists of blocks, which are
independent groups of ALUs.
Each block contains threads, which are levels of
computation.
The threads in each block typically work together
to calculate a value.
Threads in the same block can share memory.
In CUDA, sending information from the CPU to
the GPU is often the most typical part of the
computation.
For each thread, local memory is the fastest,
followed by shared memory, global, static, and
texture memory the slowest.
Typical CUDA Program flow
1. Load data into CPU memory
2. Copy data from CPU to GPU memory - e.g.,
cudaMemcpy(..., cudaMemcpyHostToDevice)
3. Call GPU kernel using device variable - e.g.,
kernel<<<>>> (gpuVar)
4. Copy results from GPU to CPU memory - e.g.,
cudaMemcpy(.., cudaMemcpyDeviceToHost)
5. Use results on CPU
How work is distributed?
Each thread "knows" the x and y coordinates of
the block it is in, and the coordinates where it is
in the block.
These positions can be used to calculate a
unique thread ID for each thread.
The computational work done will depend on the
value of the thread ID.
For example, the thread ID corresponds to a
group of matrix elements.
CUDA Applications
CUDA applications must run parallel operations on a lot
of data, and be processing-intensive.
1. Computational finance
2. Climate, weather, and ocean modelling
3. Data science and analytics
4. Deep learning and machine learning
5. Defence and intelligence
6. Manufacturing/AEC
7. Media and entertainment
8. Medical imaging
9. Oil and gas
10. Research
11. Safety and security
12. Tools and management
Benefits of CUDA
There are several advantages that give CUDA an edge
over traditional general-purpose graphics processor
(GPU) computers with graphics APIs:
Integrated memory (CUDA 6.0 or later) and
Integrated virtual memory (CUDA 4.0 or later).
Shared memory provides a fast area of shared
memory for CUDA threads. It can be used as a
caching mechanism and provides more
bandwidth than texture lookup.
Scattered read codes can be read from any
address in memory.
Improved performance on downloads and reads,
which works well from the GPU and to the GPU.
CUDA has full support for bitwise and integer
operations.
Limitations of CUDA
CUDA source code is given on the host machine
or GPU, as defined by the C++ syntax rules.
Longstanding versions of CUDA use C syntax
rules, which means that up-to-date CUDA source
code may or may not work as required.
CUDA has unilateral interoperability(the ability of
computer systems or software to exchange and
make use of information) with transferor
languages like OpenGL. OpenGL can access
CUDA registered memory, but CUDA cannot
access OpenGL memory.
Afterward versions of CUDA do not provide
emulators or fallback support for older versions.
CUDA supports only NVIDIA hardware
BASIC PRINCIPLES OF CUDA PROGRAMMING
1. ✅ Heterogeneous Computing
CUDA uses both the CPU (host) and GPU (device).
CPU runs the main program, and the GPU handles parallel
computation tasks (called kernels).
2. ✅ Kernels and Threads
A kernel is a function that runs on the GPU.
When a kernel is launched, it runs many threads in parallel.
__global__ void add(int *a, int *b, int *c) {
int i = threadIdx.x;
c[i] = a[i] + b[i];
}
3. ✅ Thread Hierarchy
Threads are grouped in a hierarchy:
Threads are organized into blocks
Blocks are organized into a grid
Each thread has built-in IDs:
threadIdx, blockIdx, blockDim, gridDim
This allows scalable parallelism.
4. ✅ Memory Hierarchy
CUDA has multiple memory types:
Memory Type Scope Speed Use
Registers Per-thread Fastest Private variables
Shared Per-block Fast Block-level data sharing
Global All threads Slow Large data
Constant Read-only Cached Constant values
👉 Efficient memory use = better performance
5. ✅ SIMT (Single Instruction, Multiple Threads)
GPU executes instructions in groups of 32 threads, called a
warp
All threads in a warp execute the same instruction, but on
different data
Like vectorization, but flexible
6. ✅ Host-Device Memory Management
You must manually allocate and copy memory between host and
device:
cudaMalloc(&d_array, size);
cudaMemcpy(d_array, h_array, size,
cudaMemcpyHostToDevice);
7. ✅ Synchronization and Communication
Threads in the same block can synchronize using
__syncthreads
()
Threads across blocks cannot directly communicate
8. ✅ Parallel Execution
Each thread can execute independently
Ideal for data-parallel problems (e.g., vector addition, matrix
multiplication)
Example: CUDA Program Structure
// 1. Allocate memory
cudaMalloc(&d_A, size);
cudaMemcpy(d_A, h_A, size,
cudaMemcpyHostToDevice);
// 2. Launch kernel
myKernel<<<numBlocks, threadsPerBlock>>>(d_A);
// 3. Copy results back
cudaMemcpy(h_A, d_A, size,
cudaMemcpyDeviceToHost);
CONCEPTS OF THREAD AND BLOCKS
In CUDA (Compute Unified Device Architecture), which is
NVIDIA's parallel computing platform and API, the concepts of
threads and blocks are fundamental for writing GPU-accelerated
programs. Here's a clear breakdown:
1. Threads
A thread is the smallest unit of execution in CUDA.
Each thread runs the same kernel (a GPU function), but operates
on different data.
Threads have unique IDs to distinguish them from each other,
accessible via.
Smallest unit of execution.
Each thread runs the same function (called a kernel) but on
different data.
Identified by threadIdx.
[Link] i = threadIdx.x;
This line gives each thread its own index within a block.
2. Blocks (Thread Blocks)
A block is a group of threads that execute the same kernel and
can:
o Share data via shared memory
o Synchronize using __syncthreads()
Threads in a block are organized in 1D, 2D, or 3D:
dim3 blockDim(16, 16); // 2D block of 16x16
threads
Each thread in a block is identified by:
threadIdx.x, threadIdx.y, threadIdx.z
3. Grids
A grid is a collection of blocks.
You launch a kernel over a grid of blocks.
kernel<<<gridDim, blockDim>>>(...);
Blocks in a grid also have IDs:
blockIdx.x, blockIdx.y, blockIdx.z
Total thread index in the grid:
int globalIdx = blockIdx.x * blockDim.x +
threadIdx.x;
Thread Hierarchy Recap
Level Identifier Scope
Thread threadIdx Inside a block
Block blockIdx Inside a grid
Block size blockDim Threads per block
Grid size gridDim Blocks per grid
Key Advantages
Massive Parallelism: Thousands of threads can run
concurrently.
Scalability: Same code scales with data and hardware.
Memory Sharing: Threads in a block can share fast shared
memory.
Example: Vector Addition
__global__ void vectorAdd(int *a, int *b, int
*c, int n) {
int i = blockIdx.x * blockDim.x +
threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i]
GPU AND CPU DATA EXCHANGE IN CUDA
In CUDA, CPU (Host) and GPU (Device) have separate memory
spaces, so data must be explicitly transferred between them. Here's
how data exchange works:
1. Memory Spaces
Host (CPU): Uses standard RAM.
Device (GPU): Uses its own global memory.
2. Data Transfer Steps
✅ Step 1: Allocate Memory on GPU
cudaMalloc((void**)&d_A, size);
✅ Step 2: Copy Data from CPU to GPU
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
✅ Step 3: Run the GPU Kernel
kernel<<<gridDim, blockDim>>>(d_A, d_B);
✅ Step 4: Copy Result from GPU to CPU
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
✅ Step 5: Free GPU Memory
cudaFree(d_A);
Transfer Direction Types
Direction Constant
CPU → GPU cudaMemcpyHostToDevice
GPU → CPU cudaMemcpyDeviceToHost
GPU ↔ GPU cudaMemcpyDeviceToDevice
Example Summary
int *h_A, *h_C; // Host pointers
int *d_A, *d_C; // Device pointers
size_t size = N * sizeof(int);
// 1. Allocate on GPU
cudaMalloc(&d_A, size);
cudaMalloc(&d_C, size);
// 2. Copy from Host to Device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
// 3. Kernel Execution
myKernel<<<blocks, threads>>>(d_A, d_C);
// 4. Copy from Device to Host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// 5. Free GPU memory
cudaFree(d_A); cudaFree(d_C);
🔄 CPU–GPU Data Exchange in CUDA (Briefly)
CUDA requires explicit data transfer between the CPU (host) and
GPU (device) because they have separate memory.
Key Steps
1. Allocate GPU Memory
cudaMalloc(&d_ptr, size);
2. Copy Data: CPU → GPU
cudaMemcpy(d_ptr, h_ptr, size, cudaMemcpyHostToDevice);
3. Execute Kernel on GPU
kernel<<<blocks, threads>>>(d_ptr);
4. Copy Result: GPU → CPU
cudaMemcpy(h_ptr, d_ptr, size, cudaMemcpyDeviceToHost);
5. Free GPU Memory
cudaFree(d_ptr);
Transfer Types
Direction CUDA Constant
Host → Device cudaMemcpyHostToDevice
Device → Host cudaMemcpyDeviceToHost
Device → Device cudaMemcpyDeviceToDevice