0% found this document useful (0 votes)

43 views52 pages

CUDA Part-1

The document provides an introduction to CUDA (Compute Unified Device Architecture) and its programming model, focusing on data parallelism and kernel functions. It covers various topics including CUDA program structure, memory management, error handling, and threading organization, with examples such as vector addition and matrix multiplication. The document emphasizes the importance of efficient memory access and the use of CUDA-specific keywords to define host and device functions.

Uploaded by

shreyasinha3002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views52 pages

CUDA Part-1

Uploaded by

shreyasinha3002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Introduction to CUDA and

Computational Patterns
6 Hours
Topics covered:
▪ Introduction

▪ Data Parallelism

▪ CUDA Program Structure

▪ A Vector Addition Kernel

▪ Device Global Memory and Data Transfer

▪ Error handling in CUDA

▪ Kernel Functions and Threading

30-12-2023 2
Topics covered:
▪ CUDA Thread Organization

▪ Mapping Threads to Multidimensional Data

▪ Matrix-Matrix Multiplication—A More Complex Kernel

▪ Calculations of global threadID

▪ Synchronization and Transparent Scalability

▪ Assigning Resources to Blocks

▪ Querying Device Properties

30-12-2023 3
Topics covered:
▪ 1D Sequential Convolution

▪ 1D Parallel Convolution – A Basic Algorithm

▪ Atomic and Arithmetic Functions

▪ A Simple Parallel Scan Algorithm

▪ Sequential Sparse-Matrix Vector Multiplication (SpVM)

▪ Parallel SpVM using CSR

30-12-2023 4
Topics covered:
▪ Importance of Memory Access Efficiency

▪ GPU Device Memory Types

▪ A Strategy or Reducing Global Memory Traffic

▪ A Tiled Matrix-Matrix Multiplication Kernel

▪ Constant Memory and Caching

▪ Tiled 1D Convolution with Halo Elements

30-12-2023 5
Introduction
• CUDA stands for Compute Unified Device Architecture

• CUDA C is an extension to the popular C programming language used for writing massively parallel
programs in a heterogeneous computing system.

• To a CUDA programmer, the computing system consists of a host that is a traditional CPU, and one or
more devices (GPUs) that are processors with a massive number of arithmetic units.

• Software applications often have sections that exhibit a rich amount of data parallelism, a
phenomenon that allows arithmetic operations to be safely performed on different parts of the data
structures in parallel.

• CUDA devices accelerate the execution of software applications by applying their massive number of
arithmetic units to the data-parallel program sections.

30-12-2023 6
Data Parallelism
• Modern software applications often process a large amount of data and incur long execution time on
sequential computers.
• Parallel programming uses both Task parallelism and Data parallelism.

Task Parallelism Vs Data parallelism

• Task parallelism exists if the two tasks can be done independently. For example, a simple application
may need to do a vector addition and a matrix-vector multiplication. Each of these would be a task.
• Data parallelism refers to the program property whereby many arithmetic operations can be safely
performed on the data structures in a simultaneous manner. For example, in vector addition, we use
data parallelism.

30-12-2023 7
CUDA Program Structure
• The structure of a CUDA program reflects the coexistence of a host (CPU) and one or more devices
(GPUs) in the computer.

• Each CUDA source file can have a mixture of both host and device code.

• By default, any traditional C program is a CUDA program that contains only host code. One can add
device functions and data declarations into any C source file by marking them with special CUDA
keywords.

• The NVIDIA C Compiler (NVCC) separates the host code and the device code during compilation
process.

❑ The host code is compiled with the host’s standard C compilers and runs as an ordinary CPU
process.
❑ The device code(kernels) is compiled by the NVCC and executed on a GPU device.

30-12-2023 8
CUDA Program Structure

An Overview of the compilation process of a CUDA program

30-12-2023 9
CUDA Program Structure
• The execution of a CUDA program starts with host (CPU) execution.

• When a kernel function is called (launched), it is executed by a large number of threads on a device.

• All the threads that are generated by a kernel launch are collectively called a grid.

• When all threads of a kernel complete their execution, the corresponding grid terminates, and the
execution continues on the host until another kernel is launched.

CUDA threads take very few

cycles to generate and schedule
due to efficient hardware
support.

CPU threads typically require

thousands of clock cycles to
generate and schedule

30-12-2023 10
A Vector Addition Kernel
• A traditional vector addition C code example:.

// Compute vector sum h_C = h_A+h_B

void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
for (i = 0; i < n; i++)
h_C[i] = h_A[i] + h_B[i];
}

int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements each …
vecAdd(h_A, h_B, h_C, N);
}

30-12-2023 11
A Vector Addition Kernel
• A modified vecAdd() function for execution on a CUDA device:.

#include <cuda.h>

void vecAdd(float* A, floatB, float C, int n)

{
float *d_A, *d_B, *d_C;
int size = n* sizeof(float);

Part-1. // Allocate device memory for A, B, and C

// Copy A and B to device memory

Part-2. // Kernel launch code – to have the device

// to perform the actual vector addition

Part-3. // Copy C from the device memory

// Free device vectors
}

30-12-2023 12
Device Global Memory and Data transfer
• In CUDA, the host and devices have separate memory spaces. Devices are typically hardware cards
that come with their own Dynamic Random Access Memory (DRAM) which is also called as Global
memory.

• The CUDA runtime system provides Application Programming Interface (API) functions to perform
the following activities on behalf of the programmer:

❑ To execute a kernel on a device, the programmer needs to allocate global memory on the device and
transfer pertinent data from the host memory to the allocated device memory (Part-1).

❑ Similarly, after device execution, the programmer needs to transfer result data from the device
memory back to the host memory and free up the device memory that is no longer needed (Part-3).

CUDA host memory and device memory model for programmers

30-12-2023 13
Device Global Memory and Data transfer
• The CUDA runtime system provides API functions for managing data in the device memory.

• Function cudaMalloc() can be called from the host code to allocate a piece of device global memory for
an object. It takes two parameters:

1. The first parameter to the cudaMalloc() function is the address of a pointer variable that will be set
to point to the allocated object. The address of the pointer variable should be cast to (void ) because
the function expects a generic pointer.

2. The second parameter to the cudaMalloc() function gives the size of the data to be allocated, in
terms of bytes.

• Function cudaFree() is called to free the storage space allocated for an object from the device global
memory.

30-12-2023 14
Device Global Memory and Data transfer
#include <cuda.h> #include <cuda.h>

void vecAdd(float* A, float*B, float* C, int n) void vecAdd(float* A, float*B, float* C, int n)
{ {
float *d_A, *d_B, *d_C; float *d_A, *d_B, *d_C;
int size = n* sizeof(float); int size = n* sizeof(float);

Part-1. // Allocate device memory for A, B, and C

// Copy A and B to device memory Part-1. cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
Part-2. // Kernel launch code – to have the device cudaMalloc((void**)&d_C, size);
// to perform the actual vector addition // Copy A and B to device memory

Part-3. // Copy C from the device memory Part-2. // Kernel launch code – to have the device
// Free device vectors // to perform the actual vector addition
}
Part-3. // Copy C from the device memory
The addresses in d_A, d_B, and d_C are addresses in the cudaFree(d_A);
device memory. These addresses should not be cudaFree(d_B);
dereferenced in the host code. They should be mostly used cudaFree(d_C);
in calling API functions and kernel functions. Dereferencing a }
device memory pointer in the host code can cause exceptions
or other types of runtime errors during runtime.
30-12-2023 15
Device Global Memory and Data transfer
• Once the host code has allocated device memory for the data objects, it calls cudaMemcpy() to transfer
the data from host to device.

• The cudaMemcpy() function takes four parameters:

1. The first parameter is a pointer to the destination location for the data object to be copied.

2. The second parameter points to the source location.

3. The third parameter specifies the number of bytes to be copied.

4. The fourth parameter indicates the types of memory involved in the copy:
❑ from host memory to host memory
❑ from host memory to device memory
❑ from device memory to host memory
❑ from device memory to device memory

• cudaMemcpy() cannot be used to copy between different GPUs in multi-GPU systems

30-12-2023 16
Device Global Memory and Data transfer
#include <cuda.h> #include <cuda.h>

Part-1. // Allocate device memory for A, B, and C // Part-1

// Copy A and B to device memory cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
Part-2. // Kernel launch code – to have the device cudaMalloc((void**)&d_C, size);
// to perform the actual vector addition
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
Part-3. // Copy C from the device memory cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
// Free device vectors
} // Part-2
// Kernel launch code – to have the device to perform the
//actual vector addition

// Part-3
cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
}
30-12-2023 17
Error Handling in CUDA
• CUDA API functions return flags that indicate whether an error has occurred when they served the
request.

• Most errors are due to inappropriate argument values used in the API call.

• In practice, we should surround the API call with code that tests for error conditions and prints out error
messages so that the user can be aware of the fact that an error has occurred.

// if the system is out of device memory, the user will be informed about the situation.

cudaError_t err = cudaMalloc((void** ) &d_A, size);

if (err ! = cudaSuccess)
{
printf(“%s in %s at line %d\n”, cudaGetErrorString (err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}

30-12-2023 18
Kernel Functions and Threading

• In CUDA, a kernel function specifies the code to be executed by all threads during a parallel phase.

• Since all the threads execute the same code, CUDA programming is an instance of the well-known
SPMD (single program, multiple data) parallel programming style.

SPMD Vs SIMD
➢ SPMD is not the same as SIMD (single instruction, multiple data).
➢ In an SPMD system, the parallel processing units execute the same program on multiple parts of the
data.
➢ In a SIMD system, all processing units are executing the same instruction at any instant.

30-12-2023 19
Kernel Functions and Threading
• When a host code launches a kernel, the CUDA runtime system generates a grid of threads that are
organized in a two-level hierarchy.

• Each grid is organized into an array of thread blocks.

• All blocks of a grid are of the same size; each block can contain up to 1,024 threads.

• The number of threads in each thread block is specified by the host code when a kernel is launched.

• The same kernel can be launched with different numbers of threads at different parts of the host
code.

• For a given grid of threads, the number of threads in a block is available in the blockDim variable.

The value of blockDim.x variable is

256. In general, the dimensions of
thread blocks should be multiples of
32 due to hardware efficiency
reasons.
30-12-2023 20
Kernel Functions and Threading
• Each thread in a block has a unique threadIdx value (0, 1, . . 255).

• This allows each thread to combine its threadIdx and blockIdx values to create a unique global index
for itself with the entire grid.

• A data index i is calculated as:

i = blockIdx.x * blockDim.x + threadIdx.x.

• By launching the kernel with a larger number of blocks, one can process larger vectors. By launching a
kernel with n or more threads, one can process vectors of length n.

Since blockDim is 256 in our example, the i values of

threads in block 0 ranges from 0 to 255. The i values
of threads in block 1 range from 256 to 511. The i
values of threads in block 2 range from 512 to 767.

Since each thread uses i to access d_A, d_B, and

d_C, these threads cover the first 768 vector
elements for the addition.

30-12-2023 21
Kernel Functions and Threading
Kernel function for a vector addition
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
_ _global_ _ void vecAddKernel(float* A, float* B, float* C, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;

if(i < n)
C[i] = A[i] + B[i];
}

• The keyword __global__ indicates that the function is a kernel and that it can be called from a host
function to generate a grid of threads on a device.

• The condition if(i < n) will only execute the required threads and prevents the execution of unnecessary
threads.
For example, if the vector length is 100, the smallest efficient thread block
dimension is 32. we need to launch four thread blocks to process all the 100
vector elements which will create 128 threads.

By setting condition if(i < n) , we disable the last 28 threads in thread block 3 from
30-12-2023 22
doing work.
Kernel Functions and Threading
• CUDA extends C language with three qualifier keywords in function declarations:

➢ The __global__ keyword indicates that the function being declared is a CUDA kernel function. A
__global__ function is to be executed on the device and can only be called from the host code.

➢ The __device__ keyword indicates that the function being declared is a CUDA device function. A device
function executes on a CUDA device and can only be called from a kernel function or another device
function. Device functions can have neither recursive function calls nor indirect function calls
through pointers in them.

➢ The __host__ keyword indicates that the function being declared is a CUDA host function. A host
function is simply a traditional C function that executes on the host and can only be called from another
host function.
30-12-2023 23
Kernel Functions and Threading
• By default, all functions in a CUDA program are host functions if they do not have any of the CUDA
keywords in their declaration.

• One can use both __host__ and __device__ in a function declaration. This combination tells the
compilation system to generate two versions of object files for the same function:
➢ One is executed on the host and can only be called from a host function.
➢ The other is executed on the device and can only be called from a device or kernel function.

• When the host code launches a kernel, it sets the grid and thread block dimensions via execution
configuration parameters:
o The configuration parameters are given between the <<< and >>> before the traditional C function
arguments.
o The first configuration parameter gives the number of thread blocks in the grid.
o The second configuration parameter specifies the number of threads in each thread block.

vecAddKernel<<< ceil(n/256.0), 256 >>> (d_A, d_B, d_C, n);

30-12-2023 24
Complete CUDA program for vecADD()
#include <cuda.h>

void vecAdd(float* A, floatB, float C, int n)

{
float *d_A, *d_B, *d_C; // Kernel for computing vector sum C = A+B
int size = n* sizeof(float); _ _global_ _ void vecAddKernel(float* A, float* B, float* C, int n)
{
// Part-1 int i = threadIdx.x + blockDim.x * blockIdx.x;
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size); if(i < n)
cudaMalloc((void**)&d_C, size); C[i] = A[i] + B[i];
}
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);

// Part-2
vecAddKernel<<< ceil(n/256.0), 256 >>>(d_A, d_B, d_C, n);

// Part-3
cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
}
30-12-2023 25
CUDA Thread Organization
• All CUDA threads in a grid execute the same kernel function and they rely on coordinates to
distinguish themselves from each other and to identify the appropriate portion of the data to process.

• CUDA threads are organized into a two-level hierarchy:

1) A grid consists of one or more blocks
2) Each block in turn consists of one or more threads.

• All threads in a block share the same block index, which can be accessed as the blockIdx variable in a
kernel.

• Each thread also has a thread index, which can be accessed as the threadIdx variable in a kernel.

• When a thread executes a kernel function, references to the blockIdx and threadIdx variables return
the coordinates of the thread.

• The execution configuration parameters in a kernel launch statement specify the dimensions of the
grid and the dimensions of each block:

vecAddKernel<<< ceil(n/256.0), 256 >>>(d_A, d_B, d_C, n);

30-12-2023 26
CUDA Thread Organization
• In general, a grid is a 3D array of blocks, and each block is a 3D array of threads.

• The programmer can choose to use fewer dimensions by setting the unused dimensions to 1.

• The exact organization of a grid is determined by the execution configuration parameters (within <<<
and >>> ) of the kernel launch statement. The first execution configuration parameter specifies the
dimensions of the grid in number of blocks. The second specifies the dimensions of each block in
number of threads.

• Each such parameter is of dim3 type, which is a C structure with three unsigned integer fields, x, y,
and z. These three fields correspond to the three dimensions:

dim3 dimGrid(128, 1, 1);

dim3 dimBlock(32, 1, 1);
vecAddKernel <<< dimGrid, dimBlock >>> (...);

30-12-2023 27
CUDA Thread Organization
• The grid and block dimensions can also be calculated from other variables. In the following example, the
value of variable n at kernel launch time will determine the dimension of the grid:
dim3 dimGrid (ceil(n/256.0), 1, 1);
dim3 dimBlock(32, 1, 1);
vecAddKernel <<< dimGrid, dimBlock >>> (...); // n is the number of data elements

• For convenience, CUDA C provides a special shortcut for launching a kernel with 1D grids and blocks.
Instead of using dim3 variables, one can use arithmetic expressions to specify the configuration of 1D
grids and blocks. In this case, the CUDA C compiler simply takes the arithmetic expression as the x
dimensions and assumes that the y and z dimensions are 1.
vecAddKernel <<< ceil(n/256.0), 32 >>> (...); // n is the number of data elements

• For a given grid of threads, the dimension of grid is available in gridDim variable, and the dimension
of each block is available in the blockDim variable.

30-12-2023 28
CUDA Thread Organization
• In CUDA C, the allowed values of gridDim.x, gridDim.y, and gridDim.z range from 1 to 65,536.

• All threads in a block share the same blockIdx.x, blockIdx.y, and blockIdx.z values.

• Among all blocks, the blockIdx.x value ranges between 0 and gridDim.x-1, the blockIdx.y value
between 0 and gridDim.y-1, and the blockIdx.z value between 0 and gridDim.z-1.

• All blocks in a grid have the same dimensions.

• The total size of a block is limited to 1,024 threads, with flexibility in distributing these elements into the
three dimensions as long as the total number of threads does not exceed 1,024.
➢ For example, (512, 1, 1), (8, 16, 4), and (32, 16, 2) are all allowable blockDim values, but (32, 32, 2)
is not allowable since the total number of threads would exceed 1,024.
dim3 dimGrid(128, 1, 1); // gridDim.x = 128, gridDim.y = 1, gridDim.z = 1
dim3 dimBlock(32, 1, 1); // blockDim.x = 32, blockDim.y = 1, blockDim.z = 1
vecAddKernel <<< dimGrid, dimBlock >>> (...);

30-12-2023 29
CUDA Thread Organization
• Example of a 2D (2, 2, 1) grid that consists of 3D (4, 2, 2) blocks. The grid can be generated with the
following host code:

dim3 dimGrid(2, 2, 1);

dim3 dimBlock(4, 2, 2);
vecAddKernel <<< dimGrid, dimBlock >>> (...); ▪ Each block in Figure is labeled with
(blockIdx.y, blockIdx.x).

▪ For example, block(1,0) has blockIdx.y=1 and

blockIdx.x=0.

▪ Note that the ordering of the labels is such that

the highest dimension comes first. This is
reverse of the ordering used in the configuration
parameters where the lowest dimension comes
first.

▪ This reversed ordering for labeling threads works

better for mapping of thread coordinates into data
indexes in accessing multidimensional arrays.
30-12-2023 30
CUDA Thread Organization
• Each threadIdx also consists of three fields: the x coordinate threadId.x, the y coordinate threadIdx.y,
and the z coordinate threadIdx.z.

• In the following example, each block is organized into 4 x 2 x 2 arrays of threads. The Figure expands
block(1,1) to show its 16 threads. For example, thread(1,0,2) has threadIdx.z=1, threadIdx.y=0,
and threadIdx.x=2. In this example, we have four blocks of 16 threads each, with a grand total
of 64 threads in the grid.

30-12-2023 31
Calculations of global threadID

Note: (x,y) is followed

Dr. Bhargav Bhatkalkar 32

Calculations of global threadID – ID Grid/1D Block

<<< 4, 3 >>>

(x,y) notation
used

33
Calculations of global threadID – ID Grid/2D Block

<<< 4, (3,2) >>>

Gtid=blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;

34
Calculations of global threadID – 2D Grid/1D Block

<<< (2,2), 3 >>>

35
Calculations of global threadID – 2D Grid/2D Block
<<< (2,2), (2,3) >>>

blockId = (gridDim.x * blockIdx.y) + blockIdx.x

threadId = (blockId * (blockDim.x * blockDim.y))

+
(threadIdx.y * blockDim.x) + threadIdx.x

36
Calculations of global threadID
1D grid of 1D blocks:
__device__int getGlobalID_1D_1D(){
return blockIdx.x *blockDim.x + threadIdx.x;
}

1D grid of 2D blocks:
__device__int getGlobalID_1D_2D(){
return blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;
}

1D grid of 3D blocks:
__device__int getGlobalID_1D_3D(){
return blockIdx.x * blockDim.x * blockDim.y * blockDim.z + threadIdx.z * blockDim.y * blockDim.x +
threadIdx.y * blockDim.x + threadIdx.x;
}

30-12-2023 37
Calculations of global threadID
2D grid of 1D blocks:
__device__ int getGlobalID_2D_1D(){
int blockId = blockIdx.y * gridDim.x + blockIdx.x;
int threadId = blockId * blockDim.x + threadIdx.x;
return threadId;
}

2D grid of 2D blocks:
__device__int getGlobalID_2D_2D(){
int blockId = blockIdx.x + blockIdx.y * gridDim.x;
int threadId = blockId * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x;
return threadId;
}

2D grid of 3D blocks:
__device__int getGlobalID_2D_3D(){
int blockId = blockIdx.x + blockIdx.y * gridDim.x;
int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) + (threadIdx.z * (blockDim.x *
blockDim.y)) + (threadIdx.y * blockDim.x) + threadIdx.x;
return threadId;
}
30-12-2023 38
Calculations of global threadID
3D grid of 1D blocks:
__device__int getGlobalID_3D_1D(){
int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
int threadId = blockId * blockDim.x + threadIdx.x;
return threadId;
}

3D grid of 2D blocks:
__device__int getGlobalID_3D_2D(){
int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
int threadId = blockId * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x;
return threadId;
}

3D grid of 3D blocks:
__device__int getGlobalID_3D_3D(){
int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) + (threadIdx.z * (blockDim.x *
blockDim.y)) + (threadIdx.y * blockDim.x) + threadIdx.x;
return threadId;
}
30-12-2023 39
1D Sequential Convolution

• Convolution is a popular array operation that is used in various forms in signal processing, digital recording,
image processing, video processing, and computer vision.

• Convolution is often performed as a filter that transforms signals and pixels into more desirable values. For
example, Gaussian filters are convolution filters that can be used to sharpen boundaries and edges of objects
in images.

• Mathematically, convolution is an array operation where each output data element is a weighted sum of a
collection of neighboring input elements.

• The weights used in the weighted sum calculation are defined by an input mask array, commonly referred to as
the convolution mask OR convolution kernel.

• The same convolution mask is typically used for all elements of the array.

30-12-2023 40
1D Sequential Convolution
• The following example shows a convolution example for 1D data where a five-element convolution mask array
M is applied to a seven-element input array N.

• The fact that we use a five-element mask M means that each P element is generated by a weighted sum of the
corresponding N element, up to two elements to the left and up to two elements to the right.

• Each weight value is multiplied to the corresponding N element values before the products are summed together.

• In general, the size of the mask tends to be an odd number, which makes the weighted sum calculation
symmetric around the element being calculated.
30-12-2023 41
1D Sequential Convolution
• Because convolution is defined in terms of neighboring elements, boundary conditions naturally exist for output
elements that are close to the ends of an array.

• For example, when we calculate P[1], there is only one N element to the left of N[1]. That is, there are not
enough N elements to calculate P[1] according to our definition of convolution. A typical approach to handling
such a boundary condition is to define a default value to these missing N elements. For most applications, the
default value is 0.

• These missing elements are typically referred to as ghost elements in literature.

30-12-2023 42
1D Parallel Convolution – A Basic Algorithm
• The calculation of all output (P) elements can be done in parallel in a 1D convolution.

• The first step is to define the major input parameters for the kernel. We assume that the 1D convolution kernel receives
five arguments: pointer to input array N, pointer to input mask M, pointer to output array P, size of the mask Mask_Width, and
size of the input and output arrays Width. Thus, we have the following set up:

• The second step is to determine and implement the mapping of threads to output elements. Since the output array is
one dimensional, a simple and good approach is to organize the threads into a 1D grid and have each thread in the grid
calculate one output element.
▪ We assume that Mask_Width is an odd number and
the convolution is symmetric

▪ The for loop accumulates all the contributions from

the neighboring elements to the output P element.

▪ The if statement in the loop tests if any of the input

N elements used are ghost elements, either on the
left side or the right side of the N array.
30-12-2023 43
2D Sequential Convolution
• For image processing and computer vision, input data is usually in 2D form, with pixels in an x-y space. Image
convolutions are performed using a 2D convolution mask M.

• The x and y dimensions of mask M determine the range of neighbors to be included in the weighted sum
calculation.

• In general, the mask does not have to be a square array. To generate an output element, we take the subarray of
which the center is at the corresponding location in the input array N. We then perform pairwise multiplication
between elements of the input array and those of the mask array.

30-12-2023 44
2D Sequential Convolution
• Like 1D convolution, 2D convolution must also deal with boundary conditions.

• With boundaries in both the x and y dimensions, there are more complex boundary conditions: the calculation of
an output element may involve boundary conditions along a horizontal boundary, a vertical boundary, or both.

• In the following example, the calculation of P1,0 involves two missing columns and one missing horizontal row in
the subarray of N.

30-12-2023 45
Atomic and Arithmetic functions
In a multithreaded scenario, if multiple threads try to modify a single shared memory variable, the issue of data
inconsistency will arise. To overcome this atomic functions need to be used:

atomicAdd()

• Reads the word old from the address located in global or shared memory, computes (old + val), and stores the
result back to memory at the same address. These three operations are performed in one atomic transaction.
The function returns old.

30-12-2023 46
Atomic and Arithmetic functions
atomicSub()

• Reads the word old located at the address in global or shared memory, computes (old - val), and stores the
result back to memory at the same address. These three operations are performed in one atomic transaction.
The function returns old.

30-12-2023 47
Atomic and Arithmetic functions
atomicExch()

• Reads the word old located at the address

in global or shared memory and stores val in
memory at the same address. These two
operations are performed in one atomic
transaction. The function returns old.

30-12-2023 48
Atomic and Arithmetic functions
atomicMin()

• Reads the word old located at the address in global or shared memory, computes the minimum of old and val,
and stores the result back to memory at the same address. These three operations are performed in one atomic
transaction. The function returns old.

atomicMax()

• Reads the word old located at the address in global or shared memory, computes the maximum of old and
val, and stores the result back to memory at the same address. These three operations are performed in one
atomic transaction. The function returns old.

30-12-2023 49
Atomic and Arithmetic functions
atomicInc()

• Reads the word old located at the address in global or shared memory, computes ((old >= val) ? 0 : (old+1)),
and stores the result back to memory at the same address. These three operations are performed in one atomic
transaction. The function returns old.

atomicDec()

• Reads the word old located at the address in global or shared memory, computes
((old == 0) || (old > val)) ? Val : (old-1) , and stores the result back to memory at the same address. These three
operations are performed in one atomic transaction. The function returns old.

30-12-2023 50
A CUDA program to read a string and determines the number of occurrences of a character ‘a’ in
the string using atomicAdd() function.
#include "cuda_runtime.h“
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <conio.h>
#define N 1024

global void CUDACount(char* A, unsigned int *d_count)

{
int i = threadIdx.x;
if(A[i]==’a’)
atomicAdd(d_count,1);
}

30-12-2023 51
A CUDA program to read a string and determines the number of occurrences of a character ‘a’ in
the string using atomicAdd() function.
int main() {
char A[N]; char *d_A; unsigned int *count=0,*d_count,*result; cudaEventSynchronize(stop);
printf(“Enter a string”); float elapsedTime;
gets(A); cudaEventElapsedTime(&elapsedTime, start,
cudaEvent_t start, stop; stop);
cudaEventCreate(&start);
cudaEventCreate(&stop); cudaMemcpy(result, d_count, sizeof(unsigned
cudaEventRecord(start, 0); int), cudaMemcpyDeviceToHost);
cudaMalloc((void**)&d_A, strlen(A)*sizeof(char));
cudaMalloc((void **)&d_count,sizeof(unsigned int));
cudaMemcpy(d_A, A, strlen(A)*sizeof(char), cudaMemcpyHostToDevice); printf(“Total occurences of a=%u”,result);
cudaMemcpy(d_count,count,sizeof(unsigned int),cudaMemcpyHostToDevice); printf("Time Taken=%f",elapsedTime);
cudaFree(d_A);
cudaError_t error =cudaGetLastError(); cudaFree(d_count);
if (error != cudaSuccess) { printf("\n");
printf("CUDA Error1: %s\n", cudaGetErrorString(error)); getch();
} return 0;
CUDACount<<1, strlen(A)>>(d_A,d_count);
}
error =cudaGetLastError();
if (error != cudaSuccess) {
printf("CUDA Error2: %s\n", cudaGetErrorString(error));
}
cudaEventRecord(stop, 0);
30-12-2023 52

CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
No ratings yet
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
11 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
42 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
CUDA Practical's
No ratings yet
CUDA Practical's
38 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
GPU Programming Course Overview
No ratings yet
GPU Programming Course Overview
49 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Threads
No ratings yet
Threads
54 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
Lab 10,11
No ratings yet
Lab 10,11
4 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
CUDA Memory Management Guide
No ratings yet
CUDA Memory Management Guide
3 pages
Combinepdf
No ratings yet
Combinepdf
28 pages
Laboratory Practice I (410246)
No ratings yet
Laboratory Practice I (410246)
28 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
2023 CSC14120 Lecture05 CUDAMemories
No ratings yet
2023 CSC14120 Lecture05 CUDAMemories
48 pages
3 Computation
No ratings yet
3 Computation
28 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
DGT MKT Research
No ratings yet
DGT MKT Research
40 pages
CUDA Part-2
No ratings yet
CUDA Part-2
49 pages
MBA 545 MOOC2 Module 2 Word Transcript
No ratings yet
MBA 545 MOOC2 Module 2 Word Transcript
27 pages
Plan of Action 2
No ratings yet
Plan of Action 2
1 page
Fabric of FortiManager-7.6.0-Deployment Guide
No ratings yet
Fabric of FortiManager-7.6.0-Deployment Guide
21 pages
Rohit Shankar: PGDM Profile & Experience
No ratings yet
Rohit Shankar: PGDM Profile & Experience
2 pages
SarixValue IBV Bullet Spec 082621
No ratings yet
SarixValue IBV Bullet Spec 082621
7 pages
FOGUE Rayan Simulation de Choix de Cours
No ratings yet
FOGUE Rayan Simulation de Choix de Cours
3 pages
24 Multicast+Basics+-+PIM+Principles
No ratings yet
24 Multicast+Basics+-+PIM+Principles
43 pages
Living in The It Era Lesson2
No ratings yet
Living in The It Era Lesson2
8 pages
Email Security - Cloud Administration Guide Release 2022.4!2!27-2023
No ratings yet
Email Security - Cloud Administration Guide Release 2022.4!2!27-2023
235 pages
Presentation - AVEVA Edge 2020 Introduction - 06-20 - Hoi Thao
No ratings yet
Presentation - AVEVA Edge 2020 Introduction - 06-20 - Hoi Thao
25 pages
A Web Based Event Management
No ratings yet
A Web Based Event Management
23 pages
Implementing25Live Check List
No ratings yet
Implementing25Live Check List
5 pages
Chapter 1 Computer System Overview
No ratings yet
Chapter 1 Computer System Overview
21 pages
HPE Aruba Networking 7000 Series Mobility Controllers-A00059071enw
No ratings yet
HPE Aruba Networking 7000 Series Mobility Controllers-A00059071enw
9 pages
Student Project: CCE & Game
No ratings yet
Student Project: CCE & Game
32 pages
Cis Prelim
No ratings yet
Cis Prelim
6 pages
Technical Publications Ebooks: - Group
82% (11)
Technical Publications Ebooks: - Group
8 pages
Memory Allocation
No ratings yet
Memory Allocation
6 pages
WRD-130-U1 Config Tool Manual V0172 en
No ratings yet
WRD-130-U1 Config Tool Manual V0172 en
12 pages
Mobile Display Power Solutions
No ratings yet
Mobile Display Power Solutions
63 pages
MIS Lab New Course Content
No ratings yet
MIS Lab New Course Content
2 pages
Transfer Certificate Software
No ratings yet
Transfer Certificate Software
6 pages
Storage Devices & Software
No ratings yet
Storage Devices & Software
18 pages
Networking XII Computer Science Question Answer CBSE
No ratings yet
Networking XII Computer Science Question Answer CBSE
48 pages
The Quantum Computing Revolution: Challenges and Opportunities
No ratings yet
The Quantum Computing Revolution: Challenges and Opportunities
14 pages
PPT
No ratings yet
PPT
4 pages
HCI8
No ratings yet
HCI8
3 pages
Fuzzing Android pKVM Attack Surface
No ratings yet
Fuzzing Android pKVM Attack Surface
40 pages
UML Class Diagram Examples of Common Scenarios - EdrawMax
No ratings yet
UML Class Diagram Examples of Common Scenarios - EdrawMax
12 pages
How To Install macOS High Sierra On VirtualBox PC
No ratings yet
How To Install macOS High Sierra On VirtualBox PC
16 pages
Big Idea 1 Questions Vs 01
No ratings yet
Big Idea 1 Questions Vs 01
5 pages
Industrial Networks Course Guide
No ratings yet
Industrial Networks Course Guide
10 pages

CUDA Part-1

Uploaded by

CUDA Part-1

Uploaded by

Introduction to CUDA and

▪ CUDA Program Structure

▪ A Vector Addition Kernel

▪ Device Global Memory and Data Transfer

▪ Error handling in CUDA

▪ Kernel Functions and Threading

▪ Mapping Threads to Multidimensional Data

▪ Matrix-Matrix Multiplication—A More Complex Kernel

▪ Calculations of global threadID

▪ Synchronization and Transparent Scalability

▪ Assigning Resources to Blocks

▪ Querying Device Properties

▪ 1D Parallel Convolution – A Basic Algorithm

▪ Atomic and Arithmetic Functions

▪ A Simple Parallel Scan Algorithm

▪ Sequential Sparse-Matrix Vector Multiplication (SpVM)

▪ Parallel SpVM using CSR

▪ GPU Device Memory Types

▪ A Strategy or Reducing Global Memory Traffic

▪ A Tiled Matrix-Matrix Multiplication Kernel

▪ Constant Memory and Caching

▪ Tiled 1D Convolution with Halo Elements

Task Parallelism Vs Data parallelism

An Overview of the compilation process of a CUDA program

CUDA threads take very few

CPU threads typically require

// Compute vector sum h_C = h_A+h_B

void vecAdd(float* A, float*B, float* C, int n)

Part-1. // Allocate device memory for A, B, and C

Part-2. // Kernel launch code – to have the device

Part-3. // Copy C from the device memory

CUDA host memory and device memory model for programmers

Part-1. // Allocate device memory for A, B, and C

• The cudaMemcpy() function takes four parameters:

2. The second parameter points to the source location.

3. The third parameter specifies the number of bytes to be copied.

• cudaMemcpy() cannot be used to copy between different GPUs in multi-GPU systems

Part-1. // Allocate device memory for A, B, and C // Part-1

cudaError_t err = cudaMalloc((void** ) &d_A, size);

• Each grid is organized into an array of thread blocks.

The value of blockDim.x variable is

• A data index i is calculated as:

Since blockDim is 256 in our example, the i values of

Since each thread uses i to access d_A, d_B, and

vecAddKernel<<< ceil(n/256.0), 256 >>> (d_A, d_B, d_C, n);

void vecAdd(float* A, float*B, float* C, int n)

• CUDA threads are organized into a two-level hierarchy:

vecAddKernel<<< ceil(n/256.0), 256 >>>(d_A, d_B, d_C, n);

dim3 dimGrid(128, 1, 1);

• All blocks in a grid have the same dimensions.

dim3 dimGrid(2, 2, 1);

▪ For example, block(1,0) has blockIdx.y=1 and

▪ Note that the ordering of the labels is such that

▪ This reversed ordering for labeling threads works

Note: (x,y) is followed

Dr. Bhargav Bhatkalkar 32

<<< 4, (3,2) >>>

Gtid=blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;

<<< (2,2), 3 >>>

blockId = (gridDim.x * blockIdx.y) + blockIdx.x

threadId = (blockId * (blockDim.x * blockDim.y))

• These missing elements are typically referred to as ghost elements in literature.

▪ The for loop accumulates all the contributions from

▪ The if statement in the loop tests if any of the input

• Reads the word old located at the address

__global__ void CUDACount(char* A, unsigned int *d_count)

You might also like

void vecAdd(float* A, floatB, float C, int n)

void vecAdd(float* A, floatB, float C, int n)

global void CUDACount(char* A, unsigned int *d_count)