GPU Programming Basics - Slides
GPU Programming Basics - Slides
programming
basics
Prof. Marco Bertini
CUDA: atomic
operations,
privatization,
algorithms
Atomic operations
• The basics atomic operation in hardware is something like a
read-modify-write operation performed by a single hardware
instruction on a memory location address
• Read the old value, calculate a new value, and write the new
value to the location
• reads the 32-bit word old from the location pointed to by address in global or shared memory,
computes (old + val), and stores the result back to memory at the same address. The function
returns old.
• unsigned long long int atomicAdd(unsigned long long int* address, unsigned
long long int val);
• reads the 32-bit or 64-bit word old located at the address address in global or
shared memory, computes (old == compare ? val : old) , and stores the
result back to memory at the same address. These three operations are
performed in one atomic transaction. The function returns old (Compare And
Swap).
atomicCAS
• Note that any atomic operation can be implemented based on atomicCAS() (Compare And Swap). For
example, atomicAdd() for double-precision floating-point numbers is not available on devices with
compute capability lower than 6.0 but it can be implemented as follows:
// Histogram
for (unsigned int i = tid; i < num_elements; i += blockDim.x * gridDim.x) {
atomicAdd(&(bins_s[(unsigned int)input[i]]), 1);
}
__syncthreads();
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Interleaved Partitioning of Input
• For coalescing and better memory access
performance
Interleaved Partitioning (Iteration 2)
A stride algorithm
__global__ void histo_kernel(unsigned char *buffer,
long size, unsigned int *histo)
{
}
A stride algorithm
Calculates a stride value, which is the total number threads launched during kernel
invocation (blockDim.x*gridDim.x). In the first iteration of the while loop, each thread
index__global__ void
the input buffer usinghisto_kernel(unsigned char
its global thread index: Thread *buffer,
0 accesses element 0, Thread 1
accesses element 1, etc. Thus, all threadslong
jointly size,
processunsigned int *histo)
the first blockDim.x*gridDim.x
{ elements of the input buffer.
}
A stride algorithm
Calculates a stride value, which is the total number threads launched during kernel
invocation (blockDim.x*gridDim.x). In the first iteration of the while loop, each thread
index__global__ void
the input buffer usinghisto_kernel(unsigned char
its global thread index: Thread *buffer,
0 accesses element 0, Thread 1
accesses element 1, etc. Thus, all threadslong
jointly size,
processunsigned int *histo)
the first blockDim.x*gridDim.x
{ elements of the input buffer.
}
The while loop controls the iterations for each thread. When the index of a thread exceeds the
valid range of the input buffer (i is greater than or equal to size), the thread has completed
processing its partition and will exit the loop.
CUDA: parallel
patterns -
convolution
Convolution (stencil)
• An array operation where each output data element
is a weighted sum of a collection of neighboring
input elements
The value
• Often pattern
performed as aoffilter
the that
mask array elements
transforms signal or
defines the
pixel type into
values of filtering done values.
more desirable
1D Convolution Example
N N[0] N[1] N[2] N[3] N[4] N[5] N[6] P P[0] P[1] P[2] P[3] P[4] P[5] P[6]
1 2 3 4 5 6 7 3 8 57 16 15 3 3
• Mask size is usually an odd number of elements for symmetry (5 in this example)
N N[0] N[1] N[2] N[3] N[4] N[5] N[6] P P[0] P[1] P[2] P[3] P[4] P[5]
1 2 3 4 5 6 7 3 8 57 76 15 3 3
N N[0] N[1] N[2] N[3] N[4] N[5] N[6] P P[0] P[1] P[2] P[3] P[4] P[5] P[6]
0 1 2 3 4 5 6 7 3 38 57 16 15 3 3
Filled in
M M[0]M[1]M[2]M[3]M[4]
3 4 5 4 3 0 4 10 12 12
float Pvalue = 0;
int N_start_point = i – (Mask_Width/2);
P[i] = Pvalue;
}
• This kernel forces all elements outside the valid input range to 0
A 1D Convolution Kernel with Boundary
Condition Handling
__global__ void convolution_1D_basic_kernel(float *N, float *M,
float *P, int Mask_Width, int Width) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
P[i] = Pvalue;
}
• This kernel forces all elements outside the valid input range to 0
A 1D Convolution Kernel with Boundary
Condition Handling
__global__ void convolution_1D_basic_kernel(float *N, float *M,
float *P, int Mask_Width, int Width) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
P[i] = Pvalue;
}
• This kernel forces all elements outside the valid input range to 0
2D Convolution
2D Convolution – Ghost Cells
__global__
void convolution_2D_basic_kernel(unsigned char * in, unsigned char * mask, unsigned char * out,
int maskwidth, int w, int h) {
int Col = blockIdx.x * blockDim.x + threadIdx.x;
int Row = blockIdx.y * blockDim.y + threadIdx.y;
#define MAX_MASK_WIDTH 10
__constant__ float M[MAX_MASK_WIDTH];
#define MAX_MASK_WIDTH 10
__constant__ float M[MAX_MASK_WIDTH]; global variable
float Pvalue = 0;
P[i] = Pvalue;
}
Convolution with constant memory
__global__ void convolution_1D_basic_kernel(float *N, float *P,
int Mask_Width, int Width) {
float Pvalue = 0;
}
2 floating-point operations per global memory access (N)
}
P[i] = Pvalue;
}
CUDA: parallel
patterns -
convolution & tiling
Tiling & convolution
• Calculation of adjacent output elements involve shared input
elements
• E.g., N[2] is used in calculation of P[0], P[1], P[2]. P[3 and P[5]
assuming a 1D convolution Mask_Width of width 5
N
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Definition – output tile
P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
O_TILE_WIDTH
N 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Ns 2 3 4 5 6 7 8 9
• Some threads need to load more than one input element into the
shared memory
• Design 2: The size of each thread block matches the size of an input tile
• Each thread loads one input element into the shared memory
Thread to Input and Output Data Mapping
P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
N
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
• Index_i = index_o – n
• where n is Mask_Width /2
• n is 2 in this example
Loading input tiles
All threads participate:
Ns[tx] = N[index_i];
} else {
Ns[tx] = 0.0f;
}
Calculating output
• Some threads do not participate: Only Threads 0 through O_TILE_WIDTH-1
participate in calculation of output.
output = 0.0f;
P[index_o] = output;
}
Setting Block Size
#define O_TILE_WIDTH 1020
dim3 dimGrid((Width-1)/O_TILE_WIDTH+1, 1, 1)
(O_TILE_WIDTH*MASK_WIDTH)/
(O_TILE_WIDTH+MASK_WIDTH-1)
1+2+3+4+5*(8-5+1)+4+3+2+1 = 10+20+10 = 40
• (O_TILE_WIDTH-MASK_WIDTH+1)
• = MASK_WIDTH * O_TILE_WIDTH
• MASK_WIDTH * O_TILE_WIDTH /
(O_TILE_WIDTH+MASK_WIDTH-1)
• O_TILE_WIDTH2 * MASK_WIDTH2 /
(O_TILE_WIDTH+MASK_WIDTH-1)2
Bandwidth Reduction for 2D
• The reduction ratio is:
2 2
O_TILE_WIDTH * MASK_WIDTH /
2
(O_TILE_WIDTH+MASK_WIDTH-1)
O_TILE_WIDTH 8 16 32 64
MASK_WIDTH = 5 11.1 16 19.7 22.1