Intro to Parallel Computing and GPGPU for Real-Time Visualization
Áron Samuel Kovács
Structure
Introduction to parallel computing
Basic overview of CUDA
Execution Model, Memory Model
CUDA kernel functions
Performance
Memory Access Patterns
Control Flow Divergence
2
Introduction to Parallel Computing
Serial computation
A problem is broken into several steps that follow one after
another
Only one instruction is executed at any moment in time
2
Introduction to Parallel Computing
Parallel computation
A problem is broken into several steps and some of them can be
run at the same time
Multiple instructions can be executed at the same time
2
Why Parallel?
We can manufacture processing units and each can do X flops
We need more, so what now?
2
Why Parallel?
We can manufacture processing units and each can do X flops
We need more, so what now?
We can try making better processing units
2
Why Parallel?
We can manufacture processing units and each can do X flops
We need more, so what now?
We can try making better processing units
Or use more of them at once
2
Amdahl’s law
S is the theoretical speedup
p is the proportion of execution time that can be parallelized
s is the speedup of the part that can be parallelized
2
Amdahl’s law
S is the theoretical speedup
p is the proportion of execution time that can be parallelized
s is the speedup of the part that can be parallelized
For p=0.5, s=4
2
Parallelism
Different types of parallelism
Task parallelism
Decomposition into tasks
Data parallelism
Usually almost the same operation on different data
10
Hardware for Parallel Computing
Multi-core CPUs
GPUs
Specialized hardware
Distributed computing
Cluster computing
2
GPU performance
12
GPU bandwidth
13
Example: Particle Simulation
14
Example: Molecular Simulation
15
Example: Machine Learning / Deep Learning
Perfect fit for massively parallel computation
16
Example: Ray Tracing
17
Execution Model
Threads (block)
Warps – 32 threads (thread)
Blocks – programmable size
(block size)
Blocks
Grid – programmable size
(grid size)
18
Memory Model
(block) (block)
(shared memory) (shared memory)
19
Functions
__global__
must be void
call with kernelName<<<grid,block[,shared_mem,stream]>>>(params)
__host__
__device__
threadIdx – index of a thread in a block
blockIdx – index of a block in a grid
blockDim – size of a block
gridDim – size of a grid
blockIdx.x * blockDim.x + threadIdx.x
20
Workflow
Allocate buffers on GPU
Copy data from CPU to GPU
Run kernel
Copy data from GPU to CPU
21
Example: Adding Vectors
size_t size = 1024;
float* a = getData(size);
float* b = getData(size);
float* c = malloc(1024 * sizeof(float));
for (size_t i = 0; i < size; i++) {
c[i] = a[i] + b[i];
}
22
Example: Adding Vectors
int size = 1024;
int nbytes = size * sizeof(float);
float* a = getData(size);
float* b = getData(size);
float* c = malloc(nbytes);
float* a_gpu;
float* b_gpu;
float* c_gpu;
cudaMalloc(&a_gpu, nbytes);
cudaMalloc(&b_gpu, nbytes);
cudaMalloc(&c_gpu, nbytes);
cudaMemcpy(a_gpu, a, nbytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_gpu, b, nbytes, cudaMemcpyHostToDevice);
23
Example: Adding Vectors
__global__ void vecAdd(float* a, float* b, float* c, int size)
{
int id = blockIdx.x * blockDim.x + threadIdx.x;
if (id < size) {
c[id] = a[id] + b[id];
}
}
int blockSize = 32;
int gridSize = (size + blockSize - 1) / blockSize;
vecAdd<<<gridSize, blockSize>>>(a_gpu, b_gpu, c_gpu, size);
cudaMemcpy(c, c_gpu, bytes, cudaMemcpyDeviceToHost);
cudaFree(a_gpu);
cudaFree(b_gpu);
cudaFree(c_gpu);
24
Example: Reversing a short array
__global__ void reverse(int* arr, int size)
{
__shared__ int s[128];
int i = threadIdx.x;
int ir = n – i - 1;
s[i] = arr[i];
__syncthreads();
arr[i] = s[ir];
}
25
Atomic operations
An atomic operation performs a read-modify-write atomic
operation on one 32-bit or 64-bit word residing in global or shared
memory
Atomic: it is guaranteed to be performed without interference
from other threads, however it is much slower.
Examples:
atomicAdd(T* addr,T val)
atomicSub(T* addr,T val)
atomicExch(T* addr,T val)
…
26
Example: Histogram
__global__ void
histogram(const float* a, int* histogram_bins, const int num_elements, const int num_bins)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < num_elements)
{
int bin = which_bin(a[i], num_bins);
atomicAdd(&histogram_bins[bin], 1);
}
}
27
Performance: Shared Memory Access
Access request from a warp
1 3 5 7 9 11 13 15 17 19 21 23
28
Performance: Bank Conflict
Access request from a warp
1 3 5 7 9 11 13 15 17 19 21 23
29
Performance: Broadcast
Access request from a warp
1 3 5 7 9 11 13 15 17 19 21 23
30
Performance: Warp Divergence
warp thread
if (condition){
instruction;
instruction;
} else {
instruction;
}
Warp divergence can significantly affect the instruction
throughput
Different execution paths within the same warp should be
avoided
31
Performance: Streams, Concurrent Execution
Several operations can operate concurrently
Host, device computations
Memory transfers (host-device/device-host)
32
CUDA in Python with Numba
__global__ void vecAdd(float* a, float* b, float* c)
{
int id = blockIdx.x * blockDim.x + threadIdx.x;
if (id < size) {
c[id] = a[id] + b[id];
}
}
@cuda.jit
def increment_by_one(a, b, c):
i = cuda.grid(1)
if i < a.size:
c[i] = a[i] + b[i]
33
References
CUDA Toolkit Documentation
Programming Guide
Best practices Guide
CUDA examples
Usually they come with documentation
CUDA by Example: An Introduction to General-Purpose GPU
Programming, 2010 (Book)
34