0% found this document useful (0 votes)
34 views34 pages

02 RTVis GPGPU CUDA

This document provides an introduction to parallel computing and GPU programming using CUDA. It begins with an overview of serial vs parallel computation and motivations for parallelism. It then covers CUDA execution and memory models, kernel functions, performance considerations like memory access patterns and control flow divergence. Examples are provided for vector addition, array reversal, and histogram calculation to illustrate CUDA programming concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views34 pages

02 RTVis GPGPU CUDA

This document provides an introduction to parallel computing and GPU programming using CUDA. It begins with an overview of serial vs parallel computation and motivations for parallelism. It then covers CUDA execution and memory models, kernel functions, performance considerations like memory access patterns and control flow divergence. Examples are provided for vector addition, array reversal, and histogram calculation to illustrate CUDA programming concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Intro to Parallel Computing and GPGPU for Real-Time Visualization

Áron Samuel Kovács


Structure
Introduction to parallel computing

Basic overview of CUDA


Execution Model, Memory Model
CUDA kernel functions
Performance
Memory Access Patterns
Control Flow Divergence

2
Introduction to Parallel Computing
Serial computation
A problem is broken into several steps that follow one after
another
Only one instruction is executed at any moment in time

2
Introduction to Parallel Computing
Parallel computation
A problem is broken into several steps and some of them can be
run at the same time
Multiple instructions can be executed at the same time

2
Why Parallel?
We can manufacture processing units and each can do X flops
We need more, so what now?

2
Why Parallel?
We can manufacture processing units and each can do X flops
We need more, so what now?
We can try making better processing units

2
Why Parallel?
We can manufacture processing units and each can do X flops
We need more, so what now?
We can try making better processing units
Or use more of them at once

2
Amdahl’s law

S is the theoretical speedup


p is the proportion of execution time that can be parallelized
s is the speedup of the part that can be parallelized

2
Amdahl’s law

S is the theoretical speedup


p is the proportion of execution time that can be parallelized
s is the speedup of the part that can be parallelized

For p=0.5, s=4

2
Parallelism
Different types of parallelism
Task parallelism
Decomposition into tasks
Data parallelism
Usually almost the same operation on different data

10
Hardware for Parallel Computing
Multi-core CPUs
GPUs
Specialized hardware
Distributed computing
Cluster computing

2
GPU performance

12
GPU bandwidth

13
Example: Particle Simulation

14
Example: Molecular Simulation

15
Example: Machine Learning / Deep Learning
Perfect fit for massively parallel computation

16
Example: Ray Tracing

17
Execution Model
Threads (block)

Warps – 32 threads (thread)

Blocks – programmable size

(block size)
Blocks
Grid – programmable size

(grid size)

18
Memory Model

(block) (block)

(shared memory) (shared memory)

19
Functions
__global__
must be void
call with kernelName<<<grid,block[,shared_mem,stream]>>>(params)
__host__
__device__

threadIdx – index of a thread in a block


blockIdx – index of a block in a grid
blockDim – size of a block
gridDim – size of a grid
blockIdx.x * blockDim.x + threadIdx.x
20
Workflow
Allocate buffers on GPU
Copy data from CPU to GPU
Run kernel
Copy data from GPU to CPU

21
Example: Adding Vectors
size_t size = 1024;

float* a = getData(size);
float* b = getData(size);
float* c = malloc(1024 * sizeof(float));

for (size_t i = 0; i < size; i++) {


c[i] = a[i] + b[i];
}

22
Example: Adding Vectors
int size = 1024;
int nbytes = size * sizeof(float);

float* a = getData(size);
float* b = getData(size);
float* c = malloc(nbytes);

float* a_gpu;
float* b_gpu;
float* c_gpu;

cudaMalloc(&a_gpu, nbytes);
cudaMalloc(&b_gpu, nbytes);
cudaMalloc(&c_gpu, nbytes);

cudaMemcpy(a_gpu, a, nbytes, cudaMemcpyHostToDevice);


cudaMemcpy(b_gpu, b, nbytes, cudaMemcpyHostToDevice);

23
Example: Adding Vectors
__global__ void vecAdd(float* a, float* b, float* c, int size)
{
int id = blockIdx.x * blockDim.x + threadIdx.x;

if (id < size) {


c[id] = a[id] + b[id];
}
}

int blockSize = 32;


int gridSize = (size + blockSize - 1) / blockSize;
vecAdd<<<gridSize, blockSize>>>(a_gpu, b_gpu, c_gpu, size);

cudaMemcpy(c, c_gpu, bytes, cudaMemcpyDeviceToHost);

cudaFree(a_gpu);
cudaFree(b_gpu);
cudaFree(c_gpu);
24
Example: Reversing a short array
__global__ void reverse(int* arr, int size)
{
__shared__ int s[128];
int i = threadIdx.x;
int ir = n – i - 1;
s[i] = arr[i];
__syncthreads();
arr[i] = s[ir];
}

25
Atomic operations
An atomic operation performs a read-modify-write atomic
operation on one 32-bit or 64-bit word residing in global or shared
memory
Atomic: it is guaranteed to be performed without interference
from other threads, however it is much slower.
Examples:
atomicAdd(T* addr,T val)
atomicSub(T* addr,T val)
atomicExch(T* addr,T val)

26
Example: Histogram

__global__ void
histogram(const float* a, int* histogram_bins, const int num_elements, const int num_bins)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < num_elements)
{
int bin = which_bin(a[i], num_bins);
atomicAdd(&histogram_bins[bin], 1);
}
}

27
Performance: Shared Memory Access

Access request from a warp

1 3 5 7 9 11 13 15 17 19 21 23

28
Performance: Bank Conflict

Access request from a warp

1 3 5 7 9 11 13 15 17 19 21 23

29
Performance: Broadcast

Access request from a warp

1 3 5 7 9 11 13 15 17 19 21 23

30
Performance: Warp Divergence
warp thread
if (condition){
instruction;
instruction;
} else {
instruction;
}

Warp divergence can significantly affect the instruction


throughput
Different execution paths within the same warp should be
avoided
31
Performance: Streams, Concurrent Execution
Several operations can operate concurrently
Host, device computations
Memory transfers (host-device/device-host)

32
CUDA in Python with Numba
__global__ void vecAdd(float* a, float* b, float* c)
{
int id = blockIdx.x * blockDim.x + threadIdx.x;

if (id < size) {


c[id] = a[id] + b[id];
}
}

@cuda.jit
def increment_by_one(a, b, c):
i = cuda.grid(1)
if i < a.size:
c[i] = a[i] + b[i]

33
References
CUDA Toolkit Documentation
Programming Guide
Best practices Guide
CUDA examples
Usually they come with documentation
CUDA by Example: An Introduction to General-Purpose GPU
Programming, 2010 (Book)

34

You might also like