0% found this document useful (0 votes)

34 views34 pages

02 RTVis GPGPU CUDA

This document provides an introduction to parallel computing and GPU programming using CUDA. It begins with an overview of serial vs parallel computation and motivations for parallelism. It then covers CUDA execution and memory models, kernel functions, performance considerations like memory access patterns and control flow divergence. Examples are provided for vector addition, array reversal, and histogram calculation to illustrate CUDA programming concepts.

Uploaded by

Benedikt Baumgartner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views34 pages

02 RTVis GPGPU CUDA

Uploaded by

Benedikt Baumgartner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Intro to Parallel Computing and GPGPU for Real-Time Visualization

Áron Samuel Kovács

Structure
Introduction to parallel computing

Basic overview of CUDA

Execution Model, Memory Model
CUDA kernel functions
Performance
Memory Access Patterns
Control Flow Divergence

2
Introduction to Parallel Computing
Serial computation
A problem is broken into several steps that follow one after
another
Only one instruction is executed at any moment in time

2
Introduction to Parallel Computing
Parallel computation
A problem is broken into several steps and some of them can be
run at the same time
Multiple instructions can be executed at the same time

2
Why Parallel?
We can manufacture processing units and each can do X flops
We need more, so what now?

2
Why Parallel?
We can manufacture processing units and each can do X flops
We need more, so what now?
We can try making better processing units

2
Why Parallel?
We can manufacture processing units and each can do X flops
We need more, so what now?
We can try making better processing units
Or use more of them at once

2
Amdahl’s law

S is the theoretical speedup

p is the proportion of execution time that can be parallelized
s is the speedup of the part that can be parallelized

2
Amdahl’s law

S is the theoretical speedup

p is the proportion of execution time that can be parallelized
s is the speedup of the part that can be parallelized

For p=0.5, s=4

2
Parallelism
Different types of parallelism
Task parallelism
Decomposition into tasks
Data parallelism
Usually almost the same operation on different data

10
Hardware for Parallel Computing
Multi-core CPUs
GPUs
Specialized hardware
Distributed computing
Cluster computing

2
GPU performance

12
GPU bandwidth

13
Example: Particle Simulation

14
Example: Molecular Simulation

15
Example: Machine Learning / Deep Learning
Perfect fit for massively parallel computation

16
Example: Ray Tracing

17
Execution Model
Threads (block)

Warps – 32 threads (thread)

Blocks – programmable size

(block size)
Blocks
Grid – programmable size

(grid size)

18
Memory Model

(block) (block)

(shared memory) (shared memory)

19
Functions
__global__
must be void
call with kernelName<<<grid,block[,shared_mem,stream]>>>(params)
__host__
__device__

threadIdx – index of a thread in a block

blockIdx – index of a block in a grid
blockDim – size of a block
gridDim – size of a grid
blockIdx.x * blockDim.x + threadIdx.x
20
Workflow
Allocate buffers on GPU
Copy data from CPU to GPU
Run kernel
Copy data from GPU to CPU

21
Example: Adding Vectors
size_t size = 1024;

float* a = getData(size);
float* b = getData(size);
float* c = malloc(1024 * sizeof(float));

for (size_t i = 0; i < size; i++) {

c[i] = a[i] + b[i];
}

22
Example: Adding Vectors
int size = 1024;
int nbytes = size * sizeof(float);

float* a = getData(size);
float* b = getData(size);
float* c = malloc(nbytes);

float* a_gpu;
float* b_gpu;
float* c_gpu;

cudaMalloc(&a_gpu, nbytes);
cudaMalloc(&b_gpu, nbytes);
cudaMalloc(&c_gpu, nbytes);

cudaMemcpy(a_gpu, a, nbytes, cudaMemcpyHostToDevice);

cudaMemcpy(b_gpu, b, nbytes, cudaMemcpyHostToDevice);

23
Example: Adding Vectors
__global__ void vecAdd(float* a, float* b, float* c, int size)
{
int id = blockIdx.x * blockDim.x + threadIdx.x;

if (id < size) {

c[id] = a[id] + b[id];
}
}

int blockSize = 32;

int gridSize = (size + blockSize - 1) / blockSize;
vecAdd<<<gridSize, blockSize>>>(a_gpu, b_gpu, c_gpu, size);

cudaMemcpy(c, c_gpu, bytes, cudaMemcpyDeviceToHost);

cudaFree(a_gpu);
cudaFree(b_gpu);
cudaFree(c_gpu);
24
Example: Reversing a short array
__global__ void reverse(int* arr, int size)
{
__shared__ int s[128];
int i = threadIdx.x;
int ir = n – i - 1;
s[i] = arr[i];
__syncthreads();
arr[i] = s[ir];
}

25
Atomic operations
An atomic operation performs a read-modify-write atomic
operation on one 32-bit or 64-bit word residing in global or shared
memory
Atomic: it is guaranteed to be performed without interference
from other threads, however it is much slower.
Examples:
atomicAdd(T* addr,T val)
atomicSub(T* addr,T val)
atomicExch(T* addr,T val)
…
26
Example: Histogram

__global__ void
histogram(const float* a, int* histogram_bins, const int num_elements, const int num_bins)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < num_elements)
{
int bin = which_bin(a[i], num_bins);
atomicAdd(&histogram_bins[bin], 1);
}
}

27
Performance: Shared Memory Access

Access request from a warp

1 3 5 7 9 11 13 15 17 19 21 23

28
Performance: Bank Conflict

Access request from a warp

1 3 5 7 9 11 13 15 17 19 21 23

29
Performance: Broadcast

Access request from a warp

1 3 5 7 9 11 13 15 17 19 21 23

30
Performance: Warp Divergence
warp thread
if (condition){
instruction;
instruction;
} else {
instruction;
}

Warp divergence can significantly affect the instruction

throughput
Different execution paths within the same warp should be
avoided
31
Performance: Streams, Concurrent Execution
Several operations can operate concurrently
Host, device computations
Memory transfers (host-device/device-host)

32
CUDA in Python with Numba
__global__ void vecAdd(float* a, float* b, float* c)
{
int id = blockIdx.x * blockDim.x + threadIdx.x;

if (id < size) {

c[id] = a[id] + b[id];
}
}

@cuda.jit
def increment_by_one(a, b, c):
i = cuda.grid(1)
if i < a.size:
c[i] = a[i] + b[i]

33
References
CUDA Toolkit Documentation
Programming Guide
Best practices Guide
CUDA examples
Usually they come with documentation
CUDA by Example: An Introduction to General-Purpose GPU
Programming, 2010 (Book)

CSED405 Lec5-Threads and Atomics - 240921 - 193053
No ratings yet
CSED405 Lec5-Threads and Atomics - 240921 - 193053
34 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Advanced CUDA Programming Guide
No ratings yet
Advanced CUDA Programming Guide
64 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
No ratings yet
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
56 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
CUDA Class Lecture02
No ratings yet
CUDA Class Lecture02
24 pages
Cuda
No ratings yet
Cuda
69 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Parallel Programming FDP
No ratings yet
Parallel Programming FDP
43 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Case Study On GPU Architectures: Lecture 3H
No ratings yet
Case Study On GPU Architectures: Lecture 3H
34 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Introduction CUDA
No ratings yet
Introduction CUDA
46 pages
Cuda 9 and Beyond
100% (1)
Cuda 9 and Beyond
45 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
HPCXX 2023 d4
No ratings yet
HPCXX 2023 d4
52 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
Intro To CUDA
No ratings yet
Intro To CUDA
16 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
Unit 4
100% (1)
Unit 4
48 pages
PDC Lecture 10
No ratings yet
PDC Lecture 10
32 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
No ratings yet
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
27 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
GPU Programming with C++ AMP
No ratings yet
GPU Programming with C++ AMP
43 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
Section 2
No ratings yet
Section 2
7 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
CUDA Class Lecture03
No ratings yet
CUDA Class Lecture03
18 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Install GeoServer On Linux
No ratings yet
Install GeoServer On Linux
6 pages
Embedded Internship Guide
No ratings yet
Embedded Internship Guide
9 pages
Samba: File and Print Services To Server Message Block (SMB) Clients
No ratings yet
Samba: File and Print Services To Server Message Block (SMB) Clients
6 pages
MyDLP Endpoint Installation Guide
No ratings yet
MyDLP Endpoint Installation Guide
10 pages
8086 Interrupts: Lecturer Csed Tiet
No ratings yet
8086 Interrupts: Lecturer Csed Tiet
23 pages
8 - OS - CH 08 - Virtual Memory - OS8e
No ratings yet
8 - OS - CH 08 - Virtual Memory - OS8e
47 pages
Hacking Telco Equipment The HLR HSS Laurent Ghigonis
100% (3)
Hacking Telco Equipment The HLR HSS Laurent Ghigonis
59 pages
Logcat CSC Update Log
No ratings yet
Logcat CSC Update Log
370 pages
Kung
No ratings yet
Kung
5 pages
Commands PDF
100% (1)
Commands PDF
261 pages
Final Steps of 10g LISK by Ankit Tushar EXcel READ APPLI
No ratings yet
Final Steps of 10g LISK by Ankit Tushar EXcel READ APPLI
7 pages
OS 4th Sem Syllabus
No ratings yet
OS 4th Sem Syllabus
4 pages
Python Data File Handling XII CS 2022-23 As On 28-10-2022
No ratings yet
Python Data File Handling XII CS 2022-23 As On 28-10-2022
62 pages
Chapter 10. Virtual Memory
No ratings yet
Chapter 10. Virtual Memory
68 pages
Integrate SAP GUI with Excel Guide
No ratings yet
Integrate SAP GUI with Excel Guide
3 pages
1.1. Getting Started Guide - Processor SDK Linux Documentation PDF
No ratings yet
1.1. Getting Started Guide - Processor SDK Linux Documentation PDF
53 pages
BSD Kernel Debugging
No ratings yet
BSD Kernel Debugging
12 pages
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
No ratings yet
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
29 pages
Linux LVM Interview Questions & Answer
No ratings yet
Linux LVM Interview Questions & Answer
5 pages
Pentaho Ce Installation Guide On Linux Operating System Whitepaper
No ratings yet
Pentaho Ce Installation Guide On Linux Operating System Whitepaper
8 pages
Helpful Answer: 6. Re: How To Create Mount Point in Linux
No ratings yet
Helpful Answer: 6. Re: How To Create Mount Point in Linux
2 pages
The Philosophy Book
No ratings yet
The Philosophy Book
8 pages
2 - Pfile & Spfile
No ratings yet
2 - Pfile & Spfile
2 pages
Hacking and Securiting Docker Containers
No ratings yet
Hacking and Securiting Docker Containers
119 pages
Virtual Machine
0% (1)
Virtual Machine
8 pages
How To Install GEM-Selektor v. 3.3.x On Your Local PC (Desktop or Laptop)
No ratings yet
How To Install GEM-Selektor v. 3.3.x On Your Local PC (Desktop or Laptop)
3 pages
Qemu Internals
No ratings yet
Qemu Internals
23 pages
LAN Driver Installation Guide
No ratings yet
LAN Driver Installation Guide
14 pages
Install and Configure A GPFS Cluster On AIX
No ratings yet
Install and Configure A GPFS Cluster On AIX
5 pages

02 RTVis GPGPU CUDA

Uploaded by

02 RTVis GPGPU CUDA

Uploaded by

Intro to Parallel Computing and GPGPU for Real-Time Visualization

Áron Samuel Kovács

Basic overview of CUDA

S is the theoretical speedup

S is the theoretical speedup

For p=0.5, s=4

Warps – 32 threads (thread)

Blocks – programmable size

(shared memory) (shared memory)

threadIdx – index of a thread in a block

for (size_t i = 0; i < size; i++) {

cudaMemcpy(a_gpu, a, nbytes, cudaMemcpyHostToDevice);

if (id < size) {

int blockSize = 32;

cudaMemcpy(c, c_gpu, bytes, cudaMemcpyDeviceToHost);

Access request from a warp

Access request from a warp

Access request from a warp

Warp divergence can significantly affect the instruction

if (id < size) {

You might also like