21.L18 Intro To GPU and CUDA C
21.L18 Intro To GPU and CUDA C
CUDA C/C++
Slides adapted by Dr Sparsh Mittal
• CUDA C/C++
– Based on industry-standard C/C++
– Small set of extensions to enable heterogeneous
programming
– Straightforward APIs to manage devices, memory etc.
Host Device
GPU vs. CPU
“The Tradeoff”
Optimizes
LATENCY CPU
Optimizes
THROUGHPUT
GPU
11
Memory
Execute
Pre-Fetcher
Data Cache
Memory
Execute EXE EXE
Execute
EXE EXE
Pre-Fetcher
Data Cache
Memory
Execute Thread Group 2 EXE
RF EXE
RF EXE
RF EXE
RF
Pre-Fetcher
Thread Group 3 RF RF RF RF
Data Cache
Thread Group 4 RF RF RF RF
#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16
parallel fn
// Synchronize (ensure all the data is available)
__syncthreads();
int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);
serial code
cudaMalloc((void **)&d_out, size);
// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);
parallel code
cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);
// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;
serial code
}
Simple Processing Flow
PCI Bus
PCI Bus
PCI Bus
int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}
Output:
int main(void) {
mykernel<<<1,1>>>();
$ nvcc hello.cu
printf("Hello World!\n");
$ a.out
return 0;
Hello World!
}
$
a b c
Addition on the Device
• A simple kernel to add two integers
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
CSE EE ……
are 3D
• We showed only one Block
(0,1,0)
Block
(1,1,0)
Block
(2,1,0)
dimension (x)
Block (1,1,0)
• Built-in variables: Thread Thread Thread Thread Thread
– threadIdx
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,0,0)
– blockIdx Thread
(0,1,0)
Thread
(1,1,0)
Thread
(2,1,0)
Thread
(3,1,0)
Thread
(4,1,0)
– gridDim
(0,2,0) (1,2,0) (2,2,0) (3,2,0) (4,2,0)
Parallel computing using
BLOCKS
Moving from Scalar to Parallel
• GPU computing is about massive
parallelism
– So how do we run code in parallel on the
device?
add<<< 1, 1 >>>();
add<<< N, 1 >>>();
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
M = 8 threadIdx.x = 5
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
blockIdx.x = 2
62
Local (Private) Address Space
Each thread has its own “local memory”
0x42
• cudaMalloc allocates
0x42
global memory
• Slow
65
Lets take example of Matrix
Transpose
1 2 1 3
3 4 2 4
Matrix Transpose
__global__ void transpose(float *odata, float* idata, int width, int
height){
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
68
Analogy: Institute and Dept. Library
in
Institute Library
Long latency
CSE
Student
CSE
Student
…
1 2
Institute Library
in
CSE
Student
CSE
Student
… 69
1 2
Similarly: Global and Shared memory
in
Global Memory
Long latency
Thread
1
Thread
2
…
Global Memory
in
Shared Memory
Short latency
Thread
1
Thread
2
… 70
CUDA Variable Type Qualifiers
Variable declaration Memory Scope Lifetime Latency
int LocalVar; register thread thread 1x
int localArray[10]; local thread thread 100x
__shared__ int SharedVar; shared block block 1x
__device__ int GlobalVar; global grid application 100x
__constant__ int ConstVar; constant grid application 1x
Time
Student B Networks Algorithms Compilers
time
Student B Algorithms Compilers Networks
77
Same with Blocking/Tiling
Time
Thread 2
…
Thread 1
time
Thread 2
• Bad – when threads have very different timing
78
Barrier Synchronization
• A function call in CUDA
– __syncthreads()
Thread 1
Thread 2
Thread 3
Thread 4
… …
Thread N-3
Thread N-2
Thread N-1
• GPU architecture
– “A Survey of CPU-GPU heterogeneous computing”, S. Mittal
et al., CSUR 2015
– https://cvw.cac.cornell.edu/gpu/coalesced
– https://medium.com/@smallfishbigsea/basic-concepts-in-gpu-
computing-3388710e9239
Thanks!