0% found this document useful (0 votes)

26 views89 pages

21.L18 Intro To GPU and CUDA C

Study

Uploaded by

Yash Pundeer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views89 pages

21.L18 Intro To GPU and CUDA C

Study

Uploaded by

Yash Pundeer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 89

An Introduction to GPU and

CUDA C/C++
Slides adapted by Dr Sparsh Mittal

Courtesy for slides: NVIDIA, Mutlu/Kirk/Hwu, Michael

Boyer and others
What is CUDA?
• CUDA Architecture
– Expose GPU parallelism for general-purpose computing
– Retain performance

• CUDA C/C++
– Based on industry-standard C/C++
– Small set of extensions to enable heterogeneous
programming
– Straightforward APIs to manage devices, memory etc.

• We discuss CUDA C/C++ and GPU architecture

(briefly)
Introduction to CUDA C/C++
• What will you learn in this session?
– Start from “Hello World!”
– Write and launch CUDA C/C++ kernels
– Manage GPU memory
– Manage communication and
synchronization
Architectural parameters of recent NVIDIA GPUs
CC = compute capability RF = register file
Total
Per-SM Shared Per-SM RF
Archit L1 size L2 size RF size # of size
ecture #Tran. Node CC (KB) (KB) (KB) SMs (KB)
G80 Tesla 0.68B 90nm 1.0 None None 32 16 512
GT200 Tesla 1.4B 65nm 1.3 None None 64 30 1920
GF100 Fermi 3B 40nm 2.0 48 768 128 16 2048
GK110 Kepler ~7B 28nm 3.5 48 1536 256 15 3840
GK210 Kepler ~7B 28nm 3.7 48 1536 512 15 7680
GM20 Maxwe
4 ll 8B 28nm 5.2 48 2048 256 16 4096
GP100 Pascal 15.3B 16nm 6.0 48 4096 256 56 14336
GV100 Volta 21B 12nm 7.0 128 6144 256 80 20480
TU102 Turing 18.6B 12nm 7.5 64 6144 256 72 18432

This PPT applies to devices with capability >=2.0

S. Mittal, "A Survey of Techniques for Architecting and Managing GPU Register File", IEEE TPDS 2017
Cano, “A survey on graphic processing unit computing for large-scale data
mining”, WIDM 2017
Power and Performance of GPUs
HETEROGENEOUS
COMPUTING
Heterogeneous Computing
 Terminology:
 Host The CPU and its memory (host memory)
 DeviceThe GPU and its memory (device memory)

Host Device
GPU vs. CPU
“The Tradeoff”

Optimizes
LATENCY CPU

Optimizes
THROUGHPUT

GPU
11

CPU vs. GPU: Architectural Difference 1

CPU GPU
Fetch/ Branch
Decode Predictor

Register File OOO Logic

Memory
Execute
Pre-Fetcher

Data Cache

Avoid structures that only

improve single-thread performance
12

CPU vs. GPU: Architectural Difference 2

CPU GPU
Fetch/ Branch Fetch/
Decode Predictor Decode

Register File OOO Logic Thread Group RF Register

RF RF
File RF

Memory
Execute EXE EXE
Execute
EXE EXE
Pre-Fetcher

Data Cache

Amortize the overhead of control logic across

multiple execution units (SIMD processing)
13

CPU vs. GPU: Architectural Difference 3

CPU GPU
Fetch/ Branch Fetch/
Decode Predictor Decode

Register File OOO Logic Thread

ThreadGroup
Group1 RF RF RF RF

Memory
Execute Thread Group 2 EXE
RF EXE
RF EXE
RF EXE
RF
Pre-Fetcher

Thread Group 3 RF RF RF RF
Data Cache
Thread Group 4 RF RF RF RF

EXE EXE EXE EXE

Use multiple groups of threads to keep

execution units busy and hide memory latency
14

CPU vs. GPU: Architectural Difference 4

CPU GPU
Fetch/ Branch Fetch/
Decode Predictor Core Core Decode
Core Core Core
1 2 3 4 5
Core 1File
Register Core
OOO 2
Logic RF Core RFCore RF RF
Core Core Core
6 7 8 9 10

ExecuteCPU CoreMemory RF Core RFCore RF

Core Core RF
Core
Pre-Fetcher 11 12 13
GPU 14 15
Core
Core 3 Core 4 Core
16
RF Core
17
RF Core
18
Core
RF19
Core
RF
20
Data Cache
Core Core Core Core Core
21RF 22 RF 23 RF24 RF
25

Core Core Core Core Core

EXE
26 27 EXE28 EXE
29 EXE
30

Replicate cores to leverage more parallelism

CPU vs. GPU: Architectural Differences

• Summary: take advantage of abundant

parallelism
– Lots of threads, so focus on aggregate
performance
– Parallelism in space:
• SIMD processing in each core
• Many independent SIMD cores across the chip
– Parallelism in time:
• Multiple SIMD groups in each core
Heterogeneous Computing
#include <iostream>
#include <algorithm>

using namespace std;

#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16

global void stencil_1d(int in, int out) {

__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}

parallel fn
// Synchronize (ensure all the data is available)
__syncthreads();

// Apply the stencil

int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];

// Store the result

out[gindex] = result;
}

void fill_ints(int *x, int n) {

fill_n(x, n, 1);
}

int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);

// Alloc space for host copies and setup values

in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);
out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);

// Alloc space for device copies

cudaMalloc((void **)&d_in, size);

serial code
cudaMalloc((void **)&d_out, size);

// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);

// Launch stencil_1d() kernel on GPU

stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS,
d_out + RADIUS);

// Copy result back to host

parallel code
cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;

serial code
}
Simple Processing Flow

PCI Bus

1. Copy input data from CPU

memory to GPU memory
Simple Processing Flow

PCI Bus

1. Copy input data from CPU

memory to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance
Simple Processing Flow

PCI Bus

1. Copy input data from CPU

memory to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance
3. Copy results from GPU memory
to CPU memory
CUDA extension to declare functions
__global__ called only from host
executes only on device

device called only from device

executes only on device

host called only from host

executes only on host
Hello World!
int main(void) {
printf("Hello World!\n");
return 0;
}
Output:
Standard C that runs on the host
$ nvcc
hello_world.cu
NVIDIA compiler (nvcc) can be used $ a.out
to compile programs with no device Hello World!
code $
Hello World! with Device Code
__global__ void mykernel(void) {
}

int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}

 Two new syntactic elements…

Hello World! with Device Code
__global__ void mykernel(void) {
}

• CUDA keyword global indicates a function that:

– Runs on the device
– Is called from host code
• nvcc separates source code into host and device
components
– Device functions (e.g. mykernel()) processed by
NVIDIA compiler
– Host functions (e.g. main()) processed by standard
host compiler
• gcc, cl.exe
Hello World! with Device Code
mykernel<<<1,1>>>();

• Triple angle brackets mark a call from host

code to device code
– Also called a “kernel launch”
– We’ll return to the parameters (1,1) in a
moment

• That’s all that is required to execute a

function on the GPU!
Hello World! with Device Code
__global__ void mykernel(void){
}

Output:
int main(void) {
mykernel<<<1,1>>>();
$ nvcc hello.cu
printf("Hello World!\n");
$ a.out
return 0;
Hello World!
}
$

• In this example, mykernel()

does nothing
Printing Hello World from Device
//filename helloPrintFromDevice.c
#include <stdio.h>
__device__ const char *STR = "HELLO WORLD!";
const char STR_LENGTH = 12;

global void hello()

{
printf("%d %c\n", threadIdx.x, STR[threadIdx.x %
STR_LENGTH]);
}
int main(void){
int num_threads = STR_LENGTH;
int num_blocks = 1;
hello<<<num_blocks,num_threads>>>();
cudaDeviceSynchronize();
return 0;
}
Output
$ nvcc helloPrintFromDevice.cu
$ ./a.out
0 H
1 E
2 L
3 L
4 O
5
Each thread prints one character
6 W
7 O
8 R
9 L
10 D
11 !
$
Parallel Programming in CUDA
C/C++

• GPU computing is about massive

parallelism!

• We will discuss a more interesting

example…

• We’ll start by adding two integers

and build up to vector addition

a b c
Addition on the Device
• A simple kernel to add two integers
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}

• As before global is a CUDA C/C++

keyword meaning
– add() will execute on the device
– add() will be called from the host
Addition on the Device
• Note that we use pointers for the
variables
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}

runs on the device, so a, b and c

• add()
must point to device memory

• We need to allocate memory on the GPU

Memory Management
• Host and device memory are separate entities
– Device pointers point to GPU memory
May be passed to/from host code
May not be dereferenced in host code
– Host pointers point to CPU memory
May be passed to/from device code
May not be dereferenced in device code

• Simple CUDA API for handling device memory

– cudaMalloc(), cudaFree(), cudaMemcpy()
– Similar to the C equivalents malloc(), free(),
memcpy()
Addition on the Device: add()
• Returning to our add() kernel
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}

• Let’s take a look at main()…

Addition on the Device: main()
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c;// device copies
int size = sizeof(int);
// Allocate space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Setup input values
a = 2;
b = 7;
Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a,&a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b,&b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU

add<<<1,1>>>(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
UNDERSTANDING
THREAD ORGANIZATION
Understanding thread organization
using example of student groups
All Students of Institute

CSE EE ……

BTech MTech PhD ……

1st yr 2nd yr 3rd yr ……

Similarly, threads are organized
Device
– A kernel is launched as a grid of
blocks of threads Grid 1

Block Block Block

• blockIdx and threadIdx (0,0,0) (1,0,0) (2,0,0)

are 3D
• We showed only one Block
(0,1,0)
Block
(1,1,0)
Block
(2,1,0)
dimension (x)
Block (1,1,0)
• Built-in variables: Thread Thread Thread Thread Thread

– threadIdx
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,0,0)

– blockIdx Thread
(0,1,0)
Thread
(1,1,0)
Thread
(2,1,0)
Thread
(3,1,0)
Thread
(4,1,0)

– blockDim Thread Thread Thread Thread Thread

– gridDim
(0,2,0) (1,2,0) (2,2,0) (3,2,0) (4,2,0)
Parallel computing using
BLOCKS
Moving from Scalar to Parallel
• GPU computing is about massive
parallelism
– So how do we run code in parallel on the
device?

add<<< 1, 1 >>>();

add<<< N, 1 >>>();

• Instead of executing add() once, execute N

times in parallel
Vector Addition on the Device
• With add() running in parallel we can do vector addition
• Terminology: each parallel invocation of add() is referred
to as a block
– The set of blocks is referred to as a grid
– Each invocation can refer to its block index using
blockIdx.x

global void add(int a, int b, int *c) {

c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];}

• By using blockIdx.x to index into the array, each block

handles a different index
Vector Addition on the Device
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• On the device, each block can execute in

parallel:
Block 0 Block 1 Block 2 Block 3
c[0]=a[0]+b[0]; c[1]=a[1]+b[1]; c[2]=a[2]+b[2]; c[3]=a[3]+b[3];
Vector Addition on the Device:
add()

• Returning to our parallelized add() kernel

__global__ void add(int *a, int *b, int *c)
{
c[blockIdx.x] = a[blockIdx.x] +
b[blockIdx.x];
}

• Let’s take a look at main()…

Vector Addition on the Device: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; //device copies
int size = N * sizeof(int);
// Alloc space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Alloc space for host copies and initialize
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
Vector Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N blocks

add<<<N,1>>>(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
}
Parallel computing using
THREADS
CUDA Threads
• Terminology: a block can be split into parallel threads
• Let’s change add() to use parallel threads instead of
parallel blocks

global void add(int a, int b, int *c) {

c[threadIdx.x] = a[threadIdx.x] +
b[threadIdx.x];
}

• We use threadIdx.x instead of blockIdx.x

• Need to make one change in main()…
Vector Addition Using Threads: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
// Alloc space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Alloc space for host copies and initialize
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
Vector Addition Using Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N threads

add<<<1,N>>>(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
}
COMBINING BLOCKS AND
THREADS
Combining Blocks and Threads
• We’ve seen parallel vector addition using:
– Many blocks with one thread each
– One block with many threads

• Let’s adapt vector addition to use both

blocks and threads

• Why? We’ll come to that…

• First let’s discuss data indexing…

Indexing using Blocks & Threads
• No longer as simple as using blockIdx.x and
threadIdx.x
– Consider indexing an array with one element per
thread (8 threads/block)

threadIdx.x threadIdx.x threadIdx.x threadIdx.x

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

• With M threads/block a unique index for each

thread is given by:
int index = threadIdx.x + blockIdx.x * M;
Indexing Arrays: Example
• Which thread will operate on the red
element?
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

M = 8 threadIdx.x = 5

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 2

int index = threadIdx.x + blockIdx.x * M;

= 5 + 2 * 8;
= 21;
Vector Addition with Blocks and
Threads
• Use the built-in variable blockDim.x for threads per
block
int index = threadIdx.x + blockIdx.x *
blockDim.x;

• Combined version of add() to use parallel

threads and parallel blocks

global void add(int a, int b, int *c) {

int index = threadIdx.x + blockIdx.x *
blockDim.x;
c[index] = a[index] + b[index]; }
Addition with Blocks and Threads
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies
int size = N * sizeof(int);
// Alloc space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
Addition with Blocks and Threads
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU
add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_
a, d_b, d_c);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
}
Handling Arbitrary Vector Sizes
• Typical problems: non-multiples of
blockDim.x

• Avoid accessing beyond end of the arrays:

__global__ void add(int *a,int *b,int *c, int n){
int index = threadIdx.x + blockIdx.x *
blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}
• Update the kernel launch:
add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);
Why Bother with Threads?
• Threads seem unnecessary
– They add a level of complexity
– What do we gain?
• Unlike parallel blocks, threads have
mechanisms to:
– Communicate
– Synchronize
• Device number: 0, Device name: Quadro P1000, Compute capability: 6.1
Clock Rate: 1480500 kHz, Total SMs: 5
Shared Memory Per SM: 98304 bytes
Registers Per SM: 65536 32-bit
Max threads per SM: 2048
L2 Cache Size: 1048576 bytes
Total Global Memory: 4227858432 bytes
Memory Clock Rate: 2505000 kHz

Max threads per block: 1024

Max threads in X-dimension of block: 1024
Max threads in Y-dimension of block: 1024
Max threads in Z-dimension of block: 64

Max blocks in X-dimension of grid: 2147483647

Max blocks in Y-dimension of grid: 65535
Max blocks in Z-dimension of grid: 65535

Shared Memory Per Block: 49152 bytes

Registers Per Block: 65536 32-bit
Warp size: 32
Hardware limits on GPU (for an old GPU)

• Grid and block dimension restrictions

– Grid: 64k x 64k
– Block: 512x512x64
– Max threads/block = 512
• A block maps onto an SM
– Up to 8 blocks per SM
• Every thread uses registers
– Up to 16K registers in an SM
– There is also limit on max register per thread
• Every block uses shared memory
– Up to 16KB shared memory
Example

Assume blocks of 16x16 threads using 20

registers each
– Each block uses 4K of shared memory
– Find the limit on the maximum

• 5120 registers / block  3.2 blocks/SM

• 4K shared memory/block  4 blocks/SM
Lets first discuss
GPU MEMORY ADDRESS
SPACES
GPU Memory Address Spaces
1. Local
2. Shared Increasing visibility of data between threads
3. Global

• In addition there are two more (read-only)

address spaces:
1. Constant
2. Texture.

62
Local (Private) Address Space
Each thread has its own “local memory”

0x42

Note: Location at address 100 for thread 0 is different

from location at address 100 for thread 1.

Contains local variables private to a thread.

64
Global Address Spaces
thread • Each thread in the different
block thread blocks (even from
X
different kernels) can access
thread
block Y
“global memory”

• cudaMalloc allocates
0x42
global memory

• Threads write their own

portion of global memory

• Slow
65
Lets take example of Matrix
Transpose

1 2 1 3
3 4 2 4
Matrix Transpose
__global__ void transpose(float *odata, float* idata, int width, int
height){
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

int index_in = xIndex + width * yIndex;

int index_out = yIndex + height * xIndex;
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
odata[index_out+i] = idata[index_in+i*width];
}
}

• “xIndex”, “yIndex”, “index_in”, “index_out”, and “i” are in local memory

(local variables are register allocated, stack is allocated in local memory)

• “odata” and “idata” are pointers to global memory

(both allocated using calls to cudaMalloc -- not shown above)
67
Shared Address Space

Each thread in the same block

can access a memory region called
thread
block “shared memory”

Limited size (16 to 48 KB).

0x42 Used as a software managed

“cache” to avoid off-chip memory
accesses.

Synchronize threads in a thread

block using __syncthreads();

68
Analogy: Institute and Dept. Library
in

Institute Library
Long latency

CSE
Student
CSE
Student
…
1 2
Institute Library

CSE Department Library

Short latency

CSE
Student
CSE
Student
… 69
1 2
Similarly: Global and Shared memory
in

Global Memory
Long latency

Thread
1
Thread
2
…
Global Memory

Shared Memory
Short latency

Thread
1
Thread
2
… 70
CUDA Variable Type Qualifiers
Variable declaration Memory Scope Lifetime Latency
int LocalVar; register thread thread 1x
int localArray[10]; local thread thread 100x
__shared__ int SharedVar; shared block block 1x
__device__ int GlobalVar; global grid application 100x
__constant__ int ConstVar; constant grid application 1x

• Automatic variables without any qualifier

reside in a register
– Except per-thread arrays that reside in local
memory
– Or if there are not enough registers
71
Programming scenario 1
Task:
Load data from global memory
Do thread-local computations
Store results to global memory
Solution:
Load data from global memory
float a = d_ptr[blockIdx.x * blockDim.x + threadIdx.x];
• Do computation with registers
float res = func(a)
• Store result
d_ptr[blockIdx.x*blockDim.x + threadIdx.x] =res;
Programming scenario 2
Task: 1. Load data from global memory 2. Do block-
local computations 3. Store results to global memory
Solution:
Load data from global memory to shared memory
__shared__ float a_sh [ BLOCK_SIZE ];
int idx = blockIdx .x* blockDim .x + threadIdx .x;
a_sh [ threadIdx .x] = d_ptr [ idx ];
__syncthreads ();
• Do computation
float res = func(a_sh[threadIdx.x])
• Store result
d_ptr[index] =res;
Because it’s tricky, lets discuss in more detail:
SHARED MEMORY
Dept library need synchronization
• Good – when students have similar choices
Student A Networks Algorithms Compilers

Time
Student B Networks Algorithms Compilers

• Bad – when students have different choices

Student A DataMining Algorithms Compilers

time
Student B Algorithms Compilers Networks

77
Same with Blocking/Tiling

• Good –when threads have similar access timing

Thread 1

Time
Thread 2
…
Thread 1

time
Thread 2
• Bad – when threads have very different timing
78
Barrier Synchronization
• A function call in CUDA
– __syncthreads()

• All threads in the same block must reach

the __syncthreads() before any can move on

• Best used to coordinate tiled algorithms

– To ensure that all elements of a tile are loaded
– To ensure that all elements of a tile are
consumed
79
Time
Thread 0

Thread 1

Thread 2

Thread 3

Thread 4

… …
Thread N-3

Thread N-2

Thread N-1

An example execution timing of barrier synchronization.

Consider matrix multiplication C= AB

for (i = 0; i < A.height; i++) {

for(j=0; j< B.width; j++) {
c[i][j] = 0;
for (k=0; k< A.width ; k++)
c[i][j] += a[i][k] * b[k][j];
}}

• # times each element of A accessed: B.width

• # times each element of B accessed: A.height
• # times each element of C accessed? A.width
Consider an element
c[row][col]. There are
B.width elements on a
row of C and A.height
elements in a column of C

To compute each of these

elements, we access a row
of A and a column of B

=> We access each row of

A B.width times and each
column of B A.height
times
We’ll see the code using both
global and shared memory

• Assume our matrices are square N*N

and are stored using linear arrays

• Access to the (i ; j) element is faciliated

via the macro
# define IDX(i,j,n) ((i)*(n)+j)
MatMul using global memory
__global__ void matmulGlobal ( float * c, float * a, float * b, int N )
{
// compute row and column for our matrix element
int col = blockIdx .x * blockDim .x + threadIdx .x;
int row = blockIdx .y * blockDim .y + threadIdx .y;
if ( col < N && row < N )
{
float sum = 0.0;
for ( int k = 0; k < N; k++ )
{
sum += a[IDX (row ,k,N)] * b[IDX (k,col ,N)];
}
c[IDX(row ,col ,N)] = sum;
}
}
To use shared
memory, we
use the idea of
blocking/tiling
MatMul using shared memory (1/3)
__global__ void matmulShared ( float * c, float * a, float * b,
int N )
{
// compute row and column for our matrix element
int col = blockIdx .x * blockDim .x + threadIdx .x;
int row = blockIdx .y * blockDim .y + threadIdx .y;

// compute the number of blocks we need

int M = ( N + BlockSize - 1 ) / BlockSize ;
float sum = 0.0;
MatMul using shared memory (2/3)
//Go through each block
for ( int m = 0; m < M; m++ ) {
// all threads in block copy their element from
// matrix a and matrix b to shared memory
__shared__ float a_s[ BlockSize ][ BlockSize ];
__shared__ float b_s[ BlockSize ][ BlockSize ];

int c = m * BlockSize + threadIdx .x;

int r = m * BlockSize + threadIdx .y;

a_s [ threadIdx .y][ threadIdx .x] = a[IDX (row ,c,N)];

b_s [ threadIdx .y][ threadIdx .x] = b[IDX (r,col ,N)];

// make sure all threads are finished & matrix is loaded

__syncthreads ();
MatMul using shared memory (3/3)
// compute partial sum using shared memory block
// K is block size except at right or bottom since we
// may not have a full block of data there
int K = (m == M - 1 ? N - m * BlockSize : BlockSize );
for ( int k = 0; k < K; k++ )
{
sum += a_s [ threadIdx .y][k] * b_s [k][ threadIdx .x];
}
//Synchronize to make sure that the computation is done before
loading two new sub-matrices of A and B in the next iteration
__syncthreads ();
}
if ( col < N && row < N ) c[ IDX(row ,col ,N)] = sum;
}
References
• CUDA language:
– CUDA by Example, by Jason Sanders and Edward Kandrot,
NVIDIA
– “Programming Massively Parallel Processors: A Hands-on
Approach” by David B. Kirk and Wen-mei W. Hwu

• GPU architecture
– “A Survey of CPU-GPU heterogeneous computing”, S. Mittal
et al., CSUR 2015
– https://cvw.cac.cornell.edu/gpu/coalesced
– https://medium.com/@smallfishbigsea/basic-concepts-in-gpu-
computing-3388710e9239
Thanks!

Sparsh Mittal [email protected]

Final Exam Basic Arduino Workshop PDF
100% (3)
Final Exam Basic Arduino Workshop PDF
41 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
Cuda C
No ratings yet
Cuda C
70 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
04 IntroductionGPUsCUDA
No ratings yet
04 IntroductionGPUsCUDA
25 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
3 Computation
No ratings yet
3 Computation
28 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
CUDA
No ratings yet
CUDA
33 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Lec 1
No ratings yet
Lec 1
27 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
Moving To Parallel With CUDA - Hello Program
No ratings yet
Moving To Parallel With CUDA - Hello Program
14 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Unit 5 - CUDA Architecture
No ratings yet
Unit 5 - CUDA Architecture
17 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
CUDA
No ratings yet
CUDA
18 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
From Everand
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
Rodrigo Copetti
No ratings yet
NES Architecture: Architecture of Consoles: A Practical Analysis, #1
From Everand
NES Architecture: Architecture of Consoles: A Practical Analysis, #1
Rodrigo Copetti
5/5 (1)
Nintendo DS Architecture: Architecture of Consoles: A Practical Analysis, #14
From Everand
Nintendo DS Architecture: Architecture of Consoles: A Practical Analysis, #14
Rodrigo Copetti
No ratings yet
17.L15 BranchPrediction
No ratings yet
17.L15 BranchPrediction
38 pages
24.L21 RooflineModel1
No ratings yet
24.L21 RooflineModel1
29 pages
13.L11 Pipelining
No ratings yet
13.L11 Pipelining
17 pages
16.L14 SuperscalarOOOPipelining
No ratings yet
16.L14 SuperscalarOOOPipelining
12 pages
Prerna Resume
No ratings yet
Prerna Resume
1 page
03 - CO1-ethical Hacking Terminologies
No ratings yet
03 - CO1-ethical Hacking Terminologies
53 pages
Auto CAD Notes
No ratings yet
Auto CAD Notes
74 pages
UM MIT ADS Millan Milestone1
No ratings yet
UM MIT ADS Millan Milestone1
8 pages
Sequence Diagram For Hotel Reservation System
No ratings yet
Sequence Diagram For Hotel Reservation System
4 pages
Leslie Liang: University of California, Los Angeles
No ratings yet
Leslie Liang: University of California, Los Angeles
1 page
How To Enroll TEACH ON 2021-1
No ratings yet
How To Enroll TEACH ON 2021-1
5 pages
R Programming PDF
No ratings yet
R Programming PDF
128 pages
Drone Electronics Wiring Diagrams & Pin Configuration Study Report (Week 4)
No ratings yet
Drone Electronics Wiring Diagrams & Pin Configuration Study Report (Week 4)
10 pages
Tuvsud Whitepaper Algorithm Based Data Analytics For Lifts
No ratings yet
Tuvsud Whitepaper Algorithm Based Data Analytics For Lifts
12 pages
Command Line Applications in Rust
0% (1)
Command Line Applications in Rust
53 pages
Deployment Guide-Graylog Server
No ratings yet
Deployment Guide-Graylog Server
14 pages
CC Aws Splunk Brochure
No ratings yet
CC Aws Splunk Brochure
15 pages
Papercut MF - Dell MFP Embedded Manual
No ratings yet
Papercut MF - Dell MFP Embedded Manual
43 pages
Chapter 1 - Advanced Data Structures
No ratings yet
Chapter 1 - Advanced Data Structures
15 pages
Ecommerce Module 1
No ratings yet
Ecommerce Module 1
34 pages
Full System Flash
100% (1)
Full System Flash
81 pages
Plantilla MBTS
No ratings yet
Plantilla MBTS
4 pages
Case Study 6
No ratings yet
Case Study 6
14 pages
Inbit Messenger Server: User's Guide
No ratings yet
Inbit Messenger Server: User's Guide
33 pages
Compare and Contrast OSI and TCP-IP Models
No ratings yet
Compare and Contrast OSI and TCP-IP Models
28 pages
Per Final Response - v0.1
No ratings yet
Per Final Response - v0.1
2 pages
Pic Sim300
No ratings yet
Pic Sim300
24 pages
Mathsrtae
No ratings yet
Mathsrtae
3 pages
AI-102 Designing and Implementing A Microsoft Azure AI Solution Exam Updated Dumps
100% (1)
AI-102 Designing and Implementing A Microsoft Azure AI Solution Exam Updated Dumps
36 pages
DS WhitePapers 3DSpace
No ratings yet
DS WhitePapers 3DSpace
16 pages
React Native Guide
No ratings yet
React Native Guide
182 pages
The Problem and Its Background: University of Cagayan Valley College of Information Technology
No ratings yet
The Problem and Its Background: University of Cagayan Valley College of Information Technology
23 pages
Schneider Electric Price List Feb 2020 V2
No ratings yet
Schneider Electric Price List Feb 2020 V2
478 pages

21.L18 Intro To GPU and CUDA C

Uploaded by

21.L18 Intro To GPU and CUDA C

Uploaded by

An Introduction to GPU and

Courtesy for slides: NVIDIA, Mutlu/Kirk/Hwu, Michael

• We discuss CUDA C/C++ and GPU architecture

This PPT applies to devices with capability >=2.0

CPU vs. GPU: Architectural Difference 1

Register File OOO Logic

Avoid structures that only

CPU vs. GPU: Architectural Difference 2

Register File OOO Logic Thread Group RF Register

Amortize the overhead of control logic across

CPU vs. GPU: Architectural Difference 3

Register File OOO Logic Thread

EXE EXE EXE EXE

Use multiple groups of threads to keep

CPU vs. GPU: Architectural Difference 4

ExecuteCPU CoreMemory RF Core RFCore RF

Core Core Core Core Core

Replicate cores to leverage more parallelism

CPU vs. GPU: Architectural Differences

• Summary: take advantage of abundant

using namespace std;

__global__ void stencil_1d(int *in, int *out) {

// Read input elements into shared memory

// Apply the stencil

// Store the result

void fill_ints(int *x, int n) {

// Alloc space for host copies and setup values

// Alloc space for device copies

// Launch stencil_1d() kernel on GPU

// Copy result back to host

1. Copy input data from CPU

1. Copy input data from CPU

1. Copy input data from CPU

__device__ called only from device

__host__ called only from host

 Two new syntactic elements…

• CUDA keyword __global__ indicates a function that:

• Triple angle brackets mark a call from host

• That’s all that is required to execute a

• In this example, mykernel()

__global__ void hello()

• GPU computing is about massive

• We will discuss a more interesting

• We’ll start by adding two integers

• As before __global__ is a CUDA C/C++

runs on the device, so a, b and c

• We need to allocate memory on the GPU

• Simple CUDA API for handling device memory

• Let’s take a look at main()…

// Launch add() kernel on GPU

BTech MTech PhD ……

1st yr 2nd yr 3rd yr ……

Block Block Block

– blockDim Thread Thread Thread Thread Thread

• Instead of executing add() once, execute N

__global__ void add(int *a, int *b, int *c) {

• By using blockIdx.x to index into the array, each block

• On the device, each block can execute in

• Returning to our parallelized add() kernel

• Let’s take a look at main()…

// Launch add() kernel on GPU with N blocks

__global__ void add(int *a, int *b, int *c) {

• We use threadIdx.x instead of blockIdx.x

// Launch add() kernel on GPU with N threads

• Let’s adapt vector addition to use both

• Why? We’ll come to that…

• First let’s discuss data indexing…

threadIdx.x threadIdx.x threadIdx.x threadIdx.x

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

• With M threads/block a unique index for each

int index = threadIdx.x + blockIdx.x * M;

• Combined version of add() to use parallel

__global__ void add(int *a, int *b, int *c) {

• Avoid accessing beyond end of the arrays:

Max threads per block: 1024

Max blocks in X-dimension of grid: 2147483647

global void stencil_1d(int in, int out) {

device called only from device

host called only from host

• CUDA keyword global indicates a function that:

global void hello()

• As before global is a CUDA C/C++

global void add(int a, int b, int *c) {

global void add(int a, int b, int *c) {

global void add(int a, int b, int *c) {