0% found this document useful (0 votes)

63 views48 pages

Lecture 4

1. CUDA threads are organized into a grid of blocks, with each block containing a set of threads. Each thread is assigned a unique block ID and thread ID to identify its data. 2. Block and thread IDs simplify addressing multidimensional data by allowing threads to determine what elements to work on based on their position. This is useful for tasks like image processing and solving partial differential equations on volumes. 3. Matrix multiplication is used as an example to illustrate how data is transferred to the GPU, how threads are organized to perform the calculation in parallel, and how results are transferred back to the CPU. Blocks of threads each compute a tile of the output matrix.

Uploaded by

raghunaath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views48 pages

Lecture 4

Uploaded by

raghunaath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

CUDA Threads

1
Block IDs and Thread IDs
Host Device
• Each thread uses IDs to
Grid 1
decide what data to work on
Kernel
– Block ID: 1D or 2D 1
Block
(0, 0)
Block
(1, 0)
– Thread ID: 1D, 2D, or 3D
Block Block
(0, 1) (1, 1)

• Simplifies memory Grid 2

addressing when Kernel

processing 2
Block (1, 1)
multidimensional data (0,0,1) (1,0,1) (2,0,1) (3,0,1)

– Image processing
Thread Thread Thread Thread
– Solving PDEs on volumes (0,0,0) (1,0,0) (2,0,0) (3,0,0)

– … Thread Thread Thread Thread

(0,1,0) (1,1,0) (2,1,0) (3,1,0)

Courtesy: NDVIA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 2
Figure 3.2. An Example of CUDA Thread Org
ECE498AL, University of Illinois, Urbana-Champaign
Step 1: Matrix Multiplication
A Simple Host Version in C
// Matrix multiplication on the (CPU) host
void MatrixMulOnHost(float* M, float* N, float* P, int Width)
N
{
for (int i = 0; i < Width; ++i) k
for (int j = 0; j < Width; ++j) {
float sum = 0; j

WIDTH
for (int k = 0; k < Width; ++k) {
float a = M[i * width + k];
float b = N[k * width + j];
sum += a * b;
}
P[i * Width + j] = sum;
} M P
}
i

WIDTH
k

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH

3
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Step 2: Input Matrix Data Transfer
(Host-side Code)
void MatrixMulOnDevice(float* M, float* N, float* P, int Width)
{
int size = Width * Width * sizeof(float);
float* Md, Nd, Pd;
…
1. // Allocate and Load M, N to device memory
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the device

cudaMalloc(&Pd, size);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 4

ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Step 3: Output Matrix Data Transfer
(Host-side Code)

2. // Kernel invocation code – to be shown later

…

3. // Read P from the device

cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

// Free device matrices

cudaFree(Md); cudaFree(Nd); cudaFree (Pd);
}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 5

ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Step 4: Kernel Function
// Matrix multiplication kernel – per thread code

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{

// Pvalue is used to store the element of the matrix

// that is computed by the thread
float Pvalue = 0;

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 6

ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Step 4: Kernel Function (cont.)
for (int k = 0; k < Width; ++k) {
float Melement = Md[threadIdx.y*Width+k]; Nd
float Nelement = Nd[k*Width+threadIdx.x];
Pvalue += Melement * Nelement; k
}

WIDTH
Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; tx
}

Md Pd

ty ty

WIDTH
k tx

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH 7

ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Step 5: Kernel Invocation
(Host-side Code)

// Setup the execution configuration

dim3 dimGrid(1, 1);
dim3 dimBlock(Width, Width);

// Launch the device computation threads!

MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 8

ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Step 6: Handling Arbitrary Sized Square
Matrices
• Have each 2D thread block to Nd

compute a (TILE_WIDTH)2 sub-

matrix (tile) of the result matrix

WIDTH
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of
(WIDTH/TILE_WIDTH)2 blocks Pd
Md

You still need to put a loop by

around the kernel call for cases TILE_WIDTH
where WIDTH/TILE_WIDTH ty

WIDTH
is greater than max grid size
(64K)! bx tx

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH 9

ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
bx
0 1 2
Matrix Multiplication Using
tx
Multiple Blocks 0 1 2 TILE_WIDTH-1

Nd
• Break-up Pd into tiles
• Each block calculates one

WIDTH
tile
– Each thread calculates one
element
– Block size equal tile size
Md Pd

TILE_WIDTHE
0 Pdsub

WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH

2 WIDTH WIDTH

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 10

ECE498AL, University of Illinois, Urbana-Champaign
A Small Example

Block(0,0) Block(1,0)

P0,0 P1,0 P2,0 P3,0 TILE_WIDTH = 2

P0,1 P1,1 P2,1 P3,1

P0,2 P1,2 P2,2 P3,2

P0,3 P1,3 P2,3 P3,3

Block(0,1) Block(1,1)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 11

ECE498AL, University of Illinois, Urbana-Champaign
A Small Example: Multiplication
Nd0,0 Nd1,0

Nd0,1 Nd1,1

Nd0,2 Nd1,2

Nd0,3 Nd1,3

Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 12

ECE498AL, University of Illinois, Urbana-Champaign
Revised Matrix Multiplication
Kernel using Multiple Blocks
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

Pd[Row*Width+Col] = Pvalue;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 13
ECE498AL, University of Illinois, Urbana-Champaign
CUDA Thread Block
• All threads in a block execute the same
kernel program (SPMD)
CUDA Thread Block
• Programmer declares block:
– Block size 1 to 512 concurrent threads
– Block shape 1D, 2D, or 3D Thread Id #:
– Block dimensions in threads 0123… m
• Threads have thread id numbers within block
– Thread program uses thread id to select
work and address shared data Thread program

• Threads in the same block share data and

synchronize while doing their share of the
work
• Threads in different blocks cannot cooperate
Courtesy: John Nickolls,
– Each block can execute in any order relative NVIDIA
to other blocs!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 14

ECE498AL, University of Illinois, Urbana-Champaign
Transparent Scalability
• Hardware is free to assigns blocks to any
processor at any time
– A kernel scales across any number of
parallel processors
Device Kernel grid
Device
Block 0 Block 1

Block 2 Block 3

Block 0 Block 1 Block 4 Block 5

Block 0 Block 1 Block 2 Block 3
Block 6 Block 7
time
Block 2 Block 3
Block 4 Block 5 Block 6 Block 7

Block 4 Block 5
Each block can execute in any order relative to other blocks.
Block 6 Block 7

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 15

ECE498AL, University of Illinois, Urbana-Champaign
G80 Example: Executing Thread Blocks
SM 0 SM 1
t0 t1 t2 … tm t0 t1 t2 … tm

MT IU MT IU
Blocks
SP SP

Blocks • Threads are assigned to Streaming

Multiprocessors in block granularity
– Up to 8 blocks to each SM as
Shared Shared resource allows
Memory Memory
– SM in G80 can take up to 768 threads
• Could be 256 (threads/block) * 3
blocks
• Or 128 (threads/block) * 6 blocks, etc.
• Threads run concurrently
– SM maintains thread/block id #s
– SM manages/schedules thread
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 execution 16
ECE498AL, University of Illinois, Urbana-Champaign
G80 Example: Thread Scheduling
• Each Block is executed as …Block 1 Warps …Block 2 Warps …Block 1 Warps
32-thread Warps t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31

– An implementation decision, … … …
not part of the CUDA
programming model
– Warps are scheduling units Streaming Multiprocessor
in SM Instruction L1

• If 3 blocks are assigned to an Instruction Fetch/Dispatch

SM and each block has 256 Shared Memory

threads, how many Warps are SP SP

there in an SM? SP SP
– Each Block is divided into SP
SFU
SP
SFU

256/32 = 8 Warps
SP SP
– There are 8 * 3 = 24 Warps

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 17

ECE498AL, University of Illinois, Urbana-Champaign
G80 Example: Thread Scheduling
(Cont.)

• SM implements zero-overhead warp scheduling

– At any time, only one of the warps is executed by SM
– Warps whose next instruction has its operands ready for
consumption are eligible for execution
– Eligible Warps are selected for execution on a prioritized
scheduling policy
– All threads in a warp execute the same instruction when selected

TB1, W1 stall
TB2, W1 stall TB3, W2 stall

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3

W1 W1 W1 W2 W1 W1 W2 W3 W2
Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4

Time TB = Thread Block, W = Warp

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 18

ECE498AL, University of Illinois, Urbana-Champaign
G80 Block Granularity Considerations
• For Matrix Multiplication using multiple blocks, should I
use 8X8, 16X16 or 32X32 blocks?

– For 8X8, we have 64 threads per Block. Since each SM can take
up to 768 threads, there are 12 Blocks. However, each SM can
only take up to 8 Blocks, only 512 threads will go into each SM!

– For 16X16, we have 256 threads per Block. Since each SM can
take up to 768 threads, it can take up to 3 Blocks and achieve full
capacity unless other resource considerations overrule.

– For 32X32, we have 1024 threads per Block. Not even one can fit
into an SM!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 19

ECE498AL, University of Illinois, Urbana-Champaign
CUDA Memories

20
G80 Implementation of CUDA Memories
• Each thread can: Grid
– Read/write per-thread
Block (0, 0) Block (1, 0)
registers
– Read/write per-thread Shared Memory Shared Memory

local memory Registers Registers Registers Registers

– Read/write per-block
shared memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

– Read/write per-grid
global memory Host Global Memory

– Read/only per-grid
Constant Memory
constant memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 21

ECE498AL, University of Illinois, Urbana Champaign
CUDA Variable Type Qualifiers
Variable declaration Memory Scope Lifetime
__device__ __local__ int LocalVar; local thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application

• device is optional when used with

__local__, __shared__, or __constant__

• Automatic variables without any qualifier reside in

a register
– Except arrays that reside in local memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 22
ECE498AL, University of Illinois, Urbana Champaign
Where to Declare Variables?

Can host access it?

Outside of
In the kernel
any Function

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 23

ECE498AL, University of Illinois, Urbana Champaign
Variable Type Restrictions
• Pointers can only point to memory allocated or
declared in global memory:
– Allocated in the host and passed to the kernel:
__global__ void KernelFunc(float* ptr)
– Obtained as the address of a global variable:
float* ptr = &GlobalVar;

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 24

ECE498AL, University of Illinois, Urbana Champaign
A Common Programming Strategy
• Global memory resides in device memory (DRAM)
- much slower access than shared memory
• So, a profitable way of performing computation on
the device is to tile data to take advantage of fast
shared memory:
– Partition data into subsets that fit into shared memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory,
using multiple threads to exploit memory-level parallelism
• Performing the computation on the subset from shared
memory; each thread can efficiently multi-pass over any data
element
• Copying results from shared memory to global memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 25
ECE498AL, University of Illinois, Urbana Champaign
A Common Programming Strategy
(Cont.)
• Constant memory also resides in device memory
(DRAM) - much slower access than shared
memory
– But… cached!
– Highly efficient access for read-only data
• Carefully divide data according to access patterns
– R/Only  constant memory (very fast if in cache)
– R/W shared within Block  shared memory (very fast)
– R/W within each thread  registers (very fast)
– R/W inputs/results  global memory (very slow)
For texture memory usage, see NVIDIA document.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 26
ECE498AL, University of Illinois, Urbana Champaign
GPU Atomic Integer Operations

• Atomic operations on integers in global memory:

– Associative operations on signed/unsigned ints
– add, sub, min, max, ...
– and, or, xor
– Increment, decrement
– Exchange, compare and swap
• Requires hardware with compute capability 1.1
and above.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 27

ECE498AL, University of Illinois, Urbana Champaign 27
Matrix Multiplication using
Shared Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 28

ECE498AL, University of Illinois, Urbana Champaign
Review: Matrix Multiplication
Kernel using Multiple Blocks
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

Pd[Row*Width+Col] = Pvalue;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 29
ECE498AL, University of Illinois, Urbana Champaign
Idea: Use Shared Memory to reuse
global memory data
• Each input element is N

read by Width threads.

• Load each element into

WIDTH
Shared Memory and
have several threads
use the local version to
reduce the memory M P

bandwidth ty
– Tiled algorithms

WIDTH
tx

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

WIDTH WIDTH 30
ECE498AL, University of Illinois, Urbana Champaign
bx
Tiled Multiply 0 1 2

tx
• Break up the execution of the 0 1 2 TILE_WIDTH-1

TILE_WIDTH
Nd
kernel into phases so that the
data accesses in each phase is

WIDTH
TILE_WIDTH
focused on one subset (tile) of
Md and Nd

Md Pd

TILE_WIDTHE
0 Pdsub

WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 31
ECE498AL, University of Illinois, Urbana Champaign
A Small Example
Nd0,0 Nd1,0

Nd0,1 Nd1,1

Nd0,2 Nd1,2

Nd0,3 Nd1,3

Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

ECE498AL, University of Illinois, Urbana Champaign
Every Md and Nd Element is used
exactly twice in generating a 2X2 tile of P
P0,0 P1,0 P0,1 P1,1
thread0,0 thread1,0 thread0,1 thread1,1
M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0

Access M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1

order
M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2

M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3

ECE498AL, University of Illinois, Urbana Champaign
Breaking Md and Nd into Tiles

Nd0,0 Nd1,0

Nd0,1 Nd1,1

Nd0,2 Nd1,2

Nd0,3 Nd1,3

Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

ECE498AL, University of Illinois, Urbana Champaign
Each phase of a Thread Block uses one
tile from Md and one from Nd
Phase 1 Step 52
Step 4 Phase Step 6
T0,0 Md0,0 Nd0,0 PValue0,0 += Md2,0 Nd0,2 PValue0,0 +=
↓ ↓ Mds0,0*Nds0,0 + ↓ ↓ Mds0,0*Nds0,0 +
Mds0,0 Nds0,0 Mds1,0*Nds0,1 Mds0,0 Nds0,0 Mds1,0*Nds0,1

T1,0 Md1,0 Nd1,0 PValue1,0 += Md3,0 Nd1,2 PValue1,0 +=

↓ ↓ Mds0,0*Nds1,0 + ↓ ↓ Mds0,0*Nds1,0 +
Mds1,0 Nds1,0 Mds1,0*Nds1,1 Mds1,0 Nds1,0 Mds1,0*Nds1,1

T0,1 Md0,1 Nd0,1 PdValue0,1 += Md2,1 Nd0,3 PdValue0,1 +=

↓ ↓ Mds0,1*Nds0,0 + ↓ ↓ Mds0,1*Nds0,0 +
Mds0,1 Nds0,1 Mds1,1*Nds0,1 Mds0,1 Nds0,1 Mds1,1*Nds0,1

T1,1 Md1,1 Nd1,1 PdValue1,1 += Md3,1 Nd1,3 PdValue1,1 +=

↓ ↓ Mds0,1*Nds1,0 + ↓ ↓ Mds0,1*Nds1,0 +
Mds1,1 Nds1,1 Mds1,1*Nds1,1 Mds1,1 Nds1,1 Mds1,1*Nds1,1

time 35
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign
First-order Size Considerations in G80

• Each thread block should have many threads

– TILE_WIDTH of 16 gives 16*16 = 256 threads

• There should be many thread blocks

– A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks

• Each thread block perform 2*256 = 512 float

loads from global memory for 256 * (2*16) =
8,192 mul/add operations.
– Memory bandwidth no longer a limiting factor
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 36
ECE498AL, University of Illinois, Urbana Champaign
CUDA Code – Kernel Execution
Configuration
// Setup the execution configuration
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);
dim3 dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);

ECE498AL, University of Illinois, Urbana Champaign
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH];
2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;

4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on

5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Coolaborative loading of Md and Nd tiles into shared memory
9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
11. __syncthreads();
11. for (int k = 0; k < TILE_WIDTH; ++k)
12. Pvalue += Mds[ty][k] * Nds[k][tx];
13. Synchthreads();
14. }
13. Pd[Row*Width+Col] = Pvalue;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 38
ECE498AL, University of Illinois, Urbana Champaign
bx
Tiled Multiply 0 1 2

tx
• Each block computes one 0 1 2 TILE_WIDTH-1

TILE_WIDTH
Nd
square sub-matrix Pdsub of size
m
TILE_WIDTH

WIDTH
• Each thread computes one

TILE_WIDTH
bx k
element of Pdsub

Md Pd
by
0
m

TILE_WIDTHE
0 Pdsub

WIDTH
1

by 1
ty 2
k
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 39
ECE498AL, University of Illinois, Urbana Champaign
G80 Shared Memory and Threading
• Each SM in G80 has 16KB shared memory
– SM size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB
of shared memory.
– Can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256
threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB
shared memory usage per thread block, allowing only up to two
thread blocks active at the same time
• Using 16x16 tiling, we reduce the accesses to the global
memory by a factor of 16
– The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6
GFLOPS!

ECE498AL, University of Illinois, Urbana Champaign
Tiling Size Effects

100

70
GFLOPS

0
o n ly

o n ly

o n ly
u n ro lle d

u n ro lle d

u n ro lle d
tile d

tile d

tile d
tile d &

tile d &

tile d &
no t tile d 4 x4 tile s 8 x8 tile s 1 2 x1 2 tile s 1 6 x1 6 tile s

ECE498AL, University of Illinois, Urbana Champaign
Summary- Typical Structure of a
CUDA Program
• Global variables declaration
– __host__
– __device__... __global__, __constant__, __texture__
• Function prototypes
– __global__ void kernelOne(…)
– float handyFunction(…)
• Main ()
– allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes )
– transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…)
– execution configuration setup
– kernel call – kernelOne<<<execution configuration>>>( args… ); repeat
– transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) as needed
– optional: compare against golden (host computed) solution
• Kernel – void kernelOne(type args,…)
– variables declaration - __local__, __shared__
• automatic variables transparently assigned to registers or local memory
– syncthreads()…
• Other functions
– float handyFunction(int inVar…);

ECE498AL, University of Illinois, Urbana Champaign
Some Additional API Features

ECE498AL, University of Illinois, Urbana-Champaign
Application Programming Interface
• The API is an extension to the C programming
language
• It consists of:
– Language extensions
• To target portions of the code for execution on the device
– A runtime library split into:
• A common component providing built-in vector types and a
subset of the C runtime library in both host and device
codes
• A host component to control and access one or more
devices from the host
• A device component providing device-specific functions
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 44
ECE498AL, University of Illinois, Urbana-Champaign
Language Extensions:
Built-in Variables

• dim3 gridDim;
– Dimensions of the grid in blocks (gridDim.z
unused)
• dim3 blockDim;
– Dimensions of the block in threads
• dim3 blockIdx;
– Block index within the grid
• dim3 threadIdx;
– Thread index within the block
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 45
ECE498AL, University of Illinois, Urbana-Champaign
Common Runtime Component:
Mathematical Functions
• pow, sqrt, cbrt, hypot
• exp, exp2, expm1
• log, log2, log10, log1p
• sin, cos, tan, asin, acos, atan, atan2
• sinh, cosh, tanh, asinh, acosh, atanh
• ceil, floor, trunc, round
• Etc.
– When executed on the host, a given function uses
the C runtime implementation if available
– These functions are only supported for scalar types,
not vector types
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 46
ECE498AL, University of Illinois, Urbana-Champaign
Device Runtime Component:
Mathematical Functions
• Some mathematical functions (e.g. sin(x))
have a less accurate, but faster device-only
version (e.g. __sin(x))
– __pow
– __log, __log2, __log10
– __exp
– __sin, __cos, __tan

ECE498AL, University of Illinois, Urbana-Champaign
Device Runtime Component:
Synchronization Function
• void __syncthreads();
• Synchronizes all threads in a block
• Once all threads have reached this point,
execution resumes normally
• Used to avoid RAW / WAR / WAW hazards
when accessing shared or global memory
• Allowed in conditional constructs only if the
conditional is uniform across the entire thread
block
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 48
ECE498AL, University of Illinois, Urbana-Champaign

Threads
No ratings yet
Threads
54 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Lecture4 CUDA Threads Part2
No ratings yet
Lecture4 CUDA Threads Part2
15 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
CUDA Class Lecture03
No ratings yet
CUDA Class Lecture03
18 pages
VSCSE Lecture3 Cuda Memory Model 2012
No ratings yet
VSCSE Lecture3 Cuda Memory Model 2012
31 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Lec 6
No ratings yet
Lec 6
16 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Introduction to GPGPU Programming
No ratings yet
Introduction to GPGPU Programming
32 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Slides - Chapter 6
No ratings yet
Slides - Chapter 6
59 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
Lec 1
No ratings yet
Lec 1
27 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
No ratings yet
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
21 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
No ratings yet
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
31 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Class 10
No ratings yet
Class 10
13 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Chapter 3
No ratings yet
Chapter 3
20 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
HPC File
No ratings yet
HPC File
22 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
CUDA Matrix Multiplication Quiz
No ratings yet
CUDA Matrix Multiplication Quiz
12 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
GPU Architecture
No ratings yet
GPU Architecture
17 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
CUDA MatrixMultiplication
No ratings yet
CUDA MatrixMultiplication
2 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
CUDA
No ratings yet
CUDA
18 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Lecture 6
No ratings yet
Lecture 6
28 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
GPU Computing Course Overview
No ratings yet
GPU Computing Course Overview
17 pages
Segmentation and Classification of CT Renal Images Using Deep Networks
No ratings yet
Segmentation and Classification of CT Renal Images Using Deep Networks
8 pages
Boot 2 Root
No ratings yet
Boot 2 Root
5 pages
Ch-5 Awt & Swing
No ratings yet
Ch-5 Awt & Swing
17 pages
Python Day - 6
No ratings yet
Python Day - 6
13 pages
05 Handout 1
No ratings yet
05 Handout 1
3 pages
Double and Circular Linked List
No ratings yet
Double and Circular Linked List
4 pages
List in C++ Standard Template Library (STL) - GeeksforGeeks
No ratings yet
List in C++ Standard Template Library (STL) - GeeksforGeeks
13 pages
Notes On Event Handlers D365fo
No ratings yet
Notes On Event Handlers D365fo
3 pages
Data Structures and Algorithm
No ratings yet
Data Structures and Algorithm
2 pages
Inheritance Reference
No ratings yet
Inheritance Reference
16 pages
Java Theory Questions
No ratings yet
Java Theory Questions
3 pages
C Program: Generate 3-Address Code
No ratings yet
C Program: Generate 3-Address Code
3 pages
KISA ICSE Computer Applications - AK
No ratings yet
KISA ICSE Computer Applications - AK
7 pages
Module 1 Language Compilation and Intro To C 6PROGFUN
No ratings yet
Module 1 Language Compilation and Intro To C 6PROGFUN
47 pages
Asotić, Din - Strings in C++ - The Fourth Step in C++ Learning (2023, Independent) - Libgen - Li
No ratings yet
Asotić, Din - Strings in C++ - The Fourth Step in C++ Learning (2023, Independent) - Libgen - Li
197 pages
Nested Functions in C
No ratings yet
Nested Functions in C
4 pages
Peter and Train
No ratings yet
Peter and Train
20 pages
Lab Manual
No ratings yet
Lab Manual
19 pages
FPL - All Units - QB
No ratings yet
FPL - All Units - QB
7 pages
Assignment 7 July 2022 Solution
No ratings yet
Assignment 7 July 2022 Solution
4 pages
Unit 1 - Programming Fundamental of C (12th February 2025)
No ratings yet
Unit 1 - Programming Fundamental of C (12th February 2025)
60 pages
Python Session 1
No ratings yet
Python Session 1
36 pages
Backend Interview Questions
100% (1)
Backend Interview Questions
4 pages
Lecture 6
No ratings yet
Lecture 6
21 pages
Log Connwa 2025-03-30
No ratings yet
Log Connwa 2025-03-30
29 pages
Unit 5 Structure Union and File Handling
No ratings yet
Unit 5 Structure Union and File Handling
43 pages
TypeScript Corporate Training Guide
No ratings yet
TypeScript Corporate Training Guide
1 page
Program
No ratings yet
Program
6 pages
Anr 6.45 (64500004) 20250531 132308 380015538
No ratings yet
Anr 6.45 (64500004) 20250531 132308 380015538
16 pages
cs201 Midterm Solved Quiz 1 Fall 2014
No ratings yet
cs201 Midterm Solved Quiz 1 Fall 2014
11 pages
Compiler Assignment
No ratings yet
Compiler Assignment
11 pages

Lecture 4

Uploaded by

Lecture 4

Uploaded by

CUDA Threads

• Simplifies memory Grid 2

addressing when Kernel

– … Thread Thread Thread Thread

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH

// Allocate P on the device

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 4

2. // Kernel invocation code – to be shown later

3. // Read P from the device

// Free device matrices

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 5

// Pvalue is used to store the element of the matrix

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 6

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH 7

// Setup the execution configuration

// Launch the device computation threads!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 8

compute a (TILE_WIDTH)2 sub-

You still need to put a loop by

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH 9

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 10

P0,0 P1,0 P2,0 P3,0 TILE_WIDTH = 2

P0,2 P1,2 P2,2 P3,2

P0,3 P1,3 P2,3 P3,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 11

Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 12

• Threads in the same block share data and

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 14

Block 0 Block 1 Block 4 Block 5

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 15

Blocks • Threads are assigned to Streaming

• If 3 blocks are assigned to an Instruction Fetch/Dispatch

SM and each block has 256 Shared Memory

threads, how many Warps are SP SP

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 17

• SM implements zero-overhead warp scheduling

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3

Time TB = Thread Block, W = Warp

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 18

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 19

local memory Registers Registers Registers Registers

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 21

• __device__ is optional when used with

• Automatic variables without any qualifier reside in

Can host access it?

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 23

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 24

• Atomic operations on integers in global memory:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 27

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 28

read by Width threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 32

Access M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1

M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 33

Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 34

T1,0 Md1,0 Nd1,0 PValue1,0 += Md3,0 Nd1,2 PValue1,0 +=

T0,1 Md0,1 Nd0,1 PdValue0,1 += Md2,1 Nd0,3 PdValue0,1 +=

T1,1 Md1,1 Nd1,1 PdValue1,1 += Md3,1 Nd1,3 PdValue1,1 +=

• Each thread block should have many threads

• There should be many thread blocks

• Each thread block perform 2*256 = 512 float

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 37

• device is optional when used with