0% found this document useful (0 votes)
63 views48 pages

Lecture 4

1. CUDA threads are organized into a grid of blocks, with each block containing a set of threads. Each thread is assigned a unique block ID and thread ID to identify its data. 2. Block and thread IDs simplify addressing multidimensional data by allowing threads to determine what elements to work on based on their position. This is useful for tasks like image processing and solving partial differential equations on volumes. 3. Matrix multiplication is used as an example to illustrate how data is transferred to the GPU, how threads are organized to perform the calculation in parallel, and how results are transferred back to the CPU. Blocks of threads each compute a tile of the output matrix.

Uploaded by

raghunaath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views48 pages

Lecture 4

1. CUDA threads are organized into a grid of blocks, with each block containing a set of threads. Each thread is assigned a unique block ID and thread ID to identify its data. 2. Block and thread IDs simplify addressing multidimensional data by allowing threads to determine what elements to work on based on their position. This is useful for tasks like image processing and solving partial differential equations on volumes. 3. Matrix multiplication is used as an example to illustrate how data is transferred to the GPU, how threads are organized to perform the calculation in parallel, and how results are transferred back to the CPU. Blocks of threads each compute a tile of the output matrix.

Uploaded by

raghunaath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

CUDA Threads

1
Block IDs and Thread IDs
Host Device
• Each thread uses IDs to
Grid 1
decide what data to work on
Kernel
– Block ID: 1D or 2D 1
Block
(0, 0)
Block
(1, 0)
– Thread ID: 1D, 2D, or 3D
Block Block
(0, 1) (1, 1)

• Simplifies memory Grid 2

addressing when Kernel

processing 2
Block (1, 1)
multidimensional data (0,0,1) (1,0,1) (2,0,1) (3,0,1)

– Image processing
Thread Thread Thread Thread
– Solving PDEs on volumes (0,0,0) (1,0,0) (2,0,0) (3,0,0)

– … Thread Thread Thread Thread


(0,1,0) (1,1,0) (2,1,0) (3,1,0)

Courtesy: NDVIA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 2
Figure 3.2. An Example of CUDA Thread Org
ECE498AL, University of Illinois, Urbana-Champaign
Step 1: Matrix Multiplication
A Simple Host Version in C
// Matrix multiplication on the (CPU) host
void MatrixMulOnHost(float* M, float* N, float* P, int Width)
N
{
for (int i = 0; i < Width; ++i) k
for (int j = 0; j < Width; ++j) {
float sum = 0; j

WIDTH
for (int k = 0; k < Width; ++k) {
float a = M[i * width + k];
float b = N[k * width + j];
sum += a * b;
}
P[i * Width + j] = sum;
} M P
}
i

WIDTH
k

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH


3
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Step 2: Input Matrix Data Transfer
(Host-side Code)
void MatrixMulOnDevice(float* M, float* N, float* P, int Width)
{
int size = Width * Width * sizeof(float);
float* Md, Nd, Pd;

1. // Allocate and Load M, N to device memory
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the device


cudaMalloc(&Pd, size);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 4


ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Step 3: Output Matrix Data Transfer
(Host-side Code)

2. // Kernel invocation code – to be shown later


3. // Read P from the device


cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

// Free device matrices


cudaFree(Md); cudaFree(Nd); cudaFree (Pd);
}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 5


ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Step 4: Kernel Function
// Matrix multiplication kernel – per thread code

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{

// Pvalue is used to store the element of the matrix


// that is computed by the thread
float Pvalue = 0;

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 6


ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Step 4: Kernel Function (cont.)
for (int k = 0; k < Width; ++k) {
float Melement = Md[threadIdx.y*Width+k]; Nd
float Nelement = Nd[k*Width+threadIdx.x];
Pvalue += Melement * Nelement; k
}

WIDTH
Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; tx
}

Md Pd

ty ty

WIDTH
k tx

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH 7


ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Step 5: Kernel Invocation
(Host-side Code)

// Setup the execution configuration


dim3 dimGrid(1, 1);
dim3 dimBlock(Width, Width);

// Launch the device computation threads!


MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 8


ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Step 6: Handling Arbitrary Sized Square
Matrices
• Have each 2D thread block to Nd

compute a (TILE_WIDTH)2 sub-


matrix (tile) of the result matrix

WIDTH
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of
(WIDTH/TILE_WIDTH)2 blocks Pd
Md

You still need to put a loop by


around the kernel call for cases TILE_WIDTH
where WIDTH/TILE_WIDTH ty

WIDTH
is greater than max grid size
(64K)! bx tx

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH 9


ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
bx
0 1 2
Matrix Multiplication Using
tx
Multiple Blocks 0 1 2 TILE_WIDTH-1

Nd
• Break-up Pd into tiles
• Each block calculates one

WIDTH
tile
– Each thread calculates one
element
– Block size equal tile size
Md Pd

TILE_WIDTHE
0 Pdsub

WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH

2 WIDTH WIDTH

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 10


ECE498AL, University of Illinois, Urbana-Champaign
A Small Example

Block(0,0) Block(1,0)

P0,0 P1,0 P2,0 P3,0 TILE_WIDTH = 2


P0,1 P1,1 P2,1 P3,1

P0,2 P1,2 P2,2 P3,2

P0,3 P1,3 P2,3 P3,3

Block(0,1) Block(1,1)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 11


ECE498AL, University of Illinois, Urbana-Champaign
A Small Example: Multiplication
Nd0,0 Nd1,0

Nd0,1 Nd1,1

Nd0,2 Nd1,2

Nd0,3 Nd1,3

Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 12


ECE498AL, University of Illinois, Urbana-Champaign
Revised Matrix Multiplication
Kernel using Multiple Blocks
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

Pd[Row*Width+Col] = Pvalue;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 13
ECE498AL, University of Illinois, Urbana-Champaign
CUDA Thread Block
• All threads in a block execute the same
kernel program (SPMD)
CUDA Thread Block
• Programmer declares block:
– Block size 1 to 512 concurrent threads
– Block shape 1D, 2D, or 3D Thread Id #:
– Block dimensions in threads 0123… m
• Threads have thread id numbers within block
– Thread program uses thread id to select
work and address shared data Thread program

• Threads in the same block share data and


synchronize while doing their share of the
work
• Threads in different blocks cannot cooperate
Courtesy: John Nickolls,
– Each block can execute in any order relative NVIDIA
to other blocs!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 14


ECE498AL, University of Illinois, Urbana-Champaign
Transparent Scalability
• Hardware is free to assigns blocks to any
processor at any time
– A kernel scales across any number of
parallel processors
Device Kernel grid
Device
Block 0 Block 1

Block 2 Block 3

Block 0 Block 1 Block 4 Block 5


Block 0 Block 1 Block 2 Block 3
Block 6 Block 7
time
Block 2 Block 3
Block 4 Block 5 Block 6 Block 7

Block 4 Block 5
Each block can execute in any order relative to other blocks.
Block 6 Block 7

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 15


ECE498AL, University of Illinois, Urbana-Champaign
G80 Example: Executing Thread Blocks
SM 0 SM 1
t0 t1 t2 … tm t0 t1 t2 … tm

MT IU MT IU
Blocks
SP SP

Blocks • Threads are assigned to Streaming


Multiprocessors in block granularity
– Up to 8 blocks to each SM as
Shared Shared resource allows
Memory Memory
– SM in G80 can take up to 768 threads
• Could be 256 (threads/block) * 3
blocks
• Or 128 (threads/block) * 6 blocks, etc.
• Threads run concurrently
– SM maintains thread/block id #s
– SM manages/schedules thread
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 execution 16
ECE498AL, University of Illinois, Urbana-Champaign
G80 Example: Thread Scheduling
• Each Block is executed as …Block 1 Warps …Block 2 Warps …Block 1 Warps
32-thread Warps t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31

– An implementation decision, … … …
not part of the CUDA
programming model
– Warps are scheduling units Streaming Multiprocessor
in SM Instruction L1

• If 3 blocks are assigned to an Instruction Fetch/Dispatch

SM and each block has 256 Shared Memory

threads, how many Warps are SP SP


there in an SM? SP SP
– Each Block is divided into SP
SFU
SP
SFU

256/32 = 8 Warps
SP SP
– There are 8 * 3 = 24 Warps

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 17


ECE498AL, University of Illinois, Urbana-Champaign
G80 Example: Thread Scheduling
(Cont.)

• SM implements zero-overhead warp scheduling


– At any time, only one of the warps is executed by SM
– Warps whose next instruction has its operands ready for
consumption are eligible for execution
– Eligible Warps are selected for execution on a prioritized
scheduling policy
– All threads in a warp execute the same instruction when selected

TB1, W1 stall
TB2, W1 stall TB3, W2 stall

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3


W1 W1 W1 W2 W1 W1 W2 W3 W2
Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4

Time TB = Thread Block, W = Warp

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 18


ECE498AL, University of Illinois, Urbana-Champaign
G80 Block Granularity Considerations
• For Matrix Multiplication using multiple blocks, should I
use 8X8, 16X16 or 32X32 blocks?

– For 8X8, we have 64 threads per Block. Since each SM can take
up to 768 threads, there are 12 Blocks. However, each SM can
only take up to 8 Blocks, only 512 threads will go into each SM!

– For 16X16, we have 256 threads per Block. Since each SM can
take up to 768 threads, it can take up to 3 Blocks and achieve full
capacity unless other resource considerations overrule.

– For 32X32, we have 1024 threads per Block. Not even one can fit
into an SM!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 19


ECE498AL, University of Illinois, Urbana-Champaign
CUDA Memories

20
G80 Implementation of CUDA Memories
• Each thread can: Grid
– Read/write per-thread
Block (0, 0) Block (1, 0)
registers
– Read/write per-thread Shared Memory Shared Memory

local memory Registers Registers Registers Registers

– Read/write per-block
shared memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

– Read/write per-grid
global memory Host Global Memory

– Read/only per-grid
Constant Memory
constant memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 21


ECE498AL, University of Illinois, Urbana Champaign
CUDA Variable Type Qualifiers
Variable declaration Memory Scope Lifetime
__device__ __local__ int LocalVar; local thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application

• __device__ is optional when used with


__local__, __shared__, or __constant__

• Automatic variables without any qualifier reside in


a register
– Except arrays that reside in local memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 22
ECE498AL, University of Illinois, Urbana Champaign
Where to Declare Variables?

Can host access it?


register (automatic)
yes no
global shared
constant local

Outside of
In the kernel
any Function

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 23


ECE498AL, University of Illinois, Urbana Champaign
Variable Type Restrictions
• Pointers can only point to memory allocated or
declared in global memory:
– Allocated in the host and passed to the kernel:
__global__ void KernelFunc(float* ptr)
– Obtained as the address of a global variable:
float* ptr = &GlobalVar;

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 24


ECE498AL, University of Illinois, Urbana Champaign
A Common Programming Strategy
• Global memory resides in device memory (DRAM)
- much slower access than shared memory
• So, a profitable way of performing computation on
the device is to tile data to take advantage of fast
shared memory:
– Partition data into subsets that fit into shared memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory,
using multiple threads to exploit memory-level parallelism
• Performing the computation on the subset from shared
memory; each thread can efficiently multi-pass over any data
element
• Copying results from shared memory to global memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 25
ECE498AL, University of Illinois, Urbana Champaign
A Common Programming Strategy
(Cont.)
• Constant memory also resides in device memory
(DRAM) - much slower access than shared
memory
– But… cached!
– Highly efficient access for read-only data
• Carefully divide data according to access patterns
– R/Only  constant memory (very fast if in cache)
– R/W shared within Block  shared memory (very fast)
– R/W within each thread  registers (very fast)
– R/W inputs/results  global memory (very slow)
For texture memory usage, see NVIDIA document.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 26
ECE498AL, University of Illinois, Urbana Champaign
GPU Atomic Integer Operations

• Atomic operations on integers in global memory:


– Associative operations on signed/unsigned ints
– add, sub, min, max, ...
– and, or, xor
– Increment, decrement
– Exchange, compare and swap
• Requires hardware with compute capability 1.1
and above.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 27


ECE498AL, University of Illinois, Urbana Champaign 27
Matrix Multiplication using
Shared Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 28


ECE498AL, University of Illinois, Urbana Champaign
Review: Matrix Multiplication
Kernel using Multiple Blocks
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

Pd[Row*Width+Col] = Pvalue;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 29
ECE498AL, University of Illinois, Urbana Champaign
Idea: Use Shared Memory to reuse
global memory data
• Each input element is N

read by Width threads.


• Load each element into

WIDTH
Shared Memory and
have several threads
use the local version to
reduce the memory M P

bandwidth ty
– Tiled algorithms

WIDTH
tx

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009


WIDTH WIDTH 30
ECE498AL, University of Illinois, Urbana Champaign
bx
Tiled Multiply 0 1 2

tx
• Break up the execution of the 0 1 2 TILE_WIDTH-1

TILE_WIDTH
Nd
kernel into phases so that the
data accesses in each phase is

WIDTH
TILE_WIDTH
focused on one subset (tile) of
Md and Nd

Md Pd

TILE_WIDTHE
0 Pdsub

WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 31
ECE498AL, University of Illinois, Urbana Champaign
A Small Example
Nd0,0 Nd1,0

Nd0,1 Nd1,1

Nd0,2 Nd1,2

Nd0,3 Nd1,3

Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 32


ECE498AL, University of Illinois, Urbana Champaign
Every Md and Nd Element is used
exactly twice in generating a 2X2 tile of P
P0,0 P1,0 P0,1 P1,1
thread0,0 thread1,0 thread0,1 thread1,1
M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0

Access M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1


order
M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2

M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 33


ECE498AL, University of Illinois, Urbana Champaign
Breaking Md and Nd into Tiles

Nd0,0 Nd1,0

Nd0,1 Nd1,1

Nd0,2 Nd1,2

Nd0,3 Nd1,3

Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 34


ECE498AL, University of Illinois, Urbana Champaign
Each phase of a Thread Block uses one
tile from Md and one from Nd
Phase 1 Step 52
Step 4 Phase Step 6
T0,0 Md0,0 Nd0,0 PValue0,0 += Md2,0 Nd0,2 PValue0,0 +=
↓ ↓ Mds0,0*Nds0,0 + ↓ ↓ Mds0,0*Nds0,0 +
Mds0,0 Nds0,0 Mds1,0*Nds0,1 Mds0,0 Nds0,0 Mds1,0*Nds0,1

T1,0 Md1,0 Nd1,0 PValue1,0 += Md3,0 Nd1,2 PValue1,0 +=


↓ ↓ Mds0,0*Nds1,0 + ↓ ↓ Mds0,0*Nds1,0 +
Mds1,0 Nds1,0 Mds1,0*Nds1,1 Mds1,0 Nds1,0 Mds1,0*Nds1,1

T0,1 Md0,1 Nd0,1 PdValue0,1 += Md2,1 Nd0,3 PdValue0,1 +=


↓ ↓ Mds0,1*Nds0,0 + ↓ ↓ Mds0,1*Nds0,0 +
Mds0,1 Nds0,1 Mds1,1*Nds0,1 Mds0,1 Nds0,1 Mds1,1*Nds0,1

T1,1 Md1,1 Nd1,1 PdValue1,1 += Md3,1 Nd1,3 PdValue1,1 +=


↓ ↓ Mds0,1*Nds1,0 + ↓ ↓ Mds0,1*Nds1,0 +
Mds1,1 Nds1,1 Mds1,1*Nds1,1 Mds1,1 Nds1,1 Mds1,1*Nds1,1

time 35
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign
First-order Size Considerations in G80

• Each thread block should have many threads


– TILE_WIDTH of 16 gives 16*16 = 256 threads

• There should be many thread blocks


– A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks

• Each thread block perform 2*256 = 512 float


loads from global memory for 256 * (2*16) =
8,192 mul/add operations.
– Memory bandwidth no longer a limiting factor
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 36
ECE498AL, University of Illinois, Urbana Champaign
CUDA Code – Kernel Execution
Configuration
// Setup the execution configuration
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);
dim3 dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 37


ECE498AL, University of Illinois, Urbana Champaign
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH];
2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;


4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on


5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Coolaborative loading of Md and Nd tiles into shared memory
9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
11. __syncthreads();
11. for (int k = 0; k < TILE_WIDTH; ++k)
12. Pvalue += Mds[ty][k] * Nds[k][tx];
13. Synchthreads();
14. }
13. Pd[Row*Width+Col] = Pvalue;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 38
ECE498AL, University of Illinois, Urbana Champaign
bx
Tiled Multiply 0 1 2

tx
• Each block computes one 0 1 2 TILE_WIDTH-1

TILE_WIDTH
Nd
square sub-matrix Pdsub of size
m
TILE_WIDTH

WIDTH
• Each thread computes one

TILE_WIDTH
bx k
element of Pdsub

Md Pd
by
0
m

TILE_WIDTHE
0 Pdsub

WIDTH
1

by 1
ty 2
k
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 39
ECE498AL, University of Illinois, Urbana Champaign
G80 Shared Memory and Threading
• Each SM in G80 has 16KB shared memory
– SM size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB
of shared memory.
– Can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256
threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB
shared memory usage per thread block, allowing only up to two
thread blocks active at the same time
• Using 16x16 tiling, we reduce the accesses to the global
memory by a factor of 16
– The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6
GFLOPS!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 40


ECE498AL, University of Illinois, Urbana Champaign
Tiling Size Effects

100

90

80

70
GFLOPS

60

50

40

30

20

10

0
o n ly

o n ly

o n ly

o n ly
u n ro lle d

u n ro lle d

u n ro lle d

u n ro lle d
tile d

tile d

tile d

tile d
tile d &

tile d &

tile d &

tile d &
no t tile d 4 x4 tile s 8 x8 tile s 1 2 x1 2 tile s 1 6 x1 6 tile s

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 41


ECE498AL, University of Illinois, Urbana Champaign
Summary- Typical Structure of a
CUDA Program
• Global variables declaration
– __host__
– __device__... __global__, __constant__, __texture__
• Function prototypes
– __global__ void kernelOne(…)
– float handyFunction(…)
• Main ()
– allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes )
– transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…)
– execution configuration setup
– kernel call – kernelOne<<<execution configuration>>>( args… ); repeat
– transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) as needed
– optional: compare against golden (host computed) solution
• Kernel – void kernelOne(type args,…)
– variables declaration - __local__, __shared__
• automatic variables transparently assigned to registers or local memory
– syncthreads()…
• Other functions
– float handyFunction(int inVar…);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 42


ECE498AL, University of Illinois, Urbana Champaign
Some Additional API Features

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 43


ECE498AL, University of Illinois, Urbana-Champaign
Application Programming Interface
• The API is an extension to the C programming
language
• It consists of:
– Language extensions
• To target portions of the code for execution on the device
– A runtime library split into:
• A common component providing built-in vector types and a
subset of the C runtime library in both host and device
codes
• A host component to control and access one or more
devices from the host
• A device component providing device-specific functions
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 44
ECE498AL, University of Illinois, Urbana-Champaign
Language Extensions:
Built-in Variables

• dim3 gridDim;
– Dimensions of the grid in blocks (gridDim.z
unused)
• dim3 blockDim;
– Dimensions of the block in threads
• dim3 blockIdx;
– Block index within the grid
• dim3 threadIdx;
– Thread index within the block
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 45
ECE498AL, University of Illinois, Urbana-Champaign
Common Runtime Component:
Mathematical Functions
• pow, sqrt, cbrt, hypot
• exp, exp2, expm1
• log, log2, log10, log1p
• sin, cos, tan, asin, acos, atan, atan2
• sinh, cosh, tanh, asinh, acosh, atanh
• ceil, floor, trunc, round
• Etc.
– When executed on the host, a given function uses
the C runtime implementation if available
– These functions are only supported for scalar types,
not vector types
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 46
ECE498AL, University of Illinois, Urbana-Champaign
Device Runtime Component:
Mathematical Functions
• Some mathematical functions (e.g. sin(x))
have a less accurate, but faster device-only
version (e.g. __sin(x))
– __pow
– __log, __log2, __log10
– __exp
– __sin, __cos, __tan

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 47


ECE498AL, University of Illinois, Urbana-Champaign
Device Runtime Component:
Synchronization Function
• void __syncthreads();
• Synchronizes all threads in a block
• Once all threads have reached this point,
execution resumes normally
• Used to avoid RAW / WAR / WAW hazards
when accessing shared or global memory
• Allowed in conditional constructs only if the
conditional is uniform across the entire thread
block
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 48
ECE498AL, University of Illinois, Urbana-Champaign

You might also like