Lecture 4
Lecture 4
1
Block IDs and Thread IDs
Host Device
• Each thread uses IDs to
Grid 1
decide what data to work on
Kernel
– Block ID: 1D or 2D 1
Block
(0, 0)
Block
(1, 0)
– Thread ID: 1D, 2D, or 3D
Block Block
(0, 1) (1, 1)
processing 2
Block (1, 1)
multidimensional data (0,0,1) (1,0,1) (2,0,1) (3,0,1)
– Image processing
Thread Thread Thread Thread
– Solving PDEs on volumes (0,0,0) (1,0,0) (2,0,0) (3,0,0)
Courtesy: NDVIA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 2
Figure 3.2. An Example of CUDA Thread Org
ECE498AL, University of Illinois, Urbana-Champaign
Step 1: Matrix Multiplication
A Simple Host Version in C
// Matrix multiplication on the (CPU) host
void MatrixMulOnHost(float* M, float* N, float* P, int Width)
N
{
for (int i = 0; i < Width; ++i) k
for (int j = 0; j < Width; ++j) {
float sum = 0; j
WIDTH
for (int k = 0; k < Width; ++k) {
float a = M[i * width + k];
float b = N[k * width + j];
sum += a * b;
}
P[i * Width + j] = sum;
} M P
}
i
WIDTH
k
cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
WIDTH
Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; tx
}
Md Pd
ty ty
WIDTH
k tx
WIDTH
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of
(WIDTH/TILE_WIDTH)2 blocks Pd
Md
WIDTH
is greater than max grid size
(64K)! bx tx
Nd
• Break-up Pd into tiles
• Each block calculates one
WIDTH
tile
– Each thread calculates one
element
– Block size equal tile size
Md Pd
TILE_WIDTHE
0 Pdsub
WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH
2 WIDTH WIDTH
Block(0,0) Block(1,0)
Block(0,1) Block(1,1)
Nd0,1 Nd1,1
Nd0,2 Nd1,2
Nd0,3 Nd1,3
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 13
ECE498AL, University of Illinois, Urbana-Champaign
CUDA Thread Block
• All threads in a block execute the same
kernel program (SPMD)
CUDA Thread Block
• Programmer declares block:
– Block size 1 to 512 concurrent threads
– Block shape 1D, 2D, or 3D Thread Id #:
– Block dimensions in threads 0123… m
• Threads have thread id numbers within block
– Thread program uses thread id to select
work and address shared data Thread program
Block 2 Block 3
Block 4 Block 5
Each block can execute in any order relative to other blocks.
Block 6 Block 7
MT IU MT IU
Blocks
SP SP
– An implementation decision, … … …
not part of the CUDA
programming model
– Warps are scheduling units Streaming Multiprocessor
in SM Instruction L1
256/32 = 8 Warps
SP SP
– There are 8 * 3 = 24 Warps
TB1, W1 stall
TB2, W1 stall TB3, W2 stall
– For 8X8, we have 64 threads per Block. Since each SM can take
up to 768 threads, there are 12 Blocks. However, each SM can
only take up to 8 Blocks, only 512 threads will go into each SM!
– For 16X16, we have 256 threads per Block. Since each SM can
take up to 768 threads, it can take up to 3 Blocks and achieve full
capacity unless other resource considerations overrule.
– For 32X32, we have 1024 threads per Block. Not even one can fit
into an SM!
20
G80 Implementation of CUDA Memories
• Each thread can: Grid
– Read/write per-thread
Block (0, 0) Block (1, 0)
registers
– Read/write per-thread Shared Memory Shared Memory
– Read/write per-block
shared memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
– Read/write per-grid
global memory Host Global Memory
– Read/only per-grid
Constant Memory
constant memory
Outside of
In the kernel
any Function
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 29
ECE498AL, University of Illinois, Urbana Champaign
Idea: Use Shared Memory to reuse
global memory data
• Each input element is N
WIDTH
Shared Memory and
have several threads
use the local version to
reduce the memory M P
bandwidth ty
– Tiled algorithms
WIDTH
tx
tx
• Break up the execution of the 0 1 2 TILE_WIDTH-1
TILE_WIDTH
Nd
kernel into phases so that the
data accesses in each phase is
WIDTH
TILE_WIDTH
focused on one subset (tile) of
Md and Nd
Md Pd
TILE_WIDTHE
0 Pdsub
WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 31
ECE498AL, University of Illinois, Urbana Champaign
A Small Example
Nd0,0 Nd1,0
Nd0,1 Nd1,1
Nd0,2 Nd1,2
Nd0,3 Nd1,3
Nd0,0 Nd1,0
Nd0,1 Nd1,1
Nd0,2 Nd1,2
Nd0,3 Nd1,3
time 35
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign
First-order Size Considerations in G80
tx
• Each block computes one 0 1 2 TILE_WIDTH-1
TILE_WIDTH
Nd
square sub-matrix Pdsub of size
m
TILE_WIDTH
WIDTH
• Each thread computes one
TILE_WIDTH
bx k
element of Pdsub
Md Pd
by
0
m
TILE_WIDTHE
0 Pdsub
WIDTH
1
by 1
ty 2
k
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 39
ECE498AL, University of Illinois, Urbana Champaign
G80 Shared Memory and Threading
• Each SM in G80 has 16KB shared memory
– SM size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB
of shared memory.
– Can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256
threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB
shared memory usage per thread block, allowing only up to two
thread blocks active at the same time
• Using 16x16 tiling, we reduce the accesses to the global
memory by a factor of 16
– The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6
GFLOPS!
100
90
80
70
GFLOPS
60
50
40
30
20
10
0
o n ly
o n ly
o n ly
o n ly
u n ro lle d
u n ro lle d
u n ro lle d
u n ro lle d
tile d
tile d
tile d
tile d
tile d &
tile d &
tile d &
tile d &
no t tile d 4 x4 tile s 8 x8 tile s 1 2 x1 2 tile s 1 6 x1 6 tile s
• dim3 gridDim;
– Dimensions of the grid in blocks (gridDim.z
unused)
• dim3 blockDim;
– Dimensions of the block in threads
• dim3 blockIdx;
– Block index within the grid
• dim3 threadIdx;
– Thread index within the block
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 45
ECE498AL, University of Illinois, Urbana-Champaign
Common Runtime Component:
Mathematical Functions
• pow, sqrt, cbrt, hypot
• exp, exp2, expm1
• log, log2, log10, log1p
• sin, cos, tan, asin, acos, atan, atan2
• sinh, cosh, tanh, asinh, acosh, atanh
• ceil, floor, trunc, round
• Etc.
– When executed on the host, a given function uses
the C runtime implementation if available
– These functions are only supported for scalar types,
not vector types
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 46
ECE498AL, University of Illinois, Urbana-Champaign
Device Runtime Component:
Mathematical Functions
• Some mathematical functions (e.g. sin(x))
have a less accurate, but faster device-only
version (e.g. __sin(x))
– __pow
– __log, __log2, __log10
– __exp
– __sin, __cos, __tan