VSCSE-Lecture3-cuda-memory-model-2012
VSCSE-Lecture3-cuda-memory-model-2012
Lecture 3:
Memory Model and Locality
1
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Programmer View of CUDA Memories
• Each thread can: Grid
– Read/write per-thread
Block (0, 0) Block (1, 0)
registers (~1 cycle)
– Read/write per-block Shared Memory Shared Memory
2
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
CUDA Variable Type Qualifiers
Variable declaration Memory Scope Lifetime
int LocalVar; register thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application
Outside of
In the kernel
any Function
4
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
A Common Programming Strategy
• Global memory resides in device memory (DRAM)
- slow access
• So, a profitable way of performing computation on
the device is to tile input data to take advantage of
fast shared memory:
– Partition data into subsets that fit into shared memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory,
using multiple threads to exploit memory-level parallelism
• Performing the computation on the subset from shared
memory; each thread can efficiently multi-pass over any data
element
• Copying results from shared memory to global memory
5
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Matrix-Matrix Multiplication using
Shared Memory
6
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Base Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)
{
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col];
d_P[Row*Width+Col] = Pvalue;
}
7
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
How about performance on G80?
• All threads access global memory
Grid
for their input matrix elements
– Two memory accesses (8 bytes) Block (0, 0) Block (1, 0)
per floating point multiply-add
– 4B/s of memory Shared Memory Shared Memory
bandwidth/FLOPS
– 4*346.5 = 1386 GB/s required to Registers Registers Registers Registers
8
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Shared Memory Blocking Basic Idea
in
Global Memory
Thread
1
Thread
2
…
Global Memory
in
On-chip Memory
Thread
1
Thread
2
… 9
Basic Concept of Blocking/Tiling
• In a congested traffic
system, significant
reduction of vehicles can
greatly improve the delay
seen by all vehicles
– Carpooling for commuters
– Blocking/Tiling for global
memory accesses
• drivers = threads,
• cars = data
10
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Some computations are more
challenging to block/tile than others.
• Some carpools may
be easier than others
– More efficient if
neighbors are also
classmates or co-
workers
– Some vehicles may be
more suitable for
carpooling
• Similar variations exist
in blocking/tiling
11
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Carpools need synchronization.
• Good – when people have similar schedule
Worker A sleep work dinner
Time
Worker B sleep work dinner
time
Worker B sleep work dinner
12
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Same with Blocking/Tiling
Time
Thread 2
…
Thread 1
time
Thread 2
14
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Idea: Use Shared Memory to reuse
global memory data
• Each input element is N
read by WIDTH
threads.
WIDTH
• Load each element into
Shared Memory and
have several threads P
M
use the local version to
reduce the memory ty
bandwidth
WIDTH
– Tiled algorithms tx
WIDTH WIDTH 15
Work for Block (0,0)
in a TILE_WIDTH = 2 Configuration
Col = 0
Col = 1
blockDim.x blockDim.y
Col = 0 * 2 + threadIdx.x
Row = 0 * 2 + threadIdx.y N0,0 N0,1 N0,2 N0,3
tx
• Break up the execution of the 0 1 2 TILE_WIDTH-1
TILE_WIDTH
Nd
kernel into phases so that the
data accesses in each phase is
WIDTH
TILE_WIDTH
focused on one subset (tile) of
Md and Nd
Md Pd
TILE_WIDTHE
0 Pdsub
WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
17
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Loading a Tile
• All threads in a block participate
– Each thread loads one Md element and one Nd
element in based tiled code
18
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Work for Block (0,0)
SM
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
19
Work for Block (0,0)
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
22
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Work for Block (0,0)
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
23
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Barrier Synchronization
• An API function call in CUDA
– __syncthreads()
24
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
bx
Loading an M Tile 0 1 2
tx
0 1 2 TILE_WIDTH-1
Upper left corner of the M tile at step m:
TILE_WIDTH
by * TILE_WIDTH * WIDTH + m* TILE_WIDTH Nd
m
Each thread uses ty and tx to load an element
WIDTH
Upper left corner + ty * Width + tx
TILE_WIDTH
bx k
= by * TILE_WIDTH * Width + m * TILE_WIDTH +
ty * Width + tx
TILE_WIDTHE
0 Pdsub
WIDTH
1
by 1
ty 2
k
Row = by * TILE_WIDTH +ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, 25
July 9-13, 2012
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)
{
1. __shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
2. __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
tx
Upper left corner of N tile at step m: 0 1 2 TILE_WIDTH-1
bx*TILE_WIDTH + m*TILE_WIDTH*Width
TILE_WIDTH
Nd
Each thread uses ty and tx to load an element
Upper left corner + ty * Width + tx
m
WIDTH
TILE_WIDTH
= bx*TILE_WIDTH + m*TILE_WIDTH*Width + bx k
ty * Width + tx
TILE_WIDTHE
0 Pdsub
WIDTH
1
by 1
ty 2
k
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, 27
July 9-13, 2012
First-order Size Considerations
31
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012