0% found this document useful (0 votes)
4 views31 pages

VSCSE-Lecture3-cuda-memory-model-2012

The document discusses the memory model and locality in CUDA programming, detailing the different types of memory available to threads, including registers, shared memory, global memory, and constant memory. It emphasizes the importance of using shared memory to optimize performance through techniques like tiling and blocking, especially in operations like matrix multiplication. Additionally, it highlights the need for synchronization among threads to ensure efficient data access and computation.

Uploaded by

Ken Lamar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views31 pages

VSCSE-Lecture3-cuda-memory-model-2012

The document discusses the memory model and locality in CUDA programming, detailing the different types of memory available to threads, including registers, shared memory, global memory, and constant memory. It emphasizes the importance of using shared memory to optimize performance through techniques like tiling and blocking, especially in operations like matrix multiplication. Additionally, it highlights the need for synchronization among threads to ensure efficient data access and computation.

Uploaded by

Ken Lamar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

VSCSE Summer School

Programming Heterogeneous Parallel Computing


Systems

Lecture 3:
Memory Model and Locality

1
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Programmer View of CUDA Memories
• Each thread can: Grid
– Read/write per-thread
Block (0, 0) Block (1, 0)
registers (~1 cycle)
– Read/write per-block Shared Memory Shared Memory

shared memory (~5 Registers Registers Registers Registers


cycles)
– Read/write per-grid Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
global memory (~500
cycles) Host Global Memory
– Read/only per-grid
Constant Memory
constant memory (~5
cycles with caching)

2
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
CUDA Variable Type Qualifiers
Variable declaration Memory Scope Lifetime
int LocalVar; register thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application

• __device__ is optional when used with


__shared__, or __constant__

• Automatic variables without any qualifier reside in


a register
– Except per-thread arrays that reside in global memory
3
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Where to Declare Variables?

Can host access it?

yes no register (automatic


global shared
constant local

Outside of
In the kernel
any Function

4
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
A Common Programming Strategy
• Global memory resides in device memory (DRAM)
- slow access
• So, a profitable way of performing computation on
the device is to tile input data to take advantage of
fast shared memory:
– Partition data into subsets that fit into shared memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory,
using multiple threads to exploit memory-level parallelism
• Performing the computation on the subset from shared
memory; each thread can efficiently multi-pass over any data
element
• Copying results from shared memory to global memory
5
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Matrix-Matrix Multiplication using
Shared Memory

6
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Base Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)
{
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col];

d_P[Row*Width+Col] = Pvalue;
}
7
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
How about performance on G80?
• All threads access global memory
Grid
for their input matrix elements
– Two memory accesses (8 bytes) Block (0, 0) Block (1, 0)
per floating point multiply-add
– 4B/s of memory Shared Memory Shared Memory
bandwidth/FLOPS
– 4*346.5 = 1386 GB/s required to Registers Registers Registers Registers

achieve peak FLOP rating


– 86.4 GB/s limits the code at Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
21.6 GFLOPS
• The actual code runs at about 15 Host Global Memory
GFLOPS
• Need to drastically cut down Constant Memory
memory accesses to get closer to
the peak 346.5 GFLOPS

8
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Shared Memory Blocking Basic Idea
in

Global Memory

Thread
1
Thread
2

Global Memory

in

On-chip Memory

Thread
1
Thread
2
… 9
Basic Concept of Blocking/Tiling
• In a congested traffic
system, significant
reduction of vehicles can
greatly improve the delay
seen by all vehicles
– Carpooling for commuters
– Blocking/Tiling for global
memory accesses
• drivers = threads,
• cars = data

10
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Some computations are more
challenging to block/tile than others.
• Some carpools may
be easier than others
– More efficient if
neighbors are also
classmates or co-
workers
– Some vehicles may be
more suitable for
carpooling
• Similar variations exist
in blocking/tiling
11
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Carpools need synchronization.
• Good – when people have similar schedule
Worker A sleep work dinner

Time
Worker B sleep work dinner

• Bad – when people have very different schedule


Worker A party sleep work

time
Worker B sleep work dinner

12
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Same with Blocking/Tiling

• Good – when threads have similar access timing


Thread 1

Time
Thread 2

Thread 1

time
Thread 2

• Bad – when threads have very different timing


13
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Outline of Technique
• Identify a block/tile of global memory content that
are accessed by multiple threads
• Load the block/tile from global memory into on-
chip memory
• Have the multiple threads to access their data
from the on-chip memory
• Move on to the next block/tile

14
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Idea: Use Shared Memory to reuse
global memory data
• Each input element is N

read by WIDTH
threads.

WIDTH
• Load each element into
Shared Memory and
have several threads P
M
use the local version to
reduce the memory ty
bandwidth

WIDTH
– Tiled algorithms tx
WIDTH WIDTH 15
Work for Block (0,0)
in a TILE_WIDTH = 2 Configuration

Col = 0
Col = 1
blockDim.x blockDim.y

Col = 0 * 2 + threadIdx.x
Row = 0 * 2 + threadIdx.y N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3


blockIdx.x blockIdx.y
N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3

Row = 0 M0,0 M0,1 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3

Row = 1 M1,0 M1,1 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3


© David Kirk/NVIDIA and
Wen-mei W. Hwu,
Urbana, July 9-13
bx
Tiled Multiply 0 1 2

tx
• Break up the execution of the 0 1 2 TILE_WIDTH-1

TILE_WIDTH
Nd
kernel into phases so that the
data accesses in each phase is

WIDTH
TILE_WIDTH
focused on one subset (tile) of
Md and Nd

Md Pd

TILE_WIDTHE
0 Pdsub

WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
17
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Loading a Tile
• All threads in a block participate
– Each thread loads one Md element and one Nd
element in based tiled code

• Assign the loaded element to each thread such


that the accesses within each warp is coalesced
(more later).

18
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Work for Block (0,0)

SM

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

N1,0 N1,1 N1,2 N1,3 N1,0 N1,1

N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3


SM

M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
19
Work for Block (0,0)

N0,0 N0,1 N0,2 N0,3


SM N0,0 N0,1
N1,0 N1,1 N1,2 N1,3
N1,0 N1,1
N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3


SM

M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3

M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3


Work for Block (0,0)

N0,0 N0,1 N0,2 N0,3


SM N0,0 N0,1
N1,0 N1,1 N1,2 N1,3
N1,0 N1,1
N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3


SM

M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3

M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3


Work for Block (0,0)

N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3

N2,0 N2,1 N2,2 N2,3 N0,0 N0,1

N3,0 N3,1 N3,2 N3,3 N1,0 N1,1


SM

M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
22
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Work for Block (0,0)

N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3


N0,0 N0,1
N2,0 N2,1 N2,2 N2,3
SM
N1,0 N1,1
N3,0 N3,1 N3,2 N3,3
SM

M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
23
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
Barrier Synchronization
• An API function call in CUDA
– __syncthreads()

• All threads in the same block must reach the


__synctrheads() before any can move on

• Best used to coordinate tiled algorithms


– To ensure that all elements of a tile are loaded
– To ensure that all elements of a tile are consumed

24
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012
bx
Loading an M Tile 0 1 2

tx
0 1 2 TILE_WIDTH-1
Upper left corner of the M tile at step m:

TILE_WIDTH
by * TILE_WIDTH * WIDTH + m* TILE_WIDTH Nd

m
Each thread uses ty and tx to load an element

WIDTH
Upper left corner + ty * Width + tx

TILE_WIDTH
bx k
= by * TILE_WIDTH * Width + m * TILE_WIDTH +
ty * Width + tx

= (by * TILE_WIDTH + ty) * Width +


m * TILE_WIDTH + tx
Md Pd
= Row * Width + by
0
m * TILE_WIDTH + tx m

TILE_WIDTHE
0 Pdsub

WIDTH
1

by 1
ty 2
k
Row = by * TILE_WIDTH +ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, 25
July 9-13, 2012
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width)
{
1. __shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
2. __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;


4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on


5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Coolaborative loading of Md and Nd tiles into shared memory
9. ds_M[ty][tx] = d_M[Row*Width + m*TILE_WIDTH+tx];
10. ds_N[ty][tx] = d_N[(m*TILE_WIDTH+ty)*Width +Col];
11. __syncthreads();
12. for (int k = 0; k < TILE_WIDTH; ++k)
13. Pvalue += ds_M[ty][k] * ds_N[k][tx];
14. __syncthreads();
15.}
16. d_P[Row*Width+Col] = Pvalue;
}

© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, 26


July 9-13, 2012
bx
Loading an N Tile 0 1 2

tx
Upper left corner of N tile at step m: 0 1 2 TILE_WIDTH-1
bx*TILE_WIDTH + m*TILE_WIDTH*Width

TILE_WIDTH
Nd
Each thread uses ty and tx to load an element
Upper left corner + ty * Width + tx
m

WIDTH
TILE_WIDTH
= bx*TILE_WIDTH + m*TILE_WIDTH*Width + bx k
ty * Width + tx

= bx*TILE_WIDTH+tx + (m*TILE_WIDTH+ty)* Width

= Col + (m*TILE_WIDTH+ty)* Width


Md Pd
by
0
m

TILE_WIDTHE
0 Pdsub

WIDTH
1

by 1
ty 2
k
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, 27
July 9-13, 2012
First-order Size Considerations

• Each thread block should have many threads


– TILE_WIDTH of 16 gives 16*16 = 256 threads
– TILE_WIDTH of 32 gives 32*32 = 1024 threads

• For 16, each block performs 2*256 = 512 float


loads from global memory for 256 * (2*16) =
8,192 mul/add operations.

• For 32, each block performs 2*1024 = 2048 float


loads from global memory for 1024 * (2*32) =
65,536 mul/add operations
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, 28
July 9-13, 2012
Shared Memory and Threading
• Each SM in Fermi has 16KB or 48KB shared memory*
– SM size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB
of shared memory.
– Can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256
threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB
shared memory usage per thread block, allowing 2 or 6 thread
blocks active at the same time
• Using 16x16 tiling, we reduce the accesses to the global
memory by a factor of 16
– The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6
GFLOPS!
*Configurable vs L1, total 64KB
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, 29
July 9-13, 2012
Summary- Typical Structure of a
CUDA Program
• Global variables declaration
– __host__
– __device__... __global__, __constant__, __texture__
• Function prototypes
– __global__ void kernelOne(…)
– float handyFunction(…)
• Main ()
– allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes )
– transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…)
– execution configuration setup
– kernel call – kernelOne<<<execution configuration>>>( args… ); repeat
– transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) as
– optional: compare against golden (host computed) solution needed
• Kernel – void kernelOne(type args,…)
– variables declaration - auto, __shared__
• automatic variables transparently assigned to registers
– syncthreads()…
• Other functions
– float handyFunction(int inVar…);

© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, 30


July 9-13, 2012
ANY MORE QUESTIONS?

31
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13, 2012

You might also like