CUDA Memory Types: Parallel and High Performance Computing
CUDA Memory Types: Parallel and High Performance Computing
CPS343
Parallel and High Performance Computing
Spring 2013
Spring 2013
1 / 24
Outline
Device memory
CUDA memory types and uses
CUDA Type qualifiers
Programming Scenarios
Matrix multiplication
Matrix-matrix multiplication
Global memory version
Shared memory version
Spring 2013
2 / 24
Acknowledgements
Spring 2013
3 / 24
Outline
Device memory
CUDA memory types and uses
CUDA Type qualifiers
Programming Scenarios
Matrix multiplication
Matrix-matrix multiplication
Global memory version
Shared memory version
Spring 2013
4 / 24
Local memory
used for whatever doesnt fit in to registers
part of global memory, so slow and uncached
Spring 2013
5 / 24
Local memory
used for whatever doesnt fit in to registers
part of global memory; slow but now cached
Spring 2013
6 / 24
Memory limitations
Global Memory
Best if 64 or 128 bytes (16 or 32 single-precision, 8 or 16
double-precision) are read...
Coalesced read/writes:
parallel read/writes from threads in a block
sequential memory locations...
...with appropriate alignment
Spring 2013
7 / 24
Outline
Device memory
CUDA memory types and uses
CUDA Type qualifiers
Programming Scenarios
Matrix multiplication
Matrix-matrix multiplication
Global memory version
Shared memory version
Spring 2013
8 / 24
Variable declaration
int localVar;
int localArray[10];
shared int sharedVar;
device int globalVar;
constant int constantVar;
Memory
register
local
shared
global
constant
Scope
thread
thread
block
grid
grid
Lifetime
thread
thread
block
application
application
Spring 2013
9 / 24
Variable declaration
int localVar;
int localArray[10];
shared int sharedVar;
device int globalVar;
constant int constantVar;
Memory
register
local
shared
global
constant
Performance penalty
1x
100x
1x
100x
1x
Spring 2013
10 / 24
Outline
Device memory
CUDA memory types and uses
CUDA Type qualifiers
Programming Scenarios
Matrix multiplication
Matrix-matrix multiplication
Global memory version
Shared memory version
Spring 2013
11 / 24
Scenario 1
Task:
Load data from global memory
Do thread-local computations
Store result to global memory
Solution:
Load data from global memory (coalesced)
float a = d_ptr [ blockIdx . x * blockDim . x + threadIdx . x ];
Spring 2013
12 / 24
Scenario 2
Task:
Load data from global memory
Do block-local computations
Store result to global memory
Solution:
Load data to shared memory
__shared__ float a_sh [ BLOCK_SIZE ];
int idx = blockIdx . x * blockDim . x + threadIdx . x ;
a_sh [ threadIdx . x ] = d_ptr [ idx ];
__syncthreads (); // important !
Do computation
float res = f ( a_sh [ threadIdx . x ]);
Spring 2013
13 / 24
Outline
Device memory
CUDA memory types and uses
CUDA Type qualifiers
Programming Scenarios
Matrix multiplication
Matrix-matrix multiplication
Global memory version
Shared memory version
Spring 2013
14 / 24
Matrix-matrix multiplication
Consider our familiar matrix product C = AB:
for ( i = 0; i < A . height ; i ++ )
{
for ( j = 0; j < B . width ; j ++ )
{
c [ i ][ j ] = 0;
for ( k = 0; k < A . width ; k ++ )
{
c [ i ][ j ] += a [ i ][ k ] * b [ k ][ j ];
}
}
}
Spring 2013
15 / 24
Matrix-matrix multiplication
Consider our familiar matrix product C = AB:
for ( i = 0; i < A . height ; i ++ )
{
for ( j = 0; j < B . width ; j ++ )
{
c [ i ][ j ] = 0;
for ( k = 0; k < A . width ; k ++ )
{
c [ i ][ j ] += a [ i ][ k ] * b [ k ][ j ];
}
}
}
Spring 2013
15 / 24
Matrix-matrix multiplication
Consider our familiar matrix product C = AB:
for ( i = 0; i < A . height ; i ++ )
{
for ( j = 0; j < B . width ; j ++ )
{
c [ i ][ j ] = 0;
for ( k = 0; k < A . width ; k ++ )
{
c [ i ][ j ] += a [ i ][ k ] * b [ k ][ j ];
}
}
}
Spring 2013
15 / 24
Matrix-matrix multiplication
Consider our familiar matrix product C = AB:
for ( i = 0; i < A . height ; i ++ )
{
for ( j = 0; j < B . width ; j ++ )
{
c [ i ][ j ] = 0;
for ( k = 0; k < A . width ; k ++ )
{
c [ i ][ j ] += a [ i ][ k ] * b [ k ][ j ];
}
}
}
Spring 2013
15 / 24
Matrix-matrix multiplication
Consider an element
c[row][col]. There are
B.width elements on a
row of C and A.height
elements in a column of
C.
To compute each of
these elements, we
access a row of A and a
column of B.
We therefore access each
row of A B.width times
and each column of B
A.height times.
Spring 2013
16 / 24
Outline
Device memory
CUDA memory types and uses
CUDA Type qualifiers
Programming Scenarios
Matrix multiplication
Matrix-matrix multiplication
Global memory version
Shared memory version
Spring 2013
17 / 24
Kernel development
Spring 2013
18 / 24
Note: sum will be stored in a register so we this kernel only makes one
reference to C .
CPS343 (Parallel and HPC)
Spring 2013
19 / 24
Outline
Device memory
CUDA memory types and uses
CUDA Type qualifiers
Programming Scenarios
Matrix multiplication
Matrix-matrix multiplication
Global memory version
Shared memory version
Spring 2013
20 / 24
Matrix-matrix multiplication
Spring 2013
21 / 24
Spring 2013
22 / 24
Spring 2013
23 / 24
Spring 2013
24 / 24