CIS 6930: Chip
Multiprocessor: Parallel
Architecture and Programming
Fall 2009
Jih-Kwon Peir
Computer Information Science
Engineering
University of Florida
1
CIS 6930: Chip Multiprocessor:
Parallel Architecture and Programming
Acknowledgement: Slides borrowed from
o Accelerators for Science and Engineering Applications:
GPUs and Multicores, by David Kirk / NVIDIA and Wen-mei
Hwu / University of Illinois, 2006-2008,
(http://www.greatlakesconsortium.org/events/GPUMulticore
/agenda.html)
o Course material posted from CUDA zone
(http://www.nvidia.com/object/cuda_education.html)
o Intel Software Network (http://software.intel.com/enus/academic/)
o The Art of Multiprocessor Programming
(http://software.intel.com/en-us/academic/ )
o Presentation slides from various papers
2
Course Goals
Learn how to program massively parallel
processors and achieve
high performance
functionality and maintainability
scalability across future generations
Acquire technical knowledge required to achieve
the above goals
principles and patterns of parallel programming
processor architecture features and constraints
programming API, tools and techniques
Learn new many-core general-purpose and GPU
processor architecture
Organization and memory systems
Parallel programming basics: Locking, synchronization,
mutual exclusion, transactional memory, etc.
Course Outline
Week 1-2: Introduction, GPU architectures, CUDA programming
Week 3-6: CUDA threads, code blocks, grids, CUDA memory,
synchronization, performance
Week 7: Project selection and discussion
Week 8-9: Intel many-core architectures
Week 10-11: Parallel programming model, synchronization,
mutual exclusion, conditional synchronization, locks, barriers,
concurrency and correctness, sequential program and
consistency.
Add Fermi and Larrabee
Week 12-13 - Discussion of advanced issues in multi-core
architecture and programming
Week 14-16 In-depth discussion of project topics and project
presentation
4
CUDA GPU Proggming
Integrated host+device app C program
Serial or modestly parallel parts in host C code
Highly parallel parts in device SPMD kernel C code
Serial Code (host)
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
...
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
...
5
CUDA Thread Blocks and Threads
Each thread uses IDs to
decide what data to work on
Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D
Simplifies memory
addressing when
processing
multidimensional data
Image processing
Solving PDEs on volumes
Matrix Multiplication
A Simple Example
// Matrix multiplication on the (CPU) host in double precision
void MatrixMulOnHost(float* M, float* N, float* P, int Width)
N
{
for (int i = 0; i < Width; ++i)
for (int j = 0; j < Width; ++j) {
j
double sum = 0;
for (int k = 0; k < Width; ++k) {
double a = M[i * width + k];
double b = N[k * width + j];
sum += a * b;
}
M
P
P[i * Width + j] = sum;
i
}
}
WIDTH
WIDTH
k
WIDTH
WIDTH
G80 Example: Thread Scheduling (cont.)
SM implements zero-overhead warp scheduling
At any time, only one of the warps is executed by SM
Warps whose next instruction has its operands ready for
consumption are eligible for execution
Eligible Warps are selected for execution on a prioritized
scheduling policy
All threads in a warp execute the same instruction when
selected
Thread Scheduling (cont.)
Each code block assigned to one SM, each SM can take up to 8 blocks
Each block up to 512 threads, divided into 32-therad wrap, each wrap scheduled on 8 SP, 4
threads on one SP, wrap executed SIMT mode
SP is pipelined ~30 stages, fetch, decode, gather and write-back act on whole warps, so they
have a throughput of 1 warp/slow clock
Execute acts on group of 8 threads or quarter-warps (there are only 8 SP/SM), so their
throughput is 1 warp/4 fast clocks or 1 warp/2 slow clocks
The Fetch/decode/... stages have a higher throughput to feed both the MAD and the SFU/MUL
units alternatively. Hence the peak rate of 8 MAD + 8 MUL per (fast) clock cycle
Need 6 warps (or 192 threads) per SM to hide the read-after-write latencies
9
G80 Implementation of CUDA
Memories
Each thread can:
Grid
Read/write per-thread registers
Read/write per-thread local
memory
Read/write per-block shared
memory
Read/write per-grid global memory
Read/only per-grid constant
memory
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Host
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Global Memory
Constant Memory
10
How about performance on G80?
All threads access global memory
for their input matrix elements
Two memory accesses (8
bytes) per floating point
multiply-add
4B/s of memory
bandwidth/FLOPS
4*346.5 = 1386 GB/s required
to achieve peak FLOP rating
86.4 GB/s limits the code at
21.6 GFLOPS
The actual code runs at about 15
Host
GFLOPS
Need to drastically cut down
memory accesses to get closer to
the peak 346.5 GFLOPS
Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Global Memory
Constant Memory
11
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
1.
2.
__shared__float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__float Nds[TILE_WIDTH][TILE_WIDTH];
3.
4.
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on
5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7.
float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
8.
for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Coolaborative loading of Md and Nd tiles into shared memory
9.
Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10.
Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
11.
__syncthreads();
11.
12.
13.
14.
13.
}
for (int k = 0; k < TILE_WIDTH; ++k)
Pvalue += Mds[ty][k] * Nds[k][tx];
Synchthreads();
}
Pd[Row*Width+Col] = Pvalue;
12
Todays Intel PC Architecture:
Single Core System
FSB connection
between processor and
Northbridge (82925X)
Memory Control Hub
Northbridge handles
primary PCIe to
video/GPU and DRAM.
PCIe x16 bandwidth
at 8 GB/s (4 GB each
direction)
Southbridge (ICH6RW)
handles other
peripherals
GeForce-8 Series HW Overview
Streaming Processor Array
TPC
TPC
TPC
Texture Processor Cluster
TPC
TPC
Streaming Multiprocessor
Instruction L1
SM
TPC
Data L1
Instruction Fetch/Dispatch
Shared Memory
TEX
SP
SM
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
14
SM Warp Scheduling
SM hardware implements zerooverhead Warp scheduling
SM multithreaded
Warp scheduler
time
warp 8 instruction 11
warp 1 instruction 42
warp 3 instruction 95
..
.
warp 8 instruction 12
warp 3 instruction 96
Warps whose next instruction has its
operands ready for consumption are
eligible for execution
Eligible Warps are selected for
execution on a prioritized scheduling
policy
All threads in a Warp execute the same
instruction when selected
4 clock cycles needed to dispatch
the same instruction for all threads
in a Warp in G80
If one global memory access is needed
for every 4 instructions
A minimal of 13 Warps are needed to
fully tolerate 200-cycle memory latency
15
CUDA Device Memory Space: Review
Each thread can:
(Device) Grid
R/W per-thread registers
R/W per-thread local memory
R/W per-block shared memory
R/W per-grid global memory
Read only per-grid constant
memory
Read only per-grid texture memory
The host can R/W
global, constant, and
texture memories
using Copy function
Host
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
16
Memory Layout of a Matrix in C
M0,0 M1,0 M2,0 M3,0
Access
direction
in Kernel
code
M0,1 M1,1 M2,1 M3,1
M0,2 M1,2 M2,2 M3,2
M0,3 M1,3 M2,3 M3,3
Time Period 2
T1
T2
T3
T4
Time Period 1
T1
T2
T3
T4
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
17
Bank Addressing Examples
2-way Bank Conflicts
Linear addressing
stride == 2
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 8
Thread 9
Thread 10
Thread 11
8-way Bank Conflicts
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 15
Thread 15
Linear addressing
stride == 8
x8
Bank 0
Bank 1
Bank 2
Bank 7
Bank 8
x8 Bank 9
Bank 15
18
Control Flow Instructions
Main performance concern with branching is
divergence
Threads within a single warp take different paths
Different execution paths are serialized in G80
The control paths taken by the threads in a warp are traversed one at
a time until there is no more.
A common case: avoid divergence when branch
condition is a function of thread ID
Example with divergence:
If (threadIdx.x > 2) { }
This creates two different control paths for threads in a block
Branch granularity < warp size; threads 0 and 1 follow different path
than the rest of the threads in the first warp
Example without divergence:
If (threadIdx.x / WARP_SIZE > 2) { }
Also creates two different control paths for threads in a block
Branch granularity is a whole multiple of warp size; all threads in any
given warp follow the same path
19
Vector Reduction with Branch
Divergence
Thread 0
0+1
0...3
0..7
Thread 2
2+3
Thread 4
4+5
4..7
Thread 6
6+7
Thread 8
8+9
Thread 10
10
11
10+11
8..11
8..15
iterations
Array elements
20
No Divergence until < 16 sub-sums
Thread 0
0+16
13
14
15
16
17
18
19
15+31
21
Fundamentals of Parallel
Computing
Parallel computing requires that
The problem can be decomposed into sub-problems
that can be safely solved at the same time
The programmer structures the code and data to solve
these sub-problems concurrently
The goals of parallel computing are
To solve problems in less time, and/or
To solve bigger problems, and/or
To achieve better solutions
The problems must be large enough to justify parallel
computing and to exhibit exploitable concurrency.
22
Challenges of Parallel
Programming
Finding and exploiting concurrency often requires
looking at the problem from a non-obvious angle
Computational thinking (J. Wing)
Dependences need to be identified and managed
The order of task execution may change the answers
Obvious: One step feeds result to the next steps
Subtle: numeric accuracy may be affected by ordering steps that are
logically parallel with each other
Performance can be drastically reduced by many
factors
Overhead of parallel processing
Load imbalance among processor elements
Inefficient data sharing patterns
Saturation of critical resources such as memory bandwidth
23
Fermi Implements CUDA
Definition of memory scope,
grid, thread block, thread,
are same as in Tesla
Grid: Array of thread blocks
Thread Block: up to 1536
concurrent threads, comm.
through shared memory
GPU has an array of SMs,
each executes one or more
thread block, each block is
grouped into warps with 32
thread per warp
Other resource constraints
are implementation based
24
Fermi GT300 Key Feature
32 cores per SM, 512 cores
Fully pipelined integer and floating
point unit that implements new
IEEE 754-2008 standard include
fused multiply-add (FMA)
Two warps from different thread
blocks (even different kernels) can
be issued and executed
concurrently
ECC protection from the registers to
DRAM
Linear addressing model with
caching at all levels
Large shared memory / L1 cache
Double precision performance 8x
faster than GT200 and reach ~600
double-precision GFLOPs
25
Fermi GT300 Key Feature
(cont.)
Fermi supports simultaneous
execution of multiple kernels from
the same application, each kernel
distributed to one or more SMs
GigaThread hardware thread
scheduler, manages 1,536
simultaneously active threads for
each SM across 16 kernels
Switching from one application to
another is 20x faster on Fermi
Fermi supports OpenCL, Fortran, C+
+, Java, Matlab, and Python.
Each SM has 32cores and 16 LS/ST
units, 4 SFUs
Fermi supports FMA for both singe
and double precision
26
Instruction Schedule Example
A total of 32 instructions from one or
two warps can be dispatched in each
cycle to any two of the four execution
blocks within a Fermi SM: two blocks
of 16 cores each, one block of four
Special Function Units, and one block
of load/store units. This figure shows
how instructions are issued to the
four execution blocks.
It takes two cycles for the 32 instructions in each warp to execute on the cores or
load/store units. A warp of 32 special-function instructions is issued in a single cycle
but takes eight cycles to complete on the four SFUs
Another major improvement in Fermi and PTX 2.0 is a new unified addressing model.
All addresses in the GPU are allocated from a continuous 40-bit (one terabyte) address
space. Global, shared, and local addresses are defined as ranges within this address
space and can be accessed by common load/store instructions. (The load/store
instructions support 64-bit addresses to allow for future growth.)
Multi-Core Architecture:
Intel Quad Core Technology of Today
Cache Structure
Core
0
Core
1
Core
2
Core
3
4MB Shared
L2 Cache
4MB Shared
L2 Cache
Bus Interface
1066MHz/1333Mhz FSB
28
The L2 cache of todays
quad-core processors is
not one cache shared by
all 4 cores. Instead there
are two L2 cache shared
by two cores each
What Is OpenMP*?
C$OMP FLUSH
C$OMP THREADPRIVATE(/ABC/)
C$OMP parallel do shared(a, b, c)
#pragma omp critical
CALL OMP_SET_NUM_THREADS(10)
call omp_test_lock(jlok)
C$OMP MASTER
call OMP_INIT_LOCK (ilok)
http://www.openmp.org
C$OMP ATOMIC
C$OMP
SINGLE PRIVATE(X)
Current spec is OpenMP
setenv2.5
OMP_SCHEDULE dynamic
C$OMP PARALLEL DO ORDERED PRIVATE (A, B,
C)Pages
250
C$OMP PARALLEL
C$OMP ORDERED
(combined C/C++ and Fortran)
REDUCTION (+: A, B)
C$OMP SECTIONS
#pragma omp parallel for private(A, B)
C$OMP PARALLEL COPYIN(/blk/)
Nthrds = OMP_GET_NUM_PROCS()
Programming with
OpenMP*
!$OMP
BARRIER
C$OMP DO lastprivate(XX)
omp_set_lock(lck)
29
More material
Intel Larrabee Architecture
Herlihys Book
Chapter 1: Introduction
Chapter 2: Mutual Exclusion
30