ELT3047 Computer Architecture
Lecture 8: Parallelism
Hoang Gia Hung
Faculty of Electronics and Telecommunications
University of Engineering and Technology, VNU Hanoi
Exam
Improving Global Predictor Accuracy
Gshare predictor: GHR hashed with the Branch PC
Add more context information to the global predictor.
Pattern History Table
Global history register 2m entries of 2-bit counters
of last m branches executed
00 …. 00
(T = 1, NT = 0)
1 1 ….. 1 0
00 …. 01
10 11
00 …. 10
XOR
PC
00 01
Lower m bits
index
11 …. 11
Gshare speculator
BTB (2k entries)
Branch PC Target PC
PC Next fetch address
tag PC+4
Hit?
=
32-k
Taken?
PHT
XOR
1 1 ….. 1 0
GHR
Correlating Predictor
Utilize partitioned PHT, where partions are selected by the GHR
Access a row in the partitioned PHT with the low-order bits of branch address
Each partion holds the local history of 1 branch
Contents is the prediction
General form: predictor
-bit GHR
-bit indexing of local history
Tournament Predictor
Combine branch predictors
local, per-branch prediction, accessed by the PC
correlated prediction based on the last branches, assessed by the GHR
Indicator of which had been the best predictor for this branch
2-bit counter: increase for one, decrease for the other
Branch Prediction Performance
Introduction to Parallelism
Multiple levels of parallelism
Advanced ILP: Beyond Pipelining
Recap: processor performance
Increase performance = reduce CPI.
Advanced techniques to increase ILP
Deeper pipeline (superpipelining), e.g., 10 or 15 stages
Less work per stage → shorter clock cycle (limited by power dissipation).
But more potential for all 3 types of hazards! (more stalling → CPI > 1)
Multiple issue
Execute multiple instructions simultaneously in multiple pipelines.
More hardware, but CPI < 1 (so use Instructions Per Cycle - IPC).
E.g., 4GHz 4-way multiple-issue → peak CPI = 0.25, peak IPC = 4.
But dependencies reduce this in practice.
Multiple Issue Processors
Static multiple issue (a.k.a VLIW - Very Long Instruction Word)
Compiler groups instructions to be issued together into “issue packets”.
packet = very long instruction comprising of multiple “issue slots”
fixed format: each slot is dedicated for a fixed operation
Static scheduling: compiler decides which instructions to issue in parallel (without causing hazards) before the program is executed.
Dynamic multiple issue (a.k.a. superscalar processors)
CPU examines instruction stream and schedules execution at runtime by:
deciding whether to issue 0, 1, 2, … instruction(s) each cycle
resolving hazards using advanced techniques
Avoids the need for compiler scheduling, but
compiler can help by reordering instructions
code semantics ensured by the CPU
Multiple Issue: HW Implementation
Very Long Instruction Word
PC Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2
Two Integer Units,
Single Cycle Latency
Two Load/Store Units,
Three Cycle Latency
Two Floating-Point Units,
Four Cycle Latency
Fixed format, determined by pipeline resources required
Static scheduling: compiler must remove some/all hazards
Reorder instructions into issue packets: no dependencies within a packet
Pad with nop if necessary
Possibly some dependencies between packets (varied between ISAs)
Multiple Issue: Static Scheduling
Loop: lw x31,0(x20) // x31 = array element
add x31,x31,x21 // add scalar in x21
sw x31,0(x20) // store result
addi x20,x20,-4 // decrement pointer
blt x22,x20,Loop // branch if x22 < x20
Dual-issue: includes ALU/branch & Load/store slots
Load-use hazard: still one cycle use latency, but now two instructions
EX data hazard: can’t use ALU result in load/store in same packet → split
ALU/branch Load/store cycle
Loop: nop lw x31,0(x20) 1
addi x20,x20,-4 nop 2
add x31,x31,x21 nop 3
blt x22,x20,Loop sw x31,4(x20) 4
IPC = 5/4 = 1.25 (c.f. peak IPC = 2)
More Improvement: Loop Unrolling
Loop unrolling: replicate loop body to expose more parallelism
Reduces loop-control overhead
Must use different registers per replication (register renaming)
ALU/branch Load/store cycle
Loop: addi x20,x20,-16 lw x28,0(x20) 1
nop lw x29,12(x20) 2
add x28,x28,x21 lw x30,8(x20) 3
add x29,x29,x21 lw x31,4(x20) 4
add x30,x30,x21 sw x28,16(x20) 5
add x31,x31,x21 sw x29,12(x20) 6
nop sw x30,8(x20) 7
blt x22,x20,Loop sw x31,4(x20) 8
IPC = 14/8 = 1.75 (c.f. peak IPC = 2)
IPC closer to 2, but at cost of registers and code size
Dynamic Scheduling: Concepts
Why not just let the compiler schedule code?
Not all stalls are predicable
Can’t always schedule around branches
Branch outcome is dynamically determined
Different implementations of an ISA have different latencies and hazards
Dynamic scheduling: parallelizable instructions identified by HW
The chosen instructions are fetched & decoded in order as normal
available operands are copied to reservation stations prior to execution
missing operands will be supplied later by execution results
An instruction executes as soon as all operands are ready in the reservation station, allowing issued instruction to execute out of (the
fetched) order.
the result is sent to waiting reservation stations if it’s a missing oprerand
otherwise, it’ll be sent to the reorder buffer in the commit unit.
The commit unit releases results in (the fetched) order when safe to do so.
Dynamic Scheduling: Implementation
Preserves dependencies
Hold pending operands
This result also sent to any
reservation stations waiting for it
Reorders buffer for register
writes
Can supply operands for issued
instructions
Scheduling Recap: Static vs. Dynamic
Object code
Static _VOID
Scheduling and
_DEFUN(_mor_nu), IM1 = I–1
Operation
struct _reent IM2 = I–2
Normal Independence
*ptr _AND IM3 = I–3
Compiler Recognizing:
register size_t T1 = LOAD
Implemented by
{ T3 = 2*T1
hardware
Run-time
Same Same ILP
source Hardware in
code both cases
Compile Time
Object code
Static _VOID Normal compiler
_DEFUN(_mor_nu), plus Scheduling T3 = 2*T1
struct _reent and Operation
T1 = LOAD
*ptr _AND Independence
register size_t NOP
Recognizing:
{ implemented by IM2 = I–2
software
Data-Level Parallelism (DLP)
Many real-world scenarios involve performing the same operation on multiple sets of data.
E.g., vector addition: the elements are added row-wise → the addition operation is applied to all rows.
Exploiting DLP: Single-Instruction Multiple Data (SIMD)
Perform the operation once on vector registers that hold multiple operands
1 +¿17 1 17
2
+¿21 2
+¿ 21
3
+¿25
Speedup arises not from performing multiple
3 25
+¿
math operations concurrently, but from executing large memory loads/stores
4 29 4 29
simultaneously.
SIMD operations cannot be used to process multiple data in different ways.
Many existing ISAs include SIMD operations, e.g., Intel MMX/SSEn/AVX.
DLP: Matrix Multiplication Example
1 2 3 4 17 18 19 20
5 6 7 8
¿
21 22 23 24
9 10 11 12 × 25 26 27 28
13 14 15 16 29 30 31 32
Each element of the product matrix is the dot product of 2 arrays
Data parallelism: same operation performed on different operands.
SIMD loads 2 arrays to 2 vector registers: each row gets loaded times.
If the SIMD processor provides enough vector registers:
Can compute 4 cells using 4 loads.
Does require a tail case for odd .
1 2 3 4 17 18 19 20
5 6 7 8
9 10 11 12 × 21
25
22
26
23
27
24
28 ¿
13 14 15 16 29 30 31 32
DLP: Code Vectorization
Enable loop parallelization by utilizing SIMD instructions
for (i=0; i < N; i++)
C[i] = A[i] + B[i];
Scalar Sequential Code Vectorized Code
loa
loa loa
d loa
d d
d
Iter. 1 ad loa loa
d d d
Time
stor
ad ad
e
loa d d
d loa
stor stor
d
Iter. 2 e e
ad
d Iter. 1 Iter. 2
stor Vector Instruction
e
Parallel Computers
Flynn’s taxonomy [1972]
SIMD computers are best suited for problems characterized by a high degree of regularity, such as graphics/image processing.
MISD has very few actual examples.
MIMD is now the most common type of parallel computer.
Many MIMD architectures also include SIMD execution sub-components.