Chapter 2
Instruction-Level Parallelism
1
Introduction
Introduction
Pipelining become universal technique in 1985
Overlaps execution of instructions
Exploits “Instruction Level Parallelism(ILP)”
Two main approaches:
Dynamic hardware-based
Used in server and desktop processors
Not used as extensively in Parallel Multiprogrammed
Microprocessors (PMP)
Static approaches compiler-based (software based)
Not as successful outside of scientific applications
2
Review of basic concepts
Pipelining: each instruction is split up into a sequence of steps
– different steps can be executed concurrently by different
circuitry.
A basic pipeline in a RISC processor
IF – Instruction Fetch
ID – Instruction Decode
EX – Instruction Execution
MEM – Memory Access
WB – Register Write Back
Two techniques:
Superscalar - A superscalar processor executes more than one
instruction during a clock cycle.
VLIW - very long instruction word – compiler packs
multiple independent operations into an instruction
4
Review of basic concepts
IF – Instruction Fetch
Send the PC to memory
fetch the current instruction from memory.
Update the PC to next instruction by adding 4
ID – Instruction Decode
Decode the instruction
read the registers corresponding to register source
EX – Instruction Execution
adds the base register and the offset to form the effective address (mem
reference)
Register-Register ALU instruction
Conditional branch—Determine if the condition is true
MEM – Memory Access - load/store to memory
WB – Register Write Back- Reg-Reg ALU inst, or load instruc.
Note: branch require three cycles, store instruc, require four cycles, and
all other instructions require five cycles.
Basic superscalar 5-stage
pipeline
Superscalar- a processor executes more than one instruction during a
clock cycle by simultaneously dispatching multiple instructions to redundant
functional units on the processor.
The hardware determines (statically/ dynamically) - which one of a block on
n instructions will be executed next.
A single-core superscalar processor SISD
A multi-core superscalar MIMD.
5
Registers prevent interference between two different
instructions in adjacent stages
Hazards - pipelining could lead to incorrect results.
Data dependence “true dependence” Read after Write hazard (RAW)
i: sub R1, R2, R3 % sub d,s,t d =s-t
i+1: add R4, R1, R3 % add d,s t d = s+t
Instruction (i+1) reads operand (R1) before instruction (i) writes it.
Name dependence “anti dependence” two instructions use the same register or
memory location, but there is no data dependency between them.
Write after Read hazard (WAR) Example:
i: sub R4, R5, R3
i+1: add R5, R2, R3
mul R6, R5, R7
Instruction (i+1) writes
operand (R5) before
instruction (i) reads it.
Write after Write (WAW) (Output dependence)
Example i: sub R6, R5, R3
i+1: add R6, R2, R3
i+2: mul R1, R2, R7
Instruction (i+1) writes 6
Hazards
Data hazards => RAW, WAR, WAW.
Structural hazard - occurs when a part of the processor's
hardware is needed by two or more instructions at the same time.
Example: a single memory unit that is accessed both in the fetch
stage where an instruction is retrieved from memory, and the
memory stage where data is written and/or read from memory.
They can often be resolved by separating the component into
orthogonal units (such as separate caches) or bubbling the
pipeline.
Control hazard (branch hazard) => are due to branches.
On many instruction pipeline microarchitectures, the processor will
not know the outcome of the branch when it needs to insert a new
instruction into the pipeline (normally the fetch stage).
Instruction-level parallelism
Introduction
(ILP) parallelism, goal
When exploiting instruction-level
is to maximize CPI -cycle/instruction
Pipeline CPI =
Ideal pipeline CPI +
Structural stalls +
Data hazard stalls +
Control stalls
Parallelism with basic block is limited
Typical size of basic block = 3-6 instructions
Must optimize across branches
For RISC programs, the average dynamic branch
frequency is between 15% and 25%.
The simplest and most common way to increase the ILP
is to exploit parallelism among iterations of a loop.
This type of parallelism is often called loop-level
parallelism.
Data dependence
Introduction
Loop-Level Parallelism
Unroll loop statically or dynamically
Use SIMD (vector processors and GPUs)
Challenges:
Data dependency
Instruction j is data dependent on instruction i
if
Instruction i produces a result that may be used by instruction j
Instruction j is data dependent on instruction k and instruction
k is data dependent on instruction i
Dependent instructions cannot be
executed simultaneously
Data dependence
Introduction
Dependencies are a property of programs
Pipeline organization determines if dependence is
detected and if it causes a “stall.”
Data dependence conveys:
Possibility of a hazard
Order in which results must be calculated
Upper bound on exploitable instruction level
arallelism
Dependencies that flow through memory
locations are difficult to detect
Introduction
Name dependence
Two instructions use the same name but no flow of
information
Not a true data dependence, but is a problem when
reordering instructions
Antidependence: instruction j writes a
register or memory location that instruction i reads
Initial ordering (i before j) must be preserved
Output dependence: instruction i and
instruction j write the same register or memory
location
Ordering must be preserved
To resolve, use renaming techniques
Control dependence
Introduction
Every instruction is control dependent on some set of branches
and, in general, the control dependencies must be preserved to
ensure program correctness.
Instruction control dependent on a branch cannot be moved before the
branch so that its execution is no longer controller by the branch.
An instruction not control dependent on a branch cannot be moved
after the branch so that its execution is controlled by the branch.
Example
if C1 {
S1;
};
if C2 {
S2;
};
S1 is control dependent on C1; S2 is control dependent on C2 but not on C1
Properties essential to program
correctness
Preserving exception behavior any change in instruction order must not
change the order in which exceptions are raised. Example:
DADDU
R1, R2, R3 BEQZ
R1, L1
L1 LW R4,
0(R1)
Can we move LW before
BEQZ ?
Preserving data flow the flow of data between instructions that produce
results and consumes them. Example:
DADDU R1, R2, R3
BEQZ R4, L1
DSUBU R1, R8, R9
13
L1 LW R5, 0(R2)
Introduction
Examples
• Example 1: OR instruction dependent
DADDU R1,R2,R3 on DADDU and DSUBU
BEQZ R4,L
DSUBU R1,R1,R6
L: …
OR R7,R1,R8
• Example 2: Assume R4 isn’t used after
DADDU R1,R2,R3 skip
BEQZ R12,skip Possible to move DSUBU
DSUBU R4,R5,R6 before the branch
DADDU R5,R4,R9
skip:
OR R7,R8,R9
Compiler techniques for
exposing ILP
Loop transformation technique to optimize a program's execution
speed:
1. reduce or eliminate instructions that control the loop, e.g.,
pointer arithmetic and "end of loop" tests on each iteration
2. hide latencies, e.g., the delay in reading data from
memory
3. re-write loops as a repeated sequence of similar
independent statements space-time tradeoff
4. reduce branch penalties;
Methods
1. pipeline scheduling
2. loop unrolling
3. strip mining
1. Pipeline scheduling
Compiler Techniques
Pipeline stall - delay in execution of an instruction in an instruction
pipeline in order to resolve a hazard. The compiler can reorder
instructions to reduce the number of pipeline stalls.
Pipeline scheduling - Separate dependent instruction from the source
instruction by the pipeline latency of the source instruction
Example:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
Pipeline stalls
Compiler Techniques
Assumptions:
1. S F2
2. The element of the array with the highest address
R1
Loop: L.D
3. The element of%the
F0,0(R1) array
load with
array the lowest
element in F0address +8
stallR2 % next instruction needs the result in F0
ADD.D F4,F0,F2 % increment array element
stall
stall % next instruction needs the result in F4
S.D F4,0(R1) % store array element
DADDUI R1,R1,#-8 % decrement pointer to array element
BNE R1,R2,Loop % loop if not the last element
Compiler Techniques
Pipeline scheduling
Scheduled code:
Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
stall
stall
S.D F4,0(R1)
BNE R1,R2,Loop
2. Loop unrolling
Given the same code
Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
S.D F4,0(R1)
BNE R1,R2,Loop
Assume # elements of the array with starting address in R1 is
divisible by 4
Unroll by a factor of 4
Eliminate unnecessary instructions.
merging the addi instructions and dropping the unnecessary
bne operations
Unrolled loop
Compiler Techniques
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1) % drop DADDUI & BNE
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1) %drop DADDUI & BNE
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1) % drop DADDUI & BNE
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,#-32
BNE R1,R2,Loop
Live registers: F0-1, F2-3, F4-5, F6-7, F8-9,F10-11, F12-13, F14-15,
and F16-15; also R1 and R2
In the original code: only F0-1, F2-3, and F4-5; also R1 and R2
Clock cycle
Without scheduling - 8 clock cycle per element
With scheduling - 7 clock cycle per element
The unrolled loop will run in 26 (4 loops) clock
cycles
– each fld has 1 stall,
– each fadd.d has 2, plus
– 14 instruction
—or 6.5 clock cycles for each of the 4 elements.
Unrolled plus scheduled has dropped to a total of
14 clock cycles, or 3.5 clock cycles per element,
compared with 8 cycles before unrolling or
scheduling and
6.5 cycles when unrolled but not scheduled.
Pipeline schedule the unrolled
Compiler Techniques
loop
Loop: L.D F0,0(R1) Pipeline scheduling reduces
L.D F6,-8(R1) the number of stalls.
L.D F10,-16(R1) 1. The L.D instruction
L.D F14,-24(R1) requires only one cycle so
ADD.D F4,F0,F2 when ADD.D are issued F4, F8,
ADD.D F8,F6,F2 F12 , and F16 are already
ADD.D F12,F10,F2 loaded.
ADD.D F16,F14,F2 2.The ADD.D requires only
S.D F4,0(R1) two cycles so that two S.D
S.D F8,-8(R1) can proceed immediately
DADDUI R1,R1,#-32 3.The array pointer is updated
S.D F12,16(R1) after the first two S.D so the
S.D F16,8(R1) loop control can proceed
BNE R1,R2,Loop
immediately after the last two
S.D
Loop unrolling & scheduling
summary
Use different registers to avoid unnecessary constraints.
Adjust the loop termination and iteration code.
Find if the loop iterations are independent except the loop maintenance
code if so unroll the loop
Analyze memory addresses to determine if the load and store from different
iterations are independent if so interchange load and stores in the unrolled
loop
Schedule the code while ensuring correctness.
Limitations of loop unrolling
Decrease of the amount of overhead with each roll
Growth of the code size
Register pressure (shortage of registers) scheduling to increase ILP
increases the number of live values thus, the number of registers
Branch prediction
Branch prediction - guess whether a conditional
jump will be taken or not.
Goal - improve the flow in the instruction pipeline.
Speculative execution - the branch that is
guessed to be the most likely is then fetched and
speculatively executed.
Penalty - if it is later detected that the guess was
wrong then the speculatively executed or partially
executed instructions are discarded and the
pipeline starts over with the correct branch,
incurring a delay.
Branch prediction
• As pipelines get deeper and the potential penalty of
branches increases,
– using delayed branches is insufficient.
• Predicting branches
– low-cost static schemes that rely on information
available at compile time (always taken or not)
• use profile information collected from earlier runs.
• effectiveness of branch prediction (3% to 24%)
– predict branches dynamically based on program
behavior.
Dynamic Branch prediction
• How - the branch predictor keeps records of
whether branches are taken or not taken.
• When it encounters a conditional jump that has
been seen several times before then it can base
the prediction on the history.
• The branch predictor may, for example,
recognize that the conditional jump is taken
more often than not, or that it is taken every
second time
Branch Prediction
Predictors
Basic 2-bit predictor:
For each branch:
Predict taken or not taken
If the prediction is wrong two consecutive times, change prediction
Correlating predictor:
Multiple 2-bit predictors for each branch
One for each possible combination of outcomes of preceding n
branches
Local predictor:
Multiple 2-bit predictors for each branch
One for each possible combination of outcomes for the last n
occurrences of this branch
Tournament predictor:
Combine correlating predictor with local predictor
The states in a 2-bit prediction
scheme
Correlating predictor
• Branch predictors that use the behavior of other
branches to make a prediction
• A (1,2) predictor uses the behavior of the last
branch to choose from among a pair of 2-bit
branch predictors in predicting a particular
branch.
• (m,n) predictor uses the behavior of the last m
branches to choose from 2m branch predictors,
each of which is an n-bit predictor for a single
branch.
Correlating predictor
a = 0;
B1 while(a<1000){
if(a%2!=0) {...}
a++;
...
B2 if(b==0){...}
}
Branch history
B1: TNTNTNTNT
B2: NNNNNNNN
2-bit prediction
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Correlating predictor
Branch Prediction
Branch prediction
performance
Dynamic
Branch Prediction
scheduling
Rearrange order of instructions to reduce stalls
while maintaining data flow
Advantages:
Compiler doesn’t need to have knowledge of
microarchitecture
Handles cases where dependencies are unknown at
compile time
Disadvantage:
Substantial increase in hardware complexity
Complicates exceptions
Dynamic scheduling
Branch Prediction
Dynamic scheduling implies:
Out-of-order execution
Out-of-order completion
Creates the possibility for WAR and WAW
hazards
Tomasulo’s Approach
Tracks when operands are available
Introduces register renaming in hardware
Minimizes WAW and WAR hazards
Register renaming
Branch Prediction
Example:
DIV.D F0,F2,F4
ADD.D F6,F0,F8
antidependence WAR - F8
S.D F6,0(R1)
SUB.D F8,F10,F14 antidependence WAW – F6
MUL.D F6,F10,F8
+ name dependence
with F6
Register renaming
Branch Prediction
Example: add two temporary registers S and T
i: DIV.D F0,F2,F4
i+1: ADD.D S,F0,F8 (instead of ADD.D F6,F0,F8)
i+2 S.D S,0(R1) (instead of S.D F6,0(R1)
i+3 SUB.D T,F10,F14 (instead of SUB.D F8,F10,F14)
i+4 MUL.D F6,F10,T (instead of MUL.D F6,F10,F8)
Now only RAW hazards remain, which can be strictly
ordered
Branch Prediction
Register renaming
Register renaming is provided by reservation stations (RS)
Contains:
The instruction
Buffered operand values (when available)
Reservation station number of instruction providing the operand
values
RS fetches and buffers an operand as soon as it becomes available (not
necessarily involving register file)
Pending instructions designate the RS to which they will send their
output
Result values broadcast on a result bus, called the common data bus (CDB)
Only the last output updates the register file
As instructions are issued, the register specifiers are renamed with
the reservation station
May be more reservation stations than registers
Tomasulo’s algorithm
Branch Prediction
Goal high performance without special compilers when the
hardware has only a small number of floating point (FP)
registers.
Designed in 1966 for IBM 360/91 with only 4 FP registers.
Used in modern processors: Pentium 2/3/4, Power PC 604,
Neehalem…..
Additional hardware needed
Load and store buffers contain data and addresses, act like
reservation station.
Reservation stations feed data to floating point arithmetic units.
Instruction execution
Branch Prediction
steps
Issue
Get next instruction from FIFO queue
If available RS, issue the instruction to the RS with operand values if
available
If operand values not available, stall the instruction
Execute
When operand becomes available, store it in any reservation stations
waiting for it
When all operands are ready, issue the instruction
Loads and store maintained in program order through effective
address
No instruction allowed to initiate execution until all branches that
proceed it in program order have completed
Write result
Broadcast result on CDB into reservation stations and store buffers
(Stores must wait until address and value are received)
Branch Prediction
Example
Notations
Example of Tomasulo
algorithm
3 – load buffers (Load1, Load 2, Load 3)
5 - reservation stations (Add1, Add2, Add3, Mult 1, Mult 2).
16 pairs of floating point registers F0-F1, F2-F3,…..F30-31.
Clock cycles: addition 2; multiplication 10; division 40;
Clock cycle 1
A load from memory location 34+(R2) is issued; data will be stored
later in the first load buffer (Load1)
Clock cycle 2
The Second load from memory loaction 45+(R3) is issued; data will be
stored in load buffer 2 (Load2).
Multiple loads can be outstanding.
Clock cycle 3
First load completed SUBD instruction is waiting for the data.
MULTD issued
Register names are removed/renamed in Reservation Stations
Clock cycle 4
Load2 is completed; MULTD waits for the results of it.
The results of Load 1 are available for the SUBD instruction.
Clock cycle 5
The result of Load2 available for MULTD executed by Mult1 and
SUBD executed by Add1. Both can now proceed as they have both operands.
Mult2 executes DIVD and cannot proceed yet as it waits for the results of
Add1.
Clock cycle 6
Issue ADDD
Clock cycle 7
The results of the SUBD produced by Add1 will be available in the next
cycle.
ADDD instruction executed by Add2 waits for them.
Clock cycle 8
The results of SUBD are deposited by the Add1 in F8-F9
Clock cycle 9
Clock cycle 10
ADDD executed by Add2 completes, it needed 2 cycles.
There are 5 more cycles for MULTD executed by Mult1.
Clock cycle 11
Only MULTD and DIVD instructions did not complete. DIVD is waiting for
the result of MULTD before moving to the execute stage.
Clock cycle 12
Clock cycle 13
Clock cycle 14
Clock cycle 15
MULTS instruction executed by Mult1 unit completed execution.
DIVD in instruction executed by the Mult2 unit is waiting for it.
Clock cycle 16
Clock cycle 55
DIVD will finish execution in cycle 56 and the result will be in F6-F7
in cycle 57.
Hardware-based speculation
Branch Prediction
Goal overcome control dependency by speculating.
Allow instructions to execute out of order but force them to
commit to avoid: (i) updating the state or (ii) taking an
exception
Instruction commit allow an instruction to update the
register file when instruction is no longer speculative
Key ideas:
1. Dynamic branch prediction.
2. Execute instructions along predicted execution paths, but only
commit the results if prediction was correct.
3. Dynamic scheduling to deal with different combination of basic
blocks
How speculative execution is
Branch Prediction
done
Need additional hardware to prevent any irrevocable
action until an instruction commits.
Reorder buffer (ROB)
Modify functional units – operand source is ROB rather than
functional units
Register values and memory values are not written until an
instruction commits
On misprediction:
Speculated entries in ROB are cleared
Exceptions:
Not recognized until it is ready to commit
Extended floating point unit
FP using Tomasulo’s
algorithm extended to
handle speculation.
Reorder buffer
now holds the result
of instruction
between
completion and
commit. Has 4 fields
Instruction type:
branch/store/register
Destination field:
register number
Value field:
output
value
Ready field:
completed
execution?
Operand source is now
reorder buffer instead of
functional unit
Multiple Issue and Static Scheduling
Multiple issue and static scheduling
To achieve CPI < 1 complete multiple instructions per
clock cycle
Three flavors of multiple issue processors
1. Statically scheduled superscalar processors
a. Issue a varying number of instructions per clock
cycle
b. Use in-order execution
2. VLIW (very long instruction word) processors
a. Issue a fixed number of instructions as one large
instruction
b. Instructions are statically scheduled by the compiler
3. Dynamically scheduled superscalar processors
a. Issue a varying number of instructions per clock
cycle
b. Use out-of-order execution
Multiple Issue and Static Scheduling
Multiple issue
processors
VLIW
Processors
Package multiple operations into one instruction
Must be enough parallelism in code to fill the available
slots,
Disadvantages:
Statically finding parallelism
Code size
No hazard detection hardware
Binary code compatibility
Example
Multiple Issue and Static Scheduling
Unroll the loop for x[i]= x[i] +s
to eliminate any stalls. Ignore
delayed branches. Loop: L.D F0,0(R1)
The code we had before L.D F6,-8(R1)
shown on the right
L.D F10,-16(R1)
Package in one VLIW L.D F14,-24(R1)
instruction :
One integer instruction (or
ADD.D F4,F0,F2
ADD.D F8,F6,F2
branch)
Two independent floating- ADD.D F12,F10,F2
point operations ADD.D F16,F14,F2
Two independent memory S.D F4,0(R1)
references. S.D F8,-8(R1)
DADDUI R1,R1,#-32
S.D F12,16(R1)
S.D F16,8(R1)
BNE R1,R2,Loop
Example
Dynamic Scheduling, Multiple Issue, and Speculation
Dynamic scheduling, multiple issue, speculation
Modern microarchitectures:
Dynamic scheduling + multiple issue + speculation
Two approaches:
Assign reservation stations and update pipeline control table in half
clock cycles
Only supports 2 instructions/clock
Design logic to handle any possible dependencies between the
instructions
Hybrid approaches.
Issue logic can become bottleneck
Multiple issue processor with
speculation
The organization should allow simultaneous execution for all issues in
one clock cycle of one of the following operations:
FP multiplication
FP addition
Integer operations
Load/Store
Several datapaths must be widened to support multiple issues.
The instruction issue logic will be fairly complex.
Dynamic Scheduling, Multiple Issue, and Speculation
Multiple issue processor with
speculation
Dynamic Scheduling, Multiple Issue, and Speculation
Basic strategy for updating the
issue logic
Assign a reservation station and a reorder buffer for every instruction that
may be issued in the next bundle.
To pre-allocate reservation stations limit the number of instructions of a
given class that can be issued in a “bundle”
I.e. one FP, one integer, one load, one store
Examine all the dependencies among the instructions in the bundle
If dependencies exist in bundle, use the assigned ROB number to
update the reservation table for dependent instructions. Otherwise, use the
existing reservations table entries for the issuing instruction.
Also need multiple completion/commit
Dynamic Scheduling, Multiple Issue, and Speculation
Example
Loop: LD R2,0(R1) ;R2=array element
DADDIU R2,R2,#1 ;increment R2
SD R2,0(R1) ;store result
DADDIU R1,R1,#8 ;increment pointer
BNE R2,R3,LOOP ;branch if not last element
Dynamic Scheduling, Multiple Issue, and Speculation
Dual issue without speculation
Time of issue, execution, and writing the result for a dual-issue of the pipeline.
The LD following the BNE (cycles 3, 6) cannot start execution earlier, it must
wait until the branch outcome is determined as there is no speculation
Dual issue with speculation
Dynamic Scheduling, Multiple Issue, and Speculation
Time of issue, execution, writing the result, and commit.
The LD following the BNE can start execution early; speculation is supported.