Dr Noor Mahammad Sk 1
INTRODUCTION TO
ADVANCED PIPELINE
4 March 2025 Dr Noor Mahammad Sk
Review: Summary of Pipelining Basics
2
Hazards limit performance
Structural: need more HW resources
Data: need forwarding, compiler scheduling
Control: early evaluation & PC, delayed branch, prediction
Increasing length of pipe increases impact of hazards; pipelining
helps instruction bandwidth, not latency
Interrupts, instruction set, FP makes pipelining harder
Compilers reduce cost of data and control hazards
Load delay slots
Branch delay slots
Branch prediction
Today: Longer pipelines (R4000) more instruction level
parallelism SW and HW loop unrolling
4 March 2025 Dr Noor Mahammad Sk
Case Study: MIPS R4000 (200MHz)
3
8 Stage Pipeline:
IF–first half of fetching of instruction; PC selection happens here as
well as initiation of instruction cache access.
IS–second half of access to instruction cache.
RF–instruction decode and register fetch, hazard checking and also
instruction cache hit detection.
EX–execution, which includes effective address calculation, ALU
operation, and branch target computation and condition evaluation.
DF–data fetch, first half of access to data cache.
DS–second half of access to data cache.
TC–tag check, determine whether the data cache access hit.
WB–write back for loads and register-register operations.
8 Stages: What is impact on Load delay? Branch delay? Why?
4 March 2025 Dr Noor Mahammad Sk
Case Study: MIPS R4000
4
TWO Cycle IF IS RF EX DF DS TC WB
Load Latency IF IS RF EX DF DS TC
IF IS RF EX DF DS
IF IS RF EX DF
IF IS RF EX
IF IS RF
IF IS
IF
THREE Cycle IF IS RF EX DF DS TC WB
Branch Latency IF IS RF EX DF DS TC
(conditions evaluated IF IS RF EX DF DS
during EX phase) IF IS RF EX DF
IF IS RF EX
Delay slot plus two stalls IF IS RF
Branch likely cancels delay slot if not taken IF IS
IF
4 March 2025 Dr Noor Mahammad Sk
MIPS R4000 Floating Point
5
FP Adder, FP Multiplier, FP Divider
Last step of FP Multiplier/Divider uses FP Adder HW
8 kinds of stages in FP units:
Stage Functional unit Description
A FP adder Mantissa ADD stage
D FP divider Divide pipeline stage
E FP multiplier Exception test stage
M FP multiplier First stage of multiplier
N FP multiplier Second stage of multiplier
R FP adder Rounding stage
S FP adder Operand shift stage
U Unpack FP numbers
4 March 2025 Dr Noor Mahammad Sk
MIPS FP Pipe Stages
6
FP Instr 1 2 3 4 5 6 7 8 …
Add, Subtract U S+A A+R R+S
Multiply U E+M M M M N N+A R
Divide U A R D28 … D+A D+R, D+R, D+A, D+R, A, R
Square root U E (A+R)108 … A R
Negate U S
Absolute value U S
FP compare U A R
Stages:
M First stage of multiplier A Mantissa ADD stage
N Second stage of multiplier D Divide pipeline stage
R Rounding stage E Exception test stage
S Operand shift stage
U Unpack FP numbers
4 March 2025 Dr Noor Mahammad Sk
FP Loop: Where are the Hazards?
7
Loop: LD F0,0(R1) ;F0=vector element
ADDD F4,F0,F2 ;add scalar from F2
SD 0(R1),F4 ;store result
SUBI R1,R1,8 ;decrement pointer 8B (DW)
BNEZ R1,Loop ;branch R1!=zero
NOP ;delayed branch slot
Instruction producing result Instruction using result Latency in clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Integer op Integer op 0
Where are the stalls?
4 March 2025 Dr Noor Mahammad Sk
FP Loop Hazards
8
Loop: LD F0, 0(R1) ;F0=vector element
ADDD F4, F0, F2 ;add scalar from F2
SD 0(R1), F4 ;store result
SUBI R1, R1, 8 ;decrement pointer 8 Bytes (DW)
BNEZ R1, Loop ;branch R1!=zero
NOP ;delayed branch slot
Instruction producing result Instruction using result Latency in clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Integer op Integer op 0
4 March 2025 Dr Noor Mahammad Sk
FP Loop Showing Stalls
9
1 Loop: LD F0, 0(R1) ;F0=vector element
2 Stall
3 ADDD F4, F0, F2 F4,F0,F2
4 Stall
5 Stall
6 SD 0(R1), F4 ;store result
7 SUBI R1, R1, 8 ; decrement pointer 8B (DW)
8 BNEZ R1, Loop ;branch R1 != zero
9 Stall ;delayed branch slot
Instruction producing result Instruction using result Latency in clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
4 March 2025 9 clocks: Rewrite code to minimize stalls? Dr Noor Mahammad Sk
Revised FP Loop Minimizing Stalls
10
1 Loop: LD F0, 0(R1)
2 Stall
3 ADDD F4, F0, F2
4 SUBI R1, R1, 8
5 BNEZ R1, Loop ; delayed branch
6 SD 8(R1), F4 ; altered when move past SUBI
Replace BNEZ stall with SD by changing address of SD
Instruction producing result Instruction using result Latency in clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
9 clocks: Rewrite code to minimize stalls?
4 March 2025 Dr Noor Mahammad Sk
Unroll Loop Four Times
11
(straightforward way)
1 Loop: LD F0,0(R1)
2 ADDD F4,F0,F2
Rewrite loop to
3 SD 0(R1),F4 ;drop SUBI & BNEZ minimize stalls?
4 LD F6,-8(R1)
5 ADDD F8,F6,F2
6 SD -8(R1),F8 ;drop SUBI & BNEZ
7 LD F10,-16(R1)
8 ADDD F12,F10,F2
9 SD -16(R1),F12 ;drop SUBI & BNEZ
10 LD F14,-24(R1)
11 ADDD F16,F14,F2
12 SD -24(R1),F16
13 SUBI R1,R1,#32 ;alter to 4*8
14 BNEZ R1,LOOP
15 NOP
15 + 4 x (1+2) = 27 clock cycles
4 March 2025 Dr Noor Mahammad Sk
Unrolled Loop That Minimizes Stalls
12
1 Loop: LD F0,0(R1)
What assumptions made
2 LD F6,-8(R1)
3 LD F10,-16(R1)
when moved code?
4 LD F14,-24(R1) OK to move store past
5 ADDD F4,F0,F2 SUBI even though changes
6 ADDD F8,F6,F2 register
7 ADDD F12,F10,F2 OK to move loads before
8 ADDD F16,F14,F2 stores: get right data?
9 SD 0(R1),F4 When is it safe for
10 SD -8(R1),F8 compiler to do such
11 SD -16(R1),F12 changes?
12 SUBI R1,R1,#32
13 BNEZ R1,LOOP
14 SD 8(R1),F16 ; 8-32 = -24
14 clock cycles,
When safe to move instructions?
4 March 2025 Dr Noor Mahammad Sk
Compiler Perspectives on Code
13
Movement
Definitions: compiler concerned about dependencies in
program, whether or not a HW hazard depends on a
given pipeline
Try to schedule to avoid hazards
(True) Data dependencies (RAW if a hazard for HW)
Instruction i produces a result used by instruction j, or
Instruction j is data dependent on instruction k, and
instruction k is data dependent on instruction i.
If dependent, can’t execute in parallel
Easy to determine for registers (fixed names)
Hard for memory:
Does 100(R4) = 20(R6)?
From different loop iterations, does 20(R6) = 20(R6)?
4 March 2025 Dr Noor Mahammad Sk
Where are the data dependencies?
14
1 Loop: LD F0,0(R1)
2 ADDD F4,F0,F2
3 SUBI R1,R1,8
4 BNEZ R1,Loop ;delayed branch
5 SD 8(R1),F4 ;altered when move past SUBI
4 March 2025 Dr Noor Mahammad Sk
Compiler Perspectives on Code
15
Movement
Another kind of dependence called name dependence:
two instructions use same name (register or memory
location) but don’t exchange data
Antidependence (WAR if a hazard for HW)
Instruction j writes a register or memory location that
instruction i reads from and instruction i is executed first
Output dependence (WAW if a hazard for HW)
Instruction i and instruction j write the same register or
memory location; ordering between instructions must be
preserved.
4 March 2025 Dr Noor Mahammad Sk
Where are the name dependencies?
16
1 Loop: LD F0,0(R1)
2 ADDD F4,F0,F2
3 SD 0(R1),F4 ;drop SUBI & BNEZ
4 LD F0,-8(R1)
2 ADDD F4,F0,F2
3 SD -8(R1),F4 ;drop SUBI & BNEZ
7 LD F0,-16(R1)
8 ADDD F4,F0,F2
9 SD -16(R1),F4 ;drop SUBI & BNEZ
10 LD F0,-24(R1)
11 ADDD F4,F0,F2
12 SD -24(R1),F4
13 SUBI R1,R1,#32 ;alter to 4*8
14 BNEZ R1,LOOP
15 NOP
How can remove them?
4 March 2025 Dr Noor Mahammad Sk
Where are the name dependencies?
17
1 Loop: LD F0,0(R1) 1 Loop: LD F0,0(R1)
2 ADDD F4,F0,F2 2 ADDD F4,F0,F2
3 SD 0(R1),F4 3 SD 0(R1),F4
4 LD F0,-8(R1) 4 LD F6,-8(R1)
2 ADDD F4,F0,F2 5 ADDD F8,F6,F2
3 SD -8(R1),F4 6 SD -8(R1),F8
7 LD F0,-16(R1) 7 LD F10,-16(R1)
8 ADDD F4,F0,F2 8 ADDD F12,F10,F2
9 SD -16(R1),F4 9 SD -16(R1),F12
10 LD F0,-24(R1) 10 LD F14,-24(R1)
11 ADDD F4,F0,F2 11 ADDD F16,F14,F2
12 SD -24(R1),F4 12 SD -24(R1),F16
13 SUBI R1,R1,#32 13 SUBI R1,R1,#32
14 BNEZ R1,LOOP 14 BNEZ R1,LOOP
15 NOP 15 NOP
How can remove them? Called “register renaming”
4 March 2025 Dr Noor Mahammad Sk
Compiler Perspectives on Code
18
Movement
Again Name Dependencies are Hard for Memory
Accesses
Does 100(R4) = 20(R6)?
From different loop iterations, does 20(R6) = 20(R6)?
Our example required compiler to know that if R1
doesn’t change then:
0(R1) ≠ -8(R1) ≠ -16(R1) ≠ -24(R1)
There were no dependencies between some loads
and stores so they could be moved by each other
4 March 2025 Dr Noor Mahammad Sk
Compiler Perspectives on Code
19
Movement
Final kind of dependence called control dependence
Example
if p1 {S1;};
if p2 {S2;};
S1 is control dependent on p1 and S2 is control
dependent on p2 but not on p1.
4 March 2025 Dr Noor Mahammad Sk
Compiler Perspectives on Code
20
Movement
Two (obvious) constraints on control dependences:
An instruction that is control dependent on a branch
cannot be moved before the branch so that its execution is
no longer controlled by the branch.
An instruction that is not control dependent on a branch
cannot be moved to after the branch so that its execution
is controlled by the branch.
Control dependencies relaxed to get parallelism; get
same effect if preserve order of exceptions (address in
register checked by branch before use) and data flow
(value in register depends on branch)
4 March 2025 Dr Noor Mahammad Sk
Where are the control dependencies?
21
1 Loop: LD F0,0(R1)
2 ADDD F4,F0,F2
3 SD 0(R1),F4
4 SUBI R1,R1,8
5 BEQZ R1,exit
6 LD F0,0(R1)
7 ADDD F4,F0,F2
8 SD 0(R1),F4
9 SUBI R1,R1,8
10 BEQZ R1,exit
11 LD F0,0(R1)
12 ADDD F4,F0,F2
13 SD 0(R1),F4
14 SUBI R1,R1,8
15 BEQZ R1,exit
....
4 March 2025 Dr Noor Mahammad Sk
When Safe to Unroll Loop?
22
Example: Where are data dependencies?
(A,B,C distinct & nonoverlapping)
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1];} /* S2 */
1. S2 uses the value, A[i+1], computed by S1 in the same
iteration.
2. S1 uses a value computed by S1 in an earlier iteration,
since iteration i computes A[i+1] which is read in iteration i+1.
The same is true of S2 for B[i] and B[i+1].
This is a “loop-carried dependence”: between iterations
Implies that iterations are dependent, and can’t be executed
in parallel
Not the case for our prior example; each iteration was distinct
4 March 2025 Dr Noor Mahammad Sk
HW Schemes: Instruction Parallelism
23
Why in HW at run time?
Works when can’t know real dependence at compile time
Compiler simpler
Code for one machine runs well on another
Key idea: Allow instructions behind stall to proceed
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
Enables out-of-order execution out-of-order completion
ID stage checked both for structural Scoreboard dates to
CDC 6600 in 1963
4 March 2025 Dr Noor Mahammad Sk
HW Schemes: Instruction Parallelism
24
Out-of-order execution divides ID stage:
[Link]—decode instructions, check for structural
hazards
[Link] operands—wait until no data hazards, then read
operands
Scoreboards allow instruction to execute whenever 1
& 2 hold, not waiting for prior instructions
CDC 6600: In order issue, out of order execution, out
of order commit ( also called completion)
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Implications
25
Out-of-order completion WAR, WAW hazards?
Solutions for WAR
Queue both the operation and copies of its operands
Read registers only during Read Operands stage
For WAW, must detect hazard: stall until other completes
Need to have multiple instructions in execution phase
multiple execution units or pipelined execution units
Scoreboard keeps track of dependencies, state or
operations
Scoreboard replaces ID, EX, WB with 4 stages
4 March 2025 Dr Noor Mahammad Sk
Four Stages of Scoreboard Control
26
1. Issue—decode instructions & check for structural hazards (ID1)
If a functional unit for the instruction is free and no other active
instruction has the same destination register (WAW), the scoreboard
issues the instruction to the functional unit and updates its internal
data structure. If a structural or WAW hazard exists, then the
instruction issue stalls, and no further instructions will issue until these
hazards are cleared.
2. Read operands—wait until no data hazards, then read operands
(ID2)
A source operand is available if no earlier issued active instruction is
going to write it, or if the register containing the operand is being
written by a currently active functional unit. When the source
operands are available, the scoreboard tells the functional unit to
proceed to read the operands from the registers and begin execution.
The scoreboard resolves RAW hazards dynamically in this step, and
instructions may be sent into execution out of order.
4 March 2025 Dr Noor Mahammad Sk
Four Stages of Scoreboard Control
27
3. Execution—operate on operands (EX)
The functional unit begins execution upon receiving operands.
When the result is ready, it notifies the scoreboard that it has
completed execution.
4. Write result—finish execution (WB)
Once the scoreboard is aware that the functional unit has
completed execution, the scoreboard checks for WAR hazards. If
none, it writes results. If WAR, then it stalls the instruction.
Example:
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
CDC 6600 scoreboard would stall SUBD until ADDD reads
operands
4 March 2025 Dr Noor Mahammad Sk
Three Parts of the Scoreboard
28
1. Instruction status—which of 4 steps the instruction is in
2. Functional unit status—Indicates the state of the
functional unit (FU). 9 fields for each functional unit
Busy—Indicates whether the unit is busy or not
Op—Operation to perform in the unit (e.g., + or –)
Fi—Destination register
Fj, Fk—Source-register numbers
Qj, Qk—Functional units producing source registers Fj, Fk
Rj, Rk—Flags indicating when Fj, Fk are ready
3. Register result status—Indicates which functional unit
will write each register, if one exists. Blank when no
pending instructions will write that register
4 March 2025 Dr Noor Mahammad Sk
Detailed Scoreboard Pipeline Control
29
Instruction
Wait until Bookkeeping
status
Busy(FU) yes; Op(FU) op;
Fi(FU) `D’; Fj(FU) `S1’;
Not busy (FU) and
Issue Fk(FU) `S2’; Qj Result(‘S1’);
not result(D)
Qk Result(`S2’); Rj not Qj;
Rk not Qk; Result(‘D’) FU;
Read
Rj and Rk Rj No; Rk No
operands
Execution
Functional unit done
complete
f((Fj( f )≠Fi(FU)
f(if Qj(f)=FU then Rj(f) Yes);
or Rj( f )=No) & (Fk(
Write result f(if Qk(f)=FU then Rj(f) Yes);
f ) ≠Fi(FU) or
Result(Fi(FU)) 0; Busy(FU) No
Rk( f )=No))
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example
30
FP Add latency = 2 clocks, Multiply = 10, Divide = 40
Instruction status Read Execution Write
Instruction j k Issue operandscompleteResult
LD F6 34+ R2
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
FU
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 1
31
Instruction status Read ExecutionWrite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Integer
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 2
32
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Integer
• Issue 2nd LD?
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 3
33
Instruction status Read Execution
W rite
Instruction j k 1
Issue operandscompleteResult
LD F6 34+ R2 1 2 3
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Integer
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 4
34
Instruction status Read Execution
Write
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVDF10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Integer
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 5
35
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Integer
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 6
36
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6
MULTDF0 F2 F4 6
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 Integer
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 7
37
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 Integer Add
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 8a
38
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Integer Add Divide
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 8b
39
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Add Divide
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 9
40
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
10 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
9 FU Mult1 Add Divide
• Read operands for MULT & SUBD? Issue ADDD?
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 11
41
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
8 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
0 Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 Add Divide
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 12
42
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
7 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 Divide
• Read operands for DIVD?
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 13
43
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
6 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
13 FU Mult1 Add Divide
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 14
44
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
5 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
14 FU Mult1 Add Divide
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 15
45
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
4 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
1 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
15 FU Mult1 Add Divide
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 16
46
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
3 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
0 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU Mult1 Add Divide
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 17
47
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
2 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
17 FU Mult1 Add Divide
• Write result of ADDD?
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 18
48
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
1 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
18 FU Mult1 Add Divide
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 19
49
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
0 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
19 FU Mult1 Add Divide
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 20
50
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
20 FU Add Divide
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 21
51
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
21 FU Add Divide
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 22
52
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
40 Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
22 FU Divide
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 61
53
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
61 FU Divide
4 March 2025 Dr Noor Mahammad Sk
Scoreboard Example Cycle 62
54
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU
4 March 2025 Dr Noor Mahammad Sk
Review: Scoreboard Example Cycle 62
55
Instruction status Read Execution
W rite
In-order issue;
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4 out-of-order
LD F2 45+ R3 5 6 7 8 execute & commit
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU
4 March 2025 Dr Noor Mahammad Sk
CDC 6600 Scoreboard
56
Limitations of 6600 scoreboard:
No forwarding hardware
Limited to instructions in basic block (small window)
Small number of functional units (structural hazards),
especially integer/load store units
Do not issue on structural hazards
Wait for WAR hazards
Prevent WAW hazards
4 March 2025 Dr Noor Mahammad Sk
Dr Noor Mahammad Sk 57
ANOTHER CASE STUDY
EXAMPLE
4 March 2025
ILP Continues….
58
Data Hazards
LOAD R1, [R2 + 10] // Loads into R1
ADD R3, R1, R2 //R3 = R1 + R2
This is the “Read After Write (RAW)” Data Hazard for
R1
LD R1, [R2+10]
ADD R3, R1, R12
LD R1, [R2 + 14]
ADD R12, R1, R2
This shows the WAW for R1 and WAR for R12
4 March 2025 Dr Noor Mahammad Sk
ILP – Pipelining Advanced
59
Superscalar: CPI < 1
Fetch + Inc. PC
Success: Different Instrns take
different cycle time
Decode Instrn
Fetch Data
Execute Unit 1 Execute Unit 2 Execute Unit K
Store Data Four FMULs while one FDIV
Implies – Out-of-Order Execution
4 March 2025 Dr Noor Mahammad Sk
Difficulties in Superscalar Construction
60
Ensuring no Data Hazards among several instructions
executing in the different execution units at a same
point of time.
If this is done by compiler – then Static Instruction
Scheduling – VLIW - Itanium
Done by the hardware – then Dynamic Instruction
Scehduling – Tomasulo – MIPS Embedded Processor
4 March 2025 Dr Noor Mahammad Sk
Static Instruction Scheduling
61
Compiler make bundles of “K” instructions that can be put at the same
time to the execution units such that there are no data dependencies
between them.
Very Long Instruction Word (VLIW) to accommodate “K’ instructions at a time
Lot of “NOPS” if the bundle cannot be filled with relevant instructions
Size of the executable
Does not complicate the Hardware
Source code portability – if I make the next gen processor with K+5 units
(say) – then?
Solved by having a software/firmware emulator which has a negative
say in the performance.
4 March 2025 Dr Noor Mahammad Sk
Dynamic Instruction Scheduling
62
The data hazards are handled by the hardware
RAW using Operand Forwarding Technique
WAR and WAW using Register Renaming Technique
4 March 2025 Dr Noor Mahammad Sk
Processor Overview
63
Why should
result of LD go
to R2 in Reg file
and then reload Processor
to ALU? ALU/Control
Forward the Multiple function Register File
same on its way Units
to reg file
Bus
RAW
LD [R1+20],R2
Memory
ADD R3,R2,R4
4 March 2025 Dr Noor Mahammad Sk
Register Renaming
64
Dependencies due to Reg R1
1. ADD R1, R2, R3 RAW: (1,2), (1,4), (1,5) (3,4)
2. ST R1, [R4+50] (3,5)
WAR: (2,3), (2,6), (4,6), (5,6)
3. ADD R1, R5, R6
WAW: (1,3), (1,6), (3,6)
4. SUB R7,R1,R8
5. ST R1, [R4 + 54]
6. ADD R1, R9, R10
4 March 2025 Dr Noor Mahammad Sk
Register Renaming: Static Scheduling
65
1. ADD R1, R2, R3 Rename R1 to R12 after Instruction 3
till Instruction 6
2. ST R1, [R4+50] Dependency only within a window and
3. ADD R12, R5, R6 not the whole program.
4. SUB R7,R12,R8 Only WAR and WAW are between
(1,6) and (2,6) which are far away in
5. ST R12, [R4 + 54] the program order
6. ADD R1, R9,R10 Increases Register pressure for the
compiler
4 March 2025 Dr Noor Mahammad Sk
Dynamic Scheduling - Tomasulo
Instruction Fetch Unit
66
To Reg
file/Mem
Register Status Indicator
Reservation
Station
Exec 1 Exec 2 Exec 3 Exec 4
Common Data Bus (CDB)
Instructions are fetched one by one and decoded to find the type of
operation and the source of operands
4 March 2025 Dr Noor Mahammad Sk
Dynamic Scheduling - Tomasulo
Instruction Fetch Unit
67
To Reg
file/Mem
Register Status Indicator
Reservation
Station
Exec 1 Exec 2 Exec 3 Exec 4
Common Data Bus (CDB)
Register Status Indicator indicates whether the latest value of the
register is in the reg file or currently being computed by some
execution unit and if the latter it states the execution unit number
4 March 2025 Dr Noor Mahammad Sk
Dynamic Scheduling - Tomasulo
Instruction Fetch Unit
68
To Reg
file/Mem
Register Status Indicator
Reservation
Station
Exec 1 Exec 2 Exec 3 Exec 4
Common Data Bus (CDB)
If all operands available then operation proceeds in the allotted execution
unit, else, it waits in the reservation station of the allotted execution unit
pinging the CDB
4 March 2025 Dr Noor Mahammad Sk
Dynamic Scheduling - Tomasulo
Instruction Fetch Unit
69
To Reg
file/Mem
Register Status Indicator
Reservation
Station
Exec 1 Exec 2 Exec 3 Exec 4
Common Data Bus (CDB)
Every Execution unit writes the result along with the unit number on to the CDB
which is forwarded to all reservation stations, Reg-file and Memory
4 March 2025 Dr Noor Mahammad Sk
1. ADD R1, R2, R3
An Example: 2.
3.
ST R1, [R4+50]
ADD R1, R5, R6
70
4. SUB R7,R1,R8
Instruction Fetch 5. ST R1, [R4 + 54]
6. ADD R1, R9, R10
Register Status Indicator
Reg R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status 0 0 0 0 0 0 0 0 0 0
Empty Empty Empty Empty Empty Empty
4 March 2025 Dr Noor Mahammad Sk
1. --
An Example: 2.
3.
ST R1, [R4+50]
ADD R1, R5, R6
71
4. SUB R7,R1,R8
Instruction Fetch 5. ST R1, [R4 + 54]
6. ADD R1, R9, R10
ADD R1, R2, R3
Register Status Indicator
Reg R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status 1 0 0 0 0 0 0 0 0 0
Ins 1 Empty Empty Empty Empty Empty
4 March 2025 Dr Noor Mahammad Sk
1. ---
An Example: 2.
3.
---
ADD R1, R5, R6
72
4. SUB R7,R1,R8
Instruction Fetch 5. ST R1, [R4 + 54]
6. ADD R1, R9, R10
ST R1, [R4+50]
Register Status Indicator
Reg R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status 1 0 0 0 0 0 0 0 0 0
I 1, E I 2, W 1 Empty Empty Empty Empty
4 March 2025 Dr Noor Mahammad Sk
1. ---
An Example: 2.
3.
---
---
73
4. SUB R7,R1,R8
Instruction Fetch 5. ST R1, [R4 + 54]
6. ADD R1, R9, R10
ADD R1, R5, R6
Register Status Indicator
Reg R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status 3 0 0 0 0 0 0 0 0 0
I 1, E I 2, W 1 I 3, E Empty Empty Empty
Note: Reservation Station stores the number of the execution unit that shall yield the
latest value of a register.
4 March 2025 Dr Noor Mahammad Sk
1. ---
An Example: 2.
3.
---
---
74
4. ---
Instruction Fetch 5. ST R1, [R4 + 54]
6. ADD R1, R9, R10
SUB R7,R1,R8
Register Status Indicator
Reg R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status 3 0 0 0 0 0 4 0 0 0
I 1, E I 2, W 1 I 3, E I 4, W 3 Empty Empty
4 March 2025 Dr Noor Mahammad Sk
1. ---
An Example: 2.
3.
---
---
75
4. ---
Instruction Fetch 5. ---
6. ADD R1, R9, R10
ST R1, [R4 + 54]
Register Status Indicator
Reg R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status 3 0 0 0 0 0 4 0 0 0
I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 Empty
4 March 2025 Dr Noor Mahammad Sk
1. ---
An Example: 2.
3.
---
---
76
4. ---
Instruction Fetch 5. ---
6. ---
ADD R1, R9, R10
Register Status Indicator
Reg R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status 6 0 0 0 0 0 4 0 0 0
I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 I 6, E
4 March 2025 Dr Noor Mahammad Sk
1. ADD R1, R2, R3
An Example: 2.
3.
ST U1, [R4+50]
ADD R1, R5, R6
77
4. SUB R7, U3, R8
Instruction Fetch 5. ST U3, [R4 + 54]
6. ADD R1, R9, R10
ADD R1, R9, R10
Register Status Indicator
Reg R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status 6 0 0 0 0 0 4 0 0 0
I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 I 6, E
Effectively three Instructions are executing and others waiting for the appropriate results. The
whole program is converted as shown above.
4 March 2025 Dr Noor Mahammad Sk
1. ADD R1, R2, R3
An Example: 2.
3.
ST U1, [R4+50]
ADD R1, R5, R6
78
4. SUB R7, U3, R8
Instruction Fetch 5. ST U3, [R4 + 54]
6. ADD R1, R9, R10
ADD R1, R9, R10
Register Status Indicator
Reg R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status 6 0 0 0 0 0 4 0 0 0
I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 I 6, E
See that Operand Forwarding and Register Renaming is done automatically
4 March 2025 Dr Noor Mahammad Sk
1. ADD R1, R2, R3
An Example: 2.
3.
ST U1, [R4+50]
ADD R1, R5, R6
79
4. SUB R7, U3, R8
Instruction Fetch 5. ST U3, [R4 + 54]
6. ADD R1, R9, R10
ADD R1, R9, R10
Register Status Indicator
Reg R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status 6 0 0 0 0 0 4 0 0 0
I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 I 6, E
Execution unit 6, on completion will make R1 entry in Register Status Indicator 0. Similarly unit
4 will make R7 entry 0.
4 March 2025 Dr Noor Mahammad Sk
Dynamic Scheduling
80
Rearrange order of instructions to reduce stalls while
maintaining data flow
Advantages:
Compiler doesn’t need to have knowledge of
microarchitecture
Handles cases where dependencies are unknown at
compile time
Disadvantage:
Substantial increase in hardware complexity
Complicates exceptions
4 March 2025 Dr Noor Mahammad Sk
Dynamic Scheduling
81
Dynamic scheduling implies:
Out-of-order execution
Out-of-order completion
Creates the possibility for WAR and WAW hazards
Tomasulo’s Approach
Trackswhen operands are available
Introduces register renaming in hardware
Minimizes WAW and WAR hazards
4 March 2025 Dr Noor Mahammad Sk
Register Renaming
82
Example:
DIV.D F0,F2,F4
ADD.D F6,F0,F8
S.D F6,0(R1) antidependence
SUB.D F8,F10,F14 antidependence
MUL.D F6,F10,F8
+ name dependence with F6
4 March 2025 Dr Noor Mahammad Sk
Register Renaming
83
Example:
DIV.D F0,F2,F4
ADD.D S,F0,F8
S.D S,0(R1)
SUB.D T,F10,F14
MUL.D F6,F10,T
Now only RAW hazards remain, which can be strictly
ordered
4 March 2025 Dr Noor Mahammad Sk
Register Renaming
84
Register renaming is provided by reservation stations (RS)
Contains:
The instruction
Buffered operand values (when available)
Reservation station number of instruction providing the
operand values
RS fetches and buffers an operand as soon as it becomes available (not
necessarily involving register file)
Pending instructions designate the RS to which they will send their output
Result values broadcast on a result bus, called the common data bus (CDB)
Only the last output updates the register file
As instructions are issued, the register specifiers are renamed with the
reservation station
May be more reservation stations than registers
4 March 2025 Dr Noor Mahammad Sk
Tomasulo’s Algorithm
85
Load and store buffers
Contain data and addresses,
act like reservation stations
Top-level design:
4 March 2025 Dr Noor Mahammad Sk
Tomasulo’s Algorithm
86
Three Steps:
Issue
Get next instruction from FIFO queue
If available RS, issue the instruction to the RS with operand values if available
If operand values not available, stall the instruction
Execute
When operand becomes available, store it in any reservation stations waiting for
it
When all operands are ready, issue the instruction
Loads and store maintained in program order through effective address
No instruction allowed to initiate execution until all branches that proceed it in
program order have completed
Write result
Write result on CDB into reservation stations and store buffers
(Stores must wait until address and value are received)
4 March 2025 Dr Noor Mahammad Sk
Example
87
4 March 2025 Dr Noor Mahammad Sk
Dr Noor Mahammad Sk 88
THANK YOU!!
4 March 2025