13 PipelinedProcessorDesign
13 PipelinedProcessorDesign
COE 233
Logic Design and Computer Organization
Dr. Muhamed Mudawar
Pipeline Hazards
Control Hazards
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 2
Laundry Example
Laundry Example: Three Stages
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 3
Sequential Laundry
6 PM 7 8 9 10 11 12 AM
Time 30 30 30 30 30 30 30 30 30 30 30 30
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 4
Pipelined Laundry: Start Load ASAP
6 PM 7 8 9 PM
30 30 30
30 30 30 Time
30 30 30
30 30 30
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 5
Serial versus Pipelined Execution
Consider a task that can be divided into k subtasks
The k subtasks are executed on k different stages
Each subtask requires one time unit
The total execution time of the task is k time units
Pipelining is to overlap the execution
The k stages work in parallel on k different tasks
Tasks enter/leave pipeline at the rate of one task per time unit
1 2 … k 1 2 … k
1 2 … k 1 2 … k
1 2 … k 1 2 … k
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 6
Synchronous Pipeline
Uses clocked registers between stages
Upon arrival of a clock edge …
All registers hold the results of previous stages simultaneously
Register
Register
Register
Input S1 S2 Sk Output
Clock
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 7
Pipeline Performance
Let ti = time delay in stage Si
Clock cycle t = max(ti) is the maximum stage delay
Clock frequency f = 1/t = 1/max(ti)
A pipeline can process n tasks in k + n – 1 cycles
k cycles are needed to complete the first task
n – 1 cycles are needed to complete the remaining n – 1 tasks
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 8
MIPS Processor Pipeline
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 9
Single-Cycle vs Pipelined Performance
Consider a 5-stage instruction execution in which …
Instruction fetch = ALU operation = Data memory access = 200 ps
Register read = register write = 150 ps
What is the clock cycle of the single-cycle processor?
What is the clock cycle of the pipelined processor?
What is the speedup factor of pipelined execution?
Solution
Single-Cycle Clock = 200+150+200+200+150 = 900 ps
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 10
Single-Cycle versus Pipelined – cont’d
Pipelined clock cycle = max(200, 150) = 200 ps
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 11
Pipeline Performance Summary
Pipelining doesn’t improve latency of a single instruction
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 12
Next . . .
Pipeline Hazards
Control Hazards
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 13
Single-Cycle Datapath
Shown below is the single-cycle datapath
How to pipeline this single-cycle datapath?
WB = Write Back
& Register Read
Branch Target Address
Next PC Address
ExtOp +
Imm16
+1 Ext Zero ALU result
Instruction Rs BusA
Data
RA
A
00
0 Memory Memory
Registers L Address
Rt 0
Address U
PC
1 RB 1 Data_out
Instruction 1
2 0
BusB 0
Rd RW Data_in
1 BusW
clk
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 14
Pipelined Datapath
Pipeline registers are shown in green, including the PC
Same clock edge updates all pipeline registers and PC
In addition to updating register file and data memory (for store)
IF = Instruction Fetch ID = Instruction Decode EX = Execute MEM = Memory Access
& Register Read
Branch Target Address
WB = Write Back
BTA
Next PC Address +
NPC
ExtOp
Imm
Ext Zero
Instruction Rs Data
A
RA BusA
Memory A Memory
00
0
Registers L Address
R
Rt
Data
PC
0
1 Address
RB 1
U
Data_out
Inst
Instruction 1
2 0 BusB
Rd RW B 0
Data_in
D
1 BusW
clk
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 15
Problem with Register Destination
Instruction in ID stage is different from the one in WB stage
WB stage is writing to a different destination register
Writing the destination register of the instruction in the ID Stage
WB = Write Back
BTA
Next PC Address +
NPC
ExtOp
Imm
Ext Zero
Instruction Rs Data
A
RA BusA
Memory A Memory
00
0
Registers L Address
R
Rt
Data
PC
0
1 Address
RB 1
U
Data_out
Inst
Instruction 1
2 0 BusB
Rd RW B 0
Data_in
D
1 BusW
clk
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 16
Pipelining the Destination Register
Destination Register should be pipelined from ID to WB
The WB stage writes back data knowing the destination register
WB = Write Back
BTA
Next PC Address +
NPC ExtOp
Imm
Ext Zero
Instruction Rs Data
A
RA BusA
Memory A Memory
00
0
Registers L Address
Data
PC
0
1 Address Rt
RB 1
U
Data_out
Inst
Instruction 1
2 BusB
B
0
Data_in
D
RW BusW
0
Rd4
Rd3
Rd
1
Rd2
clk
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 17
Graphically Representing Pipelines
Multiple instruction execution over multiple clock cycles
Instructions are listed in execution order from top to bottom
Clock cycles move from left to right
Figure shows the use of resources at each stage and each cycle
Time (in cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 18
Instruction-Time Diagram
Instruction-Time Diagram shows:
Which instruction occupying what stage at each clock cycle
Instruction flow is pipelined over the 5 stages
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Time
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 19
Control Signals
IF = Instruction Fetch ID = Instruction Decode EX = Execute MEM = Memory Access
Branch Target Address
WB = Write Back
BTA
Next PC Address +
NPC
ExtOp
Imm
Ext Zero
Instruction Rs Data
A
RA BusA
Memory A Memory
00
0
Registers L Address
Data
PC
0
1 Address Rt
RB 1
U
Data_out
Inst
Instruction 1
2 BusB
B
0
Data_in
D
RW BusW
0
Rd4
Rd3
Rd2
Rd
1
clk
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 20
Pipelined Control
IF = Instruction Fetch ID = Instruction Decode EX = Execute MEM = Memory Access
Branch Target Address
WB = Write Back
BTA
Next PC Address + just like data
NPC
ExtOp
Imm
Ext Zero
Instruction Rs Data
A
RA BusA
Memory A Memory
00
0
Registers L Address
Data
PC
0
1 Address Rt
RB 1
U
Data_out
Inst
Instruction 1
2 BusB
B
0
Data_in
D
RW BusW
0
Rd4
Rd3
Rd2
Rd
PCSrc 1
clk
RegDst RegWr ALUSrc ALUOp MemRd MemWr WBdata
PC
Zero ExtOp
Control
Op
BEQ, BNE J Main & ALU
MEM
EX
func Control
WB
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 21
Pipelined Control – Cont'd
ID stage generates all the control signals
Pipeline the control signals as the instruction moves
Extend the pipeline registers to include the control signals
J X X X X 0 0 X 0 1 = jump target
PCSrc = 0 or 2 (BTA) for BEQ and BNE, depending on the zero flag
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 23
Next . . .
Pipeline Hazards
Control Hazards
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 24
Pipeline Hazards
Hazards: situations that would cause incorrect execution
If next instruction were launched during its designated clock cycle
1. Structural hazards
Caused by resource contention
Using same resource by two instructions during the same cycle
2. Data hazards
An instruction may compute a result needed by next instruction
Data hazards are caused by data dependencies between instructions
3. Control hazards
Caused by instructions that change control flow (branches/jumps)
Delays in changing the flow of control
Hazards complicate pipeline control and limit performance
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 25
Structural Hazards
Problem
Attempt to use the same hardware resource by two different
instructions during the same clock cycle
Example Structural Hazard
Two instructions are
Writing back ALU result in stage 4 attempting to write the
Conflict with writing load data in stage 5 register file during
same cycle
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Time
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 26
Resolving Structural Hazards
Serious Hazard:
Hazard cannot be ignored
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 27
Data Hazards
Dependency between instructions causes a data hazard
The dependent instructions are close to each other
Pipelined execution might change the order of operand access
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 28
Example of a RAW Data Hazard
Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
value of $s2 10 10 10 10 10 20 20 20
The 3 stall cycles insert 3 bubbles (No operations) into the ALU
Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
value of $s2 10 10 10 10 10 20 20 20
Program Execution Order
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 31
Implementing Forwarding
Two multiplexers added at the inputs of A & B registers
Data from ALU stage, MEM stage, and WB stage is fed back
ForwardA
Imm16 32
Imm
Ext
32 32 ALU result
0 A
Register File
Rs BusA 1
RA A L Address
R
2
Instruction
Rt 3 1 U Data 0
RB
Data
BusB
0 Memory
0 32
1 32 32 Data_out 1
D
B
RW BusW 2 Data_in
3
32
0
Rd2
Rd4
Rd3
1
Rd
clk
ForwardB
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 32
Forwarding Control Signals
Signal Explanation
ForwardA = 0 First ALU operand comes from register file = Value of (Rs)
ForwardB = 0 Second ALU operand comes from register file = Value of (Rt)
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 33
Forwarding Example
Instruction sequence: When sub instruction in ID stage
lw $t4, ori will be in the ALU stage
4($t0) lw will be in the MEM stage
ori $t7, $t1, ForwardA = 2 (from MEM stage)
2
sub $t3,$t4,$t7 ori $t7,$t1,2 lw $t4,4($t0)
sub $t3, $t4,
$t7 32 Imm16
Imm
Ext
32 32 ALU result
0 A
Register File
Rs BusA 1
RA L Address
R
2
Instruction
Rt 3 1 U Data 0
RB
Data
BusB
0 Memory
0 32
1 32 32 Data_out 1
D
B
RW BusW 2 Data_in
3
32
0
Rd2
Rd4
Rd3
1
Rd
clk
Imm16 32
Imm
Ext
32 32 ALU result
0 A
Register File
Rs BusA 1
RA L Address
R
2
Instruction
Rt 3 1 U Data 0
RB
Data
BusB Memory
0
0 32
1 32 32 Data_out 1
D
B
RW BusW 2 Data_in
3
32
0
Rd2
Rd4
Rd3
1
Rd
clk
ForwardB ForwardA
RegDst
Rs Hazard
Detect &
Rt
ExtOp Forward
ALUSrc MemRd
RegWr ALUOp RegWr MemWr RegWr
Op
WBdata
Main
& ALU
EX
MEM
func
Control
WB
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 36
Next . . .
Pipeline Hazards
Control Hazards
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 37
Load Delay
Unfortunately, not all data hazards can be forwarded
Load has a delay that cannot be eliminated by forwarding
Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
ALU
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 38
Detecting RAW Hazard after Load
Detecting a RAW hazard after a Load instruction:
The load instruction will be in the EX stage
Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 40
Showing Stall Cycles
Stall cycles can be shown on instruction-time diagram
Hazard is detected in the Decode stage
Stall indicates that instruction is delayed
Instruction fetching is also delayed after a stall
Example:
Data forwarding is shown using green arrows
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 Time
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 41
Hazard Detecting and Forwarding Logic
ExtOp
Imm16 32
Imm
Ext
32 32 ALU result
0 A
Register File
Rs BusA 1
RA L Address
R
2
Instruction
Rt 3 1 U Data 0
RB
Data
BusB
PC
0 Memory
0 32
1 32 32 Data_out 1
D
B
RW BusW 2 Data_in
3
32
0
Rd2
Rd4
Rd3
1
Rd
RegDst
clk
ForwardB ForwardA
Disable PC
Disable IR
Rs Hazard Detect
Forward
Rt
and Stall
ALUSrc MemRd
RegWr ALUOp MemRd RegWr MemWr RegWr
Stall WBdata
Op
Main Control Signals
& ALU 0
EX
func
MEM
Control Bubble = 0 1
WB
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 42
Code Scheduling to Avoid Stalls
Compilers reorder code in a way to avoid load stalls
Consider the translation of the following statements:
A = B + C; D = E – F; // A thru F are in Memory
Slow code: Fast code: No Stalls
lw $t0, 4($s0) # &B = 4($s0) lw $t0, 4($s0)
lw $t1, 8($s0) # &C = 8($s0) lw $t1, 8($s0)
add $t2, $t0, $t1 # stall cycle lw $t3, 16($s0)
sw $t2, 0($s0) # &A = 0($s0) lw $t4, 20($s0)
lw $t3, 16($s0) # &E = 16($s0)add $t2, $t0,
lw $t4, 20($s0) # &F = 20($s0)$t1
sub $t5, $t3, $t4 # stall cycle sw $t2, 0($s0)
sw $t5, 12($0) # &D = 12($0) sub $t5, $t3,
$t4
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 43
Name Dependence: Write After Read
Instruction J should write its result after it is read by I
Called anti-dependence by compiler writers
I: sub $t4, $t1, $t3 # $t1 is read
J: add $t1, $t2, $t3 # $t1 is written
Results from reuse of the name $t1
NOT a data hazard in the 5-stage pipeline because:
Reads are always in stage 2
Writes are always in stage 5, and
Instructions are processed in order
Pipeline Hazards
Control Hazards
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 46
Control Hazards
Jump and Branch can cause great performance loss
Jump instruction needs only the jump target address
Branch instruction needs two things:
Branch Result Taken or Not Taken
Branch Target Address
PC + 4 If Branch is NOT taken
PC + 4 + 4 × immediate If Branch is Taken
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 47
1-Cycle Jump Delay
Control logic detects a Jump instruction in the 2nd Stage
J L1 IF ID
. . .
Jump
L1: Target instruction Target IF Reg ALU DM Reg
Addr
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 48
2-Cycle Branch Delay
Control logic detects a Branch instruction in the 2nd Stage
ALU computes the Branch outcome in the 3rd Stage
Next1 and Next2 instructions will be fetched anyway
Convert Next1 and Next2 into bubbles if branch is taken
cc1 cc2 cc3 cc4 cc5 cc6 cc7
Branch
L1: target instruction Target IF Reg ALU DM
Addr
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 49
If Branch is NOT Taken . . .
Branches can be predicted to be NOT taken
No wasted cycles
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 50
Pipelined Jump and Branch
Branch Target Address
BTA
Next PC Address +
NPC
ForwardA
Imm16 32 Zero
+1
Imm
Ext
32
Instruction 0 A
Register File
Rs BusA 1
RA L
00
Memory
R
0 2
Rt 3 1 U
Address RB
PC
1 BusB
Instruction
0
0
2 Instruction 0 32
1
D
B
1 RW BusW 2
3
Bubble = NOP 32
PCSrc 0
Rd2
Rd3
1
Rd
Jump
Disable PC
Kill1
Disable IR
kills next
ForwardB
instruction
Rs Rd2, Rd3, Rd4
Forward & Stall RegWr, MemRd
Rt
Stall
Kill2
Op
Taken branch kills two
PC Main Control Signals
J & ALU 0 Control Signals
MEM
Control func
EX
BEQ, BNE J Control Bubble = 0 1
Zero BEQ, BNE
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 51
PC Control for Pipelined Jump and Branch
if ((BEQ && Zero) || (BNE && !
BEQ BNE J
Zero))
Zero
{ Jmp=0; Br=1; Kill1=1; Kill2=1;
}
else if (J)
{ Jmp=1; Br=0; Kill1=1; Kill2=0;
}
else
Br = (( BEQ · Zero ) + (BNE ·
{Zero ))
Jmp=0; Br=0; Kill1=0; Kill2=0;
}Jmp = J · Br
Kill1 = J + Br Kill2 Kill1 Br Jmp
Kill2 = Br PCSrc
PCSrc = { Br, Jmp } // 0, 1, or 2
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 52
Jump and Branch Impact on CPI
Base CPI = 1 without counting jump and branch
Unconditional Jump = 5%, Conditional branch = 20%
90% of conditional branches are taken
Jump kills next instruction, Taken Branch kills next two
What is the effect of jump and branch on the CPI?
Solution:
Jump adds 1 wasted cycle for 5% of instructions = 1 × 0.05
Branch adds 2 wasted cycles for 20% × 90% of instructions
= 2 × 0.2 × 0.9 = 0.36
New CPI = 1 + 0.05 + 0.36 = 1.41 (due to wasted cycles)
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 53