Pipelined Processor Design
ICS 233
Computer Architecture and Assembly
Language
Prof. Muhamed Mudawar
College of Computer Sciences and
Engineering
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 1
Presentation Outline
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch
Prediction
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 2
Pipelining Example
Laundry Example: Three Stages
1. Wash dirty load of clothes
2. Dry wet clothes
3. Fold and put clothes into drawers
A B
Each stage takes 30 minutes to complete
C
Four loads of clothes to wash, dry, and fold
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 3
Sequential Laundry
6 PM
Time 30
7
30
8
30
30
9
30
30
10
30
30
11
30
30
12 AM
30
A
B
C
D
Sequential laundry takes 6 hours for 4 loads
Intuitively, we can use pipelining to speed up
laundry
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 4
30
Pipelined Laundry: Start
Load
ASAP
6 PM
7
8
9 PM
30
30
30
30
30
30
30
30
30
Time
30
30
30
Pipelined laundry
takes 3 hours for 4
loads
Speedup factor is 2 for
4 loads
Pipelined Processor Design
Time to wash, dry, and
fold one load is still
the same (90 minutes)
ICS 233 KFUPM
Muhamed Mudawar slide 5
Serial Execution versus
Consider a task Pipelining
that can be divided into k subtasks
The k subtasks are executed on k different
stages
Each subtask requires one time unit
The total execution time of the task is k time
units
Pipelining is to start a new task before finishing
previous
1 2 k
The1 k
stages
work
in
parallel
on
k different
tasks
2 k
1 2 k
1 2 k
Tasks enter/leave
pipeline at the rate
1 2
of kone
k
task per time unit1 2
Without Pipelining
One completion every k time units
Pipelined Processor Design
With Pipelining
One completion every 1 time unit
ICS 233 KFUPM
Muhamed Mudawar slide 6
Synchronous Pipeline
Uses clocked registers between stages
Upon arrival of a clock edge
All registers hold the results of previous stages
simultaneously
The pipeline stages are combinational logic circuits
It is desirable to have balanced stages
Approximately equal delay in all stages
S2
Clock
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 7
Sk
Register
S1
Register
Input
Register
Clock period is determined by the maximum stage delay
Register
Output
Pipeline Performance
Let i = time delay in stage Si
Clock cycle = max(i) is the maximum stage delay
Clock frequency f = 1/ = 1/max(i)
A pipeline can process n tasks in k + n 1 cycles
k cycles are needed to complete the first task
n 1 cycles are needed to complete the remaining n 1
tasks
Ideal speedup of a k-stage pipeline over serial execution
nk
Serial execution in cycles
Sk =
Sk k for large n
=
Pipelined execution in cycles
k+n1
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 8
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch
Prediction
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 9
Single-Cycle Datapath
Shown below is the single-cycle datapath
How to pipeline this single-cycle datapath?
Answer: Introduce registers at the end of each stage
IF = Instruction
Fetch
ID = Decode and
Register Fetch
EX = Execute and
Calculate Address
Inc
Address
Rt
Instruction
Instruction
Memory
Rd
Pipelined Processor Design
m
u
x
1
Imm16
Register
File
BusW
PC
Rs
RW
00
Imm26
m
u
x
Next
PC
Ext
m
u
x
1
ICS 233 KFUPM
Muhamed Mudawar slide 10
MEM = Memory
Access
ALU result
zero
A
L
U
WB = Write
Back
Data
Memory
Address
Data_in
m
u
x
1
Pipelined Datapath
Pipeline registers, in green, separate each pipeline stage
Pipeline registers are labeled by the stages they separate
Is there a problem with the register destination address?
IF = Instruction Fetch
ID = Decode
IF/ID
EX = Execute
ID/EX
Next
PC
00
Imm26
Instruction
Rs
Register
File
Rt
RW
Instruction
Memory
Rd
Pipelined Processor Design
m
u
x
BusW
PC
Imm16
Address
WB
EX/MEM
Inc
m
u
x
MEM = Memory
Ext
m
u
x
ICS 233 KFUPM
Muhamed Mudawar slide 11
zero
A
L
U
MEM/WB
ALU result
Address
Data
Memory
Data_in
m
u
x
1
Corrected Pipelined
Destination registerDatapath
number should come from MEM/WB
Along with the data during the written back stage
Destination register number is passed from ID to WB stage
IF
ID
EX
IF/ID
ID/EX
Next
PC
00
Imm26
Instruction
Rs
Register
File
Rt
BusW
Instruction
Memory
Rd
Pipelined Processor Design
m
u
x
RW
PC
Imm16
Address
WB
EX/MEM
Inc
m
u
x
MEM
Ext
m
u
x
ICS 233 KFUPM
Muhamed Mudawar slide 12
zero
A
L
U
MEM/WB
ALU result
Address
Data
Memory
Data_in
m
u
x
1
Graphically Representing
Pipelines
Multiple instruction execution over multiple clock cycles
Instructions are listed in execution order from top to bottom
Clock cycles move from left to right
Program Execution Order
Figure shows the use of resources at each stage and each cycle
Time (in cycles)
CC1
CC2
CC3
CC4
CC5
lw $6, 8($5)
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
add $1, $2, $3
ori $4, $3, 7
sub $5, $2, $3
sw $2, 10($3)
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 13
CC6
CC7
CC8
InstructionTime Diagram
Diagram shows:
Which instruction occupies what stage at each
clock cycle
Instruction execution is pipelined over the 5
stages
Up to five instructions can be in
ALU instructions skip
Instruction Order
execution during a single cycle
lw
$7, 8($3)
lw
$6, 8($5)
IF
ori $4, $3, 7
sub $5, $2, $3
sw
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX
WB
IF
ID
EX
IF
ID
$2, 10($3)
CC1
Pipelined Processor Design
the MEM stage.
Store instructions
skip the WB stage
WB
EX MEM
CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
ICS 233 KFUPM
Muhamed Mudawar slide 14
Time
Single-Cycle vs Pipelined
Performance
Consider a 5-stage instruction execution in which
Instruction fetch = ALU operation = Data memory access
= 200 ps
Register read = register write = 150 ps
What is the single-cycle non-pipelined time?
What is the pipelined cycle time?
What is the speedup factor for pipelined execution?
Solution
Non-pipelined cycle =
IF
Reg
ALU
900 ps
200+150+200+200+150 = 900 ps
MEM
Reg
IF
Reg
ALU
900 ps
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 15
MEM
Reg
Single-Cycle versus
Pipelined
Pipelined cycle time
= max(200, 150) = 200 ps
IF
Reg
200
IF
200
ALU
Reg
IF
200
MEM
Reg
ALU
MEM
Reg
ALU
MEM
200
200
Reg
200
Reg
200
CPI for pipelined execution =
1
One instruction completes each cycle (ignoring
pipeline fill)
Speedup of pipelined execution = 900 ps / 200 ps = 4.5
Instruction count and CPI are equal in both cases
Speedup factor is less than 5 (number of pipeline stage)
Because the pipeline stages are not balanced
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 16
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 17
Control Signals
IF/ID
EX/MEM
j
Inc
PCSrc
00
Imm26
Imm26
Imm16
Address
Instruction
Rs
Register
File
Rt
BusW
Instruction
Memory
Rd
Ext
m
u
x
RW
PC
m
u
x
ID/EX
m
u
x
func
RegDst RegWrite
Next
PC
beq
bne
zero
A
L
U
ALU result
m
u
x
Address
Data
Memory
Data_in
ALU
Control
ALUSrc
MEM/WB
ALUOp Br&J
MemWrite
MemRead
MemtoReg
Similar to control signals used in the single-cycle datapath
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 18
Control Signals contd
Op
Decode
Execute Stage
Memory Stage
Writeback
Signal
Control Signals
Control Signals
Signal
MemRead MemWrite MemtoReg
RegWrite
RegDst ALUSrc ALUOp Beq Bne
R-Type
1=Rd
0=Reg R-Type
addi
0=Rt
1=Imm
ADD
slti
0=Rt
1=Imm
SLT
andi
0=Rt
1=Imm
AND
ori
0=Rt
1=Imm
OR
lw
0=Rt
1=Imm
ADD
sw
1=Imm
ADD
beq
0=Reg
SUB
bne
0=Reg
SUB
Pipelined
j ProcessorxDesign
0ICS 2331 KFUPM 0
Muhamed Mudawar slide 19
Pipelined Control
IF/ID
EX/MEM
j
Inc
PCSrc
Next
PC
00
Imm26
Imm16
Address
Register
File
Rt
Instruction
BusW
Instruction
Memory
ALU result
A
L
U
Ext
m
u
x
m
u
x
MEM/WB
bne
m
u
x
Address
Data
Memory
Data_in
Op
Rd
beq
zero
Rs
RW
PC
m
u
x
ID/EX
Pipelined Processor Design
RegDst
ALU
Control
M
WB
Main
Control
MemRead
ALUOp
EX
ALUSrc
MemWrite
ICS 233 KFUPM
Muhamed Mudawar slide 20
RegWrite
MemtoReg
WB
func
WB
Pass control
signals along
pipeline just
like the data
Pipelined Control Cont'd
ID stage generates all the control signals
Pipeline the control signals as the instruction moves
Extend the pipeline registers to include the control signals
Each stage uses some of the control signals
Instruction Decode and Register Fetch
Control signals are generated
RegDst is used in this stage
Execution Stage => ALUSrc and ALUOp
Next PC uses Beq, Bne, J and zero signals for branch control
Memory Stage
=> MemRead, MemWrite, and MemtoReg
Write Back Stage
Pipelined Processor Design
=> RegWrite is used in this stage
ICS 233 KFUPM
Muhamed Mudawar slide 21
Pipelining Summary
Pipelining doesnt improve latency of a single instruction
However, it improves throughput of entire workload
Instructions are initiated and completed at a higher rate
In a k-stage pipeline, k instructions operate in parallel
Overlapped execution using multiple hardware resources
Potential speedup = number of pipeline stages k
Unbalanced lengths of pipeline stages reduces speedup
Pipeline rate is limited by slowest pipeline stage
Unbalanced lengths of pipeline stages reduces speedup
Also, time to fill and drain pipeline reduces speedup
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 22
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 23
Pipeline Hazards
Hazards: situations that would cause
incorrect execution
If next instruction were launched during
its designated clock cycle
[Link] hazards
Caused by resource contention
Using same resource by two instructions
during the same cycle
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 24
[Link] hazards
An instruction may compute a result
needed by next instruction
Hardware can detect dependencies
between instructions
[Link] hazards
Caused by instructions that change
control flow (branches/jumps)
Delays in changing the flow of control
Hazards complicate pipeline control and
limit performance
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 25
Problem
Structural Hazards
Attempt to use the same hardware resource by
two different
instructions during the same cycleStructural Hazard
Example
Writing back ALU result in stage
Two instructions are
attempting to write
4 the register file
during same cycle
Instructions
Conflict with writing load data in stage 5
lw
$6, 8($5)
IF
ori $4, $3, 7
sub $5, $2, $3
sw
$2, 10($3)
CC1
Pipelined Processor Design
ID
EX MEM WB
IF
ID
EX
WB
IF
ID
EX
WB
IF
ID
EX MEM
CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
ICS 233 KFUPM
Muhamed Mudawar slide 26
Time
Resolving Structural
Hazards
Serious Hazard:
Hazard cannot be ignored
Solution 1: Delay Access to Resource
Must have mechanism to delay
instruction access to resource
Delay all write backs to the register file
to stage 5
ALU instructions bypass stage 4 (memory)
without doing anything
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 27
Solution 2: Add more hardware resources
(more costly)
Add more hardware to eliminate the structural
hazard
Redesign the register file to have two write
ports
First write port can be used to write back ALU
results in stage 4
Second write port can be used to write back load
data in stage 5
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 28
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 29
Data Hazards
Dependency between instructions causes a data hazard
The dependent instructions are close to each other
Pipelined execution might change the order of operand access
Read After Write RAW Hazard
Given two instructions I and J, where I comes before J
Instruction J should read an operand after it is written by I
Called a data dependence in compiler terminology
I: add $1, $2, $3
# r1 is written
J: sub $4, $1, $3
# r1 is read
Hazard occurs when J reads the operand before I writes it
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 30
Example of a RAW Data
Hazard
Program Execution Order
Time (cycles)
value of $2
sub $2, $1, $3
and $4, $2, $5
or $6, $3, $2
add $7, $2, $2
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
10
10
10
10
10/20
20
20
20
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
sw $8, 10($2)
Result of sub is needed by and, or, add, & sw instructions
Instructions and & or will read old value of $2 from reg file
During CC5, $2 is written and read new value is read
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 31
Instruction Order
Solution 1: Stalling the
Pipeline
Time (in cycles)
value of $2
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
10
10
10
10
10/20
20
20
20
sub $2, $1, $3
IM
Reg
ALU
DM
Reg
bubble
bubble
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
and $4, $2, $5
IM
or $6, $3, $2
The and instruction cannot fetch $2 until CC5
The and instruction remains in the IF/ID register until CC5
Two bubbles are inserted into ID/EX at end of CC3 & CC4
Bubbles are NOP instructions: do not modify registers or
memory
Bubbles delay instruction execution and waste clock cycles
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 32
Solution 2: Forwarding ALU
Result
The ALU result is forwarded
(fed back) to the ALU input
No bubbles are inserted into the pipeline and no cycles
are wasted
ALU result exists in either EX/MEM or MEM/WB register
Program Execution Order
Time (in cycles)
sub $2, $1, $3
and $4, $2, $5
or $6, $3, $2
add $7, $2, $2
CC1
CC2
CC3
CC4
CC5
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
sw $8, 10($2)
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 33
CC6
CC7
CC8
ID/EX
MemtoReg
Register
Rt
File
m
u
x
m
u
x
A
L
U
Rw
Rd
m
u
x
m
u
x
Rs
Instruction
Ext
RegDst
RegWrite
Pipelined Processor Design
ForwardB
ICS 233 KFUPM
Muhamed Mudawar slide 34
MEM/WB
ALU result
Address
Data_in
Data
Memory
m
u
x
WriteData
EX/MEM
Rw
ALUSrc
ALU result
ForwardA
Imm26
IF/ID
Rw
Two multiplexers added at the inputs of A & B registers
ALU output in the EX stage is forwarded (fed back)
ALU result or Load data in the MEM stage is also
forwarded
Two signals: ForwardA and ForwardB control forwarding
Imm26
Implementing Forwarding
RAW Hazard Detection
RAW hazards can be detected by the pipeline
Current instruction being decoded is in IF/ID register
Previous instruction is in the ID/EX register
Second previous instruction is in the EX/MEM register
RAW Hazard Conditions:
IF/[Link] = ID/[Link]
IF/[Link] = ID/[Link]
Raw Hazard detected with
Previous Instruction
IF/[Link] = EX/[Link]
IF/[Link] = EX/[Link]
Pipelined Processor Design
Raw Hazard detected with
Second Previous Instruction
ICS 233 KFUPM
Muhamed Mudawar slide 35
Forwarding Unit
m
u
x
ForwardB
ForwardA
Forwarding Unit
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 36
ALU result
Address
Data_in
Data
Memory
m
u
x
WriteData
MEM/WB
ALU result
Rw
Rd
m
u
x
m
u
x
A
L
U
Rw
File
Register
Rt
m
u
x
Rs
EX/MEM
Ext
Rw
Instruction
Imm26
Forwarding unit generates ForwardA and
ForwardB
That are used to control the two forwarding
multiplexers
IF/ID
ID/EX
ALUSrc
MemtoReg
Uses Rs
Imm26 and Rt in IF/ID and Rw in ID/EX & EX/MEM
Forwarding Control Signals
Control Signal
Explanation
ForwardA = 00
First ALU operand comes from the register file
ForwardA = 01
Forwarded from the previous ALU result
ForwardA = 10
Forwarded from data memory or 2nd previous ALU result
ForwardB = 00
Second ALU operand comes from the register file
ForwardB = 01
Forwarded from the previous ALU result
ForwardB = 10
Forwarded from data memory or 2nd previous ALU result
if
(IF/[Link] == ID/[Link] 0
and ID/[Link])
ForwardA = 01
elseif (IF/[Link] == EX/[Link] 0 and EX/[Link]) ForwardA = 10
else
ForwardA = 00
if
(IF/[Link] == ID/[Link] 0
and ID/[Link])
ForwardB = 01
elseif (IF/[Link] == EX/[Link] 0 and EX/[Link]) ForwardB = 10
else
ForwardB = 00
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 37
Forwarding Example
Instruction sequence:
lw
$4, 100($9)
add $7, $5, $6
sub $8, $4, $7
When lw reaches the MEM stage
ForwardA = 10
ForwardB = 01
Forward data from MEM stage
Forward ALU result from ALU stage
Pipelined Processor Design
ForwardB = 01
ICS 233 KFUPM
Muhamed Mudawar slide 38
Data_in
Data
Memory
m
u
x
WriteData
Address
Rw
Rd
m
u
x
m
u
x
ALU result
File
m
u
x
A
L
U
Register
Rt
m
u
x
Rs
lw $4,100($9)
ALU result
Ext
Instruction
ForwardA = 10
add $7,$5,$6
Rw
sub $8,$4,$7
Imm26
sub will be in the Decode stage
Rw
Imm26
add will be in the ALU stage
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 39
Load Delay
Unfortunately, not all data hazards can be forwarded
Load has a delay that cannot be eliminated by
forwarding
In the example shown below
The LW instruction does not have data until end of CC4
AND instruction wants data at beginning of CC4 - NOT
possible
Program Order
Time (cycles)
lw
$2, 20($1)
and $4, $2, $5
or $6, $3, $2
add $7, $2, $2
Pipelined Processor Design
CC1
CC2
CC3
CC4
CC5
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
ICS 233 KFUPM
Muhamed Mudawar slide 40
CC6
CC7
CC8
However, load
can forward
data to second
next instruction
Reg
Detecting RAW Hazard after
Detecting a RAW hazardLoad
after a Load instruction:
The load instruction will be in the ID/EX register
Instruction that needs the load data will be in the IF/ID
register
Condition for stalling the pipeline
if ((ID/[Link] == 1) and (ID/[Link] 0) and
((ID/[Link] == IF/[Link]) or (ID/[Link] == IF/[Link]))) Stall
Insert a bubble after the load instruction
Bubble is a no-op that wastes one clock cycle
Delays the instruction after load by once cycle
Because of RAW hazard
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 41
Stall the Pipeline for one
Freeze the PC and the Cycle
IF/ID registers
No new instruction is fetched and instruction after load
is stalled
Allow the Load instruction in ID/EX register to proceed
Introduce a bubble into the ID/EX register
Load can forward data to next instruction after delaying it
Program Order
Time (cycles)
lw
$2, 20($1)
and $4, $2, $5
or $6, $3, $2
Pipelined Processor Design
CC1
CC2
CC3
CC4
CC5
IM
Reg
ALU
DM
Reg
bubble
Reg
IM
IM
ICS 233 KFUPM
Muhamed Mudawar slide 42
CC6
CC7
ALU
DM
Reg
Reg
ALU
DM
CC8
Reg
Data_in
Data
Memory
m
u
x
WriteData
Address
Rw
ALU result
m
u
x
m
u
x
A
L
U
IF/IDWrite
Op
PCWrite
Rd
m
u
x
File
Address
Register
Rt
m
u
x
Pipelined Processor Design
Forwarding,
Hazard Detection,
and Stall Unit
MemRead
0
Main
Control
m
u
x
EX
Bubble
Bubble
clears
control
signals
ForwardB
WB
PC
Instruction
Rs
ALU result
Ext
Rw
Instruction
Memory
Instruction
ForwardA
Imm26
Rw
Imm26
Hazard Detection and Stall
Unit
ICS 233 KFUPM
Muhamed Mudawar slide 43
The pipelined is stalled
by Making PCWrite = 0
and IF/IDWrite = 0 and
introducing a bubble into
the ID/EX control signals
Compiler Scheduling
Compilers can schedule code in a way to avoid load stalls
Consider the following statements:
a = b + c; d = e f;
Fast code: No Stalls
Slow code:
lw
$10, 0($1)
lw
$11, 0($2)
add $12, $10, $11 # stall
lw
$13, 0($4)
sw
$12, ($3)
# $3 = addr a
lw
$13, ($4)
# $4 = addr e
lw
$14, 0($5)
lw
$14, ($5)
# $5 = addr f
lw
$10, ($1)
# $1 = addr b
lw
$11, ($2)
# $2 = addr c
sub $15, $13, $14 # stall
sw
$15, ($6)
Pipelined Processor Design
# $6 = addr d
ICS 233 KFUPM
Muhamed Mudawar slide 44
add $12, $10, $11
sw
$12, 0($3)
sub $15, $13, $14
sw
$14, 0($6)
Write After Read WAR
Hazard
Instruction J should write its result after it is read by I
Called an anti-dependence by compiler writers
I: sub $4, $1, $3
# $1 is read
J: add $1, $2, $3
# $1 is written
Results from reuse of the name $1
Hazard occurs when J writes $1 before I reads it
Cannot occur in our basic 5-stage pipeline because:
Reads are always in stage 2, and
Writes are always in stage 5
Instructions are processed in order
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 45
Write After Write WAW
Hazard
Instruction J should write
its result after instruction I
Called an output-dependence in compiler terminology
I: sub $1, $4, $3
# $1 is written
J: add $1, $2, $3
# $1 is written again
This hazard also results from the reuse of name $1
Hazard occurs when writes occur in the wrong order
Cant happen in our basic 5-stage pipeline because:
All writes are ordered and always take place in stage 5
WAR and WAW hazards can occur in complex pipelines
Notice that Read After Read RAR is NOT a hazard
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 46
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 47
Control Hazards
Branch instructions can cause great performance loss
Branch instructions need two things:
Branch Result
Taken or Not Taken
Branch target
PC + 4
If Branch is NOT taken
PC + 4 + 4 immediate
If Branch is Taken
Branch instruction is decoded in the ID stage
At which point a new instruction is already being fetched
For our pipeline: 2-cycle branch delay
Effective address is calculated in the ALU stage
Branch condition is determined by the ALU (zero flag)
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 48
Branch Delay = 2 Clock
Cycles
Pipelined Processor Design
Rd
A
L
U
ALU result
m
u
x
zero = 1
Imm26
m
u
x
m
u
x
Imm16
Ext
beq = 1
SUB
By the time the branch instruction reaches the
ALU stage, next1 instruction is in the decode
stage and next2 instruction is being fetched
ICS 233 KFUPM
Muhamed Mudawar slide 49
Forwarding
from MEM stage
label:
lw $8, ($7)
. . .
beq $5, $6, label
next1
next2
File
Address
Register
Rt
m
u
x
Instruction
Rs
Rw
m
u
x
Instruction
Memory
Instruction
00
PCSrc = 1
Next
PC
Rw
NPC
NPC
Imm26
PC
Branch Target Address
Inc
beq $5,$6,label
next1
next2
2-Cycle Branch Delay
Next1 thru Next2 instructions will be fetched anyway
Pipeline should flush Next1 and Next2 if branch is taken
Otherwise, they can be executed if branch is not taken
beq $5,$6,label
Next1 # bubble
cc1
cc2
cc3
IF
Reg
ALU
IF
Next2 # bubble
cc4
cc5
cc6
Reg
Bubble
Bubble
Bubble
IF
Bubble
Bubble
Bubble
Bubble
IF
Reg
ALU
MEM
label: branch target instruction
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 50
cc7
Reducing the Delay of
Branches
Branch delay can be reduced from 2 cycles to just 1 cycle
Branches can be determined earlier in the Decode stage
Next PC logic block is moved to the ID stage
A comparator is added to the Next PC logic
To determine branch decision, whether the branch is taken
or not
Only one instruction that follows the branch will be fetched
If the branch is taken then only one instruction is flushed
We need a control signal to reset the IF/ID register
This will convert the fetched instruction into a NOP
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 51
Modified Datapath
Imm16
Data_in
Data
Memory
m
u
x
WriteData
Address
Rw
Rd
m
u
x
m
u
x
ALU result
File
m
u
x
A
L
U
Register
Rt
Ext
m
u
x
Rw
Address
Rs
ALU result
Rw
Instruction
PC
m
u
x
Instruction
Memory
Instruction
00
PCSrc
Imm16
Imm26
reset
NPC
Inc
PCSrc signal resets the IF/ID
register when a branch is taken
Next
PC
Next PC block is moved to the Instruction Decode stage
Advantage: Branch and jump delay is reduced to one cycle
Drawback: Added delay in decode stage => longer cycle
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 52
Details of Next PC
PCSrc
Branch or Jump Target Address
30
NPC
30
A
D
D
30
m 30
u
x
Ext
Imm16
zero
msb 4
Imm26
beq
bne
1
26
=
Forwarded BusA
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 53
BusB
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 54
Branch Hazard Alternatives
Predict Branch Not Taken (modified datapath)
Successor instruction is already fetched
About half of MIPS branches are not taken on average
Flush instructions in pipeline only if branch is actually taken
Delayed Branch
Define branch to take place AFTER the next instruction
Compiler/assembler fills the branch delay slot (for 1 delay
cycle)
Dynamic Branch Prediction
Can predict backward branches in loops taken most of
time
However, branch target address is determined in ID stage
Must reduce branch delay from 1 cycle to 0, but how?
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 55
Delayed Branch
Define branch to take place after the next instruction
For a 1-cycle branch delay, we have one delay slot
label:
branch instruction
branch delay slot
branch target
(next instruction)
(if branch taken)
. . .
add $t2,$t3,$t4
beq $s1,$s0,label
Delay Slot
Compiler fills the branch delay slot
By selecting an independent instruction
From before the branch
If no independent instruction is found
Compiler fills delay slot with a NO-OP
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 56
label:
. . .
beq $s1,$s0,label
add $t2,$t3,$t4
Zero-Delayed Branch
Disadvantages of delayed branch
Branch delay can increase to multiple cycles in deeper pipelines
Branch delay slots must be filled with useful instructions or no-op
How can we achieve zero-delay for a taken branch?
Branch target address is computed in the ID stage
Solution
Check the PC to see if the instruction being fetched is a branch
Store the branch target address in a branch buffer in the IF stage
If branch is predicted taken then
Next PC = branch target fetched from branch target buffer
Otherwise, if branch is predicted not taken then Next PC = PC+4
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 57
Branch Target and
Buffer
The branchPrediction
target buffer is implemented
as a small cache
Stores the branch target address of recent branches
We must also have prediction bits
To predict whether branches are taken or not taken
The prediction bits are dynamically determined by the
hardware
Branch Target & Prediction Buffer
Addresses of
Recent Branches
Inc
mux
PC
predict_taken
Pipelined Processor Design
low-order bits
used as index
=
ICS 233 KFUPM
Muhamed Mudawar slide 58
Target
Predict
Addresses
Bits
Dynamic Branch Prediction
Prediction of branches at runtime using prediction bits
One or few prediction bits are associated with a branch
instruction
Branch prediction buffer is a small memory
Indexed by the lower portion of the address of branch
instruction
The simplest scheme is to have 1 prediction bit per branch
We dont know if the prediction bit is correct or not
If correct prediction
Continue normal execution no wasted cycles
If incorrect prediction (misprediction)
Flush the instructions that were incorrectly fetched
wasted cycles
Update prediction bit and target address for future use
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 59
2-bit Prediction Scheme
Prediction is just a hint that is assumed to be correct
If incorrect then fetched instructions are flushed
1-bit prediction scheme has a performance shortcoming
A loop branch is almost always taken, except for last
iteration
1-bit scheme will mispredict twice, on first and last loop
iterations
2-bit prediction schemes work better and are often used
A prediction must be wrong
Taken
twice before it is changed
A loop branch is mispredicted
only once on the last iteration
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 60
Predict
Taken
Not Taken
Taken
Predict
Taken
Not Taken
Taken
Not Taken
Not
Taken
Taken
Not
Taken
Not
Taken
Pipeline Hazards Summary
Three types of pipeline hazards
Structural hazards: conflicts using a resource during same cycle
Data hazards: due to data dependencies between instructions
Control hazards: due to branch and jump instructions
Hazards limit the performance and complicate the design
Structural hazards: eliminated by careful design or more hardware
Data hazards are eliminated by forwarding
However, load delay cannot be eliminated and stalls the pipeline
Delayed branching can be a solution when branch delay = 1 cycle
Branch prediction can reduce branch delay to zero
Branch misprediction should flush the wrongly fetched instructions
Pipelined Processor Design
ICS 233 KFUPM
Muhamed Mudawar slide 61