0% found this document useful (0 votes)
34 views53 pages

13 PipelinedProcessorDesign

The document discusses the design of pipelined processors, comparing serial and pipelined execution, and outlining key concepts such as pipeline hazards, data forwarding, and control hazards. It illustrates the advantages of pipelining through examples, including a laundry analogy, and details the MIPS processor's five-stage pipeline. The document also covers performance metrics, including speedup factors and the impact of unbalanced pipeline stages on overall efficiency.

Uploaded by

baims.contents
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views53 pages

13 PipelinedProcessorDesign

The document discusses the design of pipelined processors, comparing serial and pipelined execution, and outlining key concepts such as pipeline hazards, data forwarding, and control hazards. It illustrates the advantages of pipelining through examples, including a laundry analogy, and details the MIPS processor's five-stage pipeline. The document also covers performance metrics, including speedup factors and the impact of unbalanced pipeline stages on overall efficiency.

Uploaded by

baims.contents
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Pipelined Processor Design

COE 233
Logic Design and Computer Organization
Dr. Muhamed Mudawar

King Fahd University of Petroleum and Minerals


Presentation Outline

 Serial versus Pipelined Execution

 Pipelined Datapath and Control

 Pipeline Hazards

 Data Hazards and Forwarding

 Load Delay, Hazard Detection, and Stall

 Control Hazards

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 2
Laundry Example
 Laundry Example: Three Stages

1. Wash dirty load of clothes

2. Dry wet clothes

3. Fold and put clothes into drawers

 Each stage takes 30 minutes to complete

 Four loads of clothes to wash, dry, and fold A B


C D

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 3
Sequential Laundry

6 PM 7 8 9 10 11 12 AM
Time 30 30 30 30 30 30 30 30 30 30 30 30

 Sequential laundry takes 6 hours for 4 loads


 Intuitively, we can use pipelining to speed up laundry

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 4
Pipelined Laundry: Start Load ASAP

6 PM 7 8 9 PM
30 30 30
30 30 30 Time
30 30 30
30 30 30

A  Pipelined laundry takes 3


hours for 4 loads
B  Speedup factor is 2 for 4
loads
C
 Time to wash, dry, and
D fold one load is still the
same (90 minutes)

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 5
Serial versus Pipelined Execution
 Consider a task that can be divided into k subtasks
 The k subtasks are executed on k different stages
 Each subtask requires one time unit
 The total execution time of the task is k time units
 Pipelining is to overlap the execution
 The k stages work in parallel on k different tasks
 Tasks enter/leave pipeline at the rate of one task per time unit

1 2 … k 1 2 … k
1 2 … k 1 2 … k
1 2 … k 1 2 … k

Serial Execution Pipelined Execution


One completion every k time units One completion every 1 time unit

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 6
Synchronous Pipeline
 Uses clocked registers between stages
 Upon arrival of a clock edge …
 All registers hold the results of previous stages simultaneously

 The pipeline stages are combinational logic circuits


 It is desirable to have balanced stages
 Approximately equal delay in all stages

 Clock period is determined by the maximum stage delay


Register

Register

Register
Register

Input S1 S2 Sk Output

Clock

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 7
Pipeline Performance
 Let ti = time delay in stage Si
 Clock cycle t = max(ti) is the maximum stage delay
 Clock frequency f = 1/t = 1/max(ti)
 A pipeline can process n tasks in k + n – 1 cycles
 k cycles are needed to complete the first task
 n – 1 cycles are needed to complete the remaining n – 1 tasks

 Ideal speedup of a k-stage pipeline over serial execution

Serial execution in cycles nk


Sk = = Sk → k for large n
Pipelined execution in cycles k+n–1

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 8
MIPS Processor Pipeline

 Five stages, one cycle per stage

1. IF: Instruction Fetch from instruction memory

2. ID: Instruction Decode, register read, and J/Br address

3. EX: Execute operation or calculate load/store address

4. MEM: Memory access for load and store

5. WB: Write Back result to register

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 9
Single-Cycle vs Pipelined Performance
 Consider a 5-stage instruction execution in which …
 Instruction fetch = ALU operation = Data memory access = 200 ps
 Register read = register write = 150 ps
 What is the clock cycle of the single-cycle processor?
 What is the clock cycle of the pipelined processor?
 What is the speedup factor of pipelined execution?
 Solution
Single-Cycle Clock = 200+150+200+200+150 = 900 ps

IF Reg ALU MEM Reg


900 ps IF Reg ALU MEM Reg
900 ps

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 10
Single-Cycle versus Pipelined – cont’d
 Pipelined clock cycle = max(200, 150) = 200 ps

IF Reg ALU MEM Reg


200 IF Reg ALU MEM Reg
200 IF Reg ALU MEM Reg
200 200 200 200 200

 CPI for pipelined execution = 1


 One instruction completes each cycle (ignoring pipeline fill)

 Speedup of pipelined execution =


900 ps / 200 ps = 4.5
 Instruction count and CPI are equal in both cases

 Speedup factor is less than 5 (number of pipeline stage)


 Because the pipeline stages are not balanced

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 11
Pipeline Performance Summary
 Pipelining doesn’t improve latency of a single instruction

 However, it improves throughput of entire workload


 Instructions are initiated and completed at a higher rate

 In a k-stage pipeline, k instructions operate in parallel


 Overlapped execution using multiple hardware resources

 Potential speedup = number of pipeline stages k

 Pipeline rate is limited by slowest pipeline stage

 Unbalanced lengths of pipeline stages reduces speedup

 Also, time to fill and drain pipeline reduces speedup

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 12
Next . . .

 Serial versus Pipelined Execution

 Pipelined Datapath and Control

 Pipeline Hazards

 Data Hazards and Forwarding

 Load Delay, Hazard Detection, and Stall

 Control Hazards

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 13
Single-Cycle Datapath
 Shown below is the single-cycle datapath
 How to pipeline this single-cycle datapath?

Answer: Introduce pipeline registers at end of each stage


IF = Instruction Fetch ID = Instruction Decode EX = Execute MEM = Memory Access

WB = Write Back
& Register Read
Branch Target Address

Jump Target = PC[31:28] ‖ Imm26

Next PC Address
ExtOp +
Imm16
+1 Ext Zero ALU result

Instruction Rs BusA
Data
RA
A
00

0 Memory Memory
Registers L Address
Rt 0
Address U
PC

1 RB 1 Data_out
Instruction 1
2 0
BusB 0
Rd RW Data_in
1 BusW

clk

PCSrc RegDst RegWr ALUSrc ALUOp MemRd MemWr WBdata

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 14
Pipelined Datapath
 Pipeline registers are shown in green, including the PC
 Same clock edge updates all pipeline registers and PC
 In addition to updating register file and data memory (for store)
IF = Instruction Fetch ID = Instruction Decode EX = Execute MEM = Memory Access
& Register Read
Branch Target Address

Jump Target = PC[31:28] ‖ Imm26

WB = Write Back
BTA
Next PC Address +
NPC

ExtOp

+1 Imm16 ALU Result

Imm
Ext Zero

Instruction Rs Data

A
RA BusA
Memory A Memory
00

0
Registers L Address

R
Rt

Data
PC

0
1 Address
RB 1
U
Data_out
Inst

Instruction 1
2 0 BusB
Rd RW B 0
Data_in

D
1 BusW

clk

PCSrc RegDst RegWr ALUSrc ALUOp MemRd MemWr WBdata

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 15
Problem with Register Destination
 Instruction in ID stage is different from the one in WB stage
 WB stage is writing to a different destination register
 Writing the destination register of the instruction in the ID Stage

IF = Instruction Fetch ID = Instruction Decode EX = Execute MEM = Memory Access


& Register Read
Branch Target Address

Jump Target = PC[31:28] ‖ Imm26

WB = Write Back
BTA
Next PC Address +
NPC

ExtOp

+1 Imm16 ALU Result

Imm
Ext Zero

Instruction Rs Data

A
RA BusA
Memory A Memory
00

0
Registers L Address

R
Rt

Data
PC

0
1 Address
RB 1
U
Data_out
Inst

Instruction 1
2 0 BusB
Rd RW B 0
Data_in

D
1 BusW

clk

PCSrc RegDst RegWr ALUSrc ALUOp MemRd MemWr WBdata

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 16
Pipelining the Destination Register
 Destination Register should be pipelined from ID to WB
 The WB stage writes back data knowing the destination register

IF = Instruction Fetch ID = Instruction Decode EX = Execute MEM = Memory Access


& Register Read
Branch Target Address

Jump Target = PC[31:28] ‖ Imm26

WB = Write Back
BTA
Next PC Address +
NPC ExtOp

+1 Imm16 ALU Result

Imm
Ext Zero

Instruction Rs Data

A
RA BusA
Memory A Memory
00

0
Registers L Address

Data
PC

0
1 Address Rt
RB 1
U
Data_out
Inst

Instruction 1
2 BusB

B
0
Data_in

D
RW BusW
0

Rd4
Rd3
Rd
1
Rd2

clk

PCSrc RegDst RegWr ALUSrc ALUOp MemRd MemWr WBdata

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 17
Graphically Representing Pipelines
 Multiple instruction execution over multiple clock cycles
 Instructions are listed in execution order from top to bottom
 Clock cycles move from left to right
 Figure shows the use of resources at each stage and each cycle

Time (in cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8

lw $t6, 8($s5) IM Reg ALU DM Reg


Program Execution

add $s1, $s2, $s3 IM Reg ALU DM Reg


Order

ori $s4, $t3, 7 IM Reg ALU DM Reg

sub $t5, $s2, $t3 IM Reg ALU DM Reg

sw $s2, 10($t3) IM Reg ALU DM

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 18
Instruction-Time Diagram
 Instruction-Time Diagram shows:
 Which instruction occupying what stage at each clock cycle
 Instruction flow is pipelined over the 5 stages

Up to five instructions can be in the


pipeline during the same cycle ALU instructions skip
Instruction Level Parallelism (ILP) the MEM stage. Store
instructions skip the
WB stage
Instruction Order

lw $t7, 8($s3) IF ID EX MEM WB


lw $t6, 8($s5) IF ID EX MEM WB
ori $t4, $s3, 7 IF ID EX – WB
sub $s5, $s2, $t3 IF ID EX – WB
sw $s2, 10($s3) IF ID EX MEM –

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Time

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 19
Control Signals
IF = Instruction Fetch ID = Instruction Decode EX = Execute MEM = Memory Access
Branch Target Address

Jump Target = PC[31:28] ‖ Imm26

WB = Write Back
BTA
Next PC Address +

NPC
ExtOp

+1 Imm16 ALU Result

Imm
Ext Zero

Instruction Rs Data

A
RA BusA
Memory A Memory
00

0
Registers L Address

Data
PC

0
1 Address Rt
RB 1
U
Data_out
Inst

Instruction 1
2 BusB

B
0
Data_in

D
RW BusW
0

Rd4
Rd3
Rd2
Rd
1

clk

PCSrc RegDst RegWr ALUSrc ALUOp MemRd MemWr WBdata

Same control signals used in the single-cycle datapath

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 20
Pipelined Control
IF = Instruction Fetch ID = Instruction Decode EX = Execute MEM = Memory Access
Branch Target Address

Jump Target = PC[31:28] ‖ Imm26


Pipeline control signals

WB = Write Back
BTA
Next PC Address + just like data

NPC
ExtOp

+1 Imm16 ALU Result

Imm
Ext Zero

Instruction Rs Data

A
RA BusA
Memory A Memory
00

0
Registers L Address

Data
PC

0
1 Address Rt
RB 1
U
Data_out
Inst
Instruction 1
2 BusB

B
0
Data_in

D
RW BusW
0

Rd4
Rd3
Rd2
Rd
PCSrc 1

clk
RegDst RegWr ALUSrc ALUOp MemRd MemWr WBdata
PC
Zero ExtOp
Control
Op
BEQ, BNE J Main & ALU

MEM
EX

func Control

WB
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 21
Pipelined Control – Cont'd
 ID stage generates all the control signals
 Pipeline the control signals as the instruction moves
 Extend the pipeline registers to include the control signals

 Each stage uses some of the control signals


 Instruction Decode and Register Read
 Control signals are generated
 RegDst and ExtOp are used in this stage, J (Jump) is used by PC control

 Execution Stage => ALUSrc, ALUOp, BEQ, BNE


 ALU generates zero signal for PC control logic (Branch Control)

 Memory Stage => MemRd, MemWr, and WBdata


 Write Back Stage => RegWr control signal is used in the last
stage
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 22
Control Signals Summary
Decode Execute Memory Write PC
Stage Stage Stage Back Control
Op
RegDst ExtOp ALUSrc ALUOp MemRd MemWr WBdata RegWr PCSrc

R-Type 1=Rd X 0=Reg func 0 0 0 1 0 = next PC

ADDI 0=Rt 1=sign 1=Imm ADD 0 0 0 1 0 = next PC

SLTI 0=Rt 1=sign 1=Imm SLT 0 0 0 1 0 = next PC

ANDI 0=Rt 0=zero 1=Imm AND 0 0 0 1 0 = next PC

ORI 0=Rt 0=zero 1=Imm OR 0 0 0 1 0 = next PC

LW 0=Rt 1=sign 1=Imm ADD 1 0 1 1 0 = next PC

SW X 1=sign 1=Imm ADD 0 1 X 0 0 = next PC

BEQ X X 0=Reg SUB 0 0 X 0 0 or 2 = BTA

BNE X X 0=Reg SUB 0 0 X 0 0 or 2 = BTA

J X X X X 0 0 X 0 1 = jump target
PCSrc = 0 or 2 (BTA) for BEQ and BNE, depending on the zero flag
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 23
Next . . .

 Serial versus Pipelined Execution

 Pipelined Datapath and Control

 Pipeline Hazards

 Data Hazards and Forwarding

 Load Delay, Hazard Detection, and Stall

 Control Hazards

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 24
Pipeline Hazards
 Hazards: situations that would cause incorrect execution
 If next instruction were launched during its designated clock cycle

1. Structural hazards
 Caused by resource contention
 Using same resource by two instructions during the same cycle

2. Data hazards
 An instruction may compute a result needed by next instruction
 Data hazards are caused by data dependencies between instructions

3. Control hazards
 Caused by instructions that change control flow (branches/jumps)
 Delays in changing the flow of control
 Hazards complicate pipeline control and limit performance
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 25
Structural Hazards
 Problem
 Attempt to use the same hardware resource by two different
instructions during the same clock cycle
 Example Structural Hazard
Two instructions are
 Writing back ALU result in stage 4 attempting to write the
 Conflict with writing load data in stage 5 register file during
same cycle

lw $t6, 8($s5) IF ID EX MEM WB


Instructions

ori $t4, $s3, 7 IF ID EX WB


sub $t5, $s2, $s3 IF ID EX WB
sw $s2, 10($s3) IF ID EX MEM

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Time

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 26
Resolving Structural Hazards
 Serious Hazard:
 Hazard cannot be ignored

 Solution 1: Delay Access to Resource


 Must have mechanism to delay instruction access to resource
 Delay all write backs to the register file to stage 5
 ALU instructions bypass stage 4 (memory) without doing anything

 Solution 2: Add more hardware resources (more costly)


 Add more hardware to eliminate the structural hazard
 Redesign the register file to have two write ports
 First write port can be used to write back ALU results in stage 4
 Second write port can be used to write back load data in stage 5

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 27
Data Hazards
 Dependency between instructions causes a data hazard
 The dependent instructions are close to each other
 Pipelined execution might change the order of operand access

 Read After Write – RAW Hazard


 Given two instructions I and J, where I comes before J
 Instruction J should read an operand after it is written by I
 Called a data dependence in compiler terminology

I: add $s1, $s2, $s3 # $s1 is written


J: sub $s4, $s1, $s3 # $s1 is read
 Hazard occurs when J reads the operand before I writes it

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 28
Example of a RAW Data Hazard
Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
value of $s2 10 10 10 10 10 20 20 20

sub $s2, $t1, $t3 IM Reg ALU DM Reg


Program Execution

add$s4, $s2, $t5 IM Reg ALU DM Reg


Order

or $s6, $t3, $s2 IM Reg ALU DM Reg

and$s7, $t4, $s2 IM Reg ALU DM Reg

sw $t8, 10($s2) IM Reg ALU DM

 Result of sub is needed by add, or, and, & sw instructions


 Instructions add & or will read old value of $s2 from reg file
 During CC5, $s2 is written at end of cycle, old value is read
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 29
Solution 1: Stalling the Pipeline
Time (in cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
value of $s2 10 10 10 10 10 20 20 20 20
Instruction Order

sub $s2, $t1, $t3 IM Reg ALU DM Reg

add$s4, $s2, $t5 IM Reg Reg Reg Reg ALU DM Reg

stall stall stall


or $s6, $t3, $s2 IM Reg ALU DM

 Three stall cycles during CC3 thru CC5 (wasting 3 cycles)


 The 3 stall cycles delay the execution of add and the fetching of or

 The 3 stall cycles insert 3 bubbles (No operations) into the ALU

 The add instruction remains in the second stage until CC6


 The or instruction is not fetched until CC6
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 30
Solution 2: Forwarding ALU Result
 The ALU result is forwarded (fed back) to the ALU input
 No bubbles are inserted into the pipeline and no cycles are wasted
 ALU result is forwarded from ALU, MEM, and WB stages

Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
value of $s2 10 10 10 10 10 20 20 20
Program Execution Order

sub $s2, $t1, $t3 IM Reg ALU DM Reg

add$s4, $s2, $t5 IM Reg ALU DM Reg

or $s6, $t3, $s2 IM Reg ALU DM Reg

and$s7, $s6, $s2 IM Reg ALU DM Reg

sw $t8, 10($s2) IM Reg ALU DM

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 31
Implementing Forwarding
 Two multiplexers added at the inputs of A & B registers
 Data from ALU stage, MEM stage, and WB stage is fed back

 Two signals: ForwardA and ForwardB to control forwarding

ForwardA

Imm16 32

Imm
Ext
32 32 ALU result
0 A
Register File

Rs BusA 1
RA A L Address

R
2
Instruction

Rt 3 1 U Data 0
RB

Data
BusB
0 Memory
0 32
1 32 32 Data_out 1

D
B

RW BusW 2 Data_in
3
32
0
Rd2

Rd4
Rd3
1
Rd

clk

ForwardB

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 32
Forwarding Control Signals
Signal Explanation
ForwardA = 0 First ALU operand comes from register file = Value of (Rs)

ForwardA = 1 Forward result of previous instruction to A (from ALU stage)

ForwardA = 2 Forward result of 2nd previous instruction to A (from MEM stage)

ForwardA = 3 Forward result of 3rd previous instruction to A (from WB stage)

ForwardB = 0 Second ALU operand comes from register file = Value of (Rt)

ForwardB = 1 Forward result of previous instruction to B (from ALU stage)

ForwardB = 2 Forward result of 2nd previous instruction to B (from MEM stage)

ForwardB = 3 Forward result of 3rd previous instruction to B (from WB stage)

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 33
Forwarding Example
Instruction sequence: When sub instruction in ID stage
lw $t4, ori will be in the ALU stage
4($t0) lw will be in the MEM stage
ori $t7, $t1, ForwardA = 2 (from MEM stage)
2
sub $t3,$t4,$t7 ori $t7,$t1,2 lw $t4,4($t0)
sub $t3, $t4,
$t7 32 Imm16

Imm
Ext
32 32 ALU result
0 A
Register File

Rs BusA 1
RA L Address

R
2
Instruction

Rt 3 1 U Data 0
RB

Data
BusB
0 Memory
0 32
1 32 32 Data_out 1

D
B

RW BusW 2 Data_in
3
32
0
Rd2

Rd4
Rd3
1
Rd

clk

ForwardB = 1 (from ALU stage)


Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 34
RAW Hazard Detection
 Current instruction is being decoded in the Decode stage
 Previous instruction is in the Execute stage
 Second previous instruction is in the Memory stage
 Third previous instruction is in the Write Back stage

If ((Rs != 0) and (Rs == Rd2) and (EX.RegWr)) ForwardA = 1


Else if ((Rs != 0) and (Rs == Rd3) and (MEM.RegWr)) ForwardA = 2
Else if ((Rs != 0) and (Rs == Rd4) and (WB.RegWr)) ForwardA = 3
Else ForwardA = 0

If ((Rt != 0) and (Rt == Rd2) and (EX.RegWr)) ForwardB = 1


Else if ((Rt != 0) and (Rt == Rd3) and (MEM.RegWr)) ForwardB = 2
Else if ((Rt != 0) and (Rt == Rd4) and (WB.RegWr)) ForwardB
= 3
Else ForwardB = 0
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 35
Hazard Detecting and Forwarding Logic
ExtOp

Imm16 32

Imm
Ext
32 32 ALU result
0 A
Register File
Rs BusA 1
RA L Address

R
2
Instruction

Rt 3 1 U Data 0
RB

Data
BusB Memory
0
0 32
1 32 32 Data_out 1

D
B
RW BusW 2 Data_in
3
32
0

Rd2

Rd4
Rd3
1
Rd

clk
ForwardB ForwardA

RegDst
Rs Hazard
Detect &
Rt
ExtOp Forward
ALUSrc MemRd
RegWr ALUOp RegWr MemWr RegWr
Op
WBdata
Main
& ALU
EX

MEM
func
Control

WB
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 36
Next . . .

 Serial versus Pipelined Execution

 Pipelined Datapath and Control

 Pipeline Hazards

 Data Hazards and Forwarding

 Load Delay, Hazard Detection, and Stall

 Control Hazards

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 37
Load Delay
 Unfortunately, not all data hazards can be forwarded
 Load has a delay that cannot be eliminated by forwarding

 In the example shown below …


 The LW instruction does not read data until end of CC4
 Cannot forward data to ADD at end of CC3 - NOT possible

Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8

lw $s2, 20($t1) IF Reg DM Reg However, load can


Program Order

ALU

forward data to 2nd next


add$s4, $s2, $t5 IF Reg ALU DM Reg and later instructions

or $t6, $t3, $s2 IF Reg ALU DM Reg

and$t7, $s2, $t4 IF Reg ALU DM Reg

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 38
Detecting RAW Hazard after Load
 Detecting a RAW hazard after a Load instruction:
 The load instruction will be in the EX stage

 Instruction that depends on the load data is in the decode stage

 Condition for stalling the pipeline


if ((EX.MemRd == 1) // Detect Load in EX stage

and (ForwardA==1 or ForwardB==1)) Stall // RAW


Hazard

 Insert a bubble into the EX stage after a load instruction


 Bubble is a no-op that wastes one clock cycle

 Delays the dependent instruction after load by one cycle


 Because of RAW hazard
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 39
Stall the Pipeline for one Cycle
 ADD instruction depends on LW  stall at CC3
 Allow Load instruction in ALU stage to proceed
 Freeze PC and Instruction registers (NO instruction is fetched)
 Introduce a bubble into the ALU stage (bubble is a NO-OP)
 Load can forward data to next instruction after delaying it

Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8

lw $s2, 20($s1) IM Reg ALU DM Reg


Program Order

add$s4, $s2, $t5 IM stall bubble bubble bubble

Reg ALU DM Reg

or $t6, $s3, $s2 IM Reg ALU DM Reg

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 40
Showing Stall Cycles
 Stall cycles can be shown on instruction-time diagram
 Hazard is detected in the Decode stage
 Stall indicates that instruction is delayed
 Instruction fetching is also delayed after a stall
 Example:
Data forwarding is shown using green arrows

lw $s1, ($t5) IF ID EX MEM WB


lw $s2, 8($s1) IF Stall ID EX MEM WB
add $v0, $s2, $t3 IF Stall ID EX - WB
sub $v1, $s2, $v0 IF ID EX - WB

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 Time

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 41
Hazard Detecting and Forwarding Logic
ExtOp

Imm16 32

Imm
Ext
32 32 ALU result
0 A
Register File
Rs BusA 1
RA L Address

R
2
Instruction

Rt 3 1 U Data 0
RB

Data
BusB
PC

0 Memory
0 32
1 32 32 Data_out 1

D
B
RW BusW 2 Data_in
3
32
0

Rd2

Rd4
Rd3
1
Rd

RegDst
clk
ForwardB ForwardA
Disable PC

Disable IR

Rs Hazard Detect
Forward
Rt
and Stall
ALUSrc MemRd
RegWr ALUOp MemRd RegWr MemWr RegWr
Stall WBdata
Op
Main Control Signals
& ALU 0
EX

func

MEM
Control Bubble = 0 1

WB
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 42
Code Scheduling to Avoid Stalls
 Compilers reorder code in a way to avoid load stalls
 Consider the translation of the following statements:
A = B + C; D = E – F; // A thru F are in Memory
 Slow code:  Fast code: No Stalls
lw $t0, 4($s0) # &B = 4($s0) lw $t0, 4($s0)
lw $t1, 8($s0) # &C = 8($s0) lw $t1, 8($s0)
add $t2, $t0, $t1 # stall cycle lw $t3, 16($s0)
sw $t2, 0($s0) # &A = 0($s0) lw $t4, 20($s0)
lw $t3, 16($s0) # &E = 16($s0)add $t2, $t0,
lw $t4, 20($s0) # &F = 20($s0)$t1
sub $t5, $t3, $t4 # stall cycle sw $t2, 0($s0)
sw $t5, 12($0) # &D = 12($0) sub $t5, $t3,
$t4
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 43
Name Dependence: Write After Read
 Instruction J should write its result after it is read by I
 Called anti-dependence by compiler writers
I: sub $t4, $t1, $t3 # $t1 is read
J: add $t1, $t2, $t3 # $t1 is written
 Results from reuse of the name $t1
 NOT a data hazard in the 5-stage pipeline because:
 Reads are always in stage 2
 Writes are always in stage 5, and
 Instructions are processed in order

 Anti-dependence can be eliminated by renaming


 Use a different destination register for add (eg, $t5)
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 44
Name Dependence: Write After Write
 Same destination register is written by two instructions
 Called output-dependence in compiler terminology
I: sub $t1, $t4, $t3 # $t1 is written
J: add $t1, $t2, $t3 # $t1 is written again
 Not a data hazard in the 5-stage pipeline because:
 All writes are ordered and always take place in stage 5

 However, can be a hazard in more complex pipelines


 If instructions are allowed to complete out of order, and
 Instruction J completes and writes $t1 before instruction I

 Output dependence can be eliminated by renaming $t1


 Read After Read is NOT a name dependence
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 45
Next . . .

 Serial versus Pipelined Execution

 Pipelined Datapath and Control

 Pipeline Hazards

 Data Hazards and Forwarding

 Load Delay, Hazard Detection, and Stall

 Control Hazards

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 46
Control Hazards
 Jump and Branch can cause great performance loss
 Jump instruction needs only the jump target address
 Branch instruction needs two things:
 Branch Result Taken or Not Taken
 Branch Target Address
 PC + 4 If Branch is NOT taken
 PC + 4 + 4 × immediate If Branch is Taken

 Jump and Branch targets are computed in the ID stage


 At which point a new instruction is already being fetched
 Jump Instruction: 1-cycle delay
 Branch: 2-cycle delay for branch result (taken or not taken)

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 47
1-Cycle Jump Delay
 Control logic detects a Jump instruction in the 2nd Stage

 Next instruction is fetched anyway

 Convert Next instruction into bubble (Jump is always taken)

cc1 cc2 cc3 cc4 cc5 cc6 cc7

J L1 IF ID

Next instruction IF Bubble Bubble Bubble Bubble

. . .

Jump
L1: Target instruction Target IF Reg ALU DM Reg
Addr

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 48
2-Cycle Branch Delay
 Control logic detects a Branch instruction in the 2nd Stage
 ALU computes the Branch outcome in the 3rd Stage
 Next1 and Next2 instructions will be fetched anyway
 Convert Next1 and Next2 into bubbles if branch is taken
cc1 cc2 cc3 cc4 cc5 cc6 cc7

Beq $t1,$t2,L1 IF Reg ALU

Next1 IF Reg Bubble Bubble Bubble

Next2 IF Bubble Bubble Bubble Bubble

Branch
L1: target instruction Target IF Reg ALU DM
Addr

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 49
If Branch is NOT Taken . . .
 Branches can be predicted to be NOT taken

 If branch outcome is NOT taken then


 Next1 and Next2 instructions can be executed

 Do not convert Next1 & Next2 into bubbles

 No wasted cycles

cc1 cc2 cc3 cc4 cc5 cc6 cc7

Beq $t1,$t2,L1 IF Reg ALU NOT Taken

Next1 IF Reg ALU DM Reg

Next2 IF Reg ALU DM Reg

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 50
Pipelined Jump and Branch
Branch Target Address

Jump Target = PC[31:28] ‖ Imm26

BTA
Next PC Address +

NPC
ForwardA
Imm16 32 Zero
+1

Imm
Ext
32

Instruction 0 A

Register File
Rs BusA 1
RA L
00

Memory

R
0 2
Rt 3 1 U
Address RB
PC

1 BusB

Instruction
0
0
2 Instruction 0 32
1

D
B
1 RW BusW 2
3
Bubble = NOP 32
PCSrc 0

Rd2

Rd3
1
Rd
Jump
Disable PC

Kill1
Disable IR

kills next
ForwardB
instruction
Rs Rd2, Rd3, Rd4
Forward & Stall RegWr, MemRd
Rt

Stall
Kill2

Op
Taken branch kills two
PC Main Control Signals
J & ALU 0 Control Signals

MEM
Control func

EX
BEQ, BNE J Control Bubble = 0 1
Zero BEQ, BNE

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 51
PC Control for Pipelined Jump and Branch
if ((BEQ && Zero) || (BNE && !
BEQ BNE J
Zero))
Zero
{ Jmp=0; Br=1; Kill1=1; Kill2=1;
}
else if (J)
{ Jmp=1; Br=0; Kill1=1; Kill2=0;
}
else
Br = (( BEQ · Zero ) + (BNE ·
{Zero ))
Jmp=0; Br=0; Kill1=0; Kill2=0;
}Jmp = J · Br
Kill1 = J + Br Kill2 Kill1 Br Jmp
Kill2 = Br PCSrc
PCSrc = { Br, Jmp } // 0, 1, or 2
Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 52
Jump and Branch Impact on CPI
 Base CPI = 1 without counting jump and branch
 Unconditional Jump = 5%, Conditional branch = 20%
 90% of conditional branches are taken
 Jump kills next instruction, Taken Branch kills next two
 What is the effect of jump and branch on the CPI?
Solution:
 Jump adds 1 wasted cycle for 5% of instructions = 1 × 0.05
 Branch adds 2 wasted cycles for 20% × 90% of instructions
= 2 × 0.2 × 0.9 = 0.36
 New CPI = 1 + 0.05 + 0.36 = 1.41 (due to wasted cycles)

Pipelined Processor Design COE 233 – Logic Design and Computer Organization © Muhamed Mudawar – slide 53

You might also like