0% found this document useful (0 votes)

85 views61 pages

Pipelined Processor Design: Computer Architecture and Assembly Language Prof. Muhamed Mudawar

This document discusses pipelined processor design. It compares pipelined execution to serial execution and shows how pipelining can speed up tasks like laundry by processing multiple loads simultaneously. It then covers topics like pipelined datapaths, control signals, pipeline hazards and various techniques to address hazards.

Uploaded by

salmansami01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views61 pages

Pipelined Processor Design: Computer Architecture and Assembly Language Prof. Muhamed Mudawar

Uploaded by

salmansami01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Pipelined Processor Design

ICS 233
Computer Architecture and Assembly
Language
Prof. Muhamed Mudawar

College of Computer Sciences and

Engineering

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 1

Presentation Outline
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch
Prediction
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 2

Pipelining Example
Laundry Example: Three Stages
1. Wash dirty load of clothes
2. Dry wet clothes
3. Fold and put clothes into drawers
A B
Each stage takes 30 minutes to complete
C

Four loads of clothes to wash, dry, and fold

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 3

Sequential Laundry
6 PM
Time 30

7
30

8
30

9
30

10
30

11
30

12 AM
30

A
B
C
D

Sequential laundry takes 6 hours for 4 loads

Intuitively, we can use pipelining to speed up
laundry
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 4

Pipelined Laundry: Start

Load
ASAP
6 PM
7
8
9 PM
30

30
30

30
30
30

Time
30
30

Pipelined laundry
takes 3 hours for 4
loads

Speedup factor is 2 for

4 loads

Pipelined Processor Design

Time to wash, dry, and

fold one load is still
the same (90 minutes)
ICS 233 KFUPM
Muhamed Mudawar slide 5

Serial Execution versus

Consider a task Pipelining
that can be divided into k subtasks

The k subtasks are executed on k different

stages
Each subtask requires one time unit
The total execution time of the task is k time
units
Pipelining is to start a new task before finishing
previous
1 2 k
The1 k
stages
work
in
parallel
on
k different
tasks
2 k
1 2 k
1 2 k
Tasks enter/leave
pipeline at the rate
1 2
of kone
k
task per time unit1 2
Without Pipelining
One completion every k time units

Pipelined Processor Design

With Pipelining
One completion every 1 time unit

ICS 233 KFUPM

Muhamed Mudawar slide 6

Synchronous Pipeline

Uses clocked registers between stages

Upon arrival of a clock edge

All registers hold the results of previous stages
simultaneously

The pipeline stages are combinational logic circuits

It is desirable to have balanced stages

Approximately equal delay in all stages

Clock

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 7

Input

Clock period is determined by the maximum stage delay

Output

Pipeline Performance

Let i = time delay in stage Si

Clock cycle = max(i) is the maximum stage delay

Clock frequency f = 1/ = 1/max(i)

A pipeline can process n tasks in k + n 1 cycles

k cycles are needed to complete the first task
n 1 cycles are needed to complete the remaining n 1
tasks

Ideal speedup of a k-stage pipeline over serial execution

nk
Serial execution in cycles
Sk =
Sk k for large n
=
Pipelined execution in cycles
k+n1

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 8

Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch
Prediction
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 9

Single-Cycle Datapath

Shown below is the single-cycle datapath

How to pipeline this single-cycle datapath?
Answer: Introduce registers at the end of each stage
IF = Instruction
Fetch

ID = Decode and
Register Fetch

EX = Execute and
Calculate Address

Inc

Address

Instruction

Instruction
Memory
Rd

Pipelined Processor Design

m
u
x
1

Imm16

Imm26

m
u
x

Next
PC

Ext

m
u
x
1

ICS 233 KFUPM

Muhamed Mudawar slide 10

MEM = Memory
Access

ALU result

zero

A
L
U

WB = Write
Back

Data
Memory
Address
Data_in

m
u
x
1

Pipelined Datapath

Pipeline registers, in green, separate each pipeline stage

Pipeline registers are labeled by the stages they separate
Is there a problem with the register destination address?
IF = Instruction Fetch

ID = Decode
IF/ID

EX = Execute
ID/EX

Next
PC

Imm26

Instruction

Instruction
Memory
Rd

Pipelined Processor Design

m
u
x

BusW

Imm16
Address

EX/MEM

Inc

m
u
x

MEM = Memory

Ext

m
u
x

ICS 233 KFUPM

Muhamed Mudawar slide 11

zero

A
L
U

MEM/WB

ALU result

Address

Data
Memory
Data_in

m
u
x
1

Corrected Pipelined
Destination registerDatapath
number should come from MEM/WB

Along with the data during the written back stage

Destination register number is passed from ID to WB stage
IF

IF/ID

ID/EX

Next
PC

Imm26

Instruction

BusW

Instruction
Memory
Rd

Pipelined Processor Design

m
u
x

Imm16
Address

EX/MEM

Inc

m
u
x

MEM

Ext

m
u
x

ICS 233 KFUPM

Muhamed Mudawar slide 12

zero

A
L
U

MEM/WB

ALU result

Address

Data
Memory
Data_in

m
u
x
1

Graphically Representing
Pipelines

Multiple instruction execution over multiple clock cycles

Instructions are listed in execution order from top to bottom

Clock cycles move from left to right

Program Execution Order

Figure shows the use of resources at each stage and each cycle
Time (in cycles)

CC1

CC2

CC3

CC4

CC5

lw $6, 8($5)

Reg

ALU

Reg

ALU

Reg

ALU

Reg

ALU

Reg

ALU

add $1, $2, $3

ori $4, $3, 7
sub $5, $2, $3
sw $2, 10($3)

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 13

CC6

CC7

CC8

InstructionTime Diagram

Diagram shows:
Which instruction occupies what stage at each
clock cycle
Instruction execution is pipelined over the 5
stages
Up to five instructions can be in
ALU instructions skip
Instruction Order

execution during a single cycle

$7, 8($3)

$6, 8($5)

ori $4, $3, 7

sub $5, $2, $3
sw

EX MEM WB

$2, 10($3)
CC1

Pipelined Processor Design

the MEM stage.

Store instructions
skip the WB stage

EX MEM

CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ICS 233 KFUPM
Muhamed Mudawar slide 14

Time

Single-Cycle vs Pipelined
Performance

Consider a 5-stage instruction execution in which

Instruction fetch = ALU operation = Data memory access
= 200 ps
Register read = register write = 150 ps
What is the single-cycle non-pipelined time?
What is the pipelined cycle time?
What is the speedup factor for pipelined execution?
Solution
Non-pipelined cycle =
IF

Reg

ALU
900 ps

200+150+200+200+150 = 900 ps

MEM

Reg
IF

Reg

ALU
900 ps

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 15

MEM

Reg

Single-Cycle versus
Pipelined
Pipelined cycle time
= max(200, 150) = 200 ps
IF

Reg

200

IF
200

ALU
Reg
IF
200

MEM

Reg

ALU

MEM

Reg

ALU

MEM

200

Reg
200

CPI for pipelined execution =

1
One instruction completes each cycle (ignoring
pipeline fill)
Speedup of pipelined execution = 900 ps / 200 ps = 4.5
Instruction count and CPI are equal in both cases
Speedup factor is less than 5 (number of pipeline stage)
Because the pipeline stages are not balanced
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 16

Next . . .

Pipelining versus Serial Execution

Pipelined Datapath

Pipelined Control

Pipeline Hazards

Data Hazards and Forwarding

Load Delay, Hazard Detection, and Stall Unit

Control Hazards

Delayed Branch and Dynamic Branch Prediction

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 17

Control Signals
IF/ID

EX/MEM
j

Inc

PCSrc

Imm26

Imm26
Imm16

Address
Instruction

BusW

Instruction
Memory
Rd

Ext

m
u
x

ID/EX

m
u
x
func

RegDst RegWrite

Next
PC

beq
bne

zero

A
L
U

ALU result

m
u
x

Address

Data
Memory
Data_in

ALU
Control
ALUSrc

MEM/WB

ALUOp Br&J

MemWrite
MemRead

MemtoReg

Similar to control signals used in the single-cycle datapath

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 18

Control Signals contd

Decode

Execute Stage

Memory Stage

Writeback

Signal

Control Signals

Signal

MemRead MemWrite MemtoReg

RegWrite

RegDst ALUSrc ALUOp Beq Bne

R-Type

1=Rd

0=Reg R-Type

addi

0=Rt

1=Imm

ADD

slti

0=Rt

1=Imm

SLT

andi

0=Rt

1=Imm

AND

ori

0=Rt

1=Imm

0=Rt

1=Imm

ADD

1=Imm

ADD

beq

0=Reg

SUB

bne

0=Reg

SUB

Pipelined
j ProcessorxDesign

0ICS 2331 KFUPM 0

Muhamed Mudawar slide 19

Pipelined Control
IF/ID

EX/MEM
j

Inc

PCSrc

Next
PC

Imm26
Imm16
Address

Instruction

BusW

Instruction
Memory

ALU result

A
L
U

Ext

m
u
x

MEM/WB

bne

m
u
x

Address

Data
Memory
Data_in

beq

zero

m
u
x

ID/EX

Pipelined Processor Design

RegDst

ALU
Control

M
WB

Main
Control

MemRead

ALUOp

ALUSrc

MemWrite

ICS 233 KFUPM

Muhamed Mudawar slide 20

RegWrite

MemtoReg

func

Pass control
signals along
pipeline just
like the data

Pipelined Control Cont'd

ID stage generates all the control signals

Pipeline the control signals as the instruction moves

Extend the pipeline registers to include the control signals

Each stage uses some of the control signals

Instruction Decode and Register Fetch
Control signals are generated
RegDst is used in this stage
Execution Stage => ALUSrc and ALUOp
Next PC uses Beq, Bne, J and zero signals for branch control
Memory Stage

=> MemRead, MemWrite, and MemtoReg

Write Back Stage

Pipelined Processor Design

=> RegWrite is used in this stage

ICS 233 KFUPM
Muhamed Mudawar slide 21

Pipelining Summary

Pipelining doesnt improve latency of a single instruction

However, it improves throughput of entire workload

Instructions are initiated and completed at a higher rate

In a k-stage pipeline, k instructions operate in parallel

Overlapped execution using multiple hardware resources
Potential speedup = number of pipeline stages k
Unbalanced lengths of pipeline stages reduces speedup

Pipeline rate is limited by slowest pipeline stage

Unbalanced lengths of pipeline stages reduces speedup

Also, time to fill and drain pipeline reduces speedup

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 22

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 23

Pipeline Hazards
Hazards: situations that would cause
incorrect execution
If next instruction were launched during
its designated clock cycle
[Link] hazards
Caused by resource contention
Using same resource by two instructions
during the same cycle
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 24

[Link] hazards
An instruction may compute a result
needed by next instruction
Hardware can detect dependencies
between instructions
[Link] hazards
Caused by instructions that change
control flow (branches/jumps)
Delays in changing the flow of control
Hazards complicate pipeline control and
limit performance
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 25

Problem

Structural Hazards

Attempt to use the same hardware resource by

two different
instructions during the same cycleStructural Hazard
Example
Writing back ALU result in stage

Two instructions are

attempting to write
4 the register file
during same cycle

Instructions

Conflict with writing load data in stage 5

$6, 8($5)

ori $4, $3, 7

sub $5, $2, $3
sw

$2, 10($3)
CC1

Pipelined Processor Design

EX MEM WB

EX MEM

CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ICS 233 KFUPM
Muhamed Mudawar slide 26

Time

Resolving Structural
Hazards
Serious Hazard:
Hazard cannot be ignored

Solution 1: Delay Access to Resource

Must have mechanism to delay
instruction access to resource
Delay all write backs to the register file
to stage 5
ALU instructions bypass stage 4 (memory)
without doing anything
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 27

Solution 2: Add more hardware resources

(more costly)
Add more hardware to eliminate the structural
hazard
Redesign the register file to have two write
ports
First write port can be used to write back ALU
results in stage 4
Second write port can be used to write back load
data in stage 5
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 28

ICS 233 KFUPM

Muhamed Mudawar slide 29

Data Hazards
Dependency between instructions causes a data hazard
The dependent instructions are close to each other
Pipelined execution might change the order of operand access
Read After Write RAW Hazard
Given two instructions I and J, where I comes before J
Instruction J should read an operand after it is written by I
Called a data dependence in compiler terminology
I: add $1, $2, $3

# r1 is written

J: sub $4, $1, $3

# r1 is read

Hazard occurs when J reads the operand before I writes it

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 30

Example of a RAW Data

Hazard

Program Execution Order

Time (cycles)
value of $2

sub $2, $1, $3

and $4, $2, $5
or $6, $3, $2
add $7, $2, $2

CC1

CC2

CC3

CC4

CC5

CC6

CC7

CC8

10/20

Reg

ALU

Reg

ALU

Reg

ALU

Reg

ALU

Reg

ALU

sw $8, 10($2)

Result of sub is needed by and, or, add, & sw instructions

Instructions and & or will read old value of $2 from reg file
During CC5, $2 is written and read new value is read
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 31

Instruction Order

Solution 1: Stalling the

Pipeline

Time (in cycles)

value of $2

CC1

CC2

CC3

CC4

CC5

CC6

CC7

CC8

10/20

sub $2, $1, $3

Reg

ALU

Reg

bubble

Reg

ALU

Reg

ALU

and $4, $2, $5

or $6, $3, $2

The and instruction cannot fetch $2 until CC5

The and instruction remains in the IF/ID register until CC5
Two bubbles are inserted into ID/EX at end of CC3 & CC4
Bubbles are NOP instructions: do not modify registers or
memory
Bubbles delay instruction execution and waste clock cycles
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 32

Solution 2: Forwarding ALU

Result
The ALU result is forwarded
(fed back) to the ALU input

No bubbles are inserted into the pipeline and no cycles

are wasted
ALU result exists in either EX/MEM or MEM/WB register

Program Execution Order

Time (in cycles)

sub $2, $1, $3
and $4, $2, $5
or $6, $3, $2
add $7, $2, $2

CC1

CC2

CC3

CC4

CC5

Reg

ALU

Reg

ALU

Reg

ALU

Reg

ALU

Reg

ALU

sw $8, 10($2)

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 33

CC6

CC7

CC8

ID/EX
MemtoReg

File

m
u
x

A
L
U

m
u
x

Instruction

Ext

RegDst
RegWrite
Pipelined Processor Design

ForwardB
ICS 233 KFUPM
Muhamed Mudawar slide 34

MEM/WB

ALU result
Address

Data_in

Data
Memory

m
u
x

WriteData

EX/MEM

ALUSrc

ALU result

ForwardA

Imm26

IF/ID

Two multiplexers added at the inputs of A & B registers

ALU output in the EX stage is forwarded (fed back)
ALU result or Load data in the MEM stage is also
forwarded
Two signals: ForwardA and ForwardB control forwarding
Imm26

Implementing Forwarding

RAW Hazard Detection

RAW hazards can be detected by the pipeline

Current instruction being decoded is in IF/ID register

Previous instruction is in the ID/EX register
Second previous instruction is in the EX/MEM register
RAW Hazard Conditions:
IF/[Link] = ID/[Link]
IF/[Link] = ID/[Link]

Raw Hazard detected with

Previous Instruction

IF/[Link] = EX/[Link]
IF/[Link] = EX/[Link]
Pipelined Processor Design

Raw Hazard detected with

Second Previous Instruction

ICS 233 KFUPM

Muhamed Mudawar slide 35

Forwarding Unit

m
u
x

ForwardB

ForwardA

Forwarding Unit
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 36

ALU result

Address

Data_in

Data
Memory

m
u
x

WriteData

MEM/WB

ALU result

m
u
x

A
L
U

File

m
u
x

EX/MEM

Ext

Instruction

Imm26

Forwarding unit generates ForwardA and

ForwardB
That are used to control the two forwarding
multiplexers
IF/ID
ID/EX
ALUSrc
MemtoReg
Uses Rs
Imm26 and Rt in IF/ID and Rw in ID/EX & EX/MEM

Forwarding Control Signals

Control Signal

Explanation

ForwardA = 00

First ALU operand comes from the register file

ForwardA = 01

Forwarded from the previous ALU result

ForwardA = 10

Forwarded from data memory or 2nd previous ALU result

ForwardB = 00

Second ALU operand comes from the register file

ForwardB = 01

Forwarded from the previous ALU result

ForwardB = 10

Forwarded from data memory or 2nd previous ALU result

(IF/[Link] == ID/[Link] 0

and ID/[Link])

ForwardA = 01

elseif (IF/[Link] == EX/[Link] 0 and EX/[Link]) ForwardA = 10

else

ForwardA = 00

(IF/[Link] == ID/[Link] 0

and ID/[Link])

ForwardB = 01

elseif (IF/[Link] == EX/[Link] 0 and EX/[Link]) ForwardB = 10

else

ForwardB = 00

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 37

Forwarding Example

Instruction sequence:
lw
$4, 100($9)
add $7, $5, $6
sub $8, $4, $7

When lw reaches the MEM stage

ForwardA = 10

ForwardB = 01

Forward data from MEM stage

Forward ALU result from ALU stage

Pipelined Processor Design

ForwardB = 01

ICS 233 KFUPM

Muhamed Mudawar slide 38

Data_in

Data
Memory

m
u
x

WriteData

Address

m
u
x

ALU result

File

m
u
x

A
L
U

m
u
x

lw $4,100($9)
ALU result

Ext

Instruction

ForwardA = 10

add $7,$5,$6

sub $8,$4,$7

Imm26

sub will be in the Decode stage

Imm26

add will be in the ALU stage

ICS 233 KFUPM

Muhamed Mudawar slide 39

Load Delay

Unfortunately, not all data hazards can be forwarded

Load has a delay that cannot be eliminated by
forwarding
In the example shown below
The LW instruction does not have data until end of CC4
AND instruction wants data at beginning of CC4 - NOT
possible
Program Order

Time (cycles)
lw

$2, 20($1)

and $4, $2, $5

or $6, $3, $2
add $7, $2, $2

Pipelined Processor Design

CC1

CC2

CC3

CC4

CC5

Reg

ALU

Reg

ALU

Reg

ALU

Reg

ALU

ICS 233 KFUPM

Muhamed Mudawar slide 40

CC6

CC7

CC8

However, load
can forward
data to second
next instruction

Reg

Detecting RAW Hazard after

Detecting a RAW hazardLoad
after a Load instruction:
The load instruction will be in the ID/EX register
Instruction that needs the load data will be in the IF/ID
register

Condition for stalling the pipeline

if ((ID/[Link] == 1) and (ID/[Link] 0) and
((ID/[Link] == IF/[Link]) or (ID/[Link] == IF/[Link]))) Stall

Insert a bubble after the load instruction

Bubble is a no-op that wastes one clock cycle
Delays the instruction after load by once cycle
Because of RAW hazard

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 41

Stall the Pipeline for one

Freeze the PC and the Cycle
IF/ID registers
No new instruction is fetched and instruction after load
is stalled

Allow the Load instruction in ID/EX register to proceed

Introduce a bubble into the ID/EX register
Load can forward data to next instruction after delaying it

Program Order

Time (cycles)
lw

$2, 20($1)

and $4, $2, $5

or $6, $3, $2

Pipelined Processor Design

CC1

CC2

CC3

CC4

CC5

Reg

ALU

Reg

bubble

Reg

ICS 233 KFUPM

Muhamed Mudawar slide 42

CC6

CC7

ALU

Reg

ALU

CC8

Reg

Data_in

Data
Memory

m
u
x

WriteData

Address

ALU result

m
u
x

A
L
U

IF/IDWrite

PCWrite

m
u
x

File

Address

m
u
x

Pipelined Processor Design

Forwarding,
Hazard Detection,
and Stall Unit
MemRead

0
Main
Control

m
u
x

Bubble

Bubble
clears
control
signals

ForwardB

Instruction

ALU result

Ext

Instruction
Memory

Instruction

ForwardA

Imm26

Hazard Detection and Stall

Unit

ICS 233 KFUPM

Muhamed Mudawar slide 43

The pipelined is stalled

by Making PCWrite = 0
and IF/IDWrite = 0 and
introducing a bubble into
the ID/EX control signals

Compiler Scheduling

Compilers can schedule code in a way to avoid load stalls

Consider the following statements:

a = b + c; d = e f;

Fast code: No Stalls

Slow code:

$10, 0($1)

$11, 0($2)

add $12, $10, $11 # stall

$13, 0($4)

$12, ($3)

# $3 = addr a

$13, ($4)

# $4 = addr e

$14, 0($5)

$14, ($5)

# $5 = addr f

$10, ($1)

# $1 = addr b

$11, ($2)

# $2 = addr c

sub $15, $13, $14 # stall

$15, ($6)

Pipelined Processor Design

# $6 = addr d
ICS 233 KFUPM
Muhamed Mudawar slide 44

add $12, $10, $11

$12, 0($3)

sub $15, $13, $14

$14, 0($6)

Write After Read WAR

Hazard

Instruction J should write its result after it is read by I

Called an anti-dependence by compiler writers

I: sub $4, $1, $3

# $1 is read

J: add $1, $2, $3

# $1 is written

Results from reuse of the name $1

Hazard occurs when J writes $1 before I reads it

Cannot occur in our basic 5-stage pipeline because:

Reads are always in stage 2, and
Writes are always in stage 5
Instructions are processed in order

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 45

Write After Write WAW

Hazard
Instruction J should write
its result after instruction I

Called an output-dependence in compiler terminology

I: sub $1, $4, $3

# $1 is written

J: add $1, $2, $3

# $1 is written again

This hazard also results from the reuse of name $1

Hazard occurs when writes occur in the wrong order

Cant happen in our basic 5-stage pipeline because:

All writes are ordered and always take place in stage 5

WAR and WAW hazards can occur in complex pipelines

Notice that Read After Read RAR is NOT a hazard

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 46

ICS 233 KFUPM

Muhamed Mudawar slide 47

Control Hazards
Branch instructions can cause great performance loss
Branch instructions need two things:
Branch Result

Taken or Not Taken

Branch target
PC + 4

If Branch is NOT taken

PC + 4 + 4 immediate

If Branch is Taken

Branch instruction is decoded in the ID stage

At which point a new instruction is already being fetched

For our pipeline: 2-cycle branch delay

Effective address is calculated in the ALU stage
Branch condition is determined by the ALU (zero flag)
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 48

Branch Delay = 2 Clock

Cycles

Pipelined Processor Design

A
L
U

ALU result

m
u
x

zero = 1

Imm26

m
u
x

Imm16
Ext

beq = 1

SUB

By the time the branch instruction reaches the

ALU stage, next1 instruction is in the decode
stage and next2 instruction is being fetched
ICS 233 KFUPM
Muhamed Mudawar slide 49

Forwarding
from MEM stage

label:
lw $8, ($7)
. . .
beq $5, $6, label
next1
next2

File

Address

m
u
x

Instruction

m
u
x

Instruction
Memory

Instruction

PCSrc = 1

Next
PC

NPC

Imm26

Branch Target Address

Inc

beq $5,$6,label

next1

next2

2-Cycle Branch Delay

Next1 thru Next2 instructions will be fetched anyway

Pipeline should flush Next1 and Next2 if branch is taken

Otherwise, they can be executed if branch is not taken

beq $5,$6,label
Next1 # bubble

cc1

cc2

cc3

Reg

ALU

Next2 # bubble

cc4

cc5

cc6

Reg

Bubble

Reg

ALU

MEM

label: branch target instruction

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 50

cc7

Reducing the Delay of

Branches
Branch delay can be reduced from 2 cycles to just 1 cycle
Branches can be determined earlier in the Decode stage
Next PC logic block is moved to the ID stage
A comparator is added to the Next PC logic
To determine branch decision, whether the branch is taken
or not
Only one instruction that follows the branch will be fetched
If the branch is taken then only one instruction is flushed
We need a control signal to reset the IF/ID register
This will convert the fetched instruction into a NOP
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 51

Modified Datapath
Imm16

Data_in

Data
Memory

m
u
x

WriteData

Address

m
u
x

ALU result

File

m
u
x

A
L
U

Ext

m
u
x

Address

ALU result

Instruction

m
u
x

Instruction
Memory

Instruction

PCSrc

Imm16

Imm26

reset

NPC

Inc

PCSrc signal resets the IF/ID

Next
PC

Next PC block is moved to the Instruction Decode stage

Advantage: Branch and jump delay is reduced to one cycle
Drawback: Added delay in decode stage => longer cycle
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 52

Details of Next PC

PCSrc

Branch or Jump Target Address

NPC

A
D
D

m 30
u
x

Ext
Imm16

zero

msb 4

Imm26

beq
bne

1
26

=
Forwarded BusA
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 53

BusB

ICS 233 KFUPM

Muhamed Mudawar slide 54

Branch Hazard Alternatives

Predict Branch Not Taken (modified datapath)
Successor instruction is already fetched
About half of MIPS branches are not taken on average
Flush instructions in pipeline only if branch is actually taken

Delayed Branch
Define branch to take place AFTER the next instruction
Compiler/assembler fills the branch delay slot (for 1 delay
cycle)

Dynamic Branch Prediction

Can predict backward branches in loops taken most of
time
However, branch target address is determined in ID stage
Must reduce branch delay from 1 cycle to 0, but how?

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 55

Delayed Branch
Define branch to take place after the next instruction
For a 1-cycle branch delay, we have one delay slot
label:

branch instruction
branch delay slot
branch target

(next instruction)
(if branch taken)

. . .
add $t2,$t3,$t4
beq $s1,$s0,label
Delay Slot

Compiler fills the branch delay slot

By selecting an independent instruction
From before the branch
If no independent instruction is found
Compiler fills delay slot with a NO-OP
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 56

label:
. . .
beq $s1,$s0,label
add $t2,$t3,$t4

Zero-Delayed Branch

Disadvantages of delayed branch

Branch delay can increase to multiple cycles in deeper pipelines

Branch delay slots must be filled with useful instructions or no-op
How can we achieve zero-delay for a taken branch?
Branch target address is computed in the ID stage
Solution
Check the PC to see if the instruction being fetched is a branch
Store the branch target address in a branch buffer in the IF stage
If branch is predicted taken then
Next PC = branch target fetched from branch target buffer
Otherwise, if branch is predicted not taken then Next PC = PC+4

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 57

Branch Target and

Buffer
The branchPrediction
target buffer is implemented
as a small cache
Stores the branch target address of recent branches
We must also have prediction bits
To predict whether branches are taken or not taken
The prediction bits are dynamically determined by the
hardware
Branch Target & Prediction Buffer
Addresses of
Recent Branches

Inc
mux

PC
predict_taken

Pipelined Processor Design

low-order bits
used as index

=
ICS 233 KFUPM
Muhamed Mudawar slide 58

Target
Predict
Addresses
Bits

Dynamic Branch Prediction

Prediction of branches at runtime using prediction bits
One or few prediction bits are associated with a branch
instruction
Branch prediction buffer is a small memory
Indexed by the lower portion of the address of branch
instruction
The simplest scheme is to have 1 prediction bit per branch
We dont know if the prediction bit is correct or not
If correct prediction
Continue normal execution no wasted cycles
If incorrect prediction (misprediction)
Flush the instructions that were incorrectly fetched
wasted cycles
Update prediction bit and target address for future use

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 59

2-bit Prediction Scheme

Prediction is just a hint that is assumed to be correct

If incorrect then fetched instructions are flushed

1-bit prediction scheme has a performance shortcoming

A loop branch is almost always taken, except for last
iteration
1-bit scheme will mispredict twice, on first and last loop
iterations
2-bit prediction schemes work better and are often used
A prediction must be wrong

Taken

twice before it is changed

A loop branch is mispredicted
only once on the last iteration
Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 60

Predict
Taken

Not Taken

Taken

Predict
Taken
Not Taken

Taken
Not Taken

Not
Taken

Taken

Not
Taken

Pipeline Hazards Summary

Three types of pipeline hazards

Structural hazards: conflicts using a resource during same cycle

Data hazards: due to data dependencies between instructions
Control hazards: due to branch and jump instructions
Hazards limit the performance and complicate the design
Structural hazards: eliminated by careful design or more hardware
Data hazards are eliminated by forwarding
However, load delay cannot be eliminated and stalls the pipeline
Delayed branching can be a solution when branch delay = 1 cycle
Branch prediction can reduce branch delay to zero
Branch misprediction should flush the wrongly fetched instructions

Pipelined Processor Design

ICS 233 KFUPM

Muhamed Mudawar slide 61

Pipelined Processor Design: Computer Architecture & Assembly Language Prof. Muhamed Mudawar
No ratings yet
Pipelined Processor Design: Computer Architecture & Assembly Language Prof. Muhamed Mudawar
66 pages
13 PipelinedProcessorDesign
No ratings yet
13 PipelinedProcessorDesign
53 pages
Chapter 4.5 - 4.8 Piplined Processor and Hazards
No ratings yet
Chapter 4.5 - 4.8 Piplined Processor and Hazards
68 pages
Pipelined Processor Design Overview
No ratings yet
Pipelined Processor Design Overview
106 pages
CSE332 / EEE336 Computer Organization & Architecture Pipelining I
No ratings yet
CSE332 / EEE336 Computer Organization & Architecture Pipelining I
21 pages
CA07 2022S3 New
No ratings yet
CA07 2022S3 New
29 pages
Bản Sao Của Lecture 9 - Pipelined Processor Design
No ratings yet
Bản Sao Của Lecture 9 - Pipelined Processor Design
11 pages
Lecture 13 Pipelining
No ratings yet
Lecture 13 Pipelining
12 pages
Pipelined Processor Design: Computer Architecture and Assembly Language
No ratings yet
Pipelined Processor Design: Computer Architecture and Assembly Language
22 pages
Helping Slides Pipelining Hazards Solutions
No ratings yet
Helping Slides Pipelining Hazards Solutions
55 pages
07 Pipeline Notes
No ratings yet
07 Pipeline Notes
145 pages
Cpu
No ratings yet
Cpu
51 pages
Module 3-Part 2
No ratings yet
Module 3-Part 2
50 pages
Lec12 Pipeline
No ratings yet
Lec12 Pipeline
23 pages
Unit 4
No ratings yet
Unit 4
20 pages
Pipelining in Computer Architecture
No ratings yet
Pipelining in Computer Architecture
36 pages
Pipelining in Computer Architecture
No ratings yet
Pipelining in Computer Architecture
36 pages
Pipelined Processor Execution Diagram
100% (1)
Pipelined Processor Execution Diagram
31 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
3-Pipelining 241110 203716
No ratings yet
3-Pipelining 241110 203716
59 pages
MIPS Architecture & Datapath Guide
No ratings yet
MIPS Architecture & Datapath Guide
44 pages
Pipelined MIPS Processor Design
No ratings yet
Pipelined MIPS Processor Design
51 pages
Lec4 - ILP Pipelining Intro
No ratings yet
Lec4 - ILP Pipelining Intro
24 pages
The Improvement of The Personal Computer
No ratings yet
The Improvement of The Personal Computer
74 pages
Pipelining: 5-Stage Pipeline: Mahdi Nazm Bojnordi
No ratings yet
Pipelining: 5-Stage Pipeline: Mahdi Nazm Bojnordi
35 pages
Pipelining & Vector Processing Guide
No ratings yet
Pipelining & Vector Processing Guide
73 pages
Understanding Processor Pipelining
No ratings yet
Understanding Processor Pipelining
28 pages
Pipeline Processor Design
No ratings yet
Pipeline Processor Design
89 pages
Pipelined Data-Path in MIPS Architecture
No ratings yet
Pipelined Data-Path in MIPS Architecture
31 pages
اسمبلي ٩
No ratings yet
اسمبلي ٩
3 pages
Lec 7
No ratings yet
Lec 7
26 pages
Lec 11
No ratings yet
Lec 11
30 pages
Pipelining Basic and Intermediate Concepts
No ratings yet
Pipelining Basic and Intermediate Concepts
75 pages
Lecture-4-08 01 2025
No ratings yet
Lecture-4-08 01 2025
35 pages
Parallel Processing & Pipelining
No ratings yet
Parallel Processing & Pipelining
33 pages
1 Processor Pipeline
No ratings yet
1 Processor Pipeline
73 pages
Pipelining in Computer Architecture
No ratings yet
Pipelining in Computer Architecture
77 pages
COA Module 3 PPT Part 2
No ratings yet
COA Module 3 PPT Part 2
62 pages
Pipelining vs Parallel Processing Explained
No ratings yet
Pipelining vs Parallel Processing Explained
23 pages
Processor Organization (Part 2)
No ratings yet
Processor Organization (Part 2)
42 pages
Understanding Pipelining Techniques
No ratings yet
Understanding Pipelining Techniques
21 pages
05 Pipelining
No ratings yet
05 Pipelining
34 pages
Pipeline Registers in Pipelined Datapath
No ratings yet
Pipeline Registers in Pipelined Datapath
33 pages
L14 MipsPipeline Ovw
No ratings yet
L14 MipsPipeline Ovw
17 pages
ACA Handwriten Notes Chat GPT
No ratings yet
ACA Handwriten Notes Chat GPT
52 pages
4 29 03 ImplementingMIPS 0429
No ratings yet
4 29 03 ImplementingMIPS 0429
45 pages
CPU Pipelining Explained
No ratings yet
CPU Pipelining Explained
30 pages
4.4 Pipelining
No ratings yet
4.4 Pipelining
39 pages
Chapter 6
No ratings yet
Chapter 6
43 pages
Indirect Addressing in CPU Cycles
No ratings yet
Indirect Addressing in CPU Cycles
56 pages
Advanced Pipelining Techniques
No ratings yet
Advanced Pipelining Techniques
44 pages
Module 4-Pipelining
No ratings yet
Module 4-Pipelining
39 pages
Basic Pipelining: CS2100 - Computer Organization
No ratings yet
Basic Pipelining: CS2100 - Computer Organization
83 pages
Pipelined MIPS Processor: Dmitri Strukov ECE 154A
No ratings yet
Pipelined MIPS Processor: Dmitri Strukov ECE 154A
81 pages
MIPS Pipeline: Data and Control Path Data and Control Path
No ratings yet
MIPS Pipeline: Data and Control Path Data and Control Path
46 pages
CAO-II Module 2 Complete
100% (1)
CAO-II Module 2 Complete
32 pages
Lec11 Pipeline 1 Notes
No ratings yet
Lec11 Pipeline 1 Notes
26 pages
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
No ratings yet
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
136 pages
Introduction To MIPS Architecture
No ratings yet
Introduction To MIPS Architecture
10 pages
Bollywood: History, Language and Gender: Relevant Languages
No ratings yet
Bollywood: History, Language and Gender: Relevant Languages
5 pages
Cleantouch Software: Leading IT Solutions
No ratings yet
Cleantouch Software: Leading IT Solutions
2 pages
The CSS Point - Everyday Science Book
No ratings yet
The CSS Point - Everyday Science Book
84 pages
OSI Model Reasons
No ratings yet
OSI Model Reasons
3 pages
Saqib
No ratings yet
Saqib
3 pages
Exception Handling
No ratings yet
Exception Handling
3 pages
Dual Voltage Regulator Module SP2
No ratings yet
Dual Voltage Regulator Module SP2
5 pages
Ubiquiti Wireless Equipment Prices
No ratings yet
Ubiquiti Wireless Equipment Prices
5 pages
Varko 108 Eng
No ratings yet
Varko 108 Eng
1 page
Protection, Maintenance and Testing of Capacitor Units
No ratings yet
Protection, Maintenance and Testing of Capacitor Units
5 pages
Robotics Sensor Essentials
No ratings yet
Robotics Sensor Essentials
96 pages
Multiplexers & Demultiplexers
No ratings yet
Multiplexers & Demultiplexers
13 pages
Siemens SIPROTEC 7SJ45 Overcurrent Relay
No ratings yet
Siemens SIPROTEC 7SJ45 Overcurrent Relay
7 pages
RRVV 65D R4 Product Specification
No ratings yet
RRVV 65D R4 Product Specification
5 pages
Underwater Acoustic Propagation Code
No ratings yet
Underwater Acoustic Propagation Code
15 pages
AOZ5507QI DrMOS Power Module Guide
No ratings yet
AOZ5507QI DrMOS Power Module Guide
17 pages
Sst39vf040 Flash
No ratings yet
Sst39vf040 Flash
28 pages
Fluke 289 Spec
No ratings yet
Fluke 289 Spec
2 pages
Ultra Low-Frequency Accelerometer Specs
No ratings yet
Ultra Low-Frequency Accelerometer Specs
1 page
Chapter 4-Bipolar Junction Transistor (BJT)
100% (1)
Chapter 4-Bipolar Junction Transistor (BJT)
31 pages
2025PhysicalElectronics Final GroupDiscussion
No ratings yet
2025PhysicalElectronics Final GroupDiscussion
12 pages
EN-eng 3A2 I-O - Configuration Analogue-In-3861600585790624
No ratings yet
EN-eng 3A2 I-O - Configuration Analogue-In-3861600585790624
21 pages
Digital Filter Design Guide
No ratings yet
Digital Filter Design Guide
152 pages
Smart Street Lighting & Parking Solution
No ratings yet
Smart Street Lighting & Parking Solution
15 pages
Demorgan's 2nd Theorem Explained
No ratings yet
Demorgan's 2nd Theorem Explained
7 pages
Frequency Chart for Musical Notes
No ratings yet
Frequency Chart for Musical Notes
1 page
Sample Code
No ratings yet
Sample Code
4 pages
LPG Gas Sensor Notes
No ratings yet
LPG Gas Sensor Notes
14 pages
Unit-5 is-CDMA Part 3
No ratings yet
Unit-5 is-CDMA Part 3
20 pages
Understanding DTMF Tones and Codes
No ratings yet
Understanding DTMF Tones and Codes
3 pages
Lavender International NDT LTD
No ratings yet
Lavender International NDT LTD
5 pages
TEW-639GR: Quick Installation Guide
No ratings yet
TEW-639GR: Quick Installation Guide
19 pages
Installation and Start-Up Guide 11/2002 Edition: Ccu3 Software Version 6 Sinumerik 810D
100% (5)
Installation and Start-Up Guide 11/2002 Edition: Ccu3 Software Version 6 Sinumerik 810D
350 pages
Enhancing Production With Audio
No ratings yet
Enhancing Production With Audio
62 pages
04 - CUBE 30 - Touch - LIS - Specifications
No ratings yet
04 - CUBE 30 - Touch - LIS - Specifications
9 pages
This User's Manual Is Intended For Both PNC-1850 and PNC-1200
No ratings yet
This User's Manual Is Intended For Both PNC-1850 and PNC-1200
53 pages