0% found this document useful (0 votes)
132 views

Computer Architecture and Organization

The document discusses instruction pipelining in CPUs. It explains that pipelining allows for multiple instructions to be overlapped in execution stages to improve throughput. The stages include fetch, decode, operand fetch, execute, and store. Pipelining reduces latency for individual instructions but improves overall throughput by keeping the pipeline full. Hazards like structural hazards can occur if instructions compete for hardware resources and need to be addressed through techniques like stalling.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views

Computer Architecture and Organization

The document discusses instruction pipelining in CPUs. It explains that pipelining allows for multiple instructions to be overlapped in execution stages to improve throughput. The stages include fetch, decode, operand fetch, execute, and store. Pipelining reduces latency for individual instructions but improves overall throughput by keeping the pipeline full. Hazards like structural hazards can occur if instructions compete for hardware resources and need to be addressed through techniques like stalling.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

COMPUTER ARCHITECTURE AND

ORGANIZATION

LECTURE
TOPICS

Pipelining
Latency
Throughput
Instruction Pipelining in
Detail Pipeline Stages
Pipeline Hazards
 Structural Hazards
 Data Hazards
 Control Hazards
Pipelining
Instruction Branch
Branch Alternatives
WHAT IS A
PIPELINE?

A conduit of pipe,


especially one used for
the conveyance of water,
gas, or petroleum
products.

A serial arrangement of processors or a serial


arrangement of registers within a processor. Each
processor or register performs part of a task and
passes results to the next processor; several
parts of different tasks can be performed at the
same time.
LATENCY

Time from initiation of an operation until


its results are available.

Example:

 It takes 8 minutes from the time you enter


to get your food served in a fast food place.
THROUGHPUT

Rate at which something happens or gets


done.
Example:

 2 people per minute get served per counter


in the fast food place.
LATENCY VS. THROUGHPUT
L: 8 minutes until food is
served T: 2 people per minute

L: 8 minutes until food is


served T: 2 people per minute

L: 8 minutes until food


is served
T: 2 people per
minute
L: 8 minutes until food
is served
T: 6 people per
minute
WHAT IS PIPELINING?

An implementation technique whereby


multiple instructions are overlapped in
execution.
Key to making fast CPU’s
today.
As an instruction goes through a phase in the
instruction cycle the previous instruction goes through
an earlier phase.
WHAT IS PIPELINING?

Four friends each A B C D


have one load of
clothes
to wash, dry, and fold
“Washer” takes 30
minutes
“Dryer” takes 30
minutes
“Folder” takes 30
minutes
“Stasher” takes 30
minutes
to put clothes into drawers
SEQUENTIAL LAUNDRY

T 6 PM 7 8 10 11 12 1
9
a
s
k 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
Time
A
O
r B
d
e C
r
D

•Sequential laundry takes 8 hours for 4


loads. •What if they implement pipelining?
PIPELINED LAUNDRY
6 PM 7 8 9 10 11 12 1 2 AM

T 30 30 30 30 30 30 30 Time
a
A
s
k B
O C
r
d D
e
r
Start the work
ASAP!

Pipelined laundry takes 3.5 hours for 4

loads!
HARDWARE REQUIREMENTS

Extra incrementer to update the PC more often


(instead of the ALU).

A separate MDR for loads (memory to CPU) and


stores (CPU to memory).

High memory bandwidth to accommodate more data


to and from memory.
PIPELINING LESSONS
1

6 PM 7 8 9
Pipelining doesn’t help
T Time
latency of single task, it
a 30 30 30 30 30 30 30 helps throughput of
s entire workload
k A

O B Multiple tasks operating


r simultaneously using
d C different resources
e D
r Potential
speedup =
Number pipe stages
PIPELINING LESSONS
2
6 PM 7 8 9
T Time
a Pipelinerate limited
30 30 30 30 30 30 30
s by slowest pipeline
k A stage
Unbalanced lengths of
O B pipe stages reduces
r speedup
d C
e D
r Time to “fill” pipeline
and time to “drain” it
reduces speedup

Stall for Dependencies


RECAP: INSTRUCTION CYCLE

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Load F D O E S

 Fetch - fetch the instruction from the memory

 Decode - decode the instruction

 Operand Fetch - get the necessary operands

 Execute - execute the instruction

 Store - store the result to the appropriate location


REVIEW: VISUALIZING PIPELINING
Time (clock cycles)

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7


I

ALU
Ifetch Reg Reg
n DMem

s
t

ALU
Ifetch Reg DMem Reg
r.

ALU
Ifetch Reg DMem Reg
r
d
e

ALU
Ifetch Reg DMem Reg

r
TIMING OF PIPELINE
NON-PIPELINED

F D O E S F D O E S

time

PIPELINED

F D O E S
F D O E S
F D O E S
F D O E S F - Fetch
D - Decode
time O - Operand Fetch
E - Execute
S - Result Store
THREE, FOUR & FIVE STAGE RISC
PIPELINE
RISC II
Fetch Decode Inst. Execute Inst.
Instruction Select regs. Store result
SPARC MB86900, IBM
801
Fetch Decode Inst. Execute Inst. Store result
Instruction Select regs.
MIPS, intel 486

Fetch Execute Inst. Store result


Decode Inst. Select regs.
Instruction
THREE, FOUR & FIVE STAGE RISC
PIPELINE
cc 1 2 3 4 5 6 7
cc 1 2 3 4 5 6 7 stage
stage
1 a b c d e f g
1 a b c d e f g
2 - a b c d e f
2 - a b c d e f
3 - - a b c d e
3 - - a b c d e
4 - - - a b c d

cc 1 2 3 4 5 6 7
stage
1 a b c d e f g
2 - a b c d e f
3 - - a b c d e
4 - - - a b c d
5 - - - - a b c
OTHER NUMBER OF
STAGES

 INTEL
 Pentium I: 7 stages
 Pentium II/III: 12
stages  Pentium 4: 22
stages
PIPELINE PERFORMANCE

 A greater number of stages always provides better


performance. However:

 It increases the overhead in moving information


between stages.

 the complexity of the CPU grows

 It is difficult to keep a large pipeline at maximum rate


because of pipeline hazards.
PIPELINE HAZARDS

Structural Hazards
 Occurs when a certain resource is requested
by more than one instruction at the same time

 Solutions:
• Duplicate certain resources to
avoid structural hazards.
• Extend the clashing cycle by stopping
the whole of the pipeline until both
memory accesses are finished (stalling).
EXAMPLE: ONE MEMORY PORT/
STRUCTURAL HAZARD
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

I Load Ifetch

ALU
Reg DMem Reg

n
s

ALU
t
Instr 1 Ifetch Reg DMem Reg

r.

ALU
Reg
Instr 2 Ifetch Reg DMem

O
r

ALU
d Instr 3
Ifetch Reg DMem Reg

e
r Instr 4
Structural Hazard
RESOLVING STRUCTURAL HAZARDS

Definition:attempt to use same hardware for


two different things at the same time

Solution 1: Wait
 must detect the hazard
 must have mechanism to
stall
Solution 2: Throw more hardware at the
problem
DETECTING AND RESOLVING
STRUCTURAL HAZARD
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU
Load Ifetch Reg DMem Reg

n
s

ALU
t
Instr 1 Ifetch Reg DMem Reg

r.

ALU
Reg
Instr 2 Ifetch Reg DMem

O
r
Stall Bubble Bubble Bubble Bubble Bubble
d
e
r

ALU
Instr 3 Ifetch Reg DMem Reg
ELIMINATING STRUCTURAL
HAZARDS AT DESIGN TIME

Next PC

MUX
Next SEQ PC Next SEQ PC
Adder

4 RS1
Zero?

MUX MUX

MEM/WB
Address

RS2

EX/MEM
Reg File
Cache

ID/EX
Instr

IF/ID

ALU

Cache
Data

MUX

WB Data
Sign
Extend
Imm
Data path
RD RD RD

Control Path
ROLE OF INSTRUCTION SET DESIGN
IN STRUCTURAL HAZARD RESOLUTION

Simple to determine the sequence of


resources used by an instruction
 opcode tells it
all
Uniformityin the resource usage
Compare MIPS to IA32?
MIPS approach => all instructions flow
through same 5-stage pipeling
PIPELINE HAZARDS

Data Hazards
 Occurs when an instruction depends on the
result of a previous instruction that has not
yet terminated.

 could be avoided by using a technique


called “forwarding” or “bypassing”.
DATA HAZARDS

Time (clock cycles) IF ID/RF EX MEM WB

ALU
add r1,r2,r3 Ifetch Reg DMem Reg

n
s
t

ALU
Ifetch Reg DMem Reg
sub r4,r1,r3
r.

ALU
O and r6,r1,r7 Ifetch Reg DMem Reg

r
d

ALU
Ifetch Reg DMem Reg
e or r8,r1,r9
r

ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
THREE GENERIC DATA
HAZARDS
Read After Write (RAW)
InstrJ tries to read operand before InstrI writes
it
I: add r1,r2,r3
J: sub r4,r1,r3

Causedby a “Data Dependence” (in compiler


nomenclature). This hazard results from an actual need for
communication.
THREE GENERIC DATA
HAZARDS
Write After Read (WAR)
InstrJ writes operand before InstrI reads
it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Called an “anti-dependence” by compiler
writers. This results from reuse of the name
“r1”.
Can’t happen in MIPS 5 stage pipeline
because:  All instructions take 5 stages, and
 Reads are always in stage 2, and
 Writes are always in stage 5
THREE GENERIC DATA
HAZARDS

Write After Write (WAW)


InstrJ writes operand before InstrI writes
it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Called an “output dependence” by compiler
writers This also results from the reuse of name
“r1”. happen in MIPS 5 stage pipeline
Can’t

because:  All instructions take 5 stages, and


 Writes are always in stage 5
Will see WAR and WAW in later more complicated
pipes
FORWARDING TO AVOID DATA
HAZARD
Time (clock cycles)
I
n add r1,r2,r3 Ifetch

ALU
Reg DMem Reg

s
t
sub r4,r1,r3

ALU
Reg
r. Ifetch Reg DMem

ALU
Ifetch Reg DMem Reg
r and r6,r1,r7
d
e

ALU
Ifetch Reg DMem Reg

r or r8,r1,r9

ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
HW CHANGE FOR
FORWARDING

NextPC
mux
Registers

MEM/WR
EX/MEM
ALU
ID/EX

Data
mux

Memory

mux
Immediate
DATA HAZARD EVEN WITH
FORWARDING
Time (clock cycles)

I lw r1, 0(r2) Ifetch

ALU
Reg DMem Reg

n
s
t sub r4,r1,r6 Ifetch

ALU
Reg DMem Reg

r.

ALU
Ifetch Reg DMem Reg
and r6,r1,r7
r
d
e

ALU
Ifetch Reg DMem Reg
or r8,r1,r9
r
RESOLVING THIS LOAD
HAZARD
Adding hardware? ... not
Detection?

Compilation techniques?

What is the cost of load


delays?
RESOLVING THE LOAD DATA
HAZARD
Time (clock cycles)
I
n

ALU
Reg
s
lw r1, 0(r2) Ifetch Reg DMem

t
r.

ALU
Ifetch Reg Bubble DMem Reg
sub r4,r1,r6
O
r Bubble

ALU
Ifetch Reg DMem Reg
d and r6,r1,r7
e
r Bubble

ALU
Ifetch Reg DMem
or r8,r1,r9
How is this different from the instruction issue
stall?
SOFTWARE SCHEDULING TO
AVOID LOAD HAZARDS
Try producing fast code
for a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.

Slow code: Fast code:


LW Rb,b  LW Rb,b
LW Rc,c  LW Rc,c
ADD Ra,Rb,Rc  LW Re,e
SW a,Ra   ADD Ra,Rb,Rc
LW Re,e LW Rf,f
LW Rf,f W a,Ra
SUB Rd,Re,Rf  SUB Rd,Re,Rf
SW d,Rd  SW d,Rd
INSTRUCTION SET
CONNECTION

What is exposed about this organizational hazard


in the instruction set?
k cycle delay?

 bad, CPI is not part of


ISA
k instruction slot
delay
 load should not be followed by use of the
value in the next k instructions

Nothing,but code can reduce run-time


delays MIPS did the transformation in the
assembler
PIPELINE HAZARDS

Control Hazards
 produced by branch
instructions
 decision made by partially executed
instruction affects currently loading
instruction
CONTROL HAZARD ON
BRANCHES THREE STAGE STALL

10: beq r1,r3,36

ALU
Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5

ALU
Reg
18: or r6,r1,r7 Ifetch Reg DMem

ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9

ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
EXAMPLE: BRANCH STALL
IMPACT
If 30% branch, Stall 3 cycles
significant
Two part solution:
 Determine branch taken or not sooner,
AND  Compute taken branch address earlier

MIPS branch tests if register = 0 or


0 MIPS Solution:
 Move Zero test to ID/RF stage
 Adder to calculate new PC in ID/RF
stage  1 clock cycle penalty for branch
versus 3
PIPELINED MIPS DATAPATH
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC Next

MUX
SEQ PC

Adder
Adder

Zero?
4 RS1

MEM/WB
Address

Memory

RS2

EX/MEM
Reg File

ID/EX

ALU
IF/ID

Memory
MUX

Data

MUX

WB Data
Sign
Extend
Imm

RD RD RD

• Data stationary control


– local decode for each instruction phase / pipeline
FOUR BRANCH HAZARD
ALTERNATIVES
 #1: Stall until branch direction is clear
 #2: Predict Branch Not Taken

Execute successor instructions in


sequence
“Squash” instructions in pipeline if branch actually
taken Advantage of late pipeline state update
47% MIPS branches not taken on average
PC+4 already calculated, so use it to get next
instruction  #3: Predict Branch Taken
53% MIPS branches taken on average
But haven’t calculated branch target address in MIPS
• MIPS still incurs 1 cycle branch penalty
• Other machines: branch target known before
outcome
FOUR BRANCH HAZARD
ALTERNATIVES

#4: Delayed Branch


 Define branch to take place AFTER a following
instruction
branch instruction
sequential successor1
sequential successor2 Branch delay of length n
........
sequential successor n
 ........
 branch target if
taken
 1 slot delay allows proper decision and
branch target address in 5 stage pipeline
 MIPS uses this
DELAYED BRANCH

Where to get instructions to fill branch delay slot?


 Before branch instruction
 From the target address: only valuable when branch taken
 From fall through: only valuable when branch not taken
 Canceling branches allow more slots to be filled

Compiler effectiveness for single branch delay slot:


 Fills about 60% of branch delay slots
 About 80% of instructions executed in branch delay slots
useful in computation
 About 50% (60% x 80%) of slots usefully filled

Delayed Branch downside: 7-8 stage pipelines, multiple


instructions issued per clock (superscalar)
RECALL: SPEED UP
EQUATION FOR
PIPELINING
EXAMPLE: EVALUATING BRANCH
ALTERNATIVES

Assume:
Conditional & Unconditional = 14%, 65% change
PC
Scheduling BranchCPI speedup v.
scheme penalty stall
Stallpipeline 3 1.42 1.0
Predict taken 1 1.14 1.26
Predict not taken 1 1.09 1.29
Delayed branch 0.5 1.07 1.31
END OF LECTURE…

You might also like