Introduction to
Pipelining
Pipelining:Daily Works!!!
Laundry Example
• W, X, Y, Z each have one load of
clothes to wash, dry, and fold
A B C D
• Washer takes 30 minutes
• Dryer takes 40
minutes
• Folder takes 20 minutes
Sequential Laundry
Time
6 PM 7 8 9 10 11 Midnight
30 40 20 30 40 20 30 40 20 30 40 20
90
T
a
s
A 90
k
O
B 90
r
d C 90
e
r
D
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
Pipelined Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
90
T
a
s
A
90
k
O B
90
r
d
e C
90
r
D
Pipelined laundry takes 3.5 hours for 4 loads
Pipelining Lessons
• Pipelining doesn’t help
6 PM 7 8 9 latency of single task, it
helps throughput of entire
Time
workload
• Pipeline rate is limited by
30 40 40 40 40 20 the slowest pipeline stage
• Multiple tasks operating
T A simultaneously
a
s • Potential speedup = Number
k B Draining pipe stages
• Unbalanced lengths of pipe
O Filling stages reduce speedup
r C • Time to “fill” pipeline and
d time to “drain” it reduces
e speedup
r D
What is pipelining?
• A pipeline is a series of stage, where some work is done at each
stage.
• Its an implementation technique whereby multiple instruction are
overlapped in execution.
• Implementation technique that exploits parallelism among the
instruction in a sequential instruction stream.
• A pipelined processor consists of sequence of processing
circuits called segment or stages through which a stream of
operands can be passed.
Why is pipelining desirable?
• Used to improved performance beyond what can be
achieved with non pipelined processing.
• Yields a reduction in the average execution time per
instruction.
Pipeline Designer Goal
• To balance the length of each pipeline stage
• If stage are well balanced, then the time per instruction on
the pipelined machine is equal to (T)
T = time per instruction on the pipelined machine
Number of pipe stages
Then, the speed up from pipelining equals the number of pipe
stages
Types of Pipelines
• Instructional Pipeline
– Where different stages of an instruction fetch and
execution are handled in a pipeline
• Arithmetic Pipeline
– Where different stages of an arithmetic operation are
handled along the stages of pipeline.
Instructional Pipeline
Contd…
• The processing of an instruction need not be divided into only two steps. To gain
further speed up, the pipeline must have more stages
• consider the following decomposition of the instruction
execution
- Fetch Instruction (FI): Read the next expected instruction into a
buffer.
- Decode Instruction ((DI): Determine the opcode and the operand
specifiers.
-Calculate Operand (CO): calculate the effective address of each
source operand.
- Fetch Operands(FO): Fetch each operand from memory.
- Execute Instruction (EI): Perform the indicated operation.
- Write Operand (WO): Store the result in memory.
The timing diagram …
Implementation Pipelining using
DLX
5 Steps of DLX Instr. Execution:
Step1
Step 1: Instruction fetch cycle (IF)
– Read instruction from memory and store into IR
• IR ← Mem[PC]
– Calculate the next instruction address
• NPC ← PC+4
• 1 instruction is stored in consecutive 4 bytes
Add NPC
+4
PC
Instr. IR
Memory
5 Steps of DLX Instr. Execution:
Step2
Step 2: Instruction decode/register fetch cycle (ID)
– Read source registers to A and B
A ← Regs[IR6..10 ]
B ← Regs[IR11..15 ]
– Make 16 bits sign extension of A
Reg
16-bit immediate field to make a IR File
B
32-bit immediate value
Imm ← ((IR16 )16 ## IR16..31 ) Rd
b
– Decoding is done in parallel: OP
fixed-field decoding Sign
Imm
b ← Rd 16 Ext 32
5 Steps of DLX Instr. Execution:
Step 3
Step 3: Execution/effective address cycle (EX):
– Memory reference: Effective Address calculation
» ALUOutput ← A + Imm
– Register-register ALU instruction: Perform ALU operation with R’s
» ALUOutput ← A func B; func B
– Register-Immediate ALU instruction: Perform ALU operation with
immediate operand
» ALUOutput ← A op Imm
– Branch: Effective Address calculation for branch target address
Determine condition code
» ALUOutput ← NPC + Imm; Cond ← (A op 0)
Step 3 EX
Zero? Cond
NPC
MUX
A
ALU ALUOut
B MUX
Imm
OP
5 Steps of DLX Instr. Execution:
Step 4
Step 4: Memory access/branch completion cycle (MEM):
– Memory reference : Access memory either
• for LD: LMD ← Mem[ALUOutput] or
• for ST: Mem[ALUOutput] ← B
– Branch : Test Condition
• if (cond) PC ← ALUOutput, NPC
MUX
PC
else PC ← NPC; ALUOut
Cond
Data
LMD
Memory
B
5 Steps of DLX Instr. Execution:
Step 5
Step 5: Write-back cycle (WB):
Reg-Reg ALU : Store the result into the destination register
Regs[IR16..20 ] ← ALUOutput;
Reg-Immediate ALU : Store the result into destination register
Regs[IR11..15 ] ← ALUOutput;
Load instruction: Store the data read from memory to the
destination register
Regs[IR11..15 ] ← LMD;
LMD
MUX
Register
File
ALUOut
OP
5 Steps of DLX Datapath
IF Stage ID Stage EX Stage MEM WB Stage
Stage
MUX
Add Zero?
+4
MUX
ALU
PC
Instr. Reg ALU Output
Memory File Data LMD
MUX
MUX
Memory
SMD
Sign
Ext 32
16
A Simple Implementation
• A multi-cycle implementation
– needs temporary registers-- NPC, IC, A, B, Imm,
Cond, ALUOutput, LMD
– CPI improvements
• A single-cycle implementation
– one long clock cycle
– very inefficient for most machines that have a
reasonable variation among the amount of work
– requires the duplication of FU that could be shared in
a multi-cycle implementation
Visualizing Pipeline
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
ALU
IM Reg DM Reg
Instruction Order
IM Reg ALU DM Reg Draining
IM Reg ALU DM Reg
ALU
IM Reg DM Reg
Filling
ALU
IM Reg DM Reg
Saving Information Produced
by Each Stage of Pipeline
• Information need to be stored at the end of a clock cycle,
otherwise it will be lost
• Each pipeline stage produces information(data, address, and
control) at the end of the clock cycle
• Thus, we need a storage(called inter-stage buffer) at end of
each pipeline stage
Inter-Stage Buffer
in DLX Pipeline
• F/D Buffer
– IR, NPC
• D/EX Buffer
– A, B, Imm, b(destination Reg address to store result),
OP(OP-code), cond
– NPC
• EX/M Buffer
– ALUout(arithmetic result or effective address)
– NPC, cond, b, OP
• M/W Buffer
– LMD(data for LD)
– ALUout(arithmetic result), b, OP
Pipelined DLX Datapath
- Multicycle -
IF Stage ID Stage EX Stage MEM WB
Stage Stage
MUX
Add Zero?
+4
MUX
PC
Instr. Reg
M/W Buffer
ALU
F/D Buffer
D/EX Buffer
EX/M Buffer
Memory File Data LMD
MUX
MUX
Memory
SMD
Sign
16 Ext 32
Basic Performance Issues
• Pipelining increases the CPU instruction throughput.
• Does not reduce the execution time of an individual instruction
rather slightly increase the execution time due to overhead in the control
of the pipeline.
• Increase in throughput means that a program run faster and has lower
execution time.
• Imbalance among the pipe stage reduces performance since clock can
run no faster than the time needed for the slowest pipeline stage.
• Pipeline overhead arises from pipeline register delay and clock skew.
Contd…
• Buffering between stages marginally increase Cycle time
• Harzards reduce the CPI.
What is Hazard???
- is a risk in which pipeline operation stall(stop) for one or
more clock cycle.
- it prevent next instruction from executing during its
designated clock cycle
Pipeline Hazards
• There are three classes of hazards:
– Structural
• Happen due to simultaneous request for the same
resources by two or more instruction
– Eg. IF and MEM both required memory port.
– Data
• Instruction depends on result of prior instruction still
in the pipeline
– Control
• Happen due to branch and jump instruction.
Structural Hazard
Data harzard
Data Harzard Solution
• Types:
– Interlock: H/w detect data dependency and stall depent
instructions.
Contd..
- Forwarding or Bypassing: forward the result as soon as
available to EX
Contd…
• Instruction Scheduling: Reorder instruction. Such that
dependent instruction are 2-3 cycle apart.
– Useful for covering load delays and branch delays
– Useful in hiding delays due to long latency FP operation
Data Hazard Classification
• True data dependency - (RAW)
• Anti dependency - WAR
• Output dependency - WAW
Control Hazard
Next Class……