L04-Pipelining
L04-Pipelining
edu/~cs152
CS 152/252A Computer
Architecture and Engineering Sophia Shao
Lecture 4 – Pipelining
Intel P5 and P6
CISC, RISC, or CRISC??
2
“Iron Law” of Processor Performance
Time = Instructions Cycles Time
Program Program * Instruction * Cycle
§ Instructions per program depends on source code,
compiler technology, and ISA
§ Cycles per instructions (CPI) depends on ISA and
µarchitecture
§ Time per cycle depends upon the µarchitecture and base
technology
3
Classic 5-Stage RISC Pipeline
Fetch Decode EXecute Memory Writeback
Imm
Store
Inst. Register Data
Instruction Cache
B
Registers
ALU
PC
Cache
8
Types of Data Hazards
Consider executing a sequence of register-register
instructions of type:
rk ← ri op rj
Data-dependence
r3 ← r1 op r2 Read-after-Write
r5 ← r3 op r4 (RAW) hazard
Anti-dependence
r3 ← r1 op r2 Write-after-Read
r1 ← r4 op r5 (WAR) hazard
Output-dependence
r3 ← r1 op r2 Write-after-Write
r3 ← r6 op r7 (WAW) hazard
9
Three Strategies for Data Hazards
§ Interlock
– Wait for hazard to clear by holding dependent
instruction in issue stage
§ Bypass
– Resolve hazard earlier by bypassing value as soon as
available
§ Speculate
– Guess on value, correct if wrong
10
Interlocking Versus Bypassing
add x1, x3, x5
sub x2, x1, x4
F D X M W bubble
Instruction interlocked
F D X M W bubble in decode stage
F D X M W sub x2, x1, x4
11
Example Bypass Path
Fetch Decode EXecute Memory Writeback
Imm
Store
Inst. Register Data
Instruction Cache
B
Registers
ALU
PC
Cache
12
Fully Bypassed Data Path
Fetch Decode EXecute Memory Writeback
Imm
Store
Inst. Register Data
Instruction Cache
B
Registers
ALU
PC
Cache
F D X M W
F D X M W
F D X M W
F D X M W
13
Value Speculation for RAW Data Hazards
§ Rather than wait for value, can guess value!
14
Control Hazards
What do we need to calculate next PC?
15
Control flow information in pipeline
Fetch Decode EXecute Memory Writeback
Branch condition,
Opcode,
PC known Jump register
offset known
value known
Imm
Store
Data
Inst. Register
Instruction Cache
B
Registers
ALU
PC
Cache A
16
RISC-V Unconditional PC-Relative Jumps
PCJumpSel FKill Jump?
PC_decode
[ Kill bit turns
instruction
Add
into a bubble ]
+4
Imm
Kill
Inst. Register
PC_fetch
B
Instruction Registers
ALU
Cache
A
17
Pipelining for Unconditional PC-Relative
Jumps
F D X M W j target
F D X M W bubble
18
Branch Delay Slots
§ Early RISCs adopted idea from pipelined microcode
engines, and changed ISA semantics so instruction after
branch/jump is always executed before control flow
change occurs:
0x100 j target
0x104 add x1, x2, x3 // Executed before target
…
0x205 target: xori x1, x1, 7
§ Software has to fill delay slot with useful work, or fill with
explicit NOP instruction
F D X M W j target
19
Post-1990 RISC ISAs don’t have delay slots
§ Encodes microarchitectural detail into ISA
– c.f. IBM 650 drum layout
§ Performance issues
– Increased I-cache misses from NOPs in unused delay slots
– I-cache miss on delay slot causes machine to wait, even if delay
slot is a NOP
§ Complicates more advanced microarchitectures
– Consider 30-stage pipeline with four-instruction-per-cycle issue
§ Better branch prediction reduced need
– Branch prediction in later lecture
20
RISC-V Conditional Branches
PCSel Branch? DKill
FKill Cond?
PC_execute
PC_decode
Add
Add
+4
Kill
Kill
Inst.
Inst. Register
PC_fetch
Instruction
B
Registers
Cache
ALU
A
21
Pipelining for Conditional Branches
F D X M W bubble
F D X M W bubble
22
Pipelining for Jump Register
§ Register value obtained in execute stage
F D X M W jr x1
F D X M W bubble
F D X M W bubble
23
Why instruction may not be dispatched
every cycle in classic 5-stage pipeline (CPI>1)
§ Full bypassing may be too expensive to implement
– typically all frequently used paths are provided
– some infrequently used bypass paths may increase cycle time
and counteract the benefit of reducing CPI
§ Loads have two-cycle latency
– Instruction after load cannot use load result
– MIPS-I ISA defined load delay slots, a software-visible pipeline
hazard (compiler schedules independent instruction or inserts
NOP to avoid hazard). Removed in MIPS-II (pipeline interlocks
added in hardware)
• MIPS:“Microprocessor without Interlocked Pipeline Stages
§ Jumps/Conditional branches may cause bubbles
– kill following instruction(s) if no delay slots
25
CS252 Administrivia
§ CS252 Readings on
– https://ucb-cs252-sp23.hotcrp.com/u/0/
– Use hotcrp to upload reviews before Wednesday:
• Write one paragraph on main content of paper including good/bad
points of paper
• Also, answer/ask 1-3 questions about paper for discussion
• First two “360 Architecture”, “VAX11-780”
– 2-3pm Wednesday, Soda 606/Zoom
§ CS252 Project Timeline
– Proposal Wed Feb 22
– One page in PDF format including:
• project title
• team members (2 per project)
• what problem are you trying to solve?
• what is your approach?
• infrastructure to be used
• timeline/milestones
26
Traps and Interrupts
In class, we’ll use following terminology
§ Exception: An unusual internal event caused by
program during execution
– E.g., page fault, arithmetic underflow
§ Interrupt: An external event outside of running
program
§ Trap: Forced transfer of control to supervisor
caused by exception or interrupt
– Not all exceptions cause traps (c.f. IEEE 754 floating-point
standard)
27
History of Exception Handling
§ Analytical Engine had overflow exceptions
§ First system with traps was Univac-I, 1951
– Arithmetic overflow would either
• 1. trigger the execution a two-instruction fix-up routine at
address 0, or
• 2. at the programmer's option, cause the computer to stop
– Later Univac 1103, 1955, modified to add external interrupts
• Used to gather real-time wind tunnel data
§ First system with I/O interrupts was DYSEAC, 1954
– Had two program counters, and I/O signal caused switch between
two PCs
– Also, first system with DMA (Direct Memory Access by I/O device)
– And, first mobile computer!
28
DYSEAC, first mobile computer!
30
Trap:
altering the normal flow of control
Ii-1 HI1
trap
program Ii HI2 handler
Ii+1 HIn
31
Trap Handler
§ Saves EPC before enabling interrupts to allow
nested interrupts Þ
– need an instruction to move EPC into GPRs
– need a way to mask further interrupts at least until EPC can be
saved
§ Needs to read a status register that indicates the
cause of the trap
§ Uses a special indirect jump instruction ERET
(return-from-environment) which
– enables interrupts
– restores the processor to the user mode
– restores hardware status and control state
32
Synchronous Trap
§ A synchronous trap is caused by an exception on
a particular instruction
Inst. Data
PC Mem D Decode E + M Mem W
Asynchronous Interrupts
34
Exception Handling 5-Stage Pipeline
Commit
Point
Inst. Data
PC Mem D Decode E + M Mem W
Cause
D E M
PC PC PC
EPC
Select
Handler Kill F D Kill D E Kill E M Asynchronous
PC Stage Stage Stage Interrupts
Kill
Writeback
35
Exception Handling 5-Stage Pipeline
§ Hold exception flags in pipeline until commit
point (M stage)
36
Speculating on Exceptions
§ Prediction mechanism
– Exceptions are rare, so simply predicting no exceptions is very
accurate!
§ Check prediction mechanism
– Exceptions detected at end of instruction execution pipeline,
special hardware for various exception types
§ Recovery mechanism
– Only write architectural state at commit point, so can throw away
partially executed instructions after exception
– Launch exception handler after flushing pipeline
37
Acknowledgements
§ These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
38