COA 2013 Chapter 4 The Processor
COA 2013 Chapter 4 The Processor
The Processor
Jiang Jiang
[email protected]
[Adapted from Computer Organization and Design,
4th Edition, Patterson & Hennessy, © 2008, MK]
A
Y
B
n Arithmetic/Logic Unit
n Multiplexer n Y = F(A, B)
n Y = S ? I1 : I0
A
I0 M
u Y ALU Y
I1 x
B
S F
Clk
D Q
D
Clk
Q
Clk
D Q Write
Write D
Clk
Q
Increment by
4 for next
32-bit instruction
register
32
32
32
Sign-bit wire
replicated
ld?
ld/st?
PC+4
Immediate/
Offset
Load/
35 or 43 rs rt address
Store
31:26 25:21 20:16 15:0
Branch 4 rs rt address
31:26 25:21 20:16 15:0
opcode
data of st
ld? ld?
funct ld/st?
Load/
35 or 43 rs rt address
Store
31:26 25:21 20:16 15:0
Branch 4 rs rt address
31:26 25:21 20:16 15:0
Load/
35 or 43 rs rt address
Store
31:26 25:21 20:16 15:0
Branch 4 rs rt address
31:26 25:21 20:16 15:0
Load/
35 or 43 rs rt address
Store
31:26 25:21 20:16 15:0
Branch 4 rs rt address
31:26 25:21 20:16 15:0
n Four loads:
n Speedup
= 8/3.5 = 2.3
n Non-stop:
n Speedup
= 2n/(1.5 + 0.5n) ≈ 4
= number of stages
n-1:1.5
Chapter 4 — The Processor — 35
Parallelism
n The most fundamental ways to improve
performance
n Two types of parallelism
n Temporal parallelism
n Pipeline
n Less work per stage ⇒ shorter clock cycle
n More penalty in control hazard and interruption
n Spatial parallelism
n Multiple function units, multiple cores
n Replicate resource, multiple issue, multicore
Instr Instr fetch Register read ALU op Memory Register Total time
Decode access write
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
WR
5-2=3, 2 bubbles!
Chapter 4 — The Processor — 46
Forwarding (aka Bypassing)
n Use result when it is computed
n Don’t wait for it to be stored in a register
n Requires extra connections in the datapath
No bubble!
4-3=1, no stall!
IF ID EX
5-3=2, 1 bubble!
..…. (I3) I2 I1
..…. (?) I2? beq
Chapter 4 — The Processor — 50
Stall on Branch
n Wait until branch outcome determined
before fetching next instruction
n 1 bubble when determine in ID
n Is no stall possible? IF, prediction
3-1=2, 1 bubble!
Prediction
correct:
reward
2-1=1, No bubble!
Prediction
incorrect:
penalty lw $3, 300($0)
flush lw inst.
Right-to-left WB
flow leads to
hazards
1 2 3 4 5
5.5 5
WR
1 2 3 4 5 5.5
5.5 6 5
1 2 3 4 5 5.5
5.5 6 5
1 2 3 4 5 5.5
5.5 6 5
1 2 3 4 5 5.5
5.5 6 5
1 2 3 4 5 5.5
Wrong
register 5.5
6
5
number
1 2 3 4 5 5.5
5.5 6 5
Instruction[20-16]
1 2 3 4 5 5.5
5.5 6 5
1 2 3 4 5 5.5
5.5 6 5
1 2 3 4 5 5.5
5.5 6 5
Temporal parallelism
st data ld?
32
ld/st?
ld?
Chapter 4 — The Processor — 71
Pipelined Control
n Control signals derived from instruction
n As in single-cycle implementation
Instruction[31-26]
Load/
35 or 43 rs rt address
Store
31:26 25:21 20:16 15:0
Branch 4 rs rt address
31:26 25:21 20:16 15:0
32
5.5
6.5
5
6
Alternatively,
Reg Value is
available here
Inner-register forwarding
ALU
sub $2,
Alternatively,
n
s
ALU
t Inst 2 IM Reg DM Reg
r.
ALU
O Inst 3 IM Reg DM Reg
r
d
ALU
e add $14,$2, IM Reg DM Reg
r
5.5
clock edges that control the
clock edge that controls the pipeline registers writing
register file writing
Detecting the Need to Forward
n Pass register numbers along pipeline
n E.g., ID/EX.RegisterRs = register number for Rs
sitting in ID/EX pipeline register
n ALU (consumer) operand register numbers in EX
stage are given by
n ID/EX.RegisterRs, ID/EX.RegisterRt
n Data hazards when
nBypassing from two stages to ID/EX
Fwd from
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs EX/MEM
pipeline reg
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs Fwd from
MEM/WB
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt pipeline reg
ld ? Why here ?
ld -> Rt
Chapter 4 — The Processor — 85
Load-Use Data Hazard
5.5
Need to stall
for one cycle
5-3=2, 1 bubble!
0ROGRAM
EXECUTION
1st
2nd
3rd
ORDER
IN INSTRUCTIONS
BUBBLE
AND Keep
)-
! 2EG $- 2EG
OR )- 2EG $- 2EG
1, Ê{°xÊ / iÊÜ>ÞÊÃÌ>ÃÊ>ÀiÊÀi>ÞÊÃiÀÌi`ÊÌÊÌ iÊ««ii°Ê! BUBBLE IS INSERTED BEGINNING IN CLOCK CYCLE BY CHANGING THE
AND INSTRUCTION TO A NOP .OTE THAT THE AND INSTRUCTION IS REALLY FETCHED AND DECODED IN CLOCK CYCLES Chapter
AND BUT ITS4%8
—STAGE
TheIS Processor
DELAYED UNTIL — 89
CLOCK CYCLE VERSUS THE UNSTALLED POSITION IN CLOCK CYCLE ,IKEWISE THE /2 INSTRUCTION IS FETCHED IN CLOCK CYCLE BUT ITS )$ STAGE IS DELAYED
How to Stall the Pipeline
n Force control values in ID/EX register to 0
n Check in ID stage, and insert bubble into EX stage
n EX, MEM and WB do nop (no-operation)
n Prevent update of PC and IF/ID register
n Keep current instruction in IF/ID, next instruction is PC
n Using (Current) instruction is decoded again
n Following instruction is fetched again (by same PC)
n 1-cycle stall allows MEM to read data for lw
n Can subsequently forward to EX stage in next cycle
n Alternative: valid signal for each instruction?
Checking Forwarding
ld instr’s Rt,
Instruction[20-16]
Check here: ID
Flush 3
instructions
(Set control
values to 0),
or clear the
valid signal
1 2 3 4
4
5
32
1 2
3
IF.Flush
3–1=2, 1 bubble
IF.Flush
… IF ID EX MEM WB
beq stalled IF ID
beq stalled ID
beq stalled IF ID
3 2
0 1
Branch Hazard
ADD
Format: ADDU rd, rs, rt
Add Word
MIPS I
n Some languages
31
(e.g.,
Purpose:
SPECIAL
C) ignore
26 To
11 10
0
overflow
25 add 32-bit
6 5 21 integers.
ADD
0 20 16 15
n Use 0MIPS
0 0 0 0 0 addu, addiu, subu instructions
Description: rs + rt
rd ← rs rt rd
00000 100000
The 32-bit word value in GPR rt is added to the 32-bit value in GPR rs and the 32-bit
6 5 5 5 5 6
n Other languages (e.g., Ada, Fortran) require raising an
arithmetic result is placed into GPR rd.
No Integer Overflow exception occurs under any circumstances.
exception
Format:
Restrictions:
Purpose:
ADD rd, rs, rt
To add 32-bit integers. If overflow occurs, then trap.
MIPS I
GPR[rd] ←sign_extend(temp31..0)
endif
Exceptions:
Integer Overflow
Programming Notes:
Chapter
ADDU performs the same arithmetic operation but, does 3 overflow.
not trap on — Arithmetic for Computers — 113
Handling Exceptions
n In MIPS, exceptions managed by a
System Control Coprocessor (CP0)
n CP0 provides the processor control, memory
management, and exception handling
functions
n C.f. FP is coprocessor 1 (CP1)
n mfc0 (move from coprocessor 0’s reg)
instruction can retrieve EPC value, to return
after corrective action
Load-Use Hazard
exception
address
IF.Flush
Flush to zero => nop
Chapter 4 — The Processor — 119
Exception Properties
n Restartable exceptions
n Handler executes, then refetched and
executed this instruction from scratch
n In IA-64
n Exception: returns to current instruction
n Interrupt: returns to next instruction
n PC saved in EPC register
n Identifies causing instruction
n In MIPS, actually PC + 4 is saved (why ?)
n SW handler must adjust
generated
by HW
50, why? PC = 4C
Chapter 4 — The Processor — 122
Exception Example
extra RF port
Load/Store
Hold pending
operands
72 physical
registers
n FP is 5 stages longer
n Up to 106 RISC-ops in progress
n Bottlenecks
n Complex instructions with long dependencies
n Branch mispredictions
n Memory access delays