Branch Prediction
CIS 5710
Computer Organization and Design
This Unit: Branch Prediction
App App App • Control hazards
System software • Branch prediction
Mem CPU I/O
CIS 5710 | Prof Joseph Devietti 2
Readings
• P&H
• Chapter 4
3
Control Dependences and
Branch Prediction
CIS 5710 | Prof Joseph Devietti 4
What About Branches?
PC PC
D X <<
2
+
4 M
Register A
S O
File X
Insn s1 s2 d B B
PC
Mem
IR IR IR
• Wait for branch outcome (two-cycle penalty)
• Fetch past branch before outcome is known
• Default: assume “not-taken” (at fetch, can’t tell it’s a branch)
CIS 5710 | Prof Joseph Devietti 5
Big Idea: Speculative Execution
• Speculation: take risk on chance of profit
• Speculative execution
• Execute before all parameters known with certainty
• Correct speculation
+ Avoid stall, improve performance
• Incorrect speculation (mis-speculation)
– Must abort/flush/squash incorrect insns
– Must undo incorrect changes (recover pre-speculation
state)
• Control speculation: speculation aimed at
control hazards
• Are these the correct insns to execute next?
CIS 5710 | Prof Joseph Devietti 6
Control Speculation Mechanics
• Guess branch target, start fetching at guessed
position
• Doing nothing is implicitly guessing target is the next
sequential PC
• We were already speculating before!
• Can actively guess other targets: dynamic branch
prediction
• Execute branch to verify (check) our guess
• Correct speculation? keep going
• Mis-speculation? Flush mis-speculated insns
• Hopefully haven’t modified permanent state (Regfile,
DMem)
👍Happens naturally in our in-order 5-stage pipeline
CIS 5710 | Prof Joseph Devietti 7
Branch Prediction Components
regfile
I$ D$
B
P
• Step #1: is it a branch?
• Easy after decode...
• Step #2: is the branch taken or not taken?
• Direction predictor (applies to conditional branches only)
• Predicts taken/not-taken
• Step #3: if the branch is taken, where does it go?
• Easy after decode…
CIS 5710 | Prof Joseph Devietti 8
Branch Prediction Steps
• Which insn’s behavior are
is insn a no
branch?
PC+4 we trying to predict?
• Where does PC come
yes from?
T or NT? Not Taken
Taken
hardware structure:
predicted
target branch target buffer
direction predictor
CIS 5710 | Prof Joseph Devietti 9
When to Perform Branch Prediction?
• Option #1: During Decode
• Look at instruction opcode to determine branch instructions
• Can calculate next PC from instruction (for PC-relative
branches)
– One cycle “mis-fetch” penalty even if branch predictor is
correct
• we can do better!
1 2 3 4 5 6 7 8 9
bne x3,x0,targ F D X M W
targ:add x4,x5,x4 F D X M W
• Option #2: During Fetch?
• How do we do that?
CIS 5710 | Prof Joseph Devietti 10
Dynamic Branch Prediction
<>
BP TG TG
PC PC
<<
2
+
4 D X M
Register A
S O
File X
Insn s1 s2 d B B
PC Mem
IR IR IR
nop nop
• Dynamic branch prediction: hw guesses outcome
• Start fetching from guessed address
• Flush on mis-prediction
CIS 5710 | Prof Joseph Devietti 11
Identifying Branches
CIS 5710 | Prof Joseph Devietti 12
Branch Prediction Components
regfile
I$ D$
B
P
• Step #1: is it a branch?
• Easy after decode... during fetch: predictor
• Step #2: is the branch taken or not taken?
• Direction predictor (later)
• Step #3: if the branch is taken, where does it go?
• Branch target predictor (BTB)
• Supplies target PC if branch is taken
CIS 5710 | Prof Joseph Devietti 13
Branch Target Buffer
• Learn from the past to predict the future
• Record the past in a hardware structure
• Branch target buffer (BTB)
• Record a list of branches we have seen
+ code doesn’t change
• PC indexes table of bits
• each entry is 1 bit: is there a branch here?
• set the bit if we see a branch at that index
PC [31:10] [9:2] 1:0 BTB
branch
branch
1
CIS 5710 | Prof Joseph Devietti is it a branch? 14
BTB Aliasing
• What if two PCs have the same bits 9:2...?
• BTB is just a prediction, processor will still work correctly
• these PCs alias
• Aliasing branches interfere with each other
• In our initial BTB design, we never clear BTB bits…
• If bits 9:2 used to index, there are 256 BTB entries
• A 4MB program has 1M insns
• 4K insns mapping to each BTB entry
• What are the odds that 1 out of 4K insns is a branch?
• BTB will become saturated
CIS 5710 | Prof Joseph Devietti 16
BTB Tags
• BTB entries are too coarse-grained
+ Record only taken branches
• a never-taken branch might as well be a NOP
– useful, but doesn’t help enough
• better idea: tag each BTB entry
• remember some things precisely, rather than everything
imprecisely
• record a subset of actual taken branches
• is_a_branch = (BTB[PC].branch && BTB[PC].tag == PC)
• How large is each tag?
BTB
PC [31:10] [9:2] 1:0
branch tag
branch tag
CIS 5710 | Prof Joseph Devietti 1 17
is it a branch?
BTB Tags
• Now that we have tags, branch bits are redundant
• tag comparison achieves the same goal
• let’s get rid of them!
BTB
PC [31:10] [9:2] 1:0
tag
tag
==
1
is it a branch?
CIS 5710 | Prof Joseph Devietti 19
Branch Direction Prediction
CIS 5710 | Prof Joseph Devietti 20
Branch Prediction Components
regfile
I$ D$
B
P
• Step #1: is it a branch?
• Easy after decode... during fetch: predictor
• Step #2: is the branch taken or not taken?
• Direction predictor
• Step #3: if the branch is taken, where does it go?
• Branch target predictor (BTB)
• Supplies target PC if branch is taken
CIS 5710 | Prof Joseph Devietti 21
Branch Direction Prediction
• Learn from past, predict the future
• Record the past in a hardware structure
• Direction predictor (DIRP)
• Map conditional-branch PC to taken/not-taken (T/N) decision
• Individual conditional branches often biased or weakly biased
• 90%+ one way or the other considered biased
• Why? Loop back edges, checking for uncommon conditions
• Bimodal predictor: simplest predictor
• PC indexes Branch History Table of bits (0 = N, 1 = T), no tags
• Essentially: branch will go same way it went last time
PC [31:10] [9:2] 1:0 BHT
T or NT
• What about aliasing?
• Two PC with the same lower bits? T or NT
• No problem, just a prediction!
Prediction (taken or
CIS 5710 | Prof Joseph Devietti not taken) 22
Bimodal Branch Predictor
• simplest direction predictor
Prediction
Outcome
State
• PC indexes table of bits (0 = N, 1 = T),
Time
Result?
no tags
1 N N T Wrong
• Essentially: branch will go same way it
2 T T T Correct
went last time
3 T T T Correct
• Problem: inner loop branch below 4 T T N Wrong
for (i=0;i<100;i++) 5 N N T Wrong
for (j=0;j<3;j++) 6 T T T Correct
// whatever 7 T T T Correct
– Two “built-in” mis-predictions per 8 T T N Wrong
inner loop iteration 9 N N T Wrong
– Branch predictor “changes its mind 10 T T T Correct
too quickly” 11 T T T Correct
12 T T N Wrong
CIS 5710 | Prof Joseph Devietti 23
Two-Bit Saturating Counters (2bc)
• Two-bit saturating counters
Prediction
Outcome
(2bc) [Smith 1981]
State
Time
Result?
• Replace each single-bit prediction
1 N N T Wrong
• (0,1,2,3) = (N,n,t,T) 2 n N T Wrong
• Adds “hysteresis” 3 t T T Correct
• Force predictor to mis-predict twice 4 T T N Wrong
before “changing its mind” 5 t T T Correct
6 T T T Correct
• One mispredict each loop execution
7 T T T Correct
(rather than two)
8 T T N Wrong
+ Fixes this pathology (which is not 9 t T T Correct
contrived, by the way) 10 T T T Correct
• Can we do even better? 11 T T T Correct
12 T T N Wrong
CIS 5710 | Prof Joseph Devietti 24
Branches may be correlated
• Consider:
for (i=0; i<1000000; i++) { // Highly biased
if (i % 3 == 0) { // Locally correlated
…
}
if (random() % 2 == 0) { // Unpredictable
…
}
if (i % 3 == 0) {
… // Globally correlated
}
}
CIS 5710 | Prof Joseph Devietti 27
Gshare History-Based Predictor
• Exploits observation that branch outcomes are
correlated
• Maintains recent branch outcomes in Branch
History Register (BHR)
• In addition to BHT of counters (typically 2-bit sat. counters)
• How do we incorporate history into our
predictions?
• Use PC xor BHR to index into BHT. Why?
PC
BHT
BHR
CIS 5710 | Prof Joseph Devietti 28
direction prediction (T/NT)
Gshare History-based Predictor
• Gshare working example
Prediction
Outcome
State
• assume program has one
Time
BHR
Result?
branch
1 N NNN N T wrong
• BHT: one 1-bit DIRP entry
2 N NNT N T wrong
• 3BHR: last 3 branch 3 N NTT N T wrong
outcomes 4 N TTT N N correct
• train counter, and update 5 N TTN N T wrong
BHR after each branch 6 N TNT N T wrong
7 T NTT T T correct
8 N TTT N N correct
9 T TTN T T correct
10 T TNT T T correct
11 T NTT T T correct
12 N TTT N N correct
CIS 5710 | Prof Joseph Devietti 29
Hybrid Predictor
• Hybrid (tournament) predictor [McFarling 1993]
• Attacks correlated predictor BHT capacity problem
• Idea: combine two predictors
• Bimodal predictor for history-independent branches
• Correlated predictor for branches that need history
• Chooser assigns branches to one predictor or the other
• Branches start in simple BHT, move mis-prediction
threshold
+ Correlated predictor can be made smaller, handles fewer
branches PC
+ 90–95% accuracy
chooser
BHT
BHT
BHR
CIS 5710 | Prof Joseph Devietti 32
Branch Target Prediction
CIS 5710 | Prof Joseph Devietti 33
Branch Prediction Components
regfile
I$ D$
B
P
• Step #1: is it a branch?
• Easy after decode... during fetch: predictor
• Step #2: is the branch taken or not taken?
• Direction predictor
• Step #3: if the branch is taken, where does it go?
• Branch target predictor (BTB)
• Supplies target PC if branch is taken
CIS 5710 | Prof Joseph Devietti 34
Branch Target Buffer, Again
• Branch target buffer (BTB)
• guess the future PC based on past behavior
• “Last time the branch X was taken, it went to address Y”
• “So, if address X is fetched, fetch address Y next”
• Essentially: branch will go to same place it went last time
• PC indexes table of target addresses
• use tags to precisely remember a subset of targets
• What about aliasing?
• Two PCs with the same lower bits?
• No problem, just a prediction!
PC [31:10] [9:2] 1:0 BTB
target tag
target tag
CIS 5710 | Prof Joseph Devietti predicted target 35
Branch Target Buffer
• BTB predicts which insns are branches, and
their targets
• tag each entry with its corresponding PC
• Update BTB on every taken branch insn, record target PC:
• BTB[PC].tag = PC, BTB[PC].target = target of branch
• All insns access BTB at Fetch in parallel with Imem
• If tag matches, indicates insn at that PC is a branch
• otherwise, assume insn is not a branch
• Predicted PC = (BTB[PC].tag == PC) ? BTB[PC].target :
PC+4
==
tag
PC BTB
target
predicted target
+
4
CIS 5710 | Prof Joseph Devietti 36
Why Does a BTB Work?
• Because most control insns use direct targets
• Target encoded in insn itself ® same “taken” target every
time
• What about indirect targets?
• Target held in a register ® can be different each time
• Two indirect call idioms
+ Dynamically linked functions (DLLs): target always the
same
• Dynamically dispatched (virtual) functions: hard but
uncommon
• Also two indirect unconditional jump idioms
• Switches: hard but uncommon
– Function returns: hard and common
CIS 5710 | Prof Joseph Devietti 37
Return Address Stack (RAS)
==
tag
PC BTB
target
predicted target
+
4
RAS
• Return address stack (RAS)
• Call instruction? RAS[TopOfStack++] = PC+4
• Return instruction? Predicted-target = RAS[--TopOfStack]
• How can you tell if an insn is a call/return before decoding
it?
• mark some BTB entries as “returns”, or use another table
CIS 5710 | Prof Joseph Devietti 38
Misprediction Recovery
CIS 5710 | Prof Joseph Devietti 39
Branch Recovery
PC PC
D X <<
2
+
4 M
Register A
S O
File X
Insn s1 s2 d B B
PC
Mem
IR IR IR
nop nop
• Branch recovery: what to do when branch is actually taken
• Insns that are in F and D are wrong
• Flush them, i.e., replace them with NOPs
• They haven’t written permanent state yet (regfile, DMem)
– Two cycle penalty for taken branches
CIS 5710 | Prof Joseph Devietti 40
Branch Speculation and Recovery
1 2 3 4 5 6 7 8 9
Correct:
addi x3,x1,1 F D X M W
bne x3,x0,targ F D X M W
sw x6,4(x7) F D X M W
mul x10,x8,x9 F D X M W
speculative
• Mis-speculation recovery
• Not too painful in a short, in-order pipeline
• Branch resolves in X
+ Younger insns (in F, D) haven’t changed permanent state
• Flush insns currently in D and X (i.e., replace with nops)
1 2 3 4 5 6 7 8 9
Recovery: addi x3,x1,1 F D X M W
bne x3,x0,targ F D X M W
sw x6,4(x7) F D -- -- --
mul x10,x8,x9 F -- -- -- --
targ:add x4,x4,x5 F D X M W
CIS 5710 | Prof Joseph Devietti 41
Reducing Taken Branch Penalty
CIS 5710 | Prof Joseph Devietti 42
Reducing Penalty: Fast Branches
PC
D <<
2 <>
+ 0
4 A X M
Register S S
X O
File B X
Insn s1 s2 d B
PC
Mem
IR IR IR
• Fast branch: can decide at D, not X
• Test must be comparison to zero or equality, no time for ALU
• beq/bne
+ New taken branch penalty is 1
– Additional insns (blt) for more complex tests, must bypass to
D too
CIS 5710 | Prof Joseph Devietti 43
Reducing Penalty: Fast Branches
• Fast branch: targets control-hazard penalty
• Basically, branch insns that can resolve at D, not X
• Test must be comparison to zero or equality, no time for
ALU
+ New taken branch penalty is 1
– Additional comparison insns (e.g., cmplt, slt) for complex
tests
– Must bypass into decode stage now, too
1 2 3 4 5 6 7 8 9
bnez r3,targ F D X M W
st r6⟶[r7+4] F D -- -- --
targ:add r4⟵r5,r4 F D X M W
CIS 5710 | Prof Joseph Devietti 44
Putting It All Together
• BTB & branch direction predictor during fetch
==
tag
PC BTB
target
predicted target
+
4
RAS
BHT
taken/not-taken
• If branch prediction correct, no taken branch
penalty
CIS 5710 | Prof Joseph Devietti 46
Branch Prediction Performance
• Dynamic branch prediction
• 20% of instruction branches
• Simple predictor: branches predicted with 75% accuracy
• CPI = 1 + (20% * 25% * 2) = 1.1
• More advanced predictor: 95% accuracy
• CPI = 1 + (20% * 5% * 2) = 1.02
• Branch mis-predictions still a big problem though
• Pipelines are long: typical mis-prediction penalty is 10+
cycles
• For cores that do more per cycle, predictions more costly
(later)
CIS 5710 | Prof Joseph Devietti 47
Predication
CIS 5710 | Prof Joseph Devietti 48
Predication
• Instead of predicting which way we’re going, why
not go both ways?
• compute a predicate bit indicating a condition
• ISAs often include predicated instructions
• predicated insns either execute as normal or as NOPs,
depending on the predicate bit
• Examples
• x86 cmov performs conditional load/store
• 32b ARM allows almost all insns to be predicated
• 64b ARM has predicated reg-reg move, inc, dec, not
• Nvidia GPU ISA supports predication on most insns
• RV does not have predication
CIS 5710 | Prof Joseph Devietti 49
Predication Example
• Instead of predicting which way we’re going, why
not go both ways?
• compute a predicate bit indicating a condition
• ISA includes predicated instructions
• predicated insns either execute as normal or as NOPs,
depending on the predicate bits
// C code ; original RV ; imaginary predicated RV
if (a >= b) { blt x1,x2,else slt p1,x1,x2
x += y; add x3,x3,x4 add.!p1 x3,x3,x4
} else { j after sub.p1 x3,x3,x5
x -= z; else:
} sub x3,x3,x5
after:
CIS 5710 | Prof Joseph Devietti 50
Predication Performance
• Predication overhead is additional insns
• Sometimes overhead is zero
• for if-then statement where condition is true
– Most of the times it isn’t
• if-then-else statement, only one of the paths is useful
• For a given branch, predicate (vs speculate) if…
• Average number of additional insns > overall mis-prediction
penalty
• For an individual branch
• Mis-prediction penalty in a 5-stage scalar pipeline = 2
• Mis-prediction rate is <50%, and often <20%
• Overall mis-prediction penalty <1 and often <0.4
• So when is predication ever worth it?
CIS 5710 | Prof Joseph Devietti 51
Predication Performance
• What does predication actually accomplish?
• In a scalar 5-stage pipeline (penalty = 2): nothing!
• In a 4-way superscalar 15-stage pipeline (penalty = 60)?
• Use when mis-predictions >10% and insn overhead <6
• In a 4-way out-of-order superscalar (penalty ~ 150)
• potentially useful in more situations
• typically only desirable for branches that mis-predict
frequently
CIS 5710 | Prof Joseph Devietti 52
Predication Pros/Cons
• Other predication advantages
• Low-power: eliminates the need for a large branch predictor
• Real-time: predicated code has consistent latency
• Predication disadvantages
• wasted time/energy compared to correct prediction
• complex to implement
• doesn’t nest well
CIS 5710 | Prof Joseph Devietti 53
Summary
App App App • Control hazards
System software • Branch target prediction
• Branch direction prediction
Mem CPU I/O
CIS 5710 | Prof Joseph Devietti 54