Joel Emer
December 7, 2005
6.823, L24-1
Reliable Architectures
Joel Emer
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Joel Emer
December 7, 2005
6.823, L24-2
Strike Changes State of a Single Bit
0
1
Joel Emer
December 7, 2005
6.823, L24-3
Impact of Neutron Strike on a Si Device
neutron strike
source
Strikes release electron
drain
& hole pairs that can be
+ + absorbed by source &
+ --+ drain to alter the state
- - of the device
Transistor Device
• Secondary source of upsets: alpha particles from packaging
Joel Emer
December 7, 2005
6.823, L24-4
Cosmic Rays Come From Deep Space
p
p
n
n
p
n
n
p
n p
n
Earth’s Surface
• Neutron flux is higher in higher altitudes
3x - 5x increase in Denver at 5,000 feet
100x increase in airplanes at 30,000+ feet
Joel Emer
December 7, 2005
6.823, L24-5
Physical Solutions are hard
• Shielding?
– No practical absorbent (e.g., approximately > 10 ft of concrete)
– unlike Alpha particles
• Technology solution: SOI?
– Partially-depleted SOI of some help, effect on logic unclear
– Fully-depleted SOI may help, but is challenging to manufacture
• Circuit level solution?
– Radiation hardened circuits can provide 10x improvement with
significant penalty in performance, area, cost
– 2-4x improvement may be possible with less penalty
Joel Emer
Triple Modular Redundancy
December 7, 2005
6.823, L24-6
(Von Neumann, 1956)
M V Result
V does a majority vote on the results
Joel Emer
Dual Modular Redundancy
December 7, 2005
6.823, L24-7
(e.g., Binac, Stratus)
Error?
C Mismatch?
Error?
• Processing stops on mismatch
• Error signal used to decide which processor be used to
restore state to other
Joel Emer
Pair and Spare Lockstep
December 7, 2005
6.823, L24-8
(e.g., Tandem, 1975)
Primary
M
C Mismatch?
Backup
M
C Mismatch?
• Primary creates periodic checkpoints
• Backup restarts from checkpoint on mismatch
Joel Emer
Redundant Multithreading
December 7, 2005
6.823, L24-9
(e.g., Reinhardt, Mukherjee, 2000)
Leading Thread
X W X X W X X W
C Fault? C Fault? C Fault?
Trailing Thread
X W X X W X X W
• Writes are checked
Joel Emer
December 7, 2005
Component Protection 6.823, L24-10
Parity ECC
1 1 0 1 1 … 0 0 …
Parity ECC
Error? 1 1 …
• Fujitsu SPARC in 130 nm technology (ISSCC 2003)
– 80% of 200k latches protected with parity
– versus very few latches protected in commodity microprocessors
Joel Emer
December 7, 2005
Strike on a bit (e.g., in register file) 6.823, L24-11
Bit
Read?
yes no
Bit has error benign fault
protection? no error
no detection &
no error
correction
detection only
affects program affects program
outcome? outcome?
yes no yes yes no
no
benign fault True DUE False DUE
SDC
no error
SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error
Joel Emer
December 7, 2005
Metrics
6.823, L24-12
• Interval-based
– MTTF = Mean Time to Failure
– MTTR = Mean Time to Repair
– MTBF = Mean Time Between Failures = MTTF + MTTR
– Availability = MTTF / MTBF
• Rate-based
– FIT = Failure in Time = 1 failure in a billion hours
– 1 year MTTF = 109 / (24 * 365) FIT = 114,155 FIT
– SER FIT = SDC FIT + DUE FIT
Hypothetical Example
Cache: 0 FIT
Image removed due to + IQ: 100K FIT
copyright restrictions.
+ FU: 58K FIT
Total of 158K FIT
Joel Emer
December 7, 2005
6.823, L24-13
Cosmic Ray Strikes: Evidence & Reaction
• Publicly disclosed incidence
– Error logs in large servers, E. Normand, “Single Event Upset at
Ground Level,” IEEE Trans. on Nucl Sci, Vol. 43, No. 6, Dec 1996.
– Sun Microsystems found cosmic ray strikes on L2 cache with
defective error protection caused Sun’s flagship servers to crash,
R. Baumann, IRPS Tutorial on SER, 2000.
– Cypress Semiconductor reported in 2004 a single soft error
brought a billion-dollar automotive factory to a halt once a
month, Zielger & Puchner, “SER – History, Trends, and
Challenges,” Cypress, 2004.
Joel Emer
December 7, 2005
6.823, L24-14
# Vulnerable Bits Growing with Moore’s Law
10000
12x GAP
1000
100
10 100% Vulnerable
1
20% Vulnerable
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
Year
1000 year MTBF Goal
Typical SDC goal: 1000 year MTBF
Typical DUE goal: 10-25 year MTBF
Joel Emer
December 7, 2005
6.823, L24-15
Architectural Vulnerability Factor (AVF)
AVFbit = Probability Bit Matters
# of Visible Errors
=# of Bit Flips from Particle Strikes
FITbit= intrinsic FITbit * AVFbit
Joel Emer
December 7, 2005
6.823, L24-16
Architectural Vulnerability Factor
Does a bit matter?
• Branch Predictor
– Doesn’t matter at all (AVF = 0%)
• Program Counter
– Almost always matters (AVF ~ 100%)
Joel Emer
Statistical Fault Injection (SFI)
December 7, 2005
6.823, L24-17
with RTL
Simulate Strike on
Latch
0
1
output
Logic
Logic
0 Does Fault Propagate
to Architectural State
+ Naturally characterizes all logical structures
Joel Emer
December 7, 2005
6.823, L24-18
Architecturally Correct Execution (ACE)
Program Input
Program Outputs
• ACE path requires only a subset of values to flow correctly
through the program’s data flow graph (and the machine)
• Anything else (un-ACE path) can be derated away
Joel Emer
December 7, 2005
6.823, L24-19
Example of un-ACE instruction:
Dynamically Dead Instruction
Dynamically
Dead
Instruction
Most bits of an un-ACE instruction do not affect
program output
Joel Emer
December 7, 2005
6.823, L24-20
Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state
T=1 ACE% = 2/4
Joel Emer
December 7, 2005
6.823, L24-21
Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state
T=2 ACE% = 1/4
Joel Emer
December 7, 2005
6.823, L24-22
Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state
T=3 ACE% = 0/4
Joel Emer
December 7, 2005
6.823, L24-23
Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state
T=4 ACE% = 3/4
Joel Emer
December 7, 2005
6.823, L24-24
Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state
(2+1+0+3)/4
= 4
Average number of ACE bits in a cycle
= Total number of bits in the structure
Joel Emer
December 7, 2005
6.823, L24-25
Little’s Law for ACEs
N ace = T ace × Lace
N ace
AVF =
Ntotal
Joel Emer
December 7, 2005
Computing AVF
6.823, L24-26
• Approach is conservative
– Assume every bit is ACE unless proven otherwise
• Data Analysis using a Performance Model
– Prove that data held in a structure is un-ACE
• Timing Analysis using a Performance Model
– Tracks the time this data spent in the structure
Joel Emer
December 7, 2005
6.823, L24-27
Dynamic Instruction Breakdown
DYNAMICALLY
DEAD
20%
PERFORMANCE
INST
1%
ACE
PREDICATED 46%
FALSE
7%
NOP
26%
Average across Spec2K slices
Joel Emer
December 7, 2005
6.823, L24-28
Mapping ACE & un-ACE Instructions to
the Instruction Queue
Ex- Wrong-
ACE ACE Idle
NOP Prefetch ACE Path
Inst Inst
Inst Inst
Architectural un-ACE Micro-architectural un-ACE
Joel Emer
ACE Lifetime Analysis (1)
December 7, 2005
6.823, L24-29
(e.g., write-through data cache)
• Idle is unACE
Fill Read Read Evict
Idle Valid Valid Valid Idle
• Assuming all time intervals are equal
• For 3/5 of the lifetime the bit is valid
• Gives a measure of the structure’s utilization
– Number of useful bits
– Amount of time useful bits are resident in structure
– Valid for a particular trace
Joel Emer
December 7, 2005
6.823, L24-30
ACE Lifetime Analysis (2)
(e.g., write-through data cache)
• Valid is not necessarily ACE
Fill Read Read Evict
Idle Idle
Write-through Data Cache
• ACE % = AVF = 2/5 = 40%
• Example Lifetime Components
– ACE: fill-to-read, read-to-read
– unACE: idle, read-to-evict, write-to-evict
Joel Emer
December 7, 2005
6.823, L24-31
ACE Lifetime Analysis (3)
(e.g., write-through data cache)
• Data ACEness is a function of instruction ACEness
Fill Read Read Evict
Idle Idle
Write-through Data Cache
• Second Read is by an unACE instruction
• AVF = 1/5 = 20%
Joel Emer
Instruction Queue
December 7, 2005
6.823, L24-32
IDLE
31% ACE
29%
Ex-ACE
NOP
10%
15%
WRONG PATH PREDICATED
3% FALSE
3%
DYNAMICALLY PERFORMANCE
DEAD INST
8% 1%
ACE percentage = AVF = 29%
Joel Emer
December 7, 2005
Strike on a bit (e.g., in register file)
6.823, L24-33
Bit
Read?
yes no
Bit has error benign fault
protection? no error
no detection &
no error
correction
detection only
affects program affects program
outcome? outcome?
yes no yes yes no
no
benign fault True DUE False DUE
SDC
no error
SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error
Joel Emer
DUE AVF of Instruction Queue with Parity
December 7, 2005
6.823, L24-34
True DUE AVF
29%
Idle & Msc
i
38%
Uncommitted
6%
Dynamically Neutral
CPU2000
Dead 16% False DUE AVF
Asim
Simpoint 11% 33%
Itanium®2-like
Joel Emer
December 7, 2005
Sources of False DUE in an
6.823, L24-35
Instruction Queue
• Instructions with uncommitted results
– e.g., wrong-path, predicated-false
– solution: π (possibly incorrect) bit till commit
• Instruction types neutral to errors
– e.g., no-ops, prefetches, branch predict hints
– solution: anti- π bit
• Dynamically dead instructions
– instructions whose results will not be used in future
– solution: π bit beyond commit
Joel Emer
December 7, 2005
Coping with Wrong-Path Instructions
6.823, L24-36
(assume parity-protected instruction queue)
inst
Fetch Decode
X
IQ RR Execute Commit
DECLARE
ERROR
Instruction ON ISSUE Data Cache
Cache (IC)
• Problem: not enough information at issue
Joel Emer
December 7, 2005
The π (Possibly Incorrect) Bit 6.823, L24-37
(assume parity-protected instruction queue)
Fetch Decode IQ RR Execute Commit
inst inst inst (π) inst (π) inst (π) inst (π)
POST ERROR
Instruction IN π BIT ON Data Cache
Cache (IC) ISSUE
At commit point, declare error only if not wrong-path
instruction and π bit is set
Joel Emer
December 7, 2005
Anti-π bit: coping with No-ops
6.823, L24-38
(assume parity-protected instruction queue)
Fetch Decode IQ RR Execute Commit
inst inst inst inst inst inst
(anti-π) (anti-π)
anti-π bit
Instruction neutralizes Data Cache
Cache (IC) the π bit
On issue, if the anti-π bit is set, then do not set the π bit
Joel Emer
December 7, 2005
π bit: avoiding False DUE on
6.823, L24-39
Dynamically Dead Instructions
Inst i: write R1 write R1 write R1(π) write R1(π) write R1(π) write R1(π)
Inst i+n: read R1 read R1 read R1 read R1 (π)
Fetch Decode IQ RR Execute Commit
Instruction Data Cache
Cache (IC)
• Declare the error on reading R1, if π bit is set
• If R1 isn’t read (i.e., dynamically dead), then no False DUE
• π bit can be used in caches & main memory …
Joel Emer
December 7, 2005
% False DUE AVF Eliminated 6.823, L24-40
(PI = π)
PI bit till I/O
commit PI bit till register
12% commit
18%
PI bit till store
commit
8%
PI bit till register
read
14%
CPU2000
Asim anti-PI bit
Simpoint 48%
Itanium®2-like
Practical to eliminate most of the False DUE AVF