0% found this document useful (0 votes)
76 views40 pages

l24 Reliability

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views40 pages

l24 Reliability

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Joel Emer

December 7, 2005
6.823, L24-1

Reliable Architectures

Joel Emer
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Joel Emer
December 7, 2005
6.823, L24-2

Strike Changes State of a Single Bit

0
1
Joel Emer
December 7, 2005
6.823, L24-3

Impact of Neutron Strike on a Si Device

neutron strike

source
Strikes release electron
drain
& hole pairs that can be
+ + absorbed by source &
+ --+ drain to alter the state
- - of the device

Transistor Device

• Secondary source of upsets: alpha particles from packaging


Joel Emer
December 7, 2005
6.823, L24-4

Cosmic Rays Come From Deep Space

p
p
n
n
p
n
n
p
n p
n

Earth’s Surface

• Neutron flux is higher in higher altitudes


3x - 5x increase in Denver at 5,000 feet
100x increase in airplanes at 30,000+ feet
Joel Emer
December 7, 2005
6.823, L24-5

Physical Solutions are hard


• Shielding?
– No practical absorbent (e.g., approximately > 10 ft of concrete)
– unlike Alpha particles

• Technology solution: SOI?


– Partially-depleted SOI of some help, effect on logic unclear
– Fully-depleted SOI may help, but is challenging to manufacture

• Circuit level solution?


– Radiation hardened circuits can provide 10x improvement with
significant penalty in performance, area, cost
– 2-4x improvement may be possible with less penalty
Joel Emer

Triple Modular Redundancy


December 7, 2005
6.823, L24-6

(Von Neumann, 1956)

M V Result

V does a majority vote on the results


Joel Emer

Dual Modular Redundancy


December 7, 2005
6.823, L24-7

(e.g., Binac, Stratus)

Error?

C Mismatch?

Error?

• Processing stops on mismatch


• Error signal used to decide which processor be used to
restore state to other
Joel Emer

Pair and Spare Lockstep


December 7, 2005
6.823, L24-8

(e.g., Tandem, 1975)


Primary
M

C Mismatch?

Backup
M

C Mismatch?

• Primary creates periodic checkpoints


• Backup restarts from checkpoint on mismatch
Joel Emer

Redundant Multithreading
December 7, 2005
6.823, L24-9

(e.g., Reinhardt, Mukherjee, 2000)


Leading Thread

X W X X W X X W

C Fault? C Fault? C Fault?

Trailing Thread

X W X X W X X W

• Writes are checked


Joel Emer
December 7, 2005

Component Protection 6.823, L24-10

Parity ECC

1 1 0 1 1 … 0 0 …

Parity ECC

Error? 1 1 …

• Fujitsu SPARC in 130 nm technology (ISSCC 2003)


– 80% of 200k latches protected with parity
– versus very few latches protected in commodity microprocessors
Joel Emer
December 7, 2005

Strike on a bit (e.g., in register file) 6.823, L24-11

Bit
Read?
yes no

Bit has error benign fault


protection? no error

no detection &
no error
correction
detection only

affects program affects program


outcome? outcome?
yes no yes yes no
no

benign fault True DUE False DUE


SDC
no error

SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error


Joel Emer
December 7, 2005

Metrics
6.823, L24-12

• Interval-based
– MTTF = Mean Time to Failure
– MTTR = Mean Time to Repair
– MTBF = Mean Time Between Failures = MTTF + MTTR
– Availability = MTTF / MTBF
• Rate-based
– FIT = Failure in Time = 1 failure in a billion hours

– 1 year MTTF = 109 / (24 * 365) FIT = 114,155 FIT

– SER FIT = SDC FIT + DUE FIT

Hypothetical Example
Cache: 0 FIT

Image removed due to + IQ: 100K FIT

copyright restrictions.
+ FU: 58K FIT

Total of 158K FIT

Joel Emer
December 7, 2005
6.823, L24-13

Cosmic Ray Strikes: Evidence & Reaction

• Publicly disclosed incidence

– Error logs in large servers, E. Normand, “Single Event Upset at


Ground Level,” IEEE Trans. on Nucl Sci, Vol. 43, No. 6, Dec 1996.

– Sun Microsystems found cosmic ray strikes on L2 cache with


defective error protection caused Sun’s flagship servers to crash,
R. Baumann, IRPS Tutorial on SER, 2000.

– Cypress Semiconductor reported in 2004 a single soft error

brought a billion-dollar automotive factory to a halt once a

month, Zielger & Puchner, “SER – History, Trends, and

Challenges,” Cypress, 2004.

Joel Emer
December 7, 2005
6.823, L24-14

# Vulnerable Bits Growing with Moore’s Law

10000
12x GAP
1000

100

10 100% Vulnerable

1
20% Vulnerable
2003

2004

2005

2006

2007

2008

2009

2010

2011

2012
Year
1000 year MTBF Goal

Typical SDC goal: 1000 year MTBF


Typical DUE goal: 10-25 year MTBF
Joel Emer
December 7, 2005
6.823, L24-15

Architectural Vulnerability Factor (AVF)

AVFbit = Probability Bit Matters

# of Visible Errors
=# of Bit Flips from Particle Strikes

FITbit= intrinsic FITbit * AVFbit


Joel Emer
December 7, 2005
6.823, L24-16

Architectural Vulnerability Factor

Does a bit matter?

• Branch Predictor
– Doesn’t matter at all (AVF = 0%)

• Program Counter
– Almost always matters (AVF ~ 100%)
Joel Emer

Statistical Fault Injection (SFI)

December 7, 2005
6.823, L24-17

with RTL

Simulate Strike on
Latch
0
1
output
Logic
Logic
0 Does Fault Propagate
to Architectural State

+ Naturally characterizes all logical structures

Joel Emer
December 7, 2005
6.823, L24-18

Architecturally Correct Execution (ACE)

Program Input

Program Outputs
• ACE path requires only a subset of values to flow correctly
through the program’s data flow graph (and the machine)
• Anything else (un-ACE path) can be derated away
Joel Emer
December 7, 2005
6.823, L24-19

Example of un-ACE instruction:


Dynamically Dead Instruction

Dynamically
Dead
Instruction

Most bits of an un-ACE instruction do not affect


program output
Joel Emer
December 7, 2005
6.823, L24-20

Vulnerability of a structure

AVF = fraction of cycles a bit contains ACE state

T=1 ACE% = 2/4


Joel Emer
December 7, 2005
6.823, L24-21

Vulnerability of a structure

AVF = fraction of cycles a bit contains ACE state

T=2 ACE% = 1/4


Joel Emer
December 7, 2005
6.823, L24-22

Vulnerability of a structure

AVF = fraction of cycles a bit contains ACE state

T=3 ACE% = 0/4


Joel Emer
December 7, 2005
6.823, L24-23

Vulnerability of a structure

AVF = fraction of cycles a bit contains ACE state

T=4 ACE% = 3/4


Joel Emer
December 7, 2005
6.823, L24-24

Vulnerability of a structure

AVF = fraction of cycles a bit contains ACE state

(2+1+0+3)/4
= 4

Average number of ACE bits in a cycle


= Total number of bits in the structure
Joel Emer
December 7, 2005
6.823, L24-25

Little’s Law for ACEs

N ace = T ace × Lace


N ace
AVF =
Ntotal
Joel Emer
December 7, 2005

Computing AVF
6.823, L24-26

• Approach is conservative
– Assume every bit is ACE unless proven otherwise

• Data Analysis using a Performance Model


– Prove that data held in a structure is un-ACE

• Timing Analysis using a Performance Model


– Tracks the time this data spent in the structure
Joel Emer
December 7, 2005
6.823, L24-27

Dynamic Instruction Breakdown

DYNAMICALLY
DEAD
20%

PERFORMANCE
INST
1%

ACE
PREDICATED 46%
FALSE
7%

NOP
26%

Average across Spec2K slices


Joel Emer
December 7, 2005
6.823, L24-28

Mapping ACE & un-ACE Instructions to


the Instruction Queue

Ex- Wrong-
ACE ACE Idle
NOP Prefetch ACE Path
Inst Inst
Inst Inst

Architectural un-ACE Micro-architectural un-ACE


Joel Emer

ACE Lifetime Analysis (1)


December 7, 2005
6.823, L24-29

(e.g., write-through data cache)

• Idle is unACE
Fill Read Read Evict
Idle Valid Valid Valid Idle

• Assuming all time intervals are equal


• For 3/5 of the lifetime the bit is valid
• Gives a measure of the structure’s utilization
– Number of useful bits
– Amount of time useful bits are resident in structure
– Valid for a particular trace
Joel Emer
December 7, 2005
6.823, L24-30

ACE Lifetime Analysis (2)


(e.g., write-through data cache)

• Valid is not necessarily ACE

Fill Read Read Evict


Idle Idle

Write-through Data Cache

• ACE % = AVF = 2/5 = 40%


• Example Lifetime Components
– ACE: fill-to-read, read-to-read
– unACE: idle, read-to-evict, write-to-evict
Joel Emer
December 7, 2005
6.823, L24-31

ACE Lifetime Analysis (3)


(e.g., write-through data cache)

• Data ACEness is a function of instruction ACEness

Fill Read Read Evict


Idle Idle

Write-through Data Cache

• Second Read is by an unACE instruction

• AVF = 1/5 = 20%


Joel Emer

Instruction Queue
December 7, 2005
6.823, L24-32

IDLE
31% ACE
29%

Ex-ACE
NOP
10%
15%

WRONG PATH PREDICATED


3% FALSE
3%
DYNAMICALLY PERFORMANCE
DEAD INST
8% 1%

ACE percentage = AVF = 29%


Joel Emer
December 7, 2005

Strike on a bit (e.g., in register file)


6.823, L24-33

Bit
Read?
yes no

Bit has error benign fault


protection? no error

no detection &
no error
correction
detection only

affects program affects program


outcome? outcome?
yes no yes yes no
no

benign fault True DUE False DUE


SDC
no error

SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error


Joel Emer

DUE AVF of Instruction Queue with Parity


December 7, 2005
6.823, L24-34

True DUE AVF


29%

Idle & Msc


i
38%

Uncommitted
6%

Dynamically Neutral
CPU2000
Dead 16% False DUE AVF
Asim
Simpoint 11% 33%
Itanium®2-like
Joel Emer
December 7, 2005

Sources of False DUE in an

6.823, L24-35

Instruction Queue

• Instructions with uncommitted results

– e.g., wrong-path, predicated-false


– solution: π (possibly incorrect) bit till commit
• Instruction types neutral to errors
– e.g., no-ops, prefetches, branch predict hints
– solution: anti- π bit
• Dynamically dead instructions
– instructions whose results will not be used in future
– solution: π bit beyond commit
Joel Emer
December 7, 2005

Coping with Wrong-Path Instructions


6.823, L24-36

(assume parity-protected instruction queue)

inst
Fetch Decode
X
IQ RR Execute Commit

DECLARE
ERROR
Instruction ON ISSUE Data Cache
Cache (IC)

• Problem: not enough information at issue


Joel Emer
December 7, 2005

The π (Possibly Incorrect) Bit 6.823, L24-37

(assume parity-protected instruction queue)

Fetch Decode IQ RR Execute Commit


inst inst inst (π) inst (π) inst (π) inst (π)

POST ERROR
Instruction IN π BIT ON Data Cache
Cache (IC) ISSUE

At commit point, declare error only if not wrong-path


instruction and π bit is set
Joel Emer
December 7, 2005

Anti-π bit: coping with No-ops


6.823, L24-38

(assume parity-protected instruction queue)

Fetch Decode IQ RR Execute Commit


inst inst inst inst inst inst
(anti-π) (anti-π)
anti-π bit
Instruction neutralizes Data Cache
Cache (IC) the π bit

On issue, if the anti-π bit is set, then do not set the π bit
Joel Emer
December 7, 2005

π bit: avoiding False DUE on


6.823, L24-39

Dynamically Dead Instructions


Inst i: write R1 write R1 write R1(π) write R1(π) write R1(π) write R1(π)
Inst i+n: read R1 read R1 read R1 read R1 (π)

Fetch Decode IQ RR Execute Commit

Instruction Data Cache


Cache (IC)

• Declare the error on reading R1, if π bit is set


• If R1 isn’t read (i.e., dynamically dead), then no False DUE
• π bit can be used in caches & main memory …
Joel Emer
December 7, 2005

% False DUE AVF Eliminated 6.823, L24-40

(PI = π)
PI bit till I/O

commit PI bit till register

12% commit

18%

PI bit till store

commit

8%

PI bit till register

read

14%

CPU2000

Asim anti-PI bit

Simpoint 48%

Itanium®2-like

Practical to eliminate most of the False DUE AVF

You might also like