University of Massachusetts Dept. of Electrical & Computer Engineering Fault Tolerant Computing
University of Massachusetts Dept. of Electrical & Computer Engineering Fault Tolerant Computing
ECE655/Krishna Part.1 .1
Prerequisites
Basic courses in Digital Design Hardware Organization/Computer Architecture Probability
ECE655/Krishna Part.1 .2
Page 1
ECE655/Krishna Part.1 .3
ECE655/Krishna Part.1 .4
Page 2
ECE655/Krishna Part.1 .5
ECE655/Krishna Part.1 .6
Page 3
ECE655/Krishna Part.1 .7
ECE655/Krishna Part.1 .8
Page 4
Traditional Measures
up or down Examples: A lightbulb is either good or burned out; A wire is either connected or broken Two traditional measures: Reliability and Availability Reliability, R(t): probability that the system is up during the interval [0,t], given it was up at time 0 Availability, A(t), is the fraction of time that the system is up during the interval [0,t] Point Availability, Ap(t), is the probability that the system is up at time t A related measure is MTTF - Mean Time To Failure the average time the system remains up before it goes down and has to be repaired or replaced
ECE655/Krishna Part.1 .9 Copyright 2006 Koren & Krishna
Page 5
Computational Capacity
Example: N processors in a gracefully degrading
system System recovers from failures of processors and is useful as long as at least one processor remains operational Let Pi = Prob {i processors are operational}
Page 6
Performability
Application is used to define accomplishment
levels L1, L2,...,Ln Each represents a level of quality of service delivered by the application Example: Li indicates i system crashes during the mission time period T Performability is a vector (P(L1),P(L2),...,P(L n)) where P(Li) is the probability that the computer functions well enough to permit the application to reach up to accomplishment level Li
Page 7
Connectivity - Examples
Page 8
component and several small pieces may be able to function, vs. a network that splinters into a large number of small pieces Another measure of a network's resilience to failure: probability distribution of the largest component upon disconnection
Redundancy
Redundancy is at the heart of fault tolerance Redundancy can be defined as the incorporation
of extra parts in the design of a system in such a way that its function is not impaired in the event of a failure We will study four forms of redundancy:
1. 2. 3. 4.
Page 9
Hardware Redundancy
Extra hardware is added to override the
effects of a failed component Static Hardware Redundancy - for immediate masking of a failure Example: Use three processors (instead of one), each performing the same function. The majority output of these processors can override the wrong output of a single faulty processor Dynamic Hardware Redundancy - spare components are activated upon the failure of a currently active component Hybrid Hardware Redundancy - A combination of static and dynamic redundancy techniques
ECE655/Krishna Part.1 .19 Copyright 2006 Koren & Krishna
Software Redundancy
Software redundancy is provided by having
multiple teams of programmers Write different versions of software for the same function The hope is that such diversity will ensure that not all the copies will fail on the same set of input data
Page 10
Page 11
Failure Rate
The rate at which a component suffers faults
depends on its age, the ambient temperature, any voltage or physical shocks that it suffers, and the technology The dependence on age is usually captured by the bathtub curve:
Bathtub Curve
When components are very young, the failure rate
is quite high: there is a good chance that some units with manufacturing defects slipped through manufacturing quality control and were released As time goes on, these units are weeded out, and the unit spends the bulk of its life showing a fairly constant failure rate As it becomes very old, aging effects start to take over, and the failure rate rises again
Page 12
= L Q (C1 T V + C2 E) L: Learning factor, (how mature the technology is) Q: Manufacturing process Quality factor (0.25 to 20.00) T: Temperature factor, (from 0.1 to 1000), proportional
depending on the supply voltage and the temperature); does not apply to other technologies (set to 1)
V: Voltage stress factor for CMOS devices (from 1 to 10 E: Environment shock factor: from about 0.4 (air C1, C2: Complexity factors; functions of number the chip and number of pins in the package Further details: MIL-HDBK-217E handbook
ECE655/Krishna Part.1 .25
to exp(-Ea/kT) where Ea is the activation energy in electronvolts associated with the technology, k is the Boltzmann constant and T is the temperature in Kelvin
Environment Impact
Devices operating in space, which is replete
with energy-charged particles and can subject devices to severe temperature swings, can be expected to fail more often Similarly for computers in automobiles (high temperature and vibration) and industrial applications
Page 13
Page 14
Containment Zones
To limit such situations, designers incorporate
containment zones into systems Barriers that reduce the chance that a fault or error in one zone will propagate to another A fault-containment zone can be created by providing an independent power supply to each zone The designer tries to electrically isolate one zone from another An error-containment zone can be created by using redundant units and voting on their output
Page 15
Probabilistic Interpretation
F(t) - the probability that the component will
fail at or before time t F(t) = Prob (T t) f(t) - the momentary rate of failure f(t)dt = Prob (t T t+dt) Like any density function (defined for t 0)
F(t)= 0 f(s) ds
Page 16
-t MTTF = 0 e dt = 1/
ECE655/Krishna Part.1 .34 Copyright 2006 Koren & Krishna
Page 17
failure rate is assumed, or equivalently - the Exponential distribution for the component lifetime T There are cases in which this simplifying assumption is inappropriate Example - during the infant mortality and wear-out phases of the bathtub curve In such cases, the Weibull distribution for the lifetime T is often used for reliability calculation
Page 18
where (x) is the Gamma function The special case = 1 is the Exponential
distribution with a constant failure rate
Page 19