0% found this document useful (0 votes)
38 views

University of Massachusetts Dept. of Electrical & Computer Engineering Fault Tolerant Computing

Fault-tolerant systems - ideally systems capable of executing their tasks correctly regardless of either hardware failures or software errors. A malfunction of a computer in such applications can lead to catastrophe their probability of failure must be extremely low, possibly one in a billion per hour of operation.

Uploaded by

dayas1979
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

University of Massachusetts Dept. of Electrical & Computer Engineering Fault Tolerant Computing

Fault-tolerant systems - ideally systems capable of executing their tasks correctly regardless of either hardware failures or software errors. A malfunction of a computer in such applications can lead to catastrophe their probability of failure must be extremely low, possibly one in a billion per hour of operation.

Uploaded by

dayas1979
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIVERSITY OF MASSACHUSETTS Dept.

of Electrical & Computer Engineering Fault Tolerant Computing


ECE 655 Part 1 Introduction

C. M. Krishna Fall 2006

ECE655/Krishna Part.1 .1

Copyright 2006 Koren & Krishna

Prerequisites
Basic courses in Digital Design Hardware Organization/Computer Architecture Probability

ECE655/Krishna Part.1 .2

Copyright 2006 Koren & Krishna

Page 1

Fault Tolerance - Basic definition


Fault-tolerant systems - ideally systems
capable of executing their tasks correctly regardless of either hardware failures or software errors In practice - we can never guarantee the flawless execution of tasks under any circumstances Limit ourselves to types of failures and errors which are more likely to occur

ECE655/Krishna Part.1 .3

Copyright 2006 Koren & Krishna

Need For Fault Tolerance


1. 2. 3.

ECE655/Krishna Part.1 .4

Copyright 2006 Koren & Krishna

Page 2

Need For Fault Tolerance - Life Critical Applications


Life-critical applications such as;
aircrafts, nuclear reactors, chemical plants, and medical equipment A malfunction of a computer in such applications can lead to catastrophe Their probability of failure must be extremely low, possibly one in a billion per hour of operation

ECE655/Krishna Part.1 .5

Copyright 2006 Koren & Krishna

Need for Fault Tolerance - Harsh Environment


A computing system operating in a harsh
environment where it is subjected to electromagnetic disturbances particle hits and alike Very large number of failures means: the system will not produce useful results unless some fault-tolerance is incorporated

ECE655/Krishna Part.1 .6

Copyright 2006 Koren & Krishna

Page 3

Need For Fault Tolerance - Highly Complex Systems


Complex systems consist of millions of devices Every physical device has a certain probability
of failure A very large number of devices implies that the likelihood of failures is high The system will experience faults at such a frequency which renders it useless

ECE655/Krishna Part.1 .7

Copyright 2006 Koren & Krishna

Fault Tolerance Measures


It is important to have proper yardsticks measures - by which to measure the effect of fault tolerance A measure is a mathematical abstraction, which expresses only some subset of the object's nature Measures?

ECE655/Krishna Part.1 .8

Copyright 2006 Koren & Krishna

Page 4

Assumption: The system can be in one of two states:

Traditional Measures

up or down Examples: A lightbulb is either good or burned out; A wire is either connected or broken Two traditional measures: Reliability and Availability Reliability, R(t): probability that the system is up during the interval [0,t], given it was up at time 0 Availability, A(t), is the fraction of time that the system is up during the interval [0,t] Point Availability, Ap(t), is the probability that the system is up at time t A related measure is MTTF - Mean Time To Failure the average time the system remains up before it goes down and has to be repaired or replaced
ECE655/Krishna Part.1 .9 Copyright 2006 Koren & Krishna

Need For More Measures


The assumption of the system being in state
up or down is very limiting Example: A processor with one of its several hundreds of millions of gates stuck at logic value 0 and the rest is functional - may affect the output of the processor once in every 25,000 hours of use The processor is not fault-free, but cannot be defined as being down More general measures than the traditional reliability and availability are needed
Copyright 2006 Koren & Krishna

ECE655/Krishna Part.1 .10

Page 5

More General Measures


Capacity Reliability - probability that the
system capacity (as measured, for example, by throughput) at time t exceeds some given threshold at that time Another extension - consider everything from the perspective of the application This approach was taken to define the measure known as Performability

ECE655/Krishna Part.1 .11

Copyright 2006 Koren & Krishna

Computational Capacity
Example: N processors in a gracefully degrading
system System recovers from failures of processors and is useful as long as at least one processor remains operational Let Pi = Prob {i processors are operational}

R(t) = Pi Let c be the computational capacity of a processor


(e.g., number of fixed-size tasks it can execute) Computational capacity of i processors: Ci = i c

Computational capacity of the system: Ci Pi


ECE655/Krishna Part.1 .12 Copyright 2006 Koren & Krishna

Page 6

Performability
Application is used to define accomplishment
levels L1, L2,...,Ln Each represents a level of quality of service delivered by the application Example: Li indicates i system crashes during the mission time period T Performability is a vector (P(L1),P(L2),...,P(L n)) where P(Li) is the probability that the computer functions well enough to permit the application to reach up to accomplishment level Li

ECE655/Krishna Part.1 .13

Copyright 2006 Koren & Krishna

Network Connectivity Measures


Focus on the network that connects the
processors Classical Node and Line Connectivity - the minimum number of nodes and lines, respectively, that have to fail before the network becomes disconnected Measure indicates how vulnerable the network is to disconnection A network disconnected by the failure of just one (critically-positioned) node is potentially more vulnerable than another which requires several nodes to fail before it becomes disconnected
ECE655/Krishna Part.1 .14 Copyright 2006 Koren & Krishna

Page 7

Connectivity - Examples

ECE655/Krishna Part.1 .15

Copyright 2006 Koren & Krishna

Network Resilience Measures


Classical connectivity distinguishes between only
two network states: connected and disconnected It says nothing about how the network degrades as nodes fail before becoming disconnected Two possible resilience measures: Average node-pair distance Network diameter - the maximum node-pair distance Both calculated given the probability of node and/or link failure
ECE655/Krishna Part.1 .16 Copyright 2006 Koren & Krishna

Page 8

More Network Measures


What happens upon network disconnection A network that splits into one large connected

component and several small pieces may be able to function, vs. a network that splinters into a large number of small pieces Another measure of a network's resilience to failure: probability distribution of the largest component upon disconnection

ECE655/Krishna Part.1 .17

Copyright 2006 Koren & Krishna

Redundancy
Redundancy is at the heart of fault tolerance Redundancy can be defined as the incorporation

of extra parts in the design of a system in such a way that its function is not impaired in the event of a failure We will study four forms of redundancy:

1. 2. 3. 4.

ECE655/Krishna Part.1 .18

Copyright 2006 Koren & Krishna

Page 9

Hardware Redundancy
Extra hardware is added to override the
effects of a failed component Static Hardware Redundancy - for immediate masking of a failure Example: Use three processors (instead of one), each performing the same function. The majority output of these processors can override the wrong output of a single faulty processor Dynamic Hardware Redundancy - spare components are activated upon the failure of a currently active component Hybrid Hardware Redundancy - A combination of static and dynamic redundancy techniques
ECE655/Krishna Part.1 .19 Copyright 2006 Koren & Krishna

Software Redundancy
Software redundancy is provided by having
multiple teams of programmers Write different versions of software for the same function The hope is that such diversity will ensure that not all the copies will fail on the same set of input data

ECE655/Krishna Part.1 .20

Copyright 2006 Koren & Krishna

Page 10

Information and Time Redundancy


Information redundancy: provided by adding bits to
the original data bits so that an error in the data bits can be detected and even corrected Error detecting and correcting codes have been developed and are being used Information redundancy often requires hardware redundancy to process the additional bits Time redundancy: provided by having additional time during which a failed execution can be repeated Most failures are transient - they go away after some time If enough slack time is available, the failed unit can recover and redo the affected computation
ECE655/Krishna Part.1 .21 Copyright 2006 Koren & Krishna

Hardware Faults Classification


Three types of faults: Transient Faults - disappear after a relatively short
time Example - a memory cell whose contents are changed spuriously due to some electromagnetic interference Overwriting the memory cell with the right content will make the fault go away Permanent Faults - never go away, component has to be repaired or replaced Intermittent Faults - cycle between active and benign states Example - a loose connection
ECE655/Krishna Part.1 .22 Copyright 2006 Koren & Krishna

Page 11

Failure Rate
The rate at which a component suffers faults
depends on its age, the ambient temperature, any voltage or physical shocks that it suffers, and the technology The dependence on age is usually captured by the bathtub curve:

ECE655/Krishna Part.1 .23

Copyright 2006 Koren & Krishna

Bathtub Curve
When components are very young, the failure rate
is quite high: there is a good chance that some units with manufacturing defects slipped through manufacturing quality control and were released As time goes on, these units are weeded out, and the unit spends the bulk of its life showing a fairly constant failure rate As it becomes very old, aging effects start to take over, and the failure rate rises again

ECE655/Krishna Part.1 .24

Copyright 2006 Koren & Krishna

Page 12

Empirical Formula for - Failure Rate

= L Q (C1 T V + C2 E) L: Learning factor, (how mature the technology is) Q: Manufacturing process Quality factor (0.25 to 20.00) T: Temperature factor, (from 0.1 to 1000), proportional
depending on the supply voltage and the temperature); does not apply to other technologies (set to 1)

V: Voltage stress factor for CMOS devices (from 1 to 10 E: Environment shock factor: from about 0.4 (air C1, C2: Complexity factors; functions of number the chip and number of pins in the package Further details: MIL-HDBK-217E handbook
ECE655/Krishna Part.1 .25

to exp(-Ea/kT) where Ea is the activation energy in electronvolts associated with the technology, k is the Boltzmann constant and T is the temperature in Kelvin

conditioned environment), to 13.0 (harsh environment) of gates on


Copyright 2006 Koren & Krishna

Environment Impact
Devices operating in space, which is replete
with energy-charged particles and can subject devices to severe temperature swings, can be expected to fail more often Similarly for computers in automobiles (high temperature and vibration) and industrial applications

ECE655/Krishna Part.1 .26

Copyright 2006 Koren & Krishna

Page 13

Faults Vs. Errors


A fault can be either a hardware defect or
a software/programming mistake By contrast, an error is a manifestation of a fault For example, consider an adder circuit one of whose output lines is stuck at 1. This is a fault, but not (yet) an error. This fault causes an error when the adder is used and the result on that line is supposed to have been a 0, rather than a 1

ECE655/Krishna Part.1 .27

Copyright 2006 Koren & Krishna

Propagation of Faults and Errors


Both faults and errors can spread through
the system If a chip shorts out power to ground, it may cause nearby chips to fail as well Errors can spread because the output of one processor is frequently used as input by other processors Adder example: the erroneous result of the faulty adder can be fed into further calculations, thus propagating the error

ECE655/Krishna Part.1 .28

Copyright 2006 Koren & Krishna

Page 14

Containment Zones
To limit such situations, designers incorporate
containment zones into systems Barriers that reduce the chance that a fault or error in one zone will propagate to another A fault-containment zone can be created by providing an independent power supply to each zone The designer tries to electrically isolate one zone from another An error-containment zone can be created by using redundant units and voting on their output

ECE655/Krishna Part.1 .29

Copyright 2006 Koren & Krishna

Time to Failure - Analytic Model


Consider the following model: N identical components, all operational at time t=0 Each component remains operational until it is hit
by a failure All failures are permanent and occur in each component independently from other components We first concentrate on one component T - the lifetime of one component - the time until it fails T is a random variable f(t) - the density function of T F(t) - the cumulative distribution function of T
ECE655/Krishna Part.1 .30 Copyright 2006 Koren & Krishna

Page 15

Probabilistic Interpretation
F(t) - the probability that the component will
fail at or before time t F(t) = Prob (T t) f(t) - the momentary rate of failure f(t)dt = Prob (t T t+dt) Like any density function (defined for t 0)

These functions are related through f(t)=dF(t)/dt


ECE655/Krishna Part.1 .31

f(t) dt =1 0 f(t) 0, for all t 0

F(t)= 0 f(s) ds

Copyright 2006 Koren & Krishna

Reliability and Failure (Hazard) Rate


The reliability of a single component - R(t) R(t) = Prob (T>t) = 1- F(t) The failure probability of a component at time t,
p(t) - the conditional probability that the component will fail at time t, given it has not failed before p(t) = Prob (t T t+dt | T t) = Prob (t T t+dt) / Prob(T t) = f(t)dt / (1-F(t)) The failure rate (or hazard rate) of a component at time t, h(t), is defined as p(t)/dt h(t) = f(t)/(1- F(t)) Since dR(t)/dt = -f(t), we get h(t) = -1/R(t) dR(t)/dt
ECE655/Krishna Part.1 .32 Copyright 2006 Koren & Krishna

Page 16

Constant Failure Rate


If the failure rate is constant over time h(t) = - then dR(t) / dt = - R(t) ; R(0)=1 The solution of the differential equation is - t R(t) = e - t f(t)= e - t F(t)=1- e A constant failure rate is obtained if and only if T, the lifetime of the component, has an exponential distribution
ECE655/Krishna Part.1 .33 Copyright 2006 Koren & Krishna

Mean Time to Failure


MTTF - the expected value of the lifetime T MTTF = E[T] =0 t f(t) dt dR(t)/dt= - f(t) MTTF = -0 t dR(t)/dt dt = [-t R(t) ] | + R(t) dt 0 0 -t R(t) = 0 when t=0 and when t= since R( )=0 Therefore, MTTF = 0 R(t) dt If the failure rate is a constant
- t R(t) = e

-t MTTF = 0 e dt = 1/
ECE655/Krishna Part.1 .34 Copyright 2006 Koren & Krishna

Page 17

Weibull Distribution - Introduction


In most calculations of reliability, a constant

failure rate is assumed, or equivalently - the Exponential distribution for the component lifetime T There are cases in which this simplifying assumption is inappropriate Example - during the infant mortality and wear-out phases of the bathtub curve In such cases, the Weibull distribution for the lifetime T is often used for reliability calculation

ECE655/Krishna Part.1 .35

Copyright 2006 Koren & Krishna

Weibull distribution - Equation


The Weibull distribution has two parameters,
and The density function of the component lifetime T: -1 - t f(t)= t e The failure rate for the Weibull distribution is -1 h(t)= t The failure rate h(t) is decreasing with time for <1, increasing with time for >1, constant for =1, appropriate for the infant mortality, wearout and middle phases, respectively
ECE655/Krishna Part.1 .36 Copyright 2006 Koren & Krishna

Page 18

MTTF for the Weibull Distribution


The reliability for the Weibull distribution is
- t R(t) = e The MTTF for the Weibull distribution is MTTF = (1/ )/( 1/ )

where (x) is the Gamma function The special case = 1 is the Exponential
distribution with a constant failure rate

ECE655/Krishna Part.1 .37

Copyright 2006 Koren & Krishna

Page 19

You might also like