Reliability and Evaluation
Reliability: a measure of the success with which a system
conforms to some authoritative specification of its behaviour
As with reliability, to ensure the safety requirements of an embedded
system, system safety analysis must be performed throughout all
stages of its life cycle development When the behaviour of a system
deviates from that which is specified for it, this is called a failure
Failures result from unexpected problems internal to the system that
eventually manifest themselves in the system's external behaviour
These problems are called errors and their mechanical or algorithmic
cause are termed faults
Systems are composed of components which are themselves
systems: hence
> failure -> fault -> error -> failure -> fault
A transient (temporary) fault starts at a particular time, remains in the
system for some period and then disappears
E.g. hardware components which have an adverse reaction to
radioactivity
Many faults in communication systems are transient
Permanent faults remain in the system until they are repaired; e.g., a
broken wire or a software design error
Intermittent faults are transient faults that occur from time to time
E.g. a hardware component that is heat sensitive, it works for a time,
stops working, cools down and then starts to work again
Approaches to Achieving Reliable Systems
Fault prevention attempts to eliminate any possibility of faults
creeping into a system before it goes operational
Fault tolerance enables a system to continue functioning even in the
presence of faults
Both approaches attempt to produces systems which have well-
defined failure modes
Fault Prevention
Two stages: fault avoidance and fault removal
Fault avoidance attempts to limit the introduction of faults during
system construction by:
use of the most reliable components within the given cost and
performance constraints
use of thoroughly-refined techniques for interconnection of
components and assembly of subsystems
use of proven design methodologies
use of software engineering environments to help manipulate
software components and thereby manage complexity
Fault Removal
Design errors (hardware and software) will exist
Fault removal: procedures for finding and removing the causes of
errors;
e.g. design reviews, program verification, code inspections and
system testing
System testing can never be exhaustive and remove all potential
faults
A test can only be used to show the presence of faults, not their
absence
Most tests are done with the system in simulation mode and it
is difficult to guarantee that the simulation is accurate
Requirements errors during the system's development may not
manifest themselves until the system goes operational
Failure of Fault Prevention Approach
In spite of all the testing and verification techniques, hardware
components will fail; the fault prevention approach will therefore be
unsuccessful when
either the frequency or duration of repair times are
unacceptable, or
the system is inaccessible for maintenance and repair activities
Alternative is Fault Tolerance
Levels of Fault Tolerance
Full Fault Tolerance — the system continues to operate in the
presence of faults, although for a limited period, with no significant
loss of functionality or performance
Graceful Degradation (fail soft) — the system continues to operate in
the presence of errors, accepting a partial degradation of functionality
or performance during recovery or repair
Fail Safe — the system maintains its integrity while accepting a
temporary halt in its operation
The level required will depend on the application
Most safety critical systems require full fault tolerance, however in
practice many settle for graceful degradation
A fundamental way of improving the reliability of software systems depends
on the principle of design diversity where different versions of the functions are
implemented. In order to prevent software failure caused by unpredicted
conditions, different programs (alternative programs) are developed separately,
preferably based on different programming logic, algorithm, computer language,
etc. This diversity is normally applied under the form of recovery blocks or N-
version programming.
Fault-tolerant software assures system reliability by using protective
redundancy at the software level. There are two basic techniques for obtaining
fault-tolerant software: RB scheme and NVP. Both schemes are based on
software redundancy.
1. Recovery Block Scheme
The recovery block scheme consists of three elements: primary module,
acceptance tests, and alternate modules for a given task. The simplest scheme
of the recovery block is as follows:
Where T is an acceptance test condition that is expected to be met by successful
execution of either the primary module P or the alternate modules Q1, Q2, . . .,
Qn-1.
The probability of failure of the RB scheme is as follows:
where
= probability of failure for version Pi
= probability that acceptance test ‘i’ judges an incorrect result as
correct
= probability that acceptance test ‘i’ judges a correct result as
incorrect.
2. N-version Programming
NVP is used for providing fault-tolerance in software. In concept, the NVP
scheme is similar to the N-modular redundancy scheme used to provide tolerance
against hardware faults.
The NVP is defined as the independent generation of N>=2 functionally
equivalent programs, called versions, from the same initial specification.
Independent generation of programs means that the programming efforts are
carried out by N individuals or groups that do not interact with respect to the
programming process.
‘n’ alternative programs are usually executed simultaneously and their
results are sent to a decision mechanism which selects the final result.
The probability of failure of the NVP scheme, Pn, can be expressed as
The first term of this equation is the probability that all versions fail. The second
term is the probability that only one version is correct. The third term, d, is the
probability that there are at least two correct results but the decision algorithm fails
to deliver the correct result.