CBDT3103 Answer
CBDT3103 Answer
QUESTION 2
At the heart of all fault tolerance techniques is some form of masking redundancy. This
means that components that are prone to defects are replicated in such a way that if a
component fails, one or more of the non-failed replicas will continue to provide service
with no appreciable disruption. There are many variations on this basic theme.
Fault Classifications
Based on duration, faults can be classified as transient or permanent. A transient fault
will eventually disappear without any apparent intervention, whereas a permanent one
will remain unless it is removed by some external agency. While it may seem that per-
manent faults are more severe, from an engineering perspective, they are much easier
to diagnose and handle. A particularly problematic type of transient fault is the intermit-
tent fault that recurs, often unpredictably.
A different way to classify faults is by their underlying cause. Design faults are the result
of design failures, like our coding example above. While it may appear that in a carefully
designed system all such faults should be eliminated through fault prevention, this is
usually not realistic in practice. For this reason, many fault-tolerant systems are built with
the assumption that design faults are inevitable, and theta mechanisms need to be put in
place to protect the system against them. Operational faults, on the other hand, are
faults that occur during the lifetime of the system and are invariably due to physical
causes, such as processor failures or disk crashes.
Finally, based on how a failed component behaves once it has failed, faults can be clas-
sified into the following categories:
Crash faults -- the component either completely stops operating or never returns
to a valid state;
Omission faults -- the component completely fails to perform its service;
Timing faults -- the component does not complete its service on time;
Byzantine faults -- these are faults of an arbitrary nature.
After giving the concepts about the failure models, some of the examples about failure
models are shown below:
Case: Client unable to locate server, e.g. server down, or server has changed.
Solution: Use an exception handler, but this is not always possible in the programming
language used.
Case: Server crash after receiving client request. Problem may be not being able to tell
if request was carried out
Solutions: server and retry client request (assuming ‘at least once’ semantics for
request). Give up and report request failure (assuming ‘at most once’ semantics) what is
usually required is exactly once semantic but this difficult to guarantee.
Passive Replication
In passive replication, all client requests (via front end processes) are directed to
nominated primary replica manager (RM). Single primary RM communicates together
with one or more secondary replica managers (operating as backups). Single primary
RM is responsible for all front end communication and updating of backup RM’s.
Distributed applications communicate with primary replica manager, which sends copies
of up to date data. Requests for data update from client interface to primary RM is
distributed to each backup RM. If primary replica manager fails a secondary replica
manager observes this and is promoted to act as primary RM. To tolerate n process
failures need n+1 RM’s. Passive replication cannot tolerate Byzantine failures.
Request is issued to primary RM, each with unique id. Primary RM receives request.
Check request id, in case request has already been executed. If request is an update the
primary RM sends the updated state and unique request id to all backup RM’s. Each
backup RM sends acknowledgment to primary RM. When ack. is received from all
backup RM’s the primary RM sends request acknowledgment to front end (client
interface). All requests to primary RM are processed in the order of receipt.
Active Replication
In active replication model, there are multiple (group) replica managers (RM), each with
equivalent roles. The RM’s operate as a group and each front end (client interface)
multicasts requests to a group of RM’s. Requests are processed by all RM’s
independently (and identically). Client interface compares all replies received and can
tolerate N out of 2N+1 failures, i.e. consensus when N+1 identical responses received.
This model also can tolerate Byzantine failure.
Client request is sent to group of RM’s using totally ordered reliable multicast, each sent
with unique request id. Each RM processes the request and sends response/result back
to the front end. Front end collects (gathers) responses from each RM. Fault Tolerance:
Individual RM failures have little effect on performance. For n process fails need 2n+1
RM’s (to leave a majority n+1 operating).
Gossip Architectures
In Gossip Architectures, the main concept is to replicate data close to points where
clients need it first. Aim is to provide high availability at expense of weaker data
consistency.
It is a framework for dealing with highly available services through use of replication
RM’s exchange (or gossip) in the background from time to time. Multiple replica
managers (RM), single front end (FE) – sends query or update to any (one) RM. A given
RM may be unavailable, but the system is to guarantee a service.
In the Gossip Architecture, clients request service operations that are initially processed
by a front end, which normally communicates with only one replica manager at a time,
although free to communicate with others if its usual manager is heavily loaded.
Recovery
Once failure has occurred in many cases, it is important to recover critical processes to a
known state in order to resume processing. Problem is compounded in distributed
systems. There are two approaches for the recovery in distributed environments.
Forward recovery attempt to bring system to a new stable state from which it is possible
to proceed (applied in situations where the nature if errors are known and a reset can be
applied).
Forward recovery is most extensively used in distributed systems and generally safest
can be incorporated into middleware layers, complicated in the case of process,
machine or network failure. It gives no guarantee that same fault may occur again
(deterministic view – affects failure transparency properties), and can not be applied to
irreversible (non-idempotent) operations, e.g. ATM withdrawall.
Conclusion
As a conclusion, hardware, software and networks cannot be totally free from failures.
Fault tolerance is a non-functional requirement that requires a system to continue to
operate, even in the presence of faults. Distributed systems can be more fault tolerant
than centralized systems. Agrement in faulty systems and reliable group communication
are important problems in distributed systems. Replication of Data is a major fault
tolerance method in distributed systems. Recovery is another property to consider in
faulty distributed environments.
REFERENCE
1. Goyer, P., P. Momtahan, and B. Selic. "A Synchronization Service for Locally
Distributed Applications," in M. Barton, et al., ed.,Distributed Processing (North
Holland, 1988), pp. 3-17.
2. Goyer, P., P. Momtahan, and B. Selic. "A Fault-Tolerant Strategy for Hierarchical
Control in Distributed Computer Systems," Proc. 20th IEEE Symp. on Fault-Tol-
erant Computing Systems (FTCS20), (IEEE CS Press, 1990), pp. 290-297.