0% found this document useful (0 votes)
83 views9 pages

CBDT3103 Answer

The document discusses fault tolerance in distributed systems. It defines key terms like failures, errors, and faults. Faults can be classified as transient or permanent based on duration, and design or operational based on cause. Fault tolerance techniques involve replicating components so the system can continue functioning if one fails. The main replication models discussed are passive replication, where a primary replica coordinates backups, active replication where replicas operate independently, and gossip architectures where data is distributed across replicas. The document also covers recovery from failures.

Uploaded by

Ryan Jee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views9 pages

CBDT3103 Answer

The document discusses fault tolerance in distributed systems. It defines key terms like failures, errors, and faults. Faults can be classified as transient or permanent based on duration, and design or operational based on cause. Fault tolerance techniques involve replicating components so the system can continue functioning if one fails. The main replication models discussed are passive replication, where a primary replica coordinates backups, active replication where replicas operate independently, and gossip architectures where data is distributed across replicas. The document also covers recovery from failures.

Uploaded by

Ryan Jee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 9

QUESTION 1

QUESTION 2

Generally, fault tolerant is a method of making a computer or network system resistant to


software error and hardware problems. Every operation is performed with a backup
system, ensuring that there is no single point of failure.

Failures, Errors, and Faults


Implicit in the definition of fault tolerance is the assumption that there is a specification of
what constitutes correct behavior. A failure occurs when an actual running system
deviates from this specified behavior. The cause of a failure is called an error. An error
represents an invalid system state, one that is not allowed by the system behavior
specification. The error itself is the result of a defect in the system or fault. In other
words, a fault is the root cause of a failure. That means that an error is merely the
symptom of a fault. A fault may not necessarily result in an error, but the same fault may
result in multiple errors. Similarly, a single error may lead to multiple failures.

For example, in a software system, an incorrectly written instruction in a program may


decrement an internal variable instead of incrementing it. Clearly, if this statement is ex-
ecuted, it will result in the incorrect value being written. If other program statements then
use this value, the whole system will deviate from its desired behavior. In this case, the
erroneous statement is the fault, the invalid value is the error, and the failure is the be-
havior that results from the error. Note that if the variable is never read after being writ-
ten, no failure will occur. Or, if the invalid statement is never executed, the fault will not
lead to an error. Thus, the mere presence of errors or faults does not necessarily imply
system failure.

At the heart of all fault tolerance techniques is some form of masking redundancy. This
means that components that are prone to defects are replicated in such a way that if a
component fails, one or more of the non-failed replicas will continue to provide service
with no appreciable disruption. There are many variations on this basic theme.
Fault Classifications
Based on duration, faults can be classified as transient or permanent. A transient fault
will eventually disappear without any apparent intervention, whereas a permanent one
will remain unless it is removed by some external agency. While it may seem that per-
manent faults are more severe, from an engineering perspective, they are much easier
to diagnose and handle. A particularly problematic type of transient fault is the intermit-
tent fault that recurs, often unpredictably.

A different way to classify faults is by their underlying cause. Design faults are the result
of design failures, like our coding example above. While it may appear that in a carefully
designed system all such faults should be eliminated through fault prevention, this is
usually not realistic in practice. For this reason, many fault-tolerant systems are built with
the assumption that design faults are inevitable, and theta mechanisms need to be put in
place to protect the system against them. Operational faults, on the other hand, are
faults that occur during the lifetime of the system and are invariably due to physical
causes, such as processor failures or disk crashes.

Finally, based on how a failed component behaves once it has failed, faults can be clas-
sified into the following categories:
 Crash faults -- the component either completely stops operating or never returns
to a valid state;
 Omission faults -- the component completely fails to perform its service;
 Timing faults -- the component does not complete its service on time;
 Byzantine faults -- these are faults of an arbitrary nature.

Failure Models in the System


In all of these scenarios, clients use a collection of servers.
Crash: Server halts, but was working ok until then, e.g. O.S. Failure.
Omission: Server fails to receive or respond or reply, e.g. server not listening or buffer
overflow.
Timing: Server response time is outside its specification, client may give up.
Response: Incorrect response or incorrect processing due to control flow out of
synchronization.
Arbitrary value (or Byzantine): Server behaving erratically, for example providing
arbitrary responses at arbitrary times. Server output is inappropriate but it is not easy to
determine this to be incorrect. Duplicated message due to buffering problem maw be
given as an example. Alternatively, there may be a malicious element involved.

After giving the concepts about the failure models, some of the examples about failure
models are shown below:

Case: Client unable to locate server, e.g. server down, or server has changed.
Solution: Use an exception handler, but this is not always possible in the programming
language used.

Case: Client request to server is lost.


Solution: Use a timeout to await server reply, then re-send, but be careful about
idempotent operations. If multiple requests appear to get lost assume ‘cannot locate
server’ error.

Case: Server crash after receiving client request. Problem may be not being able to tell
if request was carried out
Solutions: server and retry client request (assuming ‘at least once’ semantics for
request). Give up and report request failure (assuming ‘at most once’ semantics) what is
usually required is exactly once semantic but this difficult to guarantee.

Case: Server reply to client is lost.


Solution: Client can simply set timer and if no reply in time assume server down,
request lost or server crashed during processing request.
Replication of Data
In this section, the replication of data in distributed systems will be discussed in a
detailed manner with its different models. The main goal of replication of data in
distributed systems is maintaining copies on multiple computers (e.g. DNS)

The main benefits of replication of data can be classified as follows:


1. Performance enhancement
2. Reliability enhancement
3. Data closer to client
4. Share workload
5. Increased availability
6. Increased fault tolerance

The constraints are classified below:


1. How to keep data consistency (need to ensure a satisfactorily consistent image
for clients)
2. Where to place replicas and how updates are propagated
3. Scalability

Fault Tolerant System Architectures:


Client (C)
Front End (FE) = client interface
Replica Manager (RM) = service provider

Passive Replication
In passive replication, all client requests (via front end processes) are directed to
nominated primary replica manager (RM). Single primary RM communicates together
with one or more secondary replica managers (operating as backups). Single primary
RM is responsible for all front end communication and updating of backup RM’s.
Distributed applications communicate with primary replica manager, which sends copies
of up to date data. Requests for data update from client interface to primary RM is
distributed to each backup RM. If primary replica manager fails a secondary replica
manager observes this and is promoted to act as primary RM. To tolerate n process
failures need n+1 RM’s. Passive replication cannot tolerate Byzantine failures.
Request is issued to primary RM, each with unique id. Primary RM receives request.
Check request id, in case request has already been executed. If request is an update the
primary RM sends the updated state and unique request id to all backup RM’s. Each
backup RM sends acknowledgment to primary RM. When ack. is received from all
backup RM’s the primary RM sends request acknowledgment to front end (client
interface). All requests to primary RM are processed in the order of receipt.

Active Replication
In active replication model, there are multiple (group) replica managers (RM), each with
equivalent roles. The RM’s operate as a group and each front end (client interface)
multicasts requests to a group of RM’s. Requests are processed by all RM’s
independently (and identically). Client interface compares all replies received and can
tolerate N out of 2N+1 failures, i.e. consensus when N+1 identical responses received.
This model also can tolerate Byzantine failure.
Client request is sent to group of RM’s using totally ordered reliable multicast, each sent
with unique request id. Each RM processes the request and sends response/result back
to the front end. Front end collects (gathers) responses from each RM. Fault Tolerance:
Individual RM failures have little effect on performance. For n process fails need 2n+1
RM’s (to leave a majority n+1 operating).

Gossip Architectures
In Gossip Architectures, the main concept is to replicate data close to points where
clients need it first. Aim is to provide high availability at expense of weaker data
consistency.

It is a framework for dealing with highly available services through use of replication
RM’s exchange (or gossip) in the background from time to time. Multiple replica
managers (RM), single front end (FE) – sends query or update to any (one) RM. A given
RM may be unavailable, but the system is to guarantee a service.

In the Gossip Architecture, clients request service operations that are initially processed
by a front end, which normally communicates with only one replica manager at a time,
although free to communicate with others if its usual manager is heavily loaded.
Recovery
Once failure has occurred in many cases, it is important to recover critical processes to a
known state in order to resume processing. Problem is compounded in distributed
systems. There are two approaches for the recovery in distributed environments.

Backward recovery, by use of checkpointing (global snapshot of distributed system


status) to record the system state but checkpointing is costly (performance degradation).

Forward recovery attempt to bring system to a new stable state from which it is possible
to proceed (applied in situations where the nature if errors are known and a reset can be
applied).

Forward recovery is most extensively used in distributed systems and generally safest
can be incorporated into middleware layers, complicated in the case of process,
machine or network failure. It gives no guarantee that same fault may occur again
(deterministic view – affects failure transparency properties), and can not be applied to
irreversible (non-idempotent) operations, e.g. ATM withdrawall.

Conclusion
As a conclusion, hardware, software and networks cannot be totally free from failures.
Fault tolerance is a non-functional requirement that requires a system to continue to
operate, even in the presence of faults. Distributed systems can be more fault tolerant
than centralized systems. Agrement in faulty systems and reliable group communication
are important problems in distributed systems. Replication of Data is a major fault
tolerance method in distributed systems. Recovery is another property to consider in
faulty distributed environments.
REFERENCE

1. Goyer, P., P. Momtahan, and B. Selic. "A Synchronization Service for Locally
Distributed Applications," in M. Barton, et al., ed.,Distributed Processing (North
Holland, 1988), pp. 3-17.

2. Goyer, P., P. Momtahan, and B. Selic. "A Fault-Tolerant Strategy for Hierarchical
Control in Distributed Computer Systems," Proc. 20th IEEE Symp. on Fault-Tol-
erant Computing Systems (FTCS20), (IEEE CS Press, 1990), pp. 290-297.

3. Jalote, P. Fault Tolerance in Distributed Systems, (Prentice Hall, 1994).

4. Randell, B., P. Lee, and P. Treleaven. "Reliability in Computing System


Design," ACM Computing Surveys, Vol. 10, No. 2, June 1978, pp. 123-165.

You might also like