0% found this document useful (0 votes)

83 views9 pages

CBDT3103 Answer

The document discusses fault tolerance in distributed systems. It defines key terms like failures, errors, and faults. Faults can be classified as transient or permanent based on duration, and design or operational based on cause. Fault tolerance techniques involve replicating components so the system can continue functioning if one fails. The main replication models discussed are passive replication, where a primary replica coordinates backups, active replication where replicas operate independently, and gossip architectures where data is distributed across replicas. The document also covers recovery from failures.

Uploaded by

Ryan Jee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views9 pages

CBDT3103 Answer

Uploaded by

Ryan Jee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 9

QUESTION 1

QUESTION 2

Generally, fault tolerant is a method of making a computer or network system resistant to

software error and hardware problems. Every operation is performed with a backup
system, ensuring that there is no single point of failure.

Failures, Errors, and Faults

Implicit in the definition of fault tolerance is the assumption that there is a specification of
what constitutes correct behavior. A failure occurs when an actual running system
deviates from this specified behavior. The cause of a failure is called an error. An error
represents an invalid system state, one that is not allowed by the system behavior
specification. The error itself is the result of a defect in the system or fault. In other
words, a fault is the root cause of a failure. That means that an error is merely the
symptom of a fault. A fault may not necessarily result in an error, but the same fault may
result in multiple errors. Similarly, a single error may lead to multiple failures.

For example, in a software system, an incorrectly written instruction in a program may

decrement an internal variable instead of incrementing it. Clearly, if this statement is ex-
ecuted, it will result in the incorrect value being written. If other program statements then
use this value, the whole system will deviate from its desired behavior. In this case, the
erroneous statement is the fault, the invalid value is the error, and the failure is the be-
havior that results from the error. Note that if the variable is never read after being writ-
ten, no failure will occur. Or, if the invalid statement is never executed, the fault will not
lead to an error. Thus, the mere presence of errors or faults does not necessarily imply
system failure.

At the heart of all fault tolerance techniques is some form of masking redundancy. This
means that components that are prone to defects are replicated in such a way that if a
component fails, one or more of the non-failed replicas will continue to provide service
with no appreciable disruption. There are many variations on this basic theme.
Fault Classifications
Based on duration, faults can be classified as transient or permanent. A transient fault
will eventually disappear without any apparent intervention, whereas a permanent one
will remain unless it is removed by some external agency. While it may seem that per-
manent faults are more severe, from an engineering perspective, they are much easier
to diagnose and handle. A particularly problematic type of transient fault is the intermit-
tent fault that recurs, often unpredictably.

A different way to classify faults is by their underlying cause. Design faults are the result
of design failures, like our coding example above. While it may appear that in a carefully
designed system all such faults should be eliminated through fault prevention, this is
usually not realistic in practice. For this reason, many fault-tolerant systems are built with
the assumption that design faults are inevitable, and theta mechanisms need to be put in
place to protect the system against them. Operational faults, on the other hand, are
faults that occur during the lifetime of the system and are invariably due to physical
causes, such as processor failures or disk crashes.

Finally, based on how a failed component behaves once it has failed, faults can be clas-
sified into the following categories:
 Crash faults -- the component either completely stops operating or never returns
to a valid state;
 Omission faults -- the component completely fails to perform its service;
 Timing faults -- the component does not complete its service on time;
 Byzantine faults -- these are faults of an arbitrary nature.

Failure Models in the System

In all of these scenarios, clients use a collection of servers.
Crash: Server halts, but was working ok until then, e.g. O.S. Failure.
Omission: Server fails to receive or respond or reply, e.g. server not listening or buffer
overflow.
Timing: Server response time is outside its specification, client may give up.
Response: Incorrect response or incorrect processing due to control flow out of
synchronization.
Arbitrary value (or Byzantine): Server behaving erratically, for example providing
arbitrary responses at arbitrary times. Server output is inappropriate but it is not easy to
determine this to be incorrect. Duplicated message due to buffering problem maw be
given as an example. Alternatively, there may be a malicious element involved.

After giving the concepts about the failure models, some of the examples about failure
models are shown below:

Case: Client unable to locate server, e.g. server down, or server has changed.
Solution: Use an exception handler, but this is not always possible in the programming
language used.

Case: Client request to server is lost.

Solution: Use a timeout to await server reply, then re-send, but be careful about
idempotent operations. If multiple requests appear to get lost assume ‘cannot locate
server’ error.

Case: Server crash after receiving client request. Problem may be not being able to tell
if request was carried out
Solutions: server and retry client request (assuming ‘at least once’ semantics for
request). Give up and report request failure (assuming ‘at most once’ semantics) what is
usually required is exactly once semantic but this difficult to guarantee.

Case: Server reply to client is lost.

Solution: Client can simply set timer and if no reply in time assume server down,
request lost or server crashed during processing request.
Replication of Data
In this section, the replication of data in distributed systems will be discussed in a
detailed manner with its different models. The main goal of replication of data in
distributed systems is maintaining copies on multiple computers (e.g. DNS)

The main benefits of replication of data can be classified as follows:

1. Performance enhancement
2. Reliability enhancement
3. Data closer to client
4. Share workload
5. Increased availability
6. Increased fault tolerance

The constraints are classified below:

1. How to keep data consistency (need to ensure a satisfactorily consistent image
for clients)
2. Where to place replicas and how updates are propagated
3. Scalability

Fault Tolerant System Architectures:

Client (C)
Front End (FE) = client interface
Replica Manager (RM) = service provider

Passive Replication
In passive replication, all client requests (via front end processes) are directed to
nominated primary replica manager (RM). Single primary RM communicates together
with one or more secondary replica managers (operating as backups). Single primary
RM is responsible for all front end communication and updating of backup RM’s.
Distributed applications communicate with primary replica manager, which sends copies
of up to date data. Requests for data update from client interface to primary RM is
distributed to each backup RM. If primary replica manager fails a secondary replica
manager observes this and is promoted to act as primary RM. To tolerate n process
failures need n+1 RM’s. Passive replication cannot tolerate Byzantine failures.
Request is issued to primary RM, each with unique id. Primary RM receives request.
Check request id, in case request has already been executed. If request is an update the
primary RM sends the updated state and unique request id to all backup RM’s. Each
backup RM sends acknowledgment to primary RM. When ack. is received from all
backup RM’s the primary RM sends request acknowledgment to front end (client
interface). All requests to primary RM are processed in the order of receipt.

Active Replication
In active replication model, there are multiple (group) replica managers (RM), each with
equivalent roles. The RM’s operate as a group and each front end (client interface)
multicasts requests to a group of RM’s. Requests are processed by all RM’s
independently (and identically). Client interface compares all replies received and can
tolerate N out of 2N+1 failures, i.e. consensus when N+1 identical responses received.
This model also can tolerate Byzantine failure.
Client request is sent to group of RM’s using totally ordered reliable multicast, each sent
with unique request id. Each RM processes the request and sends response/result back
to the front end. Front end collects (gathers) responses from each RM. Fault Tolerance:
Individual RM failures have little effect on performance. For n process fails need 2n+1
RM’s (to leave a majority n+1 operating).

Gossip Architectures
In Gossip Architectures, the main concept is to replicate data close to points where
clients need it first. Aim is to provide high availability at expense of weaker data
consistency.

It is a framework for dealing with highly available services through use of replication
RM’s exchange (or gossip) in the background from time to time. Multiple replica
managers (RM), single front end (FE) – sends query or update to any (one) RM. A given
RM may be unavailable, but the system is to guarantee a service.

In the Gossip Architecture, clients request service operations that are initially processed
by a front end, which normally communicates with only one replica manager at a time,
although free to communicate with others if its usual manager is heavily loaded.
Recovery
Once failure has occurred in many cases, it is important to recover critical processes to a
known state in order to resume processing. Problem is compounded in distributed
systems. There are two approaches for the recovery in distributed environments.

Backward recovery, by use of checkpointing (global snapshot of distributed system

status) to record the system state but checkpointing is costly (performance degradation).

Forward recovery attempt to bring system to a new stable state from which it is possible
to proceed (applied in situations where the nature if errors are known and a reset can be
applied).

Forward recovery is most extensively used in distributed systems and generally safest
can be incorporated into middleware layers, complicated in the case of process,
machine or network failure. It gives no guarantee that same fault may occur again
(deterministic view – affects failure transparency properties), and can not be applied to
irreversible (non-idempotent) operations, e.g. ATM withdrawall.

Conclusion
As a conclusion, hardware, software and networks cannot be totally free from failures.
Fault tolerance is a non-functional requirement that requires a system to continue to
operate, even in the presence of faults. Distributed systems can be more fault tolerant
than centralized systems. Agrement in faulty systems and reliable group communication
are important problems in distributed systems. Replication of Data is a major fault
tolerance method in distributed systems. Recovery is another property to consider in
faulty distributed environments.
REFERENCE

1. Goyer, P., P. Momtahan, and B. Selic. "A Synchronization Service for Locally
Distributed Applications," in M. Barton, et al., ed.,Distributed Processing (North
Holland, 1988), pp. 3-17.

2. Goyer, P., P. Momtahan, and B. Selic. "A Fault-Tolerant Strategy for Hierarchical
Control in Distributed Computer Systems," Proc. 20th IEEE Symp. on Fault-Tol-
erant Computing Systems (FTCS20), (IEEE CS Press, 1990), pp. 290-297.

3. Jalote, P. Fault Tolerance in Distributed Systems, (Prentice Hall, 1994).

4. Randell, B., P. Lee, and P. Treleaven. "Reliability in Computing System

Design," ACM Computing Surveys, Vol. 10, No. 2, June 1978, pp. 123-165.

Airtel 4G Internet Settings - Increase Airtel 4g Speed PDF
No ratings yet
Airtel 4G Internet Settings - Increase Airtel 4g Speed PDF
1 page
Esp32 Cam Board Specs
100% (2)
Esp32 Cam Board Specs
4 pages
VSphere 6.5 Storage
No ratings yet
VSphere 6.5 Storage
22 pages
IBM Power E1080 Level 2 Quiz - Attempt Review
100% (4)
IBM Power E1080 Level 2 Quiz - Attempt Review
13 pages
Important Q A
No ratings yet
Important Q A
51 pages
DS UNIT-3 Saqs Laqs (Complete)
No ratings yet
DS UNIT-3 Saqs Laqs (Complete)
16 pages
Lesson 1 - Introduction To Fault-Tolerant Computing
No ratings yet
Lesson 1 - Introduction To Fault-Tolerant Computing
6 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
Replication: Distributed Computing
No ratings yet
Replication: Distributed Computing
43 pages
08 Falhas
No ratings yet
08 Falhas
41 pages
Number Conversion
No ratings yet
Number Conversion
31 pages
Unit-6 Transactions & Replications Syllabus: Introduction, System Model and Group Communication, Concurrency Control in Distributed
No ratings yet
Unit-6 Transactions & Replications Syllabus: Introduction, System Model and Group Communication, Concurrency Control in Distributed
20 pages
Distributed Systems Assignment 6
No ratings yet
Distributed Systems Assignment 6
7 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
ProcessResilience FaultTolerance Recovery
No ratings yet
ProcessResilience FaultTolerance Recovery
21 pages
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
No ratings yet
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
13 pages
Q Answers-Posted
No ratings yet
Q Answers-Posted
4 pages
Module 5 Previous Year Questions With Solution
No ratings yet
Module 5 Previous Year Questions With Solution
8 pages
Chapter 06 Fault - Tolerance
No ratings yet
Chapter 06 Fault - Tolerance
30 pages
Failure Model
No ratings yet
Failure Model
14 pages
Du3 1
No ratings yet
Du3 1
54 pages
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
No ratings yet
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
28 pages
Vijeo Citect Installation Guide
No ratings yet
Vijeo Citect Installation Guide
64 pages
Failure Detector: Degrees of Completeness
No ratings yet
Failure Detector: Degrees of Completeness
4 pages
102801
No ratings yet
102801
23 pages
Unit - Iv
No ratings yet
Unit - Iv
19 pages
Intro To DS Chapter 6
No ratings yet
Intro To DS Chapter 6
51 pages
Fault Tolerance Unit 3-4
No ratings yet
Fault Tolerance Unit 3-4
32 pages
Unit10 Fault Tolerance and Security
No ratings yet
Unit10 Fault Tolerance and Security
24 pages
DC Mod 5
No ratings yet
DC Mod 5
12 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
37 pages
Chapter 2 MODELS OF DISTRIBUTED SYSTEMS
No ratings yet
Chapter 2 MODELS OF DISTRIBUTED SYSTEMS
7 pages
Lec 3
No ratings yet
Lec 3
30 pages
Dis Sys
No ratings yet
Dis Sys
16 pages
Fault Tolerance Notes
No ratings yet
Fault Tolerance Notes
101 pages
Week 04
No ratings yet
Week 04
49 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
30 pages
DS Unit - 4
No ratings yet
DS Unit - 4
20 pages
Unit # IV Replication and Fault Tolerance
No ratings yet
Unit # IV Replication and Fault Tolerance
82 pages
Assignment IT Project Management
100% (1)
Assignment IT Project Management
4 pages
Chen 07
No ratings yet
Chen 07
39 pages
Fault Tolerance Slides
No ratings yet
Fault Tolerance Slides
18 pages
Fault Tolerance FDCC
No ratings yet
Fault Tolerance FDCC
76 pages
DS Unit-3 Notes
No ratings yet
DS Unit-3 Notes
35 pages
Lesson 2 - Fault and Error Modelling
No ratings yet
Lesson 2 - Fault and Error Modelling
7 pages
A) What Is RPC? Explain Different Types of RPC?
No ratings yet
A) What Is RPC? Explain Different Types of RPC?
6 pages
Fault Tolerance
No ratings yet
Fault Tolerance
40 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
5 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
Failover In-Depth
No ratings yet
Failover In-Depth
4 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
Chapter 8 Fault Tolerance
No ratings yet
Chapter 8 Fault Tolerance
20 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
Maico Database Manual
No ratings yet
Maico Database Manual
11 pages
Distributed Computing: Farhad Muhammad Riaz
No ratings yet
Distributed Computing: Farhad Muhammad Riaz
18 pages
CH 4
No ratings yet
CH 4
25 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Unit 4
No ratings yet
Unit 4
11 pages
Fault Tolerance Slides
No ratings yet
Fault Tolerance Slides
18 pages
Fault
No ratings yet
Fault
101 pages
WRL0004 TMP
No ratings yet
WRL0004 TMP
9 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Blockchain - Unit1
No ratings yet
Blockchain - Unit1
115 pages
Digital Video Recorder: User Manual
No ratings yet
Digital Video Recorder: User Manual
186 pages
CBCN4103 Answer
No ratings yet
CBCN4103 Answer
16 pages
Fault System One
No ratings yet
Fault System One
19 pages
Hisatomo Menu
No ratings yet
Hisatomo Menu
16 pages
Name: Aish Siddiqui Id: FA19-BEEE-0037 Lab: 09 Subject: MBS Task: 01 Code
No ratings yet
Name: Aish Siddiqui Id: FA19-BEEE-0037 Lab: 09 Subject: MBS Task: 01 Code
6 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Jisy Raju Assistant Professor, CE Cherthala
No ratings yet
Jisy Raju Assistant Professor, CE Cherthala
10 pages
Android 4 User Guide
No ratings yet
Android 4 User Guide
12 pages
Ds Chapter 7
No ratings yet
Ds Chapter 7
21 pages
Dear Users of Inspur Server
No ratings yet
Dear Users of Inspur Server
68 pages
CitectSCADA 7.20 Service Pack 2 - Release Notes
No ratings yet
CitectSCADA 7.20 Service Pack 2 - Release Notes
31 pages
Inventec S-Series 2009 R3a 6050a2252701 Schematics PDF
No ratings yet
Inventec S-Series 2009 R3a 6050a2252701 Schematics PDF
55 pages
Memory Hierarchy
No ratings yet
Memory Hierarchy
4 pages
COSC 1046 - Computer Science I: Hardware vs. Software
No ratings yet
COSC 1046 - Computer Science I: Hardware vs. Software
10 pages
Assignment Server Security - Answer
No ratings yet
Assignment Server Security - Answer
10 pages
CPP Unit-4
No ratings yet
CPP Unit-4
61 pages
MPU3222 - Tan Po Ping
No ratings yet
MPU3222 - Tan Po Ping
7 pages
SQ Windows Driver v4.67.0 Release Notes
No ratings yet
SQ Windows Driver v4.67.0 Release Notes
2 pages
CBSM4203 Answer
No ratings yet
CBSM4203 Answer
14 pages
Assignment Submission and Assessment
No ratings yet
Assignment Submission and Assessment
3 pages
Assignment Wireless Technology
No ratings yet
Assignment Wireless Technology
5 pages
Recent Singlecycle Problems
No ratings yet
Recent Singlecycle Problems
58 pages
CBPM4103 Answer
No ratings yet
CBPM4103 Answer
13 pages
Katalog
No ratings yet
Katalog
35 pages
2009 12 - Global Financial Crisis and Its Social Impact in The Countries in ASEAN
No ratings yet
2009 12 - Global Financial Crisis and Its Social Impact in The Countries in ASEAN
33 pages
Practical-4: Aim: Understanding of Network Namespace
No ratings yet
Practical-4: Aim: Understanding of Network Namespace
13 pages
Switching OS6860N - Q2-2020
No ratings yet
Switching OS6860N - Q2-2020
7 pages
Assignment Question: BBKI4103/MAY09/ASGMT-RR
No ratings yet
Assignment Question: BBKI4103/MAY09/ASGMT-RR
5 pages
Assignment Networking Application in E-Commerce
No ratings yet
Assignment Networking Application in E-Commerce
5 pages
Industry Relationship Management - Answer
No ratings yet
Industry Relationship Management - Answer
6 pages
MTK Otp
No ratings yet
MTK Otp
31 pages
Samsung ML 1640
No ratings yet
Samsung ML 1640
14 pages
Assembler: Deffination
No ratings yet
Assembler: Deffination
19 pages
Week 3 CN
No ratings yet
Week 3 CN
5 pages
Session Tracking in Servlets
No ratings yet
Session Tracking in Servlets
8 pages
Lime3ds Logmyballs
No ratings yet
Lime3ds Logmyballs
22 pages
Exam Practice Exercise For AOS
No ratings yet
Exam Practice Exercise For AOS
3 pages
School Schedule Maker App Eng
No ratings yet
School Schedule Maker App Eng
1 page
004 - High-Level Architecture - Grokking The Advanced System Design Interview - WWW - Educative.io
No ratings yet
004 - High-Level Architecture - Grokking The Advanced System Design Interview - WWW - Educative.io
1 page
Learn Software Testing in 24 Hours
From Everand
Learn Software Testing in 24 Hours
Alex Nordeen
No ratings yet

CBDT3103 Answer

Uploaded by

CBDT3103 Answer

Uploaded by

QUESTION 1

Generally, fault tolerant is a method of making a computer or network system resistant to

Failures, Errors, and Faults

For example, in a software system, an incorrectly written instruction in a program may

Failure Models in the System

Case: Client request to server is lost.

Case: Server reply to client is lost.

The main benefits of replication of data can be classified as follows:

The constraints are classified below:

Fault Tolerant System Architectures:

Backward recovery, by use of checkpointing (global snapshot of distributed system

3. Jalote, P. Fault Tolerance in Distributed Systems, (Prentice Hall, 1994).

4. Randell, B., P. Lee, and P. Treleaven. "Reliability in Computing System

You might also like