0% found this document useful (0 votes)

72 views29 pages

ch08 Ts TK Fault Tolerance I

Uploaded by

Rashi Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views29 pages

ch08 Ts TK Fault Tolerance I

Uploaded by

Rashi Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Chapter 8: FAULT TOLERANCE I

Continue to operate even when something goes wrong!

Thanks to the authors of the textbook [TS] for providing the base slides. I made several changes/additions.
These slides may incorporate materials kindly provided by Prof. Dakai Zhu.
So I would like to thank him, too.
Turgay Korkmaz
[email protected]

Distributed Systems 1.1 TS

Chapter 8: FAULT TOLERANCE
◼ INTRODUCTION TO FAULT TOLERANCE
⚫ Basic Concepts, Failure Models
◼ PROCESS RESILIENCE
⚫ Design Issues, Failure Masking and Replication
⚫ Agreement in Faulty Systems, Failure Detection
◼ RELIABLE CLIENT-SERVER COMMUNICATION
⚫ Point-to-Point Communication, RPC Semantics
◼ RELIABLE GROUP COMMUNICATION
⚫ Basic Reliable-Multicasting Schemes, Scalability
⚫ Atomic Multicast
◼ DISTRIBUTED COMMIT
⚫ Two-Phase Commit, Three-Phase Commit
◼ RECOVERY
⚫ Introduction
⚫ Checkpointing
⚫ Message Logging
⚫ Recovery-Oriented Computing

Distributed Systems 1.2 TS

Objectives

◼ To understand failures and their implications

◼ To learn about how to deal with failures
◼

Distributed Systems 1.3 TS

What is Fault Tolerance?
From Merriam-webster:

◼ Failure is a state of inability to perform a normal

function (e.g., a received msg corrupted)
◼ Error is an act involving an unintentional deviation
from truth or accuracy (e.g., reading 1 instead of 0)
◼ Fault is ….
From our textbook

◼ Fault is the cause of an error that may need to a

failure (e.g., software bugs, broken line, or weather)
◼ It is important to find out what may cause an error
and construct the system in such a way that it can
tolerate faults (i.e., automatically recover and
continue to operate (e.g., re-transmit damaged msg) )

Distributed Systems 1.4 TS

Failure in….
Distributed Systems Non-Distributed systems
◼ Failure is partial ◼ Failure is total
◼ Some components ◼ All components would
might be still working be affected
◼ Entire system may ◼ Entire system may be
still function down
Questions:
Can we hide the effects of faults?
Can we recover from partial failures?
Answers are strongly related to what are called
dependable systems
Distributed Systems 1.5 TS
Dependable Systems
◼ A component provides services to clients. To
provide services, the component may require the
services from other components → a component
may depend on some other component.
◼ Dependability implies the following:
⚫ Availability ready to be used
⚫ Reliability run continuously w/o failure
⚫ Safety temp failure should not cause catastrophic happens
⚫ Maintainability how easy to repair a failed system
⚫ Security (ch 9)?
High availability == high reliability?
Distributed Systems 1.6 TS
How to build a dependable system?
How to control faults?
◼ Fault prevention
⚫ prevent the occurrence of a fault
◼ Fault removal
⚫ reduce the presence, number, seriousness of faults

◼ Fault forecasting
⚫ estimate the present number, future incidence, and the
consequences of faults
◼ Fault tolerance
⚫ build a component in such a way that it can meet its
specifications in the presence of faults (i.e., mask the
presence of faults)
Distributed Systems 1.7 TS
Types of Faults
◼ Transient faults
⚫ Occur once and then disappear
⚫ E.g., disturbance during wireless communication
⚫ Try it again, it will work next time!

◼ Intermittent faults
⚫ Disappear and reappear: unpredictable (and notorious)
⚫ E.g., loose contact on a connector
⚫ Hard to detect since it sometimes works or do not work!

◼ Permanent faults
⚫ Continue to exist until faulty components are repaired/replaced
⚫ E.g., software bugs or burnt out chips

Distributed Systems 1.8 TS

Failure Models
In DS, we have a collection of servers and channels.
System may fail because servers, channels, or both are not working…

There are various types of failures:

◼ Crash failure
⚫ component simply halts, but behaves correctly before halting
◼ Omission failure
⚫ component fails to receive or send
◼ Timing failure
⚫ correct output, but lies outside a specified real-time interval
◼ Response failure
⚫ incorrect response (wrong value or state transition)
◼ Arbitrary/Byzantine failure:
⚫ Arbitrary/Malicious output
⚫ Cannot be detected easily

Distributed Systems 1.9 TS

Failure Detection
◼ How can clients distinguish between a crashed
component and one that is just a bit slow?
⚫ Consider a server from which a client is expecting output
 Is the server perhaps exhibiting timing or omission failures?
 Is the channel between client and server faulty?

◼ Assumptions we can make

⚫ Fail-stop : The component exhibits crash failures, but its
failure can be detected (either through announcement or
timeouts)
⚫ Fail-silent : The component exhibits omission or crash
failures; clients cannot tell what went wrong
⚫ Fail-safe : The component exhibits arbitrary, but benign
failures that cannot do any harm (e.g., junk output that can
be recognized)
Distributed Systems 1.10 TS
Fault Tolerance Techniques

◼ Redundancy: key technique to tolerate faults

⚫ Hiding failures and effect of faults

◼ Recovery and rollback (more later in Section 8.6)

⚫ Bringing system to a consistent state

Distributed Systems 1.11 TS

Redundancy Techniques
◼ Information redundancy
⚫ e.g., parity bit and Hamming codes

◼ Time redundancy
⚫ Repeat action
⚫ e.g., re-transmit a msg

◼ Physical (software/hardware) redundancy

⚫ Replication
⚫ e.g., extra CPUs, multi-versions of a software

Distributed Systems 1.12 TS

Physical Redundancy
Triple Modular Redundancy (TMR)

V1 V2 V3

◼ If A2 fails → V1: majority vote → B gets good result

◼ What if V1 fails?!
Distributed Systems 1.13 TS
TMR (cont.)

◼ Correct results are obtain via majority vote

⚫ Mask ONE fault
bad
ok

ok
ok

ok ok

Assume that prob Vx fails is 0.1

What is the probability that the above system fails?
Distributed Systems 1.14 TS
Protect yourself against faulty processes by replicating and
distributing computations in a group.

PROCESS RESILIENCE

Distributed Systems 1.15 TS

Design Issues
◼ To tolerate a faulty process, organize several
identical processes into a group
◼ A group is a single abstraction of a collection of
processes
⚫ So we can send a message to a group without explicitly
knowing who are they, how many are there, or where
are they (e.g., e-mail groups, newsgroups)
⚫ Key property: When a message is sent, all members of
the group must receive it. So if one fails, the others can
take over for it.
◼ Groups could be dynamic
⚫ So we need mechanisms to manage groups and
membership (e.g., join, leave, be part of two groups)
Distributed Systems 1.16 TS
Flat vs. Hierarchical Groups
◼ Flat groups: information exchange
immediately occurs with all group
members
⚫ + good for fault tolerance,
⚫ + no single point of failure
⚫ - may impose more overhead as
control is completely distributed
⚫ - hard to implement

◼ Hierarchical groups: All

communication through a single
coordinator
⚫ - not really fault tolerant or scalable,
⚫ + but relatively easy to implement.
Distributed Systems 1.17 TS
Group Membership
How to add/delete groups and manage join/leave groups?

◼ Centralized: have a group server to maintain a

database for each group and get these requests
⚫ Efficient, easy to implement, but single point of failure
◼ Distributed:
⚫ to join a group, a new process can send a message to all
group members that it wishes to join the group (Assume that
reliable multicasting is available)

⚫ To leave, a process can ideally send a goodbye msg to

all, but if it crashes (not just slow) then the others should
discover that and remove it from the group!
⚫ What if many leaves…. Re-build the group….

Distributed Systems 1.18 TS

Failure masking by Replication
Use protocols from Ch 7:
◼ Primary-based
⚫ Organize processes in an hierarchical fashion
⚫ Primary coordinates all W operations
⚫ Primary is fixed but its role can be taken by a backup
⚫ If the primary fails, backups elect a new primary

◼ Replicated write protocols

⚫ Organize processes into flat group
⚫ W operations are performed using active replication or
quorum-based protocols
⚫ No single point of failure, but distributed coordination cost

◼ How much replication is needed or enough?

Distributed Systems 1.19 TS
Level of Redundancy
K-Fault Tolerance

◼ A system is said to be k-fault tolerant if it can

survive faults in k components and still meet its
specifications….
◼ How many components (processes) do we need
to provide k-fault tolerance?
◼ Depends on what kind of faults can happen?

Distributed Systems 1.20 TS

Level of Redundancy
◼ Assume crash failure semantics (i.e., fail-stop)
⚫ k + 1 components are needed to survive k failures
 if k of them stops, the last one can still take over
⚫ Ensure at least one functional component !
◼ Assume arbitrary/Byzantine (but non-malicious)
failure semantics (i.e., continue to run when sick
and send out random or erroneous replies)
⚫ Suppose group output is defined by voting and
component failures are independent
⚫ 2k+1 components are needed
 If k wrong then (k+1) must be good to have majority
⚫ Theoretically correct, but hard to convince: k+1 vs. k
(some statistical analysis is needed)

Distributed Systems 1.21 TS

Level of Redundancy:
Agreement Problem

◼ Problem: Assume Byzantine (malicious) failure

semantics and need agreement on non-faulty
components
⚫ Faulty components cooperate to cheat!!!
⚫ 3k+1 components are needed to tolerate k failures
⚫ Agreement is possible only if more than two-thirds of
components work properly.

⚫ In democracy, usually majority vote is enough but for

certain things 2/3 is required (e.g., CS bylaws). Why do
you think this might be the case?

Distributed Systems 1.22 TS

Agreement in Faulty systems (1)
◼ A process group is required to reach an agreement
for many things (e.g., electing a coordinator, deciding to commit a
transaction or not, dividing tasks among workers, synchronization etc.),

◼ If all processes and communication channels are

perfect, it is easy to reach an agreement.
◼ But not!
◼ So the goal is to have all non-faulty processes
reach consensus and establish this consensus
within a finite number of steps!
◼ Solutions differ under different assumptions.

Distributed Systems 1.23 TS

Agreement in Faulty systems (2)
Reaching agreement is only possible for below cases

In practice

Sync: if any process has taken c+1 steps,

then every other has taken at least 1 step

Async: if not sync

Distributed Systems 1.24 TS
Byzantine Agreement Problem
◼ N generals including k traitors
https://www.tutorialspoint.com/what-is-byzantine-fault-tolerance

◼ Problems:
Can trusted generals agree on
their army sizes?
What should be N and k?

◼ Assumptions:
⚫ Traitors can lie, others don’t know
who the traitors are
⚫ Reliable communication channel
more specifically …

Distributed Systems 1.25 TS

Lamport’s Agreement Algorithm

1. Each general i sends its army size vi to others

⚫ Loyal generals tell the truth
⚫ Traitors can lie
2. Each general collects received information as a
vector s.t. V[i] == vi if general i is non-faulty
3. Each general sends its vector to others
⚫ Loyal generals send what they have
⚫ Traitors can change the vectors
4. Each general determines vector elements by
voting among all vectors he/she receives

Distributed Systems 1.26 TS

An Example: N=4, k=1
N = 3*k+1 for agreement

Majority vote?
1 got 2 got 3 got
12?4 12?4 12?4

(d)

Distributed Systems 1.27 TS

An Example: N=3, k=1
For agreement, we need at least
2k+1 correctly functioning nodes
+ k faulty ones
so N is 3k+1….

Majority vote?
1 got 2 got
??? ???
(d)
Fail to agree!

Distributed Systems 1.28 TS

Failure detection
◼ How can we decide if a node is failed or just slow?
◼ There are essentially two mechanisms:
⚫ Actively send “Are you alive” and expect an answer or
passively wait until messages come from others
⚫ Use timeouts:
 Setting
timeouts properly is difficult and application dependent
 Premature timeouts generates false positives
 You cannot distinguish process failures from network failures

◼ Also all non-faulty processes need to decide

(agree on) who is failed and still a member or not!
⚫ Consider failure notification throughout the system:
 Gossiping (i.e., proactively disseminate a failure detection)
 On failure detection, pretend you failed as well to propagate it recursively

Distributed Systems 1.29 TS

Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
30 pages
Chapter 7-Fault Tolerance
No ratings yet
Chapter 7-Fault Tolerance
71 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
21 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Intro To DS Chapter 6
No ratings yet
Intro To DS Chapter 6
51 pages
Fault Tolerance in Distributed Systems
100% (1)
Fault Tolerance in Distributed Systems
21 pages
Unit5 Compressed Fault Tolerance - PACE
No ratings yet
Unit5 Compressed Fault Tolerance - PACE
11 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
37 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
DS Chapter 8-Fault Tolerance
No ratings yet
DS Chapter 8-Fault Tolerance
68 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
11 pages
DS Unit - 4
No ratings yet
DS Unit - 4
20 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
Fault Tolerance FDCC
No ratings yet
Fault Tolerance FDCC
76 pages
DS Chapter V8.0fault Tolerance
No ratings yet
DS Chapter V8.0fault Tolerance
23 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
6 pages
Chapter Seven
No ratings yet
Chapter Seven
13 pages
Overview of Distributed Computing Systems
No ratings yet
Overview of Distributed Computing Systems
18 pages
Understanding Distributed Systems-53-58
No ratings yet
Understanding Distributed Systems-53-58
6 pages
Distributed Systems Resilience
No ratings yet
Distributed Systems Resilience
25 pages
Distributed Systems Essentials
No ratings yet
Distributed Systems Essentials
156 pages
Lecture 7
No ratings yet
Lecture 7
57 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
Slides 08 PDF
No ratings yet
Slides 08 PDF
95 pages
Fault
No ratings yet
Fault
101 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Slides 08
No ratings yet
Slides 08
107 pages
w9s1 FaultTolerance1
No ratings yet
w9s1 FaultTolerance1
34 pages
Chapter 8 Fault Tolerance
No ratings yet
Chapter 8 Fault Tolerance
20 pages
Fault System One
No ratings yet
Fault System One
19 pages
Core Challenges in Distributed Systems
No ratings yet
Core Challenges in Distributed Systems
22 pages
Dis Sys
No ratings yet
Dis Sys
16 pages
Fault Tolerance Notes
No ratings yet
Fault Tolerance Notes
101 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
Cs3551 - Dss-Unit - IV Notes Final
No ratings yet
Cs3551 - Dss-Unit - IV Notes Final
46 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
49 pages
Ascs 04 0213
No ratings yet
Ascs 04 0213
5 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
42 pages
Chen 07
No ratings yet
Chen 07
39 pages
Chapter 06 Fault - Tolerance
No ratings yet
Chapter 06 Fault - Tolerance
30 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
51 pages
Chapter 1 - Intro
No ratings yet
Chapter 1 - Intro
31 pages
CSE352 Lecture9 DistributedSystemsDesignInto
No ratings yet
CSE352 Lecture9 DistributedSystemsDesignInto
98 pages
Dependable and Secure Computing Concepts
No ratings yet
Dependable and Secure Computing Concepts
14 pages
BCS 413 - Lecture7 - Fault Tolerance
No ratings yet
BCS 413 - Lecture7 - Fault Tolerance
47 pages
Unit 8
No ratings yet
Unit 8
6 pages
Distributed Systems
100% (1)
Distributed Systems
35 pages
DS Unit5
No ratings yet
DS Unit5
13 pages
Ch-4-Fault Tularance - Naming-SM
No ratings yet
Ch-4-Fault Tularance - Naming-SM
42 pages
Fault Tolerance
No ratings yet
Fault Tolerance
10 pages
Unit 3-1
No ratings yet
Unit 3-1
26 pages
DS Unit-3 Notes
No ratings yet
DS Unit-3 Notes
35 pages
Fault Tolerance in Distributed Computing
No ratings yet
Fault Tolerance in Distributed Computing
32 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
41 pages
DistributedSystems Notes
No ratings yet
DistributedSystems Notes
73 pages
Uds24201j Unit I
No ratings yet
Uds24201j Unit I
27 pages
Unit # IV Replication and Fault Tolerance
No ratings yet
Unit # IV Replication and Fault Tolerance
82 pages
DCS System Layout and Its Different Parts - Instrumentation Tools
No ratings yet
DCS System Layout and Its Different Parts - Instrumentation Tools
24 pages
NGEA - M01 - NGEA and Machine Ethernet - INS (Test Answers Shown)
No ratings yet
NGEA - M01 - NGEA and Machine Ethernet - INS (Test Answers Shown)
14 pages
ymaps Initialization Error Handling
No ratings yet
ymaps Initialization Error Handling
9 pages
Online Quiz Platform Project Overview
No ratings yet
Online Quiz Platform Project Overview
11 pages
Sixteenth Edition: Foundations of Business Intelligence: Databases and Information Management
No ratings yet
Sixteenth Edition: Foundations of Business Intelligence: Databases and Information Management
43 pages
Computer Architecture Overview by Dr. Ilavarasi
No ratings yet
Computer Architecture Overview by Dr. Ilavarasi
53 pages
Android App Development Syllabus
No ratings yet
Android App Development Syllabus
5 pages
Product Sheet - MasterBox NR200P MAX
No ratings yet
Product Sheet - MasterBox NR200P MAX
7 pages
CP Harmony Endpoint AdminGuide
No ratings yet
CP Harmony Endpoint AdminGuide
606 pages
Text Analysis Using CLI
No ratings yet
Text Analysis Using CLI
7 pages
List of FTP Commands For The Microsoft Command-Line FTP Client
No ratings yet
List of FTP Commands For The Microsoft Command-Line FTP Client
11 pages
Performance: Lenovo Legion 5 Pro 16ACH6H
No ratings yet
Performance: Lenovo Legion 5 Pro 16ACH6H
5 pages
Technical Reference: HP B2600 Workstations
No ratings yet
Technical Reference: HP B2600 Workstations
206 pages
LAB 6 Use SELECT Query in Android SQLite
No ratings yet
LAB 6 Use SELECT Query in Android SQLite
5 pages
Staff Management Project 1
No ratings yet
Staff Management Project 1
19 pages
Bottom-Up Parsing Techniques Explained
No ratings yet
Bottom-Up Parsing Techniques Explained
40 pages
Quiz
No ratings yet
Quiz
6 pages
Blockchain for Certificate Verification
No ratings yet
Blockchain for Certificate Verification
4 pages
Network Monitoring Equipment Market Report 2020
No ratings yet
Network Monitoring Equipment Market Report 2020
23 pages
PHP Application Development Ethics
No ratings yet
PHP Application Development Ethics
2 pages
DPE-101GI A1 Datasheet 01 (HQ)
No ratings yet
DPE-101GI A1 Datasheet 01 (HQ)
3 pages
MSP430™ Flash Devices Bootloader (BSL) : User's Guide
No ratings yet
MSP430™ Flash Devices Bootloader (BSL) : User's Guide
51 pages
AWS IAM Basics for Beginners
No ratings yet
AWS IAM Basics for Beginners
5 pages
04 Irvine Lecture PPT Ch04
No ratings yet
04 Irvine Lecture PPT Ch04
84 pages
Linkage New Token Guide
No ratings yet
Linkage New Token Guide
3 pages
Banking Data Masking Report
No ratings yet
Banking Data Masking Report
4 pages
W/ra Lafaa Magaala Shaggar Introduction To QGIS
No ratings yet
W/ra Lafaa Magaala Shaggar Introduction To QGIS
24 pages
MT6739 Android Scatter
No ratings yet
MT6739 Android Scatter
12 pages
Chapter 5 - Database Management System
No ratings yet
Chapter 5 - Database Management System
30 pages
Python Exam Paper Solved1
No ratings yet
Python Exam Paper Solved1
6 pages

ch08 Ts TK Fault Tolerance I

Uploaded by

ch08 Ts TK Fault Tolerance I

Uploaded by

Chapter 8: FAULT TOLERANCE I

Continue to operate even when something goes wrong!

Distributed Systems 1.1 TS

Distributed Systems 1.2 TS

◼ To understand failures and their implications

Distributed Systems 1.3 TS

◼ Failure is a state of inability to perform a normal

◼ Fault is the cause of an error that may need to a

Distributed Systems 1.4 TS

Distributed Systems 1.8 TS

There are various types of failures:

Distributed Systems 1.9 TS

◼ Assumptions we can make

◼ Redundancy: key technique to tolerate faults

◼ Recovery and rollback (more later in Section 8.6)

Distributed Systems 1.11 TS

◼ Physical (software/hardware) redundancy

Distributed Systems 1.12 TS

◼ If A2 fails → V1: majority vote → B gets good result

◼ Correct results are obtain via majority vote

Assume that prob Vx fails is 0.1

Distributed Systems 1.15 TS

◼ Hierarchical groups: All

◼ Centralized: have a group server to maintain a

⚫ To leave, a process can ideally send a goodbye msg to

Distributed Systems 1.18 TS

◼ Replicated write protocols

◼ How much replication is needed or enough?

◼ A system is said to be k-fault tolerant if it can

Distributed Systems 1.20 TS

Distributed Systems 1.21 TS

◼ Problem: Assume Byzantine (malicious) failure

⚫ In democracy, usually majority vote is enough but for

Distributed Systems 1.22 TS

◼ If all processes and communication channels are

Distributed Systems 1.23 TS

Sync: if any process has taken c+1 steps,

Async: if not sync

Distributed Systems 1.25 TS

1. Each general i sends its army size vi to others

Distributed Systems 1.26 TS

Distributed Systems 1.27 TS

Distributed Systems 1.28 TS

◼ Also all non-faulty processes need to decide

Distributed Systems 1.29 TS

You might also like