0% found this document useful (0 votes)

73 views17 pages

SDA Session 8

This document discusses reliability and availability in distributed systems. It defines key metrics like MTTF, MTTR, and availability. It explains how reliability is calculated for systems with components arranged in serial and parallel configurations. Availability is defined as the percentage of time a system is operational. Examples are provided to illustrate how availability is calculated based on MTTF and MTTR of individual components. Various techniques for improving reliability like redundancy and fault tolerance through checkpointing and recovery are also covered. The document concludes by reviewing topics covered in previous sessions related to data analytics systems.

Uploaded by

Roma Thakare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views17 pages

SDA Session 8

Uploaded by

Roma Thakare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

DSECLZG517: Systems for Data Analytics

Session 8: Reliability and availability

Dr. Anindya Neogi

Associate Professor
[email protected]
Topics for today

• Reliability
• Availability
• Single points of failure

2
Distributed computing – living with failures
• Failures of nodes and links is a common concern in Distributed Systems
• Essential to have fault tolerance aspect in design
• Fault tolerance is a measure of
• How a distributed system functions in the presence of failures of system components
• Tolerance of component faults is measured by parameters
• Reliability - An inverse indicator of failure rate
• How soon a system will fail
• Availability - An indicator of fraction of time a system is available for use
• System is not available during failure
• Serviceability: How easy is it to service / fix
• Systems have to promise strict RAS guarantees because downtime means lost revenue

3
Metrics
• MTTF - Mean Time To Failure
• MTTF = 1 / failure rate = Total #hours of operation / Total #units
• MTTF is an averaged value. In reality failure rate changes over
time because it may depend on age of component.
• Failure rate = 1 / MTTF (assuming average value over time)
• MTTR - Mean Time to Recovery / Repair
• MTTR = Total #hours for maintenance / Total #repairs
• MTTD - Mean Time to Diagnose
• MTBF - Mean Time Between Failures
• MTBF = MTTD + MTTR + MTTF

4
user —> app server —> DB server —> storage/disk

Reliability - serial assembly

• MTTF of a system is a function of MTTF of components
• Serial assembly of components
• Failure of any component results in system failure
• Failure rate of C = Failure rate of A + Failure rate of B = 1/ma + 1/mb

MTTF mc=1/(1/ma + 1/mb)

A server fails every 90 days.
C The disk fails every 45 days.
A B In serial assembly,
MTTF=ma MTTF=mb system fails every 1 / (1/90 + 1/45) = 30 days

• MTTF of system = 1 / SUM (1/MTTFi) for all components i

• Failure rate of system = SUM(1/MTTFi) for all components i

5
Reliability - parallel assembly

• In a parallel assembly, e.g. a cluster of nodes C

A
• MTTF of C = MTTF A + MTTF B because both A MTTF=ma
and B have to fail for C to fail
B
• MTTF of system = SUM(MTTFi) for all MTTF=ma
components i
MTTF mc=ma + mb
A server fails every 90 days.
The disk fails every 45 days.
2 redundant disks are connected in parallel.
Disk subsystem fails in 45 + 45 = 90 days.
System fails every 1 / (1/90 + 1/90) = 45 days

6
Topics for today

• Reliability
• Availability
• Single points of failure

7
Availability

• Availability = Time system is UP and accessible / Total time observed

• Availability = MTTF / (MTTD* + MTTR + MTTF)
or
• Availability = MTTF / MTBF
• A system is highly available when
• MTTF is high
• MTTR is low

* Unless specified one can assume MTTD = 0 8

Example
• A node in a cluster fails every 100 hours while other parts never fail.

• On failure of the node the whole system needs to be shutdown, faulty node
replaced and system. This takes 2 hours.

• The application needs to be restarted, which takes 2 hours.

• What is the availability of the cluster ?

• If downtime is $80k per hour, the what is the yearly cost ?

• Solution

• MTTF = 100 hours

• MTTR = 2 + 2 = 4 hours

• Availability = 100/104 = 96.15%

• Cost of downtime per year = 80000 x 3.85 * 365 * 24 / 100 = USD 27 million

https://www.brainkart.com/article/Fault-Tolerant-Cluster-Configurations_11320/

9
Availability : Serial and Parallel Systems (1)

A(system) = Product (Ai) for all i A(system) = 1 - Unavailability(system)

A(system) = 0.990025 = 1 - Product(1- Ai for all i )
= 1 - (1-0.995)(1-0.995)
= 1 - 0.005x0.005=0.999975

10
Availability : Parallel Systems (2)

comp1 comp2

A(S) = A(Comp1 U Comp2)

= A(Comp1) + A(Comp2) - A(Comp 1) * A(Comp2)
= 0.995 + 0.995 - 0.995 * 0.995
= 0.999975

For 3 components ?

A(S) = A1 + A2 + A3 - A1A2 - A1A3 - A2A3 + A1A2*A3

11
Reliability block diagrams

• Systems are a complex combination of serial and parallel

connections
• An RBD model is used to analyse availability of a
complex system by encapsulating serial or parallel
connections within blocks
• Sometimes it is non-trivial to create an RBD given the
system dependencies
• User to application needs both switch 1 and 2
available
• Application needs web service 1 which needs either of
the 2 switches available to use the DB

12
Fault Tolerant Clusters – Recovery
• Diagnosis
• Detection of failure and location of the failed component, e.g. using
heartbeat messages between nodes
• Backward recovery
checkpoints
• periodically do a checkpoint (save consistent state on stable storage)
• on failure, isolate the failed component, rollback to last checkpoint
and resume normal operation
• Ease to implement, independent of application, but leads to wastage
of execution time on rollback besides unused checkpointing work rollback on errors
• Forward recovery
• In real-time systems or time-critical systems cannot rollback. So state
is reconstructed on the fly from diagnosis data.
• Application specific and may need additional hardware

13
Topics for today

• Reliability
• Availability
• Single points of failure

14
Single Points of Failure in SMP and Clusters

Bus / Mem failures ? Ethernet failures ?

Node failures ? Protect against node failures with periodic

checkpoints on global storage

15
Redundancy techniques

• Availability can be increased in 2 ways

» Increase MTTF - almost saturated and expensive to increase further
» Reduce MTTR - have redundancy in the cluster so that another node takes over
as one fails (hiding failures)
» Isolated redundancy - redundant components are isolated, e.g. backup
node shares nothing with primary node
» N-version programming - N copies of software are independently built and
run. Results are compared and majority vote taken.

16
Review topics
• Session 1: Types of analytics, types of data, intro to caching
• Session 2 : Locality of reference - cache hit / miss calculations, given a program or scenario do you understand
whether it is spatial / temporal locality ?
• Session 3: Solving latency and bandwidth issues with caching, block size, prefetching, multi-threading. Interplay
between techniques, e.g. memory bandwidth impacted in trying to reduce latency with prefetching / multi-
threading.
• Session 4: Various types of message options - blocking, buffering, buffering in interface cards …. Various common
programming features in openmpi (distributed memory) and openmp (shared memory).
• Session 5: Do you know how to design a parallel program using right decomposition ?
• Session 6: Software and system architectures, Given a scenario can you decide which architecture to use ?
Fallacies in Distributed systems
• Session 7: Cluster design - components, failover options
• Session 8: Reliability and availability calculations

BDS Session 3
No ratings yet
BDS Session 3
68 pages
BDS Session 3
No ratings yet
BDS Session 3
67 pages
Availability and Reliability Theory: and The Expectations Behind The Numbers
No ratings yet
Availability and Reliability Theory: and The Expectations Behind The Numbers
37 pages
IEEE 3006.7-2013 Reliability Guide
100% (3)
IEEE 3006.7-2013 Reliability Guide
42 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
Reference Book Principles of Distributed Database System Chapters
No ratings yet
Reference Book Principles of Distributed Database System Chapters
25 pages
Meantime To Failure
No ratings yet
Meantime To Failure
10 pages
Understanding Reliability and RBDs
No ratings yet
Understanding Reliability and RBDs
8 pages
11 Errors
No ratings yet
11 Errors
33 pages
Ieee STD p3006.7 Presentation
No ratings yet
Ieee STD p3006.7 Presentation
21 pages
Dis Sys
No ratings yet
Dis Sys
16 pages
DS Unit - 4
No ratings yet
DS Unit - 4
20 pages
System Reliability and Availability
No ratings yet
System Reliability and Availability
5 pages
Chapter 2 Maintnability Reliability and Availability
100% (1)
Chapter 2 Maintnability Reliability and Availability
60 pages
Ch-4-Fault Tularance - Naming-SM
No ratings yet
Ch-4-Fault Tularance - Naming-SM
42 pages
Fault Tolerance in High Performance Computing
No ratings yet
Fault Tolerance in High Performance Computing
25 pages
Lesson 1 - Introduction To Fault-Tolerant Computing
No ratings yet
Lesson 1 - Introduction To Fault-Tolerant Computing
6 pages
CSC 308 Fault Tolerant Computing
No ratings yet
CSC 308 Fault Tolerant Computing
24 pages
Software Fault Tolerance Guide
No ratings yet
Software Fault Tolerance Guide
50 pages
7.fault Tolerance
No ratings yet
7.fault Tolerance
35 pages
Week09-Fault Tolerant System
No ratings yet
Week09-Fault Tolerant System
26 pages
Rtos Group 10
No ratings yet
Rtos Group 10
9 pages
Computer System Reliability Guide
100% (1)
Computer System Reliability Guide
108 pages
System Reliability Availability Calculations
No ratings yet
System Reliability Availability Calculations
6 pages
Software Reliability Metrics Overview
No ratings yet
Software Reliability Metrics Overview
58 pages
Why Do Computers Stop Jim Gray
No ratings yet
Why Do Computers Stop Jim Gray
8 pages
Lecture 4 ITI
No ratings yet
Lecture 4 ITI
25 pages
Availability Concepts
No ratings yet
Availability Concepts
32 pages
Chapter 8 - Final
No ratings yet
Chapter 8 - Final
48 pages
Information Technology Infrastructure IT602
No ratings yet
Information Technology Infrastructure IT602
10 pages
RTS UNiT 4
No ratings yet
RTS UNiT 4
19 pages
Reliability and Availability Analysis - Toward Muliplevel Models For Complex Systems, Trivedi2024
No ratings yet
Reliability and Availability Analysis - Toward Muliplevel Models For Complex Systems, Trivedi2024
11 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Fault Tolerance in Distributed Computing
No ratings yet
Fault Tolerance in Distributed Computing
32 pages
Risk Analysis & Reliability Theory
100% (1)
Risk Analysis & Reliability Theory
9 pages
Design For Six Sigma - Contd..: Session13
100% (1)
Design For Six Sigma - Contd..: Session13
43 pages
IT Infrastructure Availability Guide
No ratings yet
IT Infrastructure Availability Guide
31 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Software Reliability
No ratings yet
Software Reliability
20 pages
Reliability and Reusability
No ratings yet
Reliability and Reusability
35 pages
Fault Tolerance Techniques
No ratings yet
Fault Tolerance Techniques
4 pages
Computer System General Requirements
No ratings yet
Computer System General Requirements
9 pages
Reliability & Availability - Introduction
No ratings yet
Reliability & Availability - Introduction
18 pages
9 - System Management
No ratings yet
9 - System Management
22 pages
CS61C Su18 27 MRR Dependability
No ratings yet
CS61C Su18 27 MRR Dependability
60 pages
Introduction To RAMS Engineering
No ratings yet
Introduction To RAMS Engineering
34 pages
Fault Tolerant Systems: Prerequisites
No ratings yet
Fault Tolerant Systems: Prerequisites
14 pages
High Availability Process
No ratings yet
High Availability Process
8 pages
Reliability Prediction Models Overview
No ratings yet
Reliability Prediction Models Overview
46 pages
Du3 1
No ratings yet
Du3 1
54 pages
Quiz 1 Selected Topics in SWE - Model Answer
No ratings yet
Quiz 1 Selected Topics in SWE - Model Answer
5 pages
Fault Tolerance in Distributed Databases
No ratings yet
Fault Tolerance in Distributed Databases
4 pages
Fault Tolerance Unit 3-4
No ratings yet
Fault Tolerance Unit 3-4
32 pages
03 - Reliability Software
No ratings yet
03 - Reliability Software
56 pages
Software Reliability Metrics Overview
No ratings yet
Software Reliability Metrics Overview
37 pages
Availability Concepts
No ratings yet
Availability Concepts
39 pages
Operations Scheduling for Educators
No ratings yet
Operations Scheduling for Educators
4 pages
Software Maintenance 2
No ratings yet
Software Maintenance 2
4 pages
Nonlinear Dynamics in Hamiltonian Systems
No ratings yet
Nonlinear Dynamics in Hamiltonian Systems
7 pages
Jyoti
No ratings yet
Jyoti
1 page
Lec 3 Weak AI Vs Strong AI
No ratings yet
Lec 3 Weak AI Vs Strong AI
20 pages
Emerging Trends in IoT - Rajat & Swaraj
No ratings yet
Emerging Trends in IoT - Rajat & Swaraj
6 pages
Material Non-Conformity Report - Rev 01
No ratings yet
Material Non-Conformity Report - Rev 01
1 page
CH 16
No ratings yet
CH 16
16 pages
Mean Time Between Failures
0% (1)
Mean Time Between Failures
3 pages
Understanding SIL in LOPA Analysis
100% (1)
Understanding SIL in LOPA Analysis
22 pages
Dynamic Programming Overview and Applications
No ratings yet
Dynamic Programming Overview and Applications
6 pages
Teknik Kendali Digital PDF
100% (3)
Teknik Kendali Digital PDF
230 pages
Unconstrained NLP Optimization
No ratings yet
Unconstrained NLP Optimization
8 pages
Artificial Intelligence in Civil Engineering
No ratings yet
Artificial Intelligence in Civil Engineering
14 pages
Full SAFe 6.0 A4
No ratings yet
Full SAFe 6.0 A4
1 page
Industrial Control System Basics
No ratings yet
Industrial Control System Basics
27 pages
Summary of "Attention Is All You Need"
No ratings yet
Summary of "Attention Is All You Need"
2 pages
Building Management System: Characteristics
100% (1)
Building Management System: Characteristics
14 pages
Saheaw 2020
No ratings yet
Saheaw 2020
4 pages
On The Use of The Terms Verification and Validation: INCOSE International Symposium July 2017
No ratings yet
On The Use of The Terms Verification and Validation: INCOSE International Symposium July 2017
16 pages
Lect1 Modeling in Frequency Domain
100% (1)
Lect1 Modeling in Frequency Domain
39 pages
Pulp Digester Control Optimization
No ratings yet
Pulp Digester Control Optimization
6 pages
First Order Systems
No ratings yet
First Order Systems
158 pages
Pengantar Deep Learning untuk NLP
100% (1)
Pengantar Deep Learning untuk NLP
109 pages
Adv Therm Week 8
No ratings yet
Adv Therm Week 8
61 pages
System Validation Process Overview
No ratings yet
System Validation Process Overview
15 pages
22.APQP Matrix
No ratings yet
22.APQP Matrix
4 pages
Flight Control System Modeling With SysML To Suppo
No ratings yet
Flight Control System Modeling With SysML To Suppo
7 pages
Different Diagrams of Online Shopping System
No ratings yet
Different Diagrams of Online Shopping System
9 pages
Data Warehouse Thesis Writing Guide
100% (3)
Data Warehouse Thesis Writing Guide
5 pages

SDA Session 8

Uploaded by

SDA Session 8

Uploaded by

DSECLZG517: Systems for Data Analytics

Session 8: Reliability and availability

Dr. Anindya Neogi

Reliability - serial assembly

MTTF mc=1/(1/ma + 1/mb)

• MTTF of system = 1 / SUM (1/MTTFi) for all components i

• In a parallel assembly, e.g. a cluster of nodes C

• Availability = Time system is UP and accessible / Total time observed

* Unless specified one can assume MTTD = 0 8

• The application needs to be restarted, which takes 2 hours.

• What is the availability of the cluster ?

• If downtime is $80k per hour, the what is the yearly cost ?

• MTTF = 100 hours

• Availability = 100/104 = 96.15%

A(system) = Product (Ai) for all i A(system) = 1 - Unavailability(system)

A(S) = A(Comp1 U Comp2)

A(S) = A1 + A2 + A3 - A1*A2 - A1*A3 - A2*A3 + A1*A2*A3

• Systems are a complex combination of serial and parallel

Bus / Mem failures ? Ethernet failures ?

Node failures ? Protect against node failures with periodic

• Availability can be increased in 2 ways

You might also like

A(S) = A1 + A2 + A3 - A1A2 - A1A3 - A2A3 + A1A2*A3