DSECLZG517: Systems for Data Analytics
Session 8: Reliability and availability
Dr. Anindya Neogi
Associate Professor
[email protected] Topics for today
• Reliability
• Availability
• Single points of failure
2
Distributed computing – living with failures
• Failures of nodes and links is a common concern in Distributed Systems
• Essential to have fault tolerance aspect in design
• Fault tolerance is a measure of
• How a distributed system functions in the presence of failures of system components
• Tolerance of component faults is measured by parameters
• Reliability - An inverse indicator of failure rate
• How soon a system will fail
• Availability - An indicator of fraction of time a system is available for use
• System is not available during failure
• Serviceability: How easy is it to service / fix
• Systems have to promise strict RAS guarantees because downtime means lost revenue
3
Metrics
• MTTF - Mean Time To Failure
• MTTF = 1 / failure rate = Total #hours of operation / Total #units
• MTTF is an averaged value. In reality failure rate changes over
time because it may depend on age of component.
• Failure rate = 1 / MTTF (assuming average value over time)
• MTTR - Mean Time to Recovery / Repair
• MTTR = Total #hours for maintenance / Total #repairs
• MTTD - Mean Time to Diagnose
• MTBF - Mean Time Between Failures
• MTBF = MTTD + MTTR + MTTF
4
user —> app server —> DB server —> storage/disk
Reliability - serial assembly
• MTTF of a system is a function of MTTF of components
• Serial assembly of components
• Failure of any component results in system failure
• Failure rate of C = Failure rate of A + Failure rate of B = 1/ma + 1/mb
MTTF mc=1/(1/ma + 1/mb)
A server fails every 90 days.
C The disk fails every 45 days.
A B In serial assembly,
MTTF=ma MTTF=mb system fails every 1 / (1/90 + 1/45) = 30 days
• MTTF of system = 1 / SUM (1/MTTFi) for all components i
• Failure rate of system = SUM(1/MTTFi) for all components i
5
Reliability - parallel assembly
• In a parallel assembly, e.g. a cluster of nodes C
A
• MTTF of C = MTTF A + MTTF B because both A MTTF=ma
and B have to fail for C to fail
B
• MTTF of system = SUM(MTTFi) for all MTTF=ma
components i
MTTF mc=ma + mb
A server fails every 90 days.
The disk fails every 45 days.
2 redundant disks are connected in parallel.
Disk subsystem fails in 45 + 45 = 90 days.
System fails every 1 / (1/90 + 1/90) = 45 days
6
Topics for today
• Reliability
• Availability
• Single points of failure
7
Availability
• Availability = Time system is UP and accessible / Total time observed
• Availability = MTTF / (MTTD* + MTTR + MTTF)
or
• Availability = MTTF / MTBF
• A system is highly available when
• MTTF is high
• MTTR is low
* Unless specified one can assume MTTD = 0 8
Example
• A node in a cluster fails every 100 hours while other parts never fail.
• On failure of the node the whole system needs to be shutdown, faulty node
replaced and system. This takes 2 hours.
• The application needs to be restarted, which takes 2 hours.
• What is the availability of the cluster ?
• If downtime is $80k per hour, the what is the yearly cost ?
• Solution
• MTTF = 100 hours
• MTTR = 2 + 2 = 4 hours
• Availability = 100/104 = 96.15%
• Cost of downtime per year = 80000 x 3.85 * 365 * 24 / 100 = USD 27 million
https://www.brainkart.com/article/Fault-Tolerant-Cluster-Configurations_11320/
9
Availability : Serial and Parallel Systems (1)
A(system) = Product (Ai) for all i A(system) = 1 - Unavailability(system)
A(system) = 0.990025 = 1 - Product(1- Ai for all i )
= 1 - (1-0.995)(1-0.995)
= 1 - 0.005x0.005=0.999975
10
Availability : Parallel Systems (2)
comp1 comp2
A(S) = A(Comp1 U Comp2)
= A(Comp1) + A(Comp2) - A(Comp 1) * A(Comp2)
= 0.995 + 0.995 - 0.995 * 0.995
= 0.999975
For 3 components ?
A(S) = A1 + A2 + A3 - A1*A2 - A1*A3 - A2*A3 + A1*A2*A3
11
Reliability block diagrams
• Systems are a complex combination of serial and parallel
connections
• An RBD model is used to analyse availability of a
complex system by encapsulating serial or parallel
connections within blocks
• Sometimes it is non-trivial to create an RBD given the
system dependencies
• User to application needs both switch 1 and 2
available
• Application needs web service 1 which needs either of
the 2 switches available to use the DB
12
Fault Tolerant Clusters – Recovery
• Diagnosis
• Detection of failure and location of the failed component, e.g. using
heartbeat messages between nodes
• Backward recovery
checkpoints
• periodically do a checkpoint (save consistent state on stable storage)
• on failure, isolate the failed component, rollback to last checkpoint
and resume normal operation
• Ease to implement, independent of application, but leads to wastage
of execution time on rollback besides unused checkpointing work rollback on errors
• Forward recovery
• In real-time systems or time-critical systems cannot rollback. So state
is reconstructed on the fly from diagnosis data.
• Application specific and may need additional hardware
13
Topics for today
• Reliability
• Availability
• Single points of failure
14
Single Points of Failure in SMP and Clusters
Bus / Mem failures ? Ethernet failures ?
Node failures ? Protect against node failures with periodic
checkpoints on global storage
15
Redundancy techniques
• Availability can be increased in 2 ways
» Increase MTTF - almost saturated and expensive to increase further
» Reduce MTTR - have redundancy in the cluster so that another node takes over
as one fails (hiding failures)
» Isolated redundancy - redundant components are isolated, e.g. backup
node shares nothing with primary node
» N-version programming - N copies of software are independently built and
run. Results are compared and majority vote taken.
16
Review topics
• Session 1: Types of analytics, types of data, intro to caching
• Session 2 : Locality of reference - cache hit / miss calculations, given a program or scenario do you understand
whether it is spatial / temporal locality ?
• Session 3: Solving latency and bandwidth issues with caching, block size, prefetching, multi-threading. Interplay
between techniques, e.g. memory bandwidth impacted in trying to reduce latency with prefetching / multi-
threading.
• Session 4: Various types of message options - blocking, buffering, buffering in interface cards …. Various common
programming features in openmpi (distributed memory) and openmp (shared memory).
• Session 5: Do you know how to design a parallel program using right decomposition ?
• Session 6: Software and system architectures, Given a scenario can you decide which architecture to use ?
Fallacies in Distributed systems
• Session 7: Cluster design - components, failover options
• Session 8: Reliability and availability calculations
17