0% found this document useful (0 votes)
50 views73 pages

11 Coordination

This document discusses synchronization and coordination in distributed systems. It begins by defining the goal of coordination between processes to agree on values or actions even without a master-slave relationship. Examples given include synchronizing hadoop cluster nodes and leader election. The document then covers failure assumptions, distributed mutual exclusion, requirements for mutual exclusion like safety and liveness, and algorithms for mutual exclusion like multicast synchronization and Maekawa's voting algorithm. It evaluates performance metrics for mutual exclusion algorithms like bandwidth, delay, and throughput.

Uploaded by

mansoorehe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views73 pages

11 Coordination

This document discusses synchronization and coordination in distributed systems. It begins by defining the goal of coordination between processes to agree on values or actions even without a master-slave relationship. Examples given include synchronizing hadoop cluster nodes and leader election. The document then covers failure assumptions, distributed mutual exclusion, requirements for mutual exclusion like safety and liveness, and algorithms for mutual exclusion like multicast synchronization and Maekawa's voting algorithm. It evaluates performance metrics for mutual exclusion algorithms like bandwidth, delay, and throughput.

Uploaded by

mansoorehe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

CLOUD COMPUTING

Synchronization &
Coordination
Zeinab Zali
Isfahan University Of Technology
References: -Cloud computing: Theory and practice, Chapter 3
-Distributed systems: concepts and design, George Coulouris, Chapter 15
-https://www.cs.rutgers.edu/~pxk/417/notes/paxos.html
- 1 / 73
Problem statement

Goal:
– for a set of processes to coordinate their actions
or to agree on one or more value
– The computers must be able to do so even where
there is no fixed master-slave relationship
between the components

Example:
– Synchronizing hadoop cluster nodes for doing a
single task
– leader/master election
– Use barriers to block processing of a set of nodes
until a condition is met
2 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Failure assumptions and failure
detectors

Each pair of processes is connected by reliable
channels
– although the underlying network components may
suffer failures, the processes use a reliable
communication protocol that masks these failures –
for example, by re-transmitting missing or corrupted
messages

No process failure implies a threat to the other
processes’ ability to communicate.
– This means that none of the processes depends
upon another to forward messages.

Processes may fail only by crashing
3 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Distributed Mutual
Exclusion

4 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Critical section problem

If a collection of processes share a resource
or collection of resources, then often mutual
exclusion is required to prevent interference
and ensure consistency when accessing the
resources

In a distributed system, however, neither
shared variables nor facilities supplied by a
single local kernel can be used to solve it, in
general.

We require a solution to distributed mutual
exclusion: one that is based solely on message
passing 5 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
executing a critical section

The application-level protocol for executing a
critical section is as follows:

enter() // enter critical section – block if necessary


ResourceAccesses() // access shared resources in critical section
exit() // leave critical section – other processes
// may now enter

6 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Essential requirements for
mutual exclusion

ME1 (safety): At most one process may execute
in the critical section (CS) at a time.

ME2 (liveness): Requests to enter and exit the
critical section eventually succeed
– Deadlock: involve two or more of the processes
becoming stuck indefinitely while attempting to enter
or exit the critical section
– Starvation: the indefinite postponement of entry
for a process that has requested it (→ no fairness)

ME3: ( → ordering): If one request to enter the CS
happened-before another, then entry to the CS is
granted in that order (no ME3 → no fairness) 7 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Evaluating the performance of
ME algorithms

The bandwidth consumed: it is proportional to
the number of messages sent in each entry and
exit operation;

Delay: the client delay incurred by a process at
each entry and exit operation;

Algorithm’s throughput: This is the rate at which
the collection of processes as a whole can access
the critical section
– We measure the effect using the synchronization
delay between one process exiting the critical
section and the next process entering it; the
throughput is greater when the synchronization
delay is shorter. 8 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Multicast synchronization


Basic idea: processes that require entry to a
critical section multicast a request message,
and can enter it only when all the other
processes have replied to this message

9 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Maekawa’s voting algorithm

A ‘candidate’ process must collect sufficient
votes to enter (but not all the votes like
previous multicast method)

10 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Figure 15.6
Maekawa’s algorithm

On initialization For pi to exit the critical section


state := RELEASED; state := RELEASED;
voted := FALSE; Multicast release to all processes in Vi;
For pi to enter the critical section
On receipt of a release from pi at pj
state := WANTED; if (queue of requests is non-empty)
Multicast request to all processes in Vi; then
Wait until (number of replies received = K); remove head of queue – from pk, say;
state := HELD; send reply to pk;
On receipt of a request from pi at pj voted := TRUE;
else
if (state = HELD or voted = TRUE)
voted := FALSE;
then
end if
queue request from pi without replying;
else
send reply to pi;
voted := TRUE;
end if

Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5
© Pearson Education 2012
Maekawa’s voting algorithm


What is the optimal solution?
– We should minimize K

● It is non-trivial to calculate the optimal sets Ri


– approximation: place the processes in a
matrix and let Ri be the union of the row and
column containing pi .

12 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Maekawa’s voting properties
✔ This algorithm achieves the safety property,
ME1
– If it were possible for two processes pi and pj to
enter the critical section at the same time, then the
processes in would have to have voted
for both i and j
✗ Unfortunately, the algorithm is deadlock-prone
– Consider three processes, p1 , p2, p3 , with
V1={p1,p2} , V2 ={p2, p3} and V3 = {p3, p1}, If the
three processes concurrently request entry to the
critical section

p1 replies to itself and hold off p2 , p2 reply to itself
and hold off p3 , and p3 reply to itself and hold off13p1/ 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Maekawa’s voting properties

✔ We can achieve ME2 and ME3 through


ordering the requests

14 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Maekawa’s voting performance

bandwidth utilization is messages per
entry to the critical section

messages per exit

The total is less than the 2(N – 1)
messages required by Ricart and Agrawala’s
(if N > 4)

The client delay is the same as that of Ricart
and Agrawala’s algorithm
– but the synchronization delay is worse: a
round-trip time instead of a single message
transmission time.
15 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Election

16 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Election in DS

An algorithm for choosing a unique process
to play a particular role is called an election
algorithm

Examples:
– In central-server algorithm for mutual exclusion,
the ‘server’ is elected from among the processes
that need to use the critical section
– Selecting a master between some replica in
Google File System

17 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Election applications


Leader is useful for coordination among
distributed servers

Apache Zookeeper
– a centralized service for maintaining configuration
information, naming, providing distributed
synchronization, and providing group services

Google’s Chubby
– Providing lock service for loosely-coupled
distributed systems

18 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Election problem

In a group of processes, elect a Leader to
undertake special tasks
– And let everyone know in the group about this
Leader

What happens when a leader fails (crashes)
– Some process detects this (using a Failure
Detector!) Then what?

Election algorithm goal:
– 1. Elect one leader only among the non-faulty
processes
– 2. All non-faulty processes agree on who is the
leader
19 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
System Model


N processes.

Each process has a unique id.

Messages are eventually delivered.

Failures may occur during the election
protocol.

20 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Calling election

Any process can call for an election

A process can call for at most one election at
a time.

Multiple processes are allowed to call an
election simultaneously.
– All of them together must yield only a
single leader

The result of an election should not depend
on which process calls for it.

21 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Election algorithm requirements

A run of the election algorithm must always
guarantee at the end:
– Safety: For all non-faulty processes p: (p’s
elected = (q: a particular non-faulty
process with the best attribute value) or
Null)
– Liveness: For all election runs: (election run
terminates) and for all non-faulty
processes p: p’s elected is not Null

22 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Election algorithm requirements


At the end of the election protocol, the non-
faulty process with the best (highest) election
attribute value is elected.
– Common attribute: leader has highest id
– Other attribute examples: leader has
highest IP address, or fastest computation
(lowest computational load), or most disk
space, or most number of files, etc

23 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Bully algorithm

Allows processes to crash during an election,
although it assumes that message delivery
between processes is reliable

The algorithm assumes that the system is
synchronous
– it uses timeouts to detect a process failure

The bully algorithm, assumes that each
process knows which processes have higher
identifiers, and that it can communicate with
all such processes
24 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Bully algorithm

All processes know other process’ ids

When a process finds the coordinator has
failed (via the failure detector):
– if it knows its id is the highest, it elects
itself as coordinator, then sends a
Coordinator message to all processes with
lower identifiers. Election is completed.
– else it initiates an election by sending an
Election message

25 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Bully algorithm
– else it initiates an election by sending
an Election message

Sends it to only processes that have a
higher id than itself.

if receives no answer within timeout, calls
itself leader and sends Coordinator
message to all lower id processes. Election
completed.

if an answer received however, then there
is some non-faulty higher process => so,
wait for coordinator message. If none
received after another timeout, start a new
election run. 26 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Bully algorithm

A process that receives an Election message
replies with OK message, and starts its own
leader election protocol (unless it has already
done so)

27 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Bully algorithm Example

28 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Bully algorithm Example

29 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Bully algorithm Example

30 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Bully algorithm Example

31 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Bully algorithm Example

32 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Failure during election run

33 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Failure during election run

34 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Failure during election run

35 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Failures and Timeouts

If failures stop, eventually will elect a leader

How do you set the timeouts?

Based on Worst-case time to complete election
– 5 message transmission times if there are no
failures during the run:
1. Election from lowest id server in group
2. Answer to lowest id server from 2nd highest id
process
3. Election from 2nd highest id server to highest id
4. Timeout for answers @ 2nd highest id server
5. Coordinator from 2nd highest id server

36 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Analysis

Worst-case completion time: 5 message
transmission times
– When the process with the lowest id in the
system detects the failure.

(N-1) processes altogether begin elections,
each sending messages to processes with
higher ids.

i-th highest id process sends (i-1) election
messages
– Number of Election messages
= N-1 + N-2 + ... + 1 = (N-1)*N/2 = O(N 2)

37 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Analysis


Best-case
– Second-highest id detects leader failure
– Sends (N-2) Coordinator messages
– Completion time: 1 message transmission time

38 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
impossibility

Since timeouts built into protocol, in
asynchronous system model:
– Protocol may never terminate => Liveness
not guaranteed
– But satisfies liveness in synchronous
system model where

Worst-case latency can be calculated =
worst-case process time + worst-case
message latency

39 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Why is Election so Hard?


Because it is related to the consensus
problem!

If we could solve election, then we could
solve consensus!

Elect a process, use its id’s last bit as the
consensus decision

But since consensus is impossible in
asynchronous systems, so is election

40 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Consensus

41 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
What is the problem


problems of agreement:
– the problem is for processes to agree on a
value after one or more of the processes has
proposed what that value should be.

42 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Three classic problem


C: Consensus

BG: Byzantine general problem

IC: Interactive consistency

Once we solve one of these problems,
another ones can be solved through the first
one

43 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Solving one of 3 problems
from one another’s solution

IC from BG

BG from IC

C from IC

IC from C

BG from C

C from BG

44 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Consensus Problem
definition
● every process pi begins in the undecided
state and proposes a single value vi , drawn
from a set D ( i = 1, 2, …, N ).

The processes communicate with one
another, exchanging values.

Each process then sets the value of a
decision variable, di

In doing so it enters the decided state, in
which it may no longer change di( i = 1, 2,..,
N)
45 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Consensus solution
requirement

Termination: Eventually each correct process
sets its decision variable.

Agreement: The decision value of all correct
processes is the same: if pi and pj are correct
and have entered the decided state, then di
= dj ( i, j = 1, 2,..,N ).

Integrity: If the processes (correct or not) all
proposed the same value, then any correct
process in the decided state has chosen that
value.
46 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Consensus in a system with
failure

Consensus is possible to solve in a
synchronous system where message delays
and processing delays are bounded

Consensus is impossible to solve in an
asynchronous system where these delays are
unbounded

47 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Consensus in a
synchronous system

The algorithm uses only a basic multicast
protocol.

It assumes that up to f of the N processes
exhibit crash failures.

To reach consensus, each correct process
collects proposed values from the other
processes.

The algorithm proceeds in f + 1 rounds, in
each of which the correct processes multicast
the values between themselves
48 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Consensus in a
synchronous system

49 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Proof (agreement and
integrity after f+1 rounds)


The duration of a round is limited by setting a
timeout based on the maximum time for a
correct process to multicast a message.

50 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Paxos

51 / 73
Paxos


Paxos is an algorithm that is used to achieve
consensus among a distributed set of
computers that communicate via an
asynchronous network.

One or more clients proposes a value to
Paxos and we have consensus when a
majority of systems running Paxos agrees on
one of the proposed values

52 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Paxos liveness


Paxos won’t try to specify precise liveness
requirements.

However, the goal is to ensure that some
proposed value is eventually chosen and, if a
value has been chosen, then a process can
eventually learn the value

53 / 73
Paxos application

The most common use of Paxos is in
implementing replicated machines, such as
chunk servers in GFS
– To ensure that replicas are consistent,
incoming operations must be processed in the
same order on all systems.
– Each of the servers will maintain a log that is
sequenced identically to the logs on the other
servers.
– A consensus algorithm will decide the next
value that goes on the log.
– Then, each server simply processes the log in
order and applies the requested operations. 54 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Paxos roles


The nodes in paxos have three roles
– Proposers
– Acceptors
– Learners

Paxos nodes may take multiple roles, even all
of them

Paxos is a two phase algorithm

55 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Phase Promise

Proposer wants to propose a certain value
– It sends prepare (IDp) to a majority or all of the
acceptors (IDs must be unique)

If timeout, retry with a new one (a greater one)
ID = timestamp+pid;
send PREPARE(ID)

Acceptor recieves a prepare message for IDp
Is this ID bigger than any round I have previously received?
If yes
store the ID number, max_id = ID
respond with a PROMISE message
If no
do not respond (or respond with a "fail" message)
56 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Phase Accept

If a proposer received a PROMISE message from
the majority of acceptors, it now has to tell the
acceptors to accept that proposal. If not, it has to
start over with another round of Paxos.
PROPOSE(ID, VALUE)

The acceptor accepts the proposal if the ID
number of the proposal is still the largest one
that it has seen.
Is the ID the largest I have seen so far, max_id == N?
if yes
reply with an ACCEPTED message & send ACCEPTED(ID,
VALUE) to all learners
if no
do not respond (or respond with a "fail" message) 57 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Contention
Prepare(7) time
prepare(5) promise(5) promise(7)
proposer

promise(6) prepare(8)
proposer prepare(6)

Fail 5
acceptor
acc 6
propose(5)

Fail 5
acceptor Fail 6
propose(6)

acceptor

58 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Fixing the protocol (I)

acceptor receives a PREPARE(ID) message:
is this ID bigger than any round I have previously received?
if yes
store the ID number, max_id = ID
respond with a PROMISE(ID) message
if no
did I already accept a proposal?
if yes
respond with a PROMISE(ID, accepted_ID,
accepted_VALUE) message
if no
do not respond (or respond with a "fail" message)

59 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Fixing the protocol (II)

proposer receives PROMISE(ID, [VALUE])
messages:
do I have PROMISE responses from a majority of acceptors?
if yes
do any responses contain accepted values (from other
proposals)?
if yes
pick the value with the highest accepted ID
send PROPOSE(ID, accepted_VALUE) to at least a majority of
acceptors
if no
we can use our proposed value
send PROPOSE(ID, VALUE) to at least a majority of acceptors

60 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Failure analysis


Suppose failure of each of the players in
different phases and analyze how the
algorithm handle it

61 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Chubby

62 / 73
Chubby Goals


Main Intention: Chubby is a distributed lock
service intended for advisory coarse-grained
synchronization of activities within Google’s
distributed systems;

The primary goals: reliability, availability to a
moderately large set of clients, and easy-to-
understand semantics;

The secondary goals: throughput and storage
capacity

63 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
applications

Google File System uses a Chubby lock to
appoint a GFS master server

Bigtable uses Chubby in several ways:
– to elect a master
– to allow the master to discover the servers it
controls
– to permit clients to find the master

both GFS and Bigtable use Chubby as a well-
known and available location to store a small
amount of meta-data;
64 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Chubby Design

A Chubby cell consisting of some replicas
(standard is 5), one of them is elected as the
master. Clients c1 , c2 , . . . , cn communicate
with the master using RPCs

65 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Chubby Design

Chubby replicas use asynchronous Paxos algorithm
to elect a new master (master lease) when the
current one fails

Clients find the master by sending master
location requests to the replicas listed in the
DNS.

66 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Chubby Design


Clients use RPCs to request services from the
master.
– When a master receives a write request, it
propagates the request to all replicas and
waits for a reply from a majority of replicas
before responding.
– The master responds without consulting the
replicas when receiving a read request.

67 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Chubby components


Locks and Sequencers: Each Chubby file and
directory can act as a reader-writer lock
– either one client handle may hold the lock in
exclusive (writer) mode, or any number of
client handles may hold the lock in shared
(reader) mode.

API: Clients see a Chubby handle as a pointer
to an opaque structure that supports various
operations. Handles are created only by
Open(), and destroyed with Close()
68 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Zookeeper

69 / 73
Zookeeper Goals

ZooKeeper is a distributed coordination
service with this goals:
– Simplicity: With the help of a shared
hierarchical namespace, it coordinates. it is
organized as same as the standard file system
with znodes.
– Reliability: The system keeps performing, even
if more than one node fails.
– Speed: In the cases where ‘Reads’ are more
common, it runs with the ratio of 10:1.
– Scalability: By deploying more machines, the
performance can be enhanced. 70 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Zookeeper Goals

71 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Zookeeper Architecture

72 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology
Companies Using
ZooKeeper

Yahoo

Hadoop and HBase

Facebook

eBay

Twitter

Netflix

Zynga

Nutanix
73 / 73
Cloud Computing, Zeinab Zali ECE Department, Isfahan University of Technology

You might also like