0% found this document useful (0 votes)
6 views38 pages

Module 3 - Part1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views38 pages

Module 3 - Part1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Part III Distributed Resource Management

9 Distributed File Systems 199


9.1 Introduction 199
9.2 Architecture 200
9.3 Mechanisms for Building Distributed File Systems 201
9.3.1 Mounting 202
9.3.2 Caching 203
9.3.3 Hints 203
9.3.4 Bulk Data Transfer 203
9.3.5 Encryption 204
9.4 Design Issues 204
9.4. l Naming and Name Resolution 204
9.4.2 Caches on Disk or Main Memory 206
XU CONTENI'S

9.4.3 Writing Policy 206


9.4.4 Cache Consistency 207
9.4.5 Availability 208
9.4.6 Scalability 209
9.4.7 Semantics 210
9.5 Case Studies 211
9.5.1 The Sun Network File System 211
9.5.2 The Sprite File System 214
9.5.3 Apollo DOMAIN Distributed File System 220
9.5.4 Coda 222
9.5.5 The x-Kemel Logical File System 226
9.6 Log-Structured File Systems 228
9.6.1 Disk Space.Management 229
9.7 Summary 231
9.8 Further Readings 231
Problems 232
References 233
istributed Sh ed Memory 236
10.1 Introduction 236
10.2 Architecture and Motivation 236
10.3 Algorithms for Implementing DSM 238
10.3.1 The Central-Server Algorithm 238
10.3.2 The Migration Algorithm 239
10.3.3 The Read-Replication Algorithm 240
10.3.4 The Full-Replication Algorithm 240
10.4 Memory Coherence 241
10.5 Coherence Protocols 242
10.5.1 Cache Coherence in the PLUS System 243
10.5.2 Unifying Synchronization and Data Transfer in Clouds 244
10.5.3 Type-Specifi-cMemory Coherence in the Muoio System 245
10.6 Design Issues 247
10.6.1 Granularity 247
10.6.2 Page Replacement 248
10.7 Case Studies 248
10.7.1 NY 248
10.7.2 Mirage 252
10.7.3 Clouds 253
10.8 Summary 254
10.9 Further Reading 255
Problems 256
References 256
Distributed Scheduling 259
11.1 Introduction 259
11.2 Motivation 259
11.3 Issues in Load Distributing 262
11.3.1 Load 262
11.3.2 Classification of Load Distributing Algorithms 263
11.3.3 Load Balancing versus Load Sharing 263
CONTENTS xiii

11.3.4 Preemptive versus Nonpreemptive Transfers 263


U.4 Components of a_Load Distributing Algorithm 264
11.4.1 Transfer Policy 264
11.4.2 Selection Policy 264
11.4.3 Location Policy 265
11.4.4 Information Policy 265
11.5 Stability 266
11.5.1 The Queuing-Theoretic Perspective 266
11.5.2 The Algorithmic Perspective 266
11.6 Load Distributing Algorithms 266
11.6.1 Sender-Initiated Algorithms 267
11.6.2 Receiver-Initiated Algorithms 269
11.6.3 Symmetrically Initiated Algorithms 270
11.6.4 Adaptive Algorithms 272
11.7 Performance Comparison 274
11.7.l Receiver-initiated versus Sender-initiated Load Sharing 275
11.7.2 Symmetrically Initiated Load Sharing 276
11.7.3 Stable Load Sharing Algorithms 277
11.7.4 Performance Under Heterogeneous Workloads 277
11.8 Selecting a Suitable Load Sharing Algorithm 278
11.9 Requirements·for Load Distri _µting 280
11.10 Load Sharing Policies: Case Studies 280
11.10.1 The V-System 280
11.10.2 The Sprite System 281
11.10.3 Condor 282
11.10.4 The Stealth Distributed Scheduler 283
11.11 Task Migration 283
11.12 Issues in Task Migration 284
11.12.l State Transfer 284
11.12.2 Location Transparency 287
11.12.3 Structure of a Migration Mechanism 288
11.12.4 Performance 289
11.13 Summary 290
11.14 Further Reading 291
Problems 292
References 292
Part IV Failure Recovery and Fault Tolerance
CHAPTER

11
DISTRIBUTED
SCHEDULING

11.1 INTRODUCTION
Distributed systems offer a tremendous processing capacity. However, in order to real ize
this tremendous computing capacity, and to take full advantage of it, good resource
allocation schemes are needed. A distributed scheduler is a resource management com
ponent ofa distributed operating system that focuses on judiciously and transparently
redistributing the load of the system among the computers such that overall performance
of the system is maximized. Because wide-area networks have high communication de
lays, distributed scheduling is more suitable for distributed systems based on local area
networks.
In this chapter, we discuss several key issues in load distributing, including the
motivation for load distributing, tradeoffs between load balancing and load sharing and
betw n preemptive and nonpreemptive task transfers, and stability. In addition, we
descnbe several load distributing algorithms and compare their performance. Surveys
of load distributing policies and task migration mechanisms that have been implemented
are also presented. This chapter is based on [32).

ll.l MOTIVATION
A locally distributed system consists of a collection of autonomous computers, con
nected bya local area communication network (Fig. 11.1). Users submit tasks at their
259
260 ADVANCED CONCEPTS IN OPERATING SYSTI!MS

\oa eO
\,\%\\\\"j
DISlllJBUll!D SCt!E1JuuNc;

t( )
2(;}

P = Q;HN-i
(11.l)
where Qi is the p obability thata ven set of i servers are idle and HN-i is the
robability thata given s t of (N- i) servers are not idle and at one or more of them
task is waiting for service. Clearly, from the independence assumption,

Communication Network Q; =p (11.2)


HN-i= {probability that (N- i) systems have at least one
task-} all (N_ i) systems have exactly one {probability that
task}.

HN-i = (l- Po)N-i - ((1- P )P JN-i (11.3)

(N) .
0
Therefore, 0

p = tr i p { (l - Po)N-i - ((1- Po>PoJN-i}

=. f:(.iN)Pi(l (N)
Heavily loaded
filGURE 11.1
Moderately - . 0
P.)N-i
- t;r N ,
i P (l - 0 P 0)N
-,
.

loaded
A distributed system without load distributing (adapted from (321).
= {1 -(l - P0)N}..: {. P:' [<2-P;,)N '--(l-P )N]} 0

= l - (l - Po)N (l -·P:') - P:'(2 - P0)N (11.4)

host computers for processing. The need for load distributing·mscs in such nv Figure 11.2 plots the values of P for various values of server utiliz.ations p and
n ments because, due to the random arrival of tasks and their·random CPU semce the number of seivers N, For moderate system utiliz.ation (where p = O.S to 0.8), the
time requirements, there isa good possibility that several computers are heavily value of P is -high, indicating a good potential for perfonnance improvement
through 1?8d ciistributi n. At high system utilizations, the value of P is low as most
loaded (hence suffering from performance degradation), while others are idle or lightly
servers are likely to be busy, which indicates lower potential for load disttibution.
loaded. Clearly, if the workload at some computers is typically heavier than_ th8!
Similarly, at low system utilizations, the value of P is low as most servers are likely
at.others,
to be idle, which indicates lower potential for load disttibution. Another t observa
or If some processors execute tasks at a slower rate than others, this situation 1s likely to
on is that, as the number of servers in the system increase. P renwns high even at
occur often. The usefulness of load distributing is not as obvious in systems in which all
processors are equally powerful and, over the long term, have equally heavy workloads. high system utilizationlli. . .· •
Livny and Melman (24) have shown that even in•such homogeneous distributed syStem5, in
Therefore, even a homogeneous disttibu system, system perfo e can
statistical fluctuations in the arrival of tasks and task service time requirements f potentially be impl'Oved by appropriately transfemng the 1 from h vily_loaded
computers (senders) to idle.or lightly loaded computers(.rece?_iOoevers). :111idels nuusedses
computers
waiting forlead to the
service high probability
elsewhere. that at least
Their analysis, one computer
presented is idle whilea
next, modelsa tas_k
computer lll•
8
the
fd0 l . . t by performance w1 y per- .
distributed system by an M/M/1 seiver.. · ow10g two questions. Fust. what is m an f ks The response time of a task is
. _Considera system of N identical and independent M/M/1 servers [16).:
formance metric i,1 th average response:: ? ':on d completion. Minimizing
the length of the time interval between its ongi:ttibuting. Second, what constitutes a
the
identical we mean that all seivers have the same task arrival and service rates. average response time is often che goal of 1 °:1. 5
a proper load index is very imponant
17 P
the utilization of ea server. Thenp =_l p
0
1 Proper characterization of load at a node? De road measuted. at one or more nodes.
isthe probability thata serv r 1s !die.
LetP be the probability that the system is in a state in which at least one task is as load distributing decisions arc
Also, it is crucial that the mechanism
O: e measure load is efficient and imposes
o t
wa1tinl for seivice and at least one seiver is idle. Then p is given by the
expression [24) minimal overhead. These issues arc discuss nex·
262 ADVANCED CONCEPTS IN OPERATING SYS'reMS

.- •---...
1.0,
.. DlrntlBlTTU) SCHEDUUNO 26J
'
11.3.2 Classification of Load Distributing Algorithms
o."J I, -- The basic function ofa l ad dist buting algorithm is to transfer load ( ks) from heav
0.8
_/-\ ily loaded computers to !die or hgh!ly loaded computers. Load distributing algorithms
can be broadly charactenzed as static, dynamic, or adaptive. Dynamic load distributing
I \\
a.
07]
0.8
.· I • algorithms [3, 10, .11, l8, 20, 24, 31, 34, 40] use system state information (the loads
at nodes), at leastm art, ton:iake load distributing decisions, while static algorithms

0-"-I
t
'
'
\ make no use of such mformation. In static load distributing algorithms, decisions arc
hard-wired in the algorithm using a priori knowledge of the system. Dynamic load dis
I - - •----------+--
N•20
\ tributing algorithms have the potential to outperform static load distributing algorithms
-+---......N•5 • because they are able to exploit shon term fluctuations in the system state to improve
I
N• 10
0.,-1
performance. However, dynamic load distributing algorithms entail overhead in the
0.3-t--""T-"""T--,--,,--.---..--...----. collection, storage, and analysis of system state information. Adaptive load distributing
0.2 0.3 0.4 05 0.6 0.7 0.8 0.9 1 0
- FIGURE 11.2 algorithms [20, 31]'are a special class of dynamic load distributing algorithms in that
P as a function of p and N (adapted they adapt their activities by dynamically changing the parameters of the algorithm
Server Utilization
from [241). to suit the changing system state. For example, a dynamic algorithm may continue to
collect the system state irrespective of the system load. An adaptive algorithm, on the
11.3 ISSUES IN LOAD DISTRIBUTING other hand, may discontinue the collection of the system state if the overall system load
We now discuss several central issues in load distributing that will help the reader is high to avoid imposing additional overhead on the system. At such loads, all nodes
understand its intricacies. Note here that the terms computer, machine, host, are likely to be busy and attempts to find receivers are unlikely to be successful.
workstation, and node are used interchangeably, depending upon the context.
11.3.3 Load Balancing vs. Load Sharing
11.3.1 Load
Load.distributing algorithms can further be classified as-#oad.l!_a/ancing or load sharing
Zhou (41] showed that resource queue lengths and particularly the CPU queue le?gth algonthms, based on their load distributing principle. Both typcsoralgorithms stnve to
are-good indicators of load because they correlate well with the task response bme. reduce the likelihood of an_unshared state (a state in which one computer lies idle
Moreover, measuring the CPU queue length is fairly simple and carries little overh while at e same time tasks contend for service atanother computer (21)) by transferring
Ifa task transfer involves significant delays, however, simply using the current tasks to lightly loaded nodes. Load balancing algorithms (7, 20, 24], however, goa step
queue length asa load indicator can result in a node accepting tasks while other tasks funher by attempting to equalize toads at all computers. Because a load balancing
it accepted earlier are still in transit As a result, when all the tasks that the node algorithm sfers tasks at a higher rate than a load sharing algorithm, the higher
has accepted have arrived, the node can become overloaded and require er _task overhead ncurred by the load balancing algorithm may outweigh this potential
transfers to reduce its load. This undesirable situation can be prevented by artdicaally performance
incrementing the CPU queue length at a node whenever the node acceptsa remote improvement.
task. To avoid anomalies when task transfers fail, a timeout (set at the time of Task transfers are not instantaneous because of communication delays and delays
acceptance) cl eanng tbhe i se md epcl or eyme de .n tAe df .t e r the timeoul, if the task has not that occur during the collection of task state. Delays in transferringa task increase
the duration of an unshared state as an idle computer must wait for the arrival of
yet arrived, the CPU queue
the transferred task To avoid lengthy unshared states, anticipatory task transfers from
_W le the.CPU queue length has been extensively used in previous studies a ove loaded comput rs to computers that are likely to become idl shortly _can be u .
load md1cator, It has n eported that little correlation exists between CPU qu8e5ue :4'nt1cipatory transfers increase the task transfer rate of a lo shanng algonthm, !l'alung
length d processor utihzauon (35], Particularly in an interactive environment. Hen , It less distinguishable from load balancing algorith s. In this.sense, load balan 1.ng can
the designers f V-System used CPU utilization as an indicator of the load ata site. be considereda special case of load sharing, pcrfonmng a parucular level of
nus ant1c1patory
pproach requiresa background process that monitors CPU utilization continuously and task transfers.
: ; \ overhead, compared to simply finding the queue length ata node (see
l l.3.4 Preemptive vs. Nonpreempdve Transfers
Preemptive task transfers involve the transfer of a task that, is partially_ executed. is
transfie • . . the collection of a task s state (which can be
quite 85
r 1s an expensive operauon
v---•
O

ADVANCED CONCEl'n IN oP£AATIN0 svs-reMS D1S11UBU'ra0 SCKEDUUNO 26J


262

·..·.
',, tt.3.2 Classification of Load Distributing Algorithms
- ... The basic function of a l ad dis buting algorithm is to transfer load ( ks) from heav
i ily loaded computers to !die or hghtly loaded computers. Load distributing algorithms
• .. / -\ can be broadly charactenzed as Slatic, dynamic, or adaptive. Dynamic load distributing
'] . '
algorithms [3, 10, _I I, IS, 20, 24, 31, 34, 40] use system state information (the loads
.. . •'
' I \ . at nodes), at least ID art, to n:iake load distributing decisions, while static algorithms
a.. I make o us of such 1?fonnat_1on. In ta c load distributing algorithms, decisions are

• \.
hard-wired ID e algonthm usmg a _pnon knowledge of the system. Dynamic load dis
t \. tributing algonthms have the tent1al to outperfonn static load distributing algorithms
•-•-1 I '
--•--------+-- \N•2100
because they are
perfonnance, able to dynamic
However, expl01t short
load term fluctuations
distributing in the system
algorithms state to improve
entail overhead in the
•-•-1 I ---e- - - • +- N•5 •
collection, storage, and analysis _ofsystem state infonnation. Adaptive load distributing
0.3-I-- --,-- - - --,,----------------, algorithms [20, 31]'are a special class of dynamic load distributing algorithms in that
0.2 0,3 0.4 0.5 0.1 0.7 0.1 OJI
1.o FIGURE 11.2 they adapt their activities by dynamically changing the parameters of the algorithm
P as a function of p and N (adapted to suit the changing system state. For example, a dynamic algorithm may continue to
Server Utilization from [24]).
collect the system state irrespective of the system load. An adaptive algorithm, on the
other hand, may discontinue the collection of the system state if the overall system load
11.3 ISSUES IN LOAD DISTRIBUTING
is high to avoid imposing additional overhead on the system. At such loads, all nodes
We now discuss several central issues in load distributing that will help the reader are likely to be busy and attempts to find receivers are unlikely to be successful.
understand its intricacies. Note here that the terms computer, machine, host,
workstation, and node are used interchangeably, depending upon the context
11.3.3 Load Balancing vs. Load Sharing
Load distributing algorithms can further be classified as+aad.]:,alancing or load
11.3.1 Load
sharing algorithms, based on their load distributing principle. Both typesoiilgonthms
Zhou [41] showed that resource queue lengths and particularly the CPU queue le?gth stnve to reduce the likelihood of an unshared state (a state in which one computer lies
a-re goodindicators of load because they correlate well with the task response idle while at thesame time tasks cont nd for service aranother computer [21)) by
ttme. Moreoverm, easuring the CPU queue length is fairly simple and carries little transferring tasks to lightly loaded nodes. Loadbalancing algorithms (7, 20, 24],
overhead. Ifa task transferinvolves_significant delays, howeYCr, simply using the however, goa step further by attempting to equalize loads at all computers. Becausea
current CPU queue length asa load indicator can result in a node accepting tasks load balancing algorithm
while other tasks it accepted earlier are still in transit. As a result, when all the !1'3nsfers tasks ata higher rate than a load sharing lgori . the igher overhead
tasks that the node has accepted have arrived, the node can become overloaded and ncurred by the load balancing algorithm may outweigh this potennal performance
require further task sfers t reduce itsload. This undesirable situation can be improvement.
prevented by artificially mcrem ntmg t_heCPU queue length at a node whenever the Task transfers are not instantaneous because of communication delays and delays
node acceptsa remote taSk. To avoid anomahes when task transfers fail, a timeout (set that occur during the collection of wk state. Delays in transfe nga task i creasc
at the time of acceptance) cl eanng tbhee.1ms dpelcoryemede .ntAedft. er the timeoul, the duration of an unshared state as an idle computer must wait for the amval of
if the task has not yet arrived, the CPU q·ueue the transferred task. To avoid lengthy unshared states, antici tory taSk transfers from
overloadedcomputers to computers that are likely to beeome 1dl shortly _can be u .
While the CPU queue length has bee . • a nticipatory transfers increase the wk traQSfer rate ofa loac_l shanng algonthm,
load indicator it has been rted tha _n extensively used in previous studies as !"alung It less distinguishable from load balancing algorith s. In this.sense, load balan
length and pr•ocessor ulilizatiroep[o35 t .little co_rreIau·on ex·ists between CPU queue
•.ng can be considereda special case of load sharing, perfonmnga parttcular level of
the designers of V-System use:CPG t cu_larty •n 3? i teractive environmen Hen
approach requires a background izauon an1nd1cator of the load ata sate.1bil ant1c1patory
imposes more overhead compa s.that momtors CPU utilization continuously and task transfers.
Sec.I 1.!0.i). ' to simply finding the queue length ata node (see
ll.3.4 Preemptive vs. Nonpreempdve Transfen
Prccm t" k fi ,..._ •-nsfer of a wk that is partially executed. This
trans P 1.ve tas tra•ns en 1nvo.1ve aus..t,uhe,..collecn•on of a statew( h1"ch be ·
k ' q uite
taS s
,er 1s an expensive operation can
264 ADVANCED CONCEPTS IN OPERATING SYS"IEMS

large and complex) can be difficult Typically, a task state consists of a•vinua1m DISTa!Bl1Tl!I) SCl!EDUUNo 265
image, a process control block, unre_ad1/0 buffers and·messages, file. Bryant and F el (3) propose another approach based on the reduction in response
pointers, : . ume that can be obt&amendlfoiraf. task by transferring it elsewhere. In t.his method,a task
that have been set, etc. Nonpreemptive task transfers, on the other hand, involve . lected for trans,er O .Y Its response.time will be improved upon transfer. (See [3]
rs transfer of tasks that have not begun execution and hence do not require the details on how to estimate response time.) .
transt'i tbe the task'state. Inboth types of transfers, informatio bout the env for There are otherfactors·to consider in the selection ofa task. First, the overhead
nment in w of the task will execute must be transferred to the rece1vmg node. . 1ncurred by the transfer should be minimal. For example,a task of small siz.c carries less
over hea d Second., t.he nuLmobeer .of lodceatiodn-dependent system calls made by the
This information ch include the user's current working directory, the privileges selechte.erde
inherited by the task, can
Nonpreemptive task transfers are also referred to as task placements. etc, : k.:Skhould be rrummal. atlon- pen ent calls must be executed at the nodew
originated because they use resources such as windows, or the mouse, that
11.4 COMPONENTS OF A LOAD only
.eXISt at the node (8, 19].
DISTRIBUTING ALGORITHM
Typically, a load distributing algorithm has four components: (1) a transfer policy
that determines whether a node is in a suitable state to participate in a task 11,4.3 Location Polley
transfer, (2) 8 selection policy that determines which task should be transferred, (3) a The responsibility of a location policy is to find suitable nodes (senders or
location policy receivers) to share load. A widely used method for finding a suitable node is
that determines to which node a task selected for transfer should be sent, and (4) through polling. In polling, a node polls another node to find out whether it is a
an infonnation policy which is responsible for triggering the collection of system suitable node for load sharing (3, 10, 11, 24, 31]. Nodes can be polled either
state information. A transfer policy typically requires informabon.on the local serially or in parallel (e.g.• multicast). A node can be selected for polling either
node's state to make decisions. A location policy, on the othe·rhand, is likely to require randomly [3, 10, 11], based on the information collected during the previous polls (24,
information on the states of remote nodes to make decisions. 31], or on a nearest-neighbor basis. An alternative to polling is to broadcast a query to
find out if any node is available for
11.4.1 Transfer Policy load sharing. ·

A large number of the transfer policies that have been. proposed are thrt!shold poli
11.4.4 Information Policy
cies [JO, 11, 24, 31). Thresholds are expressed in units of load. Whena new
originates aat node, and the load at that node exceeds a threshold T, the b'aDSfer po The informatio policy is responsible for deciding when information about the states
cy decides that the node is a sender. If the load at a node falls below T, the b'aDSfer of other nodesm the system-should be collected, where it should be collected from,
policy decides that the node can be a receiver for a remote task. . and w at information should be collected. Most information policies arc one of the
An alternative transfer policy initiates task transfers whenever an imbalance ID followmg three types:
load among nodes is detected because of the actions of the information policy.
Demand-driven. IR this class of policy, a node tollects the state of other nodes
only when it becomes either a sender.or a receiver (decided by the tranSfer llnd
11.4.2 Selection Polley selection policies at the node), making it-a suitable candidate to initiate load sharing.
Note thaat demand-driven information policy is inherently a dynamic policy, as its
A selection policy selects a task for transfer, once lhe transrcr policy decides that actions depend on the system state. Demand-driven policies can be sender-initialed,
the node isa sender. Should the selection policy fail to finda suitable task to rtceiver-initiared,
traDSfer, the node is no longer considered a sender until the transfer policy decides
or symmetrically initiated. In sender-initiated polici , sen look for receivers to
that the node is
a sender again. transfer their load. 1n receiver-initiated policies, receivers solicit load fi'?m se dcrs.A
symmetrically initiated policy is 8 combination of both, where load sharing acllons arc
The simplest approach is to select newly originated tasks that have caused
triggered by the demaiid for extra processing power or extra work.
the node to becomea oder by increasing lhe load at the node beyond the tlueshold Periodic. 1n this class of policy, nodes exchange load.information periodical![ 14,
(11),, Such tasks arc relatively cheap to transfer, as the transfer is nonpreemptive.
40]. Based on the information collected, the transfer policy a n may dCCide to
. A b:t5ic, criterion thaat task selected for transfer should satisfy is that the ovemead
transfer job·s Periodic information paliclioesaddodisntoritbuputng athreeirnurmualv1tyat hi ghe

system
tate. For example, the benefits due 10 are busy Nevertheless overheads due
mcurredm the transfer of the task should be comJ)CDSated for by the reduction in the
oads au_se most of the n es in th syStc1ncrease th system load thus woncn
respo time rcaliz.ed by the task. In general, long-lived tasks satisfy 1his criterion4].
to penodic mfonnation coliecuon conunuc to
lso.a task can be selected for remote execution if the estimated average
time for that type of task is greater than some execution time ducshold [36). ·· the situation.
266 ADVANCED CONCEPTS IN OPERATING SYSTEMS
11.6.l Sender-Initiated Algorithms Dlmia= SCIIEDUUNG 267
State-change-driven. In this class of policy, nodes disseminate state info .
nsender-initiated algorithms load ct,· tn'b.utmg a f • ..
whenever their state changes by a certain degree [24]. A state-change-drivennna Inode (sender) that attempts •to send a stask c IVlty 1s initiated by an overloaded
on differs froma demand-driven policy in ihat it disseminates information about
section covers three simple yet effective sen on. nderloaded node(receiver). This
thl)Ohcy ofa node, rather than collecting infonnation about other nodes. Under Lazowska, and Zohorjan ll l]. er-imtiated_algorithms studied by Eager,
collection S te Under decentralized state-change-driven policies, nodes send
information to
peers
centraliZede Slate change-driven policies, nodes send state information to a centralized
Transfer policy. All three algorithms use th

t atbased on CPU
the node queue
makes the length. A nodeexceeda
queue length is ident fi:irr:
th• h a;s
0
nfder
Ide !icya,
er ifa n wthreshold policy
task originating
U.S STABILITY
suitable receiver for a remote task if accepting tit\ k
asa Je.ngth to exceed T.
T_·/
node identifies itself
e as wi not cause thenode's
We now describe two views of stability. queue
Sfoerletcratinosnfepr.olicy. These sender-initiated algorithms ony newy •
ar nved tasks
considre I 1
11.S.l The Queuing-Theoretic Perspective
When.the long term arrival rate of work to a system is greater than the rate at which Location policy. These algorithms differ only in their location policy:
the system can perform work, the CPU queues grow without bound. Such a system
Random. Random is a simple dynamic location policy th t uses no remote state
is termed unstable. For example, consider a load distributing algorithm perfonning
information. A task is simply transferred to a node selected at random. with no infor
excessive message exchanges to collect state information. The sum of the load due mation exchange between the nodes to aid in decision making. A proble.mwith
to the external work arriving and the load due to the overhead imposed by the this approach is that useless task transfers can occur when a task is transferred
algorithm can become higher than the service capacity of the system, causing system toa node that is already heavily loaded (i.e., its queue length is above the threshold).
instability. An issue raised with this policy concerns the question of how a node should treaat
Alternatively, an algorithm can be stable but may still cause a system to transferred task. If it is treated as a new arrival, the transferred task can again be
perform worse than when it is not using the algorithm. Hence, a more restrictive transferred to another node if the local queue length is above the threshold. Eager
criterion for evaluating algorithms is desirable, and we use the effectiveness of an et al. [1 1l have shown that if such is the case, then irrespective of the average
algorithm as the evaluating criterion. A load distributing algorithm is said to be load of thesystem,
effective under a given set of conditions if it improves the performance relative to the system will eventually enter a state in which the nodes are s nding all eir m
that of a system not using load distributing. Note that while an effi:ctive algorithm transferring tasks and not executing them. A simple soluti nt._o this prob_lem 1st_o hm1t
cannot be unstable, a stable algorithm can be ineffective. the number of timesa task can be transferred. A sender-m1uated algonthm using the
random location policy provides substantial perfonnance improvement over no load
sharing at all [11).
11.S.2 The Algorithmic Perspective Threshold. The problem of useless task transfersu
nd
!:C:: :it c
If an algorithm can perform fruitless actions indefinitely with finite probability, the can e avoide by polling a node (sel ted at ran:o;> ; se ;ted node, which must
algorithm is said to be unstable [3]. For example, consider processor thrashing. e receiver (see Fig. 11.3). If so: the task is transf:actually arrives. Otherwise, another
transfer of a task to a receiver may increase the receiver's queue length to the polllt execute the task regardless of its state when the be f polls is limited bya
of overload, necessitating the transfer of that task to yet another node. This P 5 parameter
may repeat indefinitely [3]. In this case, a task is moved from one node to anoth r ID node is selected at random and polled. The numth : ;hile nodes are randomly
selected. called PollLimi t to keep the overhead low. Note a •un·ng one searching
search ofa lightly loaded node without ever receiving service. Discussions on . d more than onceu .
session of a sender node will not poll any no e . i d within the Pol1L1m1t
vanou: types of algorithmic instability are beyond the scope of this book and can polls, then .

in foun
be PollLimitpolls. If no suitable receiver node ,s ounthe taSk, By avoiding useless task
. . t d must exec ute th
(6]. t e node at which the task on gi n a e . 1pe-"onnance improvement over e
h .
transfers, the threshold pohcy prov,e·ct s substanua
11

random location policy [11). h make no effort hoose the b.est. receiver
toc
11.6 LOAD DISTRIBUTING ALGORITHMS Shortest. The two previous ap roac e c a number of nodes (= PollLimit) are
We now describe some load distributing algorithms that have appeared in the Ii fora task. Under the shortest Jocauon po y,
and discuss their performance.
268 ADVANCED CONCEPTS IN OPERATING SYSTEMS

rvl·ng capacity,instability occurs Thus theact· f DlrntJBtn'Eo SCl!mU!JNO 269


s0e1 · • 1 0 n s o s e n d e · " t" e d I
effective at high system load s and c ause sys t e m ·n t a b· i· r-m b t tat ._ a
Poll-set= Poll-set U •;• Poll node • 1 5 1
'i' go nthms are e system state: . Uy Y fathng to
adapt to

Yes
11,6.2 Receiver-Initiated Algorithms
In receiver-initiatedalgorithms, the load distributingactivity ··n·i· ted fro d
Transfer ) sk J- - - -<: · ) 1h · • 0 ta m an uner
1 51 1
to I loaded nod (receiver I ts . obtain a task from anoverloaded node (sender).
trying
In thissection, we descnbe the pohcies of an algorithm [31] that isa variant of the
algorithm proposed in [10) (see Fig. 1l.4).

'i Transfer policy. Transfer policy is a threshold policy where the decision is based on
CPU queue length. The transfer policy is triggered whena task departs. If the local queue
· length falls below the threshold T, the node is identified asa receiver for obtaininga
Yes
No task froma node (sender) to be determined by the location policy.A node is identified
to bea sender if its queue length exceeds the threshold T.

Selection policy. This algorithm can make use of any of the approaches discussed
under the selection policy in Sec. 11.4.2. ·
----------------- ·-I Location policy. In this policy, a node selected at random is polled to determine if
tQauskeuleoctahlei}'
FIGURE ll.3 transferringa task from it would place its queue length below the threshold lev,.1• If
Sender-initiated load sharing with threshold locationpolicy.

selected at random and are polled to determine their queue length [ll]. The node
with theshortest queue length is selected as the destination for task transfer unless its
Poll node
queue length T. The destination node will execute the task regardless of its queue Poll-set= Poll-sel U 'i"
'i'
length at the time of arrival of the transferred task. The performance improvement
obtained by using the shortest location policy over the threshold policy was found·to _be
marginal [ll],indicating.that using more detailed state information does not necessanly
result in significant improvement in systemperformance.

lnfonn,tio• polky. Who• •ithtt"" short,,• o, th• threshold locotio• policy ;, •


polli•g activity comm,"'" who• "">an,f., policy id••tifi"• nod,"
task.Hence, the information policy can be considered to be of the demand-dnven
.,.
tire•;•""
t 0y

St,bility. Th•s, """ appro,chos fo, location policy ... d ,. s,nt!,c.i•iti,red ••gori Yes
""" syst<m iostability at high ,yst,m •-• whoc, no •od•
is lik•ly to b, hg..t
loadod. ••d h<ac, th• pcobability •hat a "•d" will "'""'1 i• fi•di•g• rece,vec•
tho
is"°"low. How,..,. •h• polli•g activi•y i• "•d"•i•i•iatod •lgorithm, inc"':"'"..., Task depanurc atT
"" at wh,ch wo,k '"""" th•syst,n, """""• .......lly =hi•g• pom• bl<
•h• cos• of load sh,ri,g is'""" •hoo •h• hoo,fit. A• this poi•t. most of th• "'!,.
CPU cyd., '" wa'lof ,o ""'"""'fol polls and io "'pondi•g to th•se polls. , FIGURE 11.4
•ho load do, to W<Jck •tri•i•g •Od do• •o th• load shari•g activity <xcecds tire '''""'' acccivcr-initiatcd loadsharing.
DISTIIIBlJTED SCHEDULING 27 J
NCEP'TS IN OPERATING SYSTEMS
270 ADVANCED co ' . another node is selected at random "d range of the system average. Shrtrivi.ng to maintain the load at a nod e at the exact system
t th e polled node trans,ersa task Otherwise, k (' ,... average can_ cause processor t ashmg [3], as the tra11sfer of a task may result in a node
. . ode that can transfer a tas 1.e., a sender) is
no • bove procedure 1· s repeated unufl an . fl d d
have failed to n a sen er. a If ll becomi g either _a send r (load above average) or a receiver (load below avera e). A
the a 1:.., 'ound or a static
polls . . Pol1L1m1t
mber o tnes nu . ther task completes ·1
or unt1 a ,,u
1 description of this algonthm follows. g
1 ' ' d aits u n u a n o · d
predetermined to fin d a sender, the n .w_ t h e se a r ch for a sender, prov td 8 Transfer policy. The transfer p l cy is a threshold policy that uses two adaptive thresh
ti ·
1e n de
eriod 1 s still
is over before imuaung d
Preceiver. Note at the search oe.s rtno start after a predeterrnmed penod, . the extra
·s completely lost to the system unt tl another olds. These thresholds are equ distant from tile node's estimate of tile average load
th 1
.
process mg power · 1 ble at a receive
1f• ava t a across all nodes. For example, if a node's estimate of the average load is 2, then the
task completes, which may not occursoon.
lower threshold = 1 and the upper threshold = 3. A node whose load is Jess than the
. . Th tion policy is demand-driven because the polling ac- lower threshold is considered a receiver, while a node whose load is greater than the
,.
Information policy. e m,orma .
tivity starts only after a node becomes a receiver. upper threshold is considered a sender. Nodes that have loads between these thresholds
lie within the acceptable range, so tlley are neither senders nor receivers.
Stab.lit R h ·ver-i'nitiated
I y. h.ece1
algorithms
ystem loads there do probab1hty
is a high not cause system instabilityw1
. . that a receiver
.
for the following
'll fldn a suitable Location policy. The location policy has the following two components:
reason. At 1g s . . · th " ·
sender to share the load within a few polls. This results m. e euective usage of polls Sender-initiated component
from receivers and very little wastage of CPU cycles at high system loads. At low
• A sender (a node that has a load greater tha11 the acceptable range) broadcasts a
system loads, there are few senders but more receiver-i itiated polls. These polls do
not cause system instability as spare CPU cycles are available at low system loads. TooHigh message, sets a TooHigh timeout alarm, and listens for an Accept message
until the timeout expires.
A drawback. Under the most widely used CPU scheduling disciplines (such as round • A receiver (a node that has a load less than the acceptable ra11ge) that receives a
robin and its variants), a newly arrived task is quickly provided a quantum of service. TooHigh message cancels its TooLow timeout, sends an Accept message to the source
In receiver-initiated algorithms, the polling starts when a node becomes a receiver. of the TooHigh message, increases its load value (taking into account the task to
However, it is unlikely that these polls will be received at senders when new tasks be received), and sets an AwaitingTask timeout. Increasing its load value preventsa
that have arrived at them have not yet begun executing. As a result, a drawback of receiver from over-committing itself to accepting rlllllote tasks. If the AwaitingTask
r ceiver-initiated algorithms is that most transfers are preemptive and therefore timeout expires without the arrival of a transferred task, tile load value at the receiver
expen siv_eC. onversely, sender-initiated algorithms are able to make greater use of is decreased.
nonpre • On receiving an Accept message, if the node is still a sender, it chooses the best
emptiv transfers because they can initiate load distributing activity as soon as a new
task-amves. task to transfer and transfers it to the node that responded.
• Whena sender that is waiting for a response for its TooHigh message receivesa
TooLow message, it sends a TooHigh message to the node that sent the TooLow
11.6.3 Symmetrically Initiated Algorithms
message. This TooHigh message is handled by the receiver as described under the
Under symmetrically initiated algorithms (21), both senders and r e. tversasveeatrhche "Receiver-initiated component."
afodr- receivers and senders, respectively, for task transfers. These algonthms h ads • n expiration of the TooHigh timeout, if no Accept messag has been "':ceived.
the vantages of both sender- and receiver-initiated algorithms. At low system 1OAt high the sender infers that its estimate of tile average system load 1s too low (smce no
sender-initiated component is more successful in finding underloadednodes. aded node hasa load much lower). To correct this problem: the sender broadcastsa
system loads, the receiver-initiated component is more successful in finding over1 both
nodes. However, these algorithms are not immune from the disadvantages • at
r° ChangeAverage message to increase the average load es11mate at the otherncdes.
O
sender- and receiver-initiatedalgorithms. As in sender-initiated algori •- pollt:f o Rece·iver-·mltiated component
high system loads may result in system instability, a11d as in receiver-1muated g • node, on becoming receiver, broadcasts.a TooLowmessage. setsa TooLow
rithmsa, _preemptive tas tratl f r.facility isnecessary. . the 8

A simple symmetrically m1t1ated algorithm can be constructed by usmg both ._ hmeout alarm. and starts listening for a TooH,.r:h message. . .
transf r. d locatio policies described in Secs. 11.6.l and 11.6.2. Another synun b1 • Ifa T, H· h • · ed the receiver performs the same acllons that II
cally m1t1ated algonthm, called the above-average algorithm [20], is described nex· oo ,g message 1s rece1v ,
does under sender-initiated negotiation (see above). . .
. • lf the T, L . . before receiving any TooH1gh messages. the receiver
THE ABOVE- VER2AGE LGORITHM. The abovesaverage algorithm, proposed /e 00 ow timeout expires tile average load estimate at the
Krueger and Fmkell 0J, tries to maintain tile load at each node within an acceptab broadcasts a ChangeA1•erage message to decrease
other nodes.
272 ADVANCED CONCEPTS IN OPERATING SYSTEMS

Selection policy. This algorithm can make use of any of the approaches di
under the selection policy in Sec.11.4.2. scussed
Information policy. The information_policy is_dem nd_- riven. A highlight of this The po,llmg process stops if asuitabl . DlmtfBU'JED SCitEDUUNo 273

rithm i-sthat the average system load 1s determmed md1v1dually at each node . al o task if the number of !)Olis reaches a PollLr ce1ver is found for the newly arrived
• . . · 1m11(a par .
little overhead and without the exchange of many messages. Another ke nn osing the receivers hst at the sender node becomes em ame er of_the algonthm), or if
note is that the acceptable range determines the responsiveness of the algorith Point to the task is processedlocally, though it canlat P - If polhng falls to finda receiver,
the co unication network_is heavily/lightly loaded (indicated by long/short: When loadsharing. er migrate as a result of receiver-
transm1ss1on delays, respectively), the acceptable range can be increased/de essage initiated
each node individually so that the load balancing actions adapt to the st eased by Receiver:initiated Component. The goal of th . . ..
communication network as well. a e of the 1o0b t a i n tasks froma sender node. The nodes polled e receiver-1_niUatcd com_ponent
is head to tail in the senders list (the most up-to-date .ar; selec_ted n the followmg order:
to h e a dm• the OK •hst (the most out-of-date inform1a0t'•oDn.nationd 1sfiused first')· then tail
11.6.4 Adaptive Algorithms the node has becomea sender); then tail to head in n is u e t, in th hope that
out-of-date infonnation is used first). e receivers st (agam the most
A STABLE SYMMETRICALLY INITIAT°ED ALGORITHM. The main cause of !he receiver-i?itiated component is triggered ata node when the node becomes
system instability due to load sharing by the previous algorithms is !he in i criminate a receiver. The receiver polls the selected node to determine whether it isa sender. On
polling by thesender's negotiation component. T e stab! sy metncally 1_mtiated al receiptof_themes age,_the polled node, if it is a sender, transfersa task the polling
gorithm [3 I] utilizes the information gathered durmg polling (instead of. discarding it 10
node and mfonns 1 of its state after the task transfer. If the polled node is noat sender,
as was done by the previous algorithms) to classify the nodes in the system as either it removes the receiver node ID from the list it is presently in, puts itat the head ofits
Sender/overloaded,Receiver/underloaded, or OK (i.e., nodes having manageableload). receivers list, and infonns the receiver whether it (the pollednode) isa receiver or OK.
Theknowledge concerning the state of nodes is maintained bya data structure at On receipt of the reply, the receiver node removes the polled node ID from whatever
each node, comprised ofa senders list, a receivers list, and an OK list. These lists are list it is presently in and puts it at the head of the appropriate list based on its reply.
main tained using an efficient scheme in which list .manipulative actions, such as The polling process stops if a sender is found, if the receiver is no longera
movinga receiver, or if the number of polls reaches a static Polllimit
node from one list to another, or finding the list to whicha node belongs, imposea small
and constant overhead irrespective of the number of nodes in the system. (See [31] for Selection policy. The sender-initiated component considers only newly arrived tasks
more details on the list maintenance scheme.)
for transfer. The receiver-initiated component can make use of any of the approaches
Initially, each node assumes that every other node isa receiver. This state is discussed under the selection policy in Sec. 11.4.2.
re p re se n te a t e a c h n o d e b y a r ec ei ve r s l ist that contains all
e m p ty s e nde rs li s t, a n d a n e m pt y O K l i st .
nodes (except itself), an Information policy. The information policy is demand-driven, as the polling activity
starts when a node becomes a sender or a receiver.
Transfer policy. The transfer policy is a threshold policy where decisions are based
on CPU queue length. The transfer policy is triggered whena new task originates or Discussion. At high system loads, the probability of a node being underloaded is neg
whena taskdeparts. The transfer policy makes use of two threshold values to classify ligible, resulting in unsuccessful polls by the sender-initia d com nent. Unsucce sful
poll
thenod_es:a lower threshold (LT) and an upper threshold (UT).A node is said to bea s resutI.m the remov al of polied node IDs from receivers . lists.
• Unless
· I receiver
h' h
sneonddee'sr 1qfu!eIusequleenugethle:nSgUthT>. UT, a receiver if its queue length< LT, and
initiated pollsto thes nod s fail to. find them as S:! •ru:: :d:1i!:!ie: fls
OK if LT :5 system loads, the receivers hsts remam _empty.a ' ted (Note thata sender
1
at high system loads (w ic? arem_oSt h ely f:! 3:/:::;:r-initiated component
is
Location policy. The location policy has the following two components:
polls only nodes foundm its receivef;l hSI.) e ei er-initiated load sharing (which is
S.•de,•l•mat,d Component. Tho .....•·•••••""' <omponent is "'-"'.:!,: deactivated at high system loads, leavmg only rec
•ode when•• becomes•"••"· Th, ......poll, the •ode" tho herul of the""'' od, effective at such loads). . . .. llin generallyfails. These failures do
list to •• nn•ne wh,...,.•• •• ,till • -" ' ' ·T hp o l,l e d •ode remove, rho "' At low system loads, rece1ver-miuated po n!essing capacity is available at low
not adversely affect performance because extrap "live eff t of updating the receivers
,•mu s stem loads. In additio , these polls hav :po :system state, fu sender- n
ID from tho list •• •• """"y" • ,. 1,11, it " the herul of .,. ''"""" Us• tmd •• On ated
th, ......Whethe, ., ••• - .........,."'OK-................. o• ... '""""' "'::.....
0

" '" • • • o f th•• «ply. th o " " ""_,.,. the new tosk if th, polled
th a t • •• • •
node hu reee•vo,. O il" "' •"10 .•• removed from
th• polf,d -•, "'\•
Im...,.-.,..,
•ruI put OI the head of tho OK list o, " tho h"" of,.. . .,, li,t bascd on 11, reply. hsts. With the receivers lists accurately_re a fe polls. Thus, by usmg 5;1:nder-1muated
the sender polls the node at the head of the receivers list. load sharing will generally succeed wi . ·u·ated load sharing at high loads, and
load
r·uu·asharing
ted at low system Joads, reec1ver-m1 ads the stable symmetn•cayll 1·
symmetrically initiated load sh.anng a1 moderate Io ,
DISTltlBIJra) SCHEDULJN() 275
NCEP'TS IN OPERATING SYSTEMS
274 ADVANCED co
rt: rmance1over a w1
·de range of system MJM!l A distributed system that performs no load distributing.
loadswh·
1 e
algorithm achieves imp oved peo R£CV Receiver-initiated algorithm.
preserving system stab1hty. RAND Sender-initiated algorithm with random location policy.
SEND Sender-initiated algorithm with threshold location policy.
A STABLE SENDER-INITIATEDA 1 GORITHM. This algo,ithm [31] has two desir
' .stability. Second, load sharing is due to non- ADSEND Stable sender-initiated algorithm.
. F. · 1does nothcause m . th d ..
able properues. 1rst, 1 . r)o nly. This algonthm uses e sen er-tmtiated sYM Symmetrically initiated algorithm (SEND and RECVcombined).
preemptive transfers(which arec ebalpc ymmetrically initiated algorithm as i , but has ADSYM Stable symmetrically initiated algorithm.
. anent of the sta e s . k
an p . .. om anent nonpreempt1ve tas loadsh M/MIK A distributed system that performs ideal load distributing without
incurring any overhead.
ng co to attract the future trans-
a modifiedreced1venro-md eisl l aTt ehdec stabpei sender-initiated policy i.s very si.milar to
the fr
stable
fers om senr. . .a roachso only the differences will be pointed out.
symmIentntchaellsytaibrnletiastednderP-m1.t1.ate'd algorithm' the data structure (at each node)
of the
A fixed threshold of T = lower threshold = upper threshold= 1 was used for

stable symmetn·callY ·m1·t1·ated algorithm is augmented by. an a. rray called the these comparisons. However, the value of T should adapt to the system load and the
st.atevector. The statevector 1· s used by each node to keep track of which list (senders, task transfer cost ecause a node is identified as a sender or a receiver by comparing
rec. ei.v.ers, or OK) it belongs to at all the other nodes i_n the system. Moreover, the its queue length with T [11). At low system loads, many nodes are likely to be idle a
sender-1rut1ated load sharing is augmented with the following step: when a sender low value of T will result in nodes with small queue lengths being identified as
polls a selected node, the sender's statevector is updated to reflect that the sender now senders who can benefit by transferring load. At.high system loads, most nodes are
belongs to the senders list at the selected node. Likewise, the polled node updates its likely to be busy-a high value of T will result in the identification of only those nodes
statevector based on the reply it sent to the sender node to reflect which list it will with significant queue lengths as senders, who can benefit the most by transferring
belong to at the sender. load. While a scheduling algorithm may adapt to the system load by making use of an
The receiver-initiated component is replaced by the following protocol: whena adaptive T, the adaptive stable algorithms of Sec.l l.6.4 adapt to the system load by
node becomes a receiver, it informs all the nodes that are misinformed about its varying the PollLimit with the help of the lists. AJso, low thresholds are desirable for
current state. The misinformed nodes are those·nodes whose receivers lists do not low transfer costs as smaller differences in node queue lengths can be exploited; high
contain the receiver's ID. This information is available in the statevector at the transfer costs demand higher thresholds.
receiver. The statevector at the receiver is then updated to reflect that it now belongs to
For these comparisons, a small, fixed PollLimit = 5 was assumed We can see why
the receivers list a all thos_e node that were informed of its current state. By this
technique, this algonthr_n avoids receivers sending broadcast messages to inform such a small.limit is sufficient by noting that if P is the probability that a particular node
other nodes that they are receivers_. Remember th_abtroadcast messages impose is below threshold, then (because the nodes are assumed to be independent) the proba
message handling overhead sattaatell. nodesm thesystem. This overhead can be high if bility that a node below threshold is first encountered on the ith poll is P(I - Pi-' [11].
For large P, this expression.decreases rapidly with increasing i; the probability of suc
nodes frequently change their
Note that there are no ·
sende.r-.1 m. t1ated Ioad preempllve tr.ansfers of partly executed tasks here. The ceeding on the first few polls is high. For small P, the quantity decreases more slowly.
sharin
the arrival of a ne t k T; com nent w ll perform any load sharing, if possible on However, since most nodes are above threshold. the improvement in systemwide re
for the stability 0;th:s;' bel Slabih_tyof this approach is due to the same reasons 401·de.nr1cal nodsamTehlong-term task am.val rat umTh to be homogeneous'· that 1s,·
follows· es. e notations used 1• 0 th e. e system is assumed to contain
given · e figures correspond to the algorithms as
a e symmetncally initiated algorithm.

11.7 PERFORMANCE COMPARISON


This section discusses the general
nthms des.cribed in the prev1·ous s e c tpt oe rnf oFr m" ance trends of some of the
example algo- response ll_me of tasks vs. the offere · igure 11.5 through Fig.l l.7
plot the average discu sedm Sec. 11.6 (32] Th d system load for several load
sharing algorithms
one llme.unit, and the task.inte a erag service demand for tasks is assumed to be
exponenttallydistributed. The tyrarntva)l t1m s and service demands are independently
nodes have the s em oad 1s ass ed · all
sponse time that will result from locating a node below threshold is small; use load distributing. Considerable further improve_men 1 - performance can be gw ed
quitting the search after the first few polls does not carry a substantial through simple sender-initiated (SEND) and rece1ver-m1t1ated (RECV) load
penalty. shanng schemes. M/M/K gives the optimistic lower bound on the i:ie o ance that
can be obtained through load distJibuting. since it assumes no load dtstnbutmg
Main resulL Comparing-M/M/l with the sender-initiated al;ori_thm that u s e overhead.
".111- dom location policy (RAND) in Fig.
J l .5. we see that even this simple
load d1stnbu11ng scheme provides a substantial performance improvemen! II.7.1 Receiver-initiated vs. Sender-initiated Load Sharing
over a system that docs_ not
It can be observed f.romF"1g. 115. that the sender-in. itiated algCoVrithm1(·ShEND) performs
marginally better than the receiver-initiated algorithm (RE ) 111 1g t to moderate
277
•,:
DISTRIBITTEO SCHEDULING
ADVANCED CONCEPTS IN OPERATING SYSTEMS 7
276
;:
;:
i 6
i ;:j:
Cl)
-·•·-·-·--·
---0--------------0--
SENO
SYM ;:
• I
. 5
--o------0-- RECV i:
M/M/1
ii i:
Q)
E
FQ)
' i
i
·--•------------·-•··
-·•·-·-·--·
-- -----◊--
---◊---------◊·,
RANDOM
SEND
RECV
M/MIK
C

Cl)
4
- . 0
ii ::
Cl)
C i ◊ a:
i :
g ,/ / 1a i :
Cl) 3
_ ::;; i o?

cY: :::,-
gJ
.--·· / ,o
a: - / d
2 / :'A
a,
::;; ' : C --- - - - - - -2.:7.,...,...-.: --8'/
aoc-=O,:'aC -:-:C-:7-- ,...-o•••

0.5 0.6 0.7 0.8 1.0 FIGURE 11.6


o-+----,-----,---.----.---. 0.9 Average response time vs. system load
0.5 0.6 0.7 0.8 0.9 1.0 Offered System Load (adapted from (32)).

Offered System Load


FIGURE 11.S 11.7.3 Stable Load Sharing Algorithms
Average response time vs. system load (adapted from (321).
The performance of the stable symmetrically initiated algorithm (ADSYM) approaches
that of M/M/K (Fig. 11.7), though this optimistic lowei'bound can never be reached, as
system loads, while the receiver-initiated algorithm performs substantially better at it assumes no load distributing overhead. The performance of ADSYM matches that of
high system loads . Receiver-initiated load sharing is less effective at low system loads the sender-initiated algorithm at low system loads and offers substantial improvements
because load sharing is not initiated when one of the few nodes becomes a sender, and at high loads (> 0.85) over all the nonadaptive algorithms [31). This performance
thus load sharing often occurs late.
improvement is the result of its judicious use of the knowledge gained by polling.
Regarding the robustness of these policies, the receiver-initiated policy has an edge Furthermore, this algorithm does not cause system instability.
over the sender-initiated policies. The receiver-initiated policy performs acceptably with The stable sender-initiated algorithm (ADSEND) yields a better performance than
a single value of the threshold over the entire system load spectrum, whereas the sender the unstable sender-initiated policy (SEND) for system loads > 0.6 and does not cause
initiated policy require an adaptive location policy to perform acceptably at high Joa . system instability. While ADSEND is not as effective as ADSYM. it does not require
It an _be seen from 1 . 11.5 that at high system loads, the receiver-initiated pohcy
expensive preemptive task transfers.
mamtamssys e1?.stab1hty_ because its polls generally find busy nodes, while polls due
to the sender-1m11ated pohcy are generally ineffective and waste resources in efforts to
find underloaded nodes.
11·7· Performance Under·Heterogeneous Workloads
4
-eterogeneous workloads have been shown to be common for distributed systems
11.7.2 Symmetrically Initiated Load Sharing I19). a:gure 11.8 plots mean response time against the number of nonload generating
This policy takes advantage of its sender-initiated load sharing component 0
thseysse· tern loads, its receiver-initiated component at high system loads, and bo
at 1
that
t nodes a constant offered system load of 0.85. These nodes originate none of the
system
th ;orkload, while the remaining nodes originate all of the system workload. From the
components at moderate system loads. Hence, its performance is better or m ate: t of thgure, we observe that RECV becomes unstable at a much lower degree of heterogeneity
of t e s n r-initiated policy at all levels of system load, and is better than :iess, an anyother algorithm. The instability occurs because, iri RECV, the load sharing does
rece1ve -1m11ated policy atlow_to m erate srs1em loads [32] (Fig. 11.6). Ne erthective n Oo dt Start .m accordance with the arrivals of tasks at a few (but.highly overloaded) sender
this policy also causes system 1itstab1lity at high system loads because of the meffi
polling activity of its sender-initiated component at such loads. sub::• and random polling by RECV is likely 10 fail to finda_ s_ender hen onlya small
As fe I of nodes are senders. SEND also beco e u stable ':"'th
mcreasmg eterogeneity. wer nodes receive all the system load,
1t 1s imperative that they quickly transfer
278 ADVANCED CONCEPTS IN OPERATING SYSTEMS

7
,,
6
,,,1,1 6 I
I
I i ! .. DISTltoilTTto 279
f
I
"E'
t= 5 -·•·---...............SEND
-·-O-·-·-·-·O-· SYM I;, 5
I
III • .-- i •! •-····
-+----+-- AOSEND
ii u,I
···•···/······ I

.l '.
_..,_ ...,._ AOSYM /
4

···O············· •·O·· Q
iI',/f ,·· /
i

• t= 4 ·····
!
···•··········•·· RANooM
5! I I
"'
[ [
C 3 I / rI C:
I
I
3 -·•-----............SEND
I
-------o--
-- SYlol
,., •
•• _/
!i
I

' ,,,
2

:::E _,. ,i [[II> 1/ ---<>-------◊-• REC\/

2
- ,._
-0 ••
A DS YM
··•0 .•.•••.• I N I M (
+
I
A./ i
ii

"/ .:. . C

, , :

"
: :

E '
™ £ . ···O··O,a() ·-·-0--·-0·-·-0·'·-·6

o I th
I I I I 1
05 0.8 0.7 0.1 0.8

1 -¢····•0••···<>····-<>····-<>·····0·-· · · o.- -.-o


• l are '=m""'- ,.,. ,,,iem, thot "" -"
gh<>n11m,.

Oflelad System load Average response time vs. system load


1.0 FIGURE 11.7
1
0 I I I,
5 10 15 20 25 30 35 40

(adapted from (321).


o I I I I

I Al L.8 G OSREI LT EHCMTING A SUITABLE LOAD SHARING on thperform trends of load sharing algorithms, onemay selecta
8 load sharing algon m that•s apPropriate to the system under consideration as
follows: 4 F
• · II n·Ons in load and has a high cost for
t g r a uon o p y e x e cube lier thanunstable sender-initiated algoriIthmd s
0
mme n d e d a s th ey p erf o r m
at a l l
2 rithms w r lgl rve nn impro ed . • · 1 all. a nd
Ioads, perf'onn belier than re.ce1v.er-.m 1.taled algorithmsover most system oa s.
e. 1foll
n g· ·r Te ahs oens e: algoni ms r>erform bc1ter than nonadaptivealoeo- rithms for nthms are preferable, as they provide1 su stan
th nonadaptive algorithms.
280 ADVANCED CONCEPTS IN OPERATING SYSTEMS

11.9 REQUIREMENTS FOR LOAD DISTRIBUTING


While improving system performanc is the r:nain obje tive of a load distributing
scheme, there are other important requirements 11 mustsaltsfy. DISTRIBlJTEDSCHEOlJUNo 281
The reas?n. he publish(ng schem_ewas hoseninstead of the directqueries
used in the sender-mutated algonthms descnbed mSec. 11.6.1 is as
Scalability. It should work well in lar e di t ibuted systems. This requires the follows:under the
ability to make quick scheduling decisions with mm1mum overhead.
publishingscheme, e o erhead incurred by the information policy is proponional to
Location transparency. A distributed system should hide the location oftasks the number of rnachmesm the system and the rate of change of state. Thisoverhead
just asa network file system hides the location of files from theuser. In addition, th can be controlled by in reasin_gor decreasing the degree of state change that triggers
remote execution of tasks should not require any special provisions in the programs. the publishing of state mforrnauon. If queries areused, the overhead due to polls is
proportional to the number of ll!sks scheduled, which may limit thenumber of tasks
if it wDe erete nrmo t intr iasnmsf. eArre tdr .ansferre d task must produce the same results it
would produce thoalticcieasn, abse dsecshcerdibueleddin(NSoecte. : 1T1.h6i.s4)p. roblem can be
overcomewith adaptive location
p Tho load ;,de, osod by tho V-Sy,tem ;, lh,CPU ,w;uuo, a, t nod,. To
Preemption. While utilizing idle wor)<stations in the owner's absence improves
""""" CPU utilization,a backgr?und process whi h periodically incrementsa coun_ter
the utilization of resources,a workstation's owner must not geta degraded performance
is run Ca t PtUh e lhoa sw eb es et ·np riidolrei t. y possible. The counter 1s then polled to
on his return. Guaranteeing the availability of the workstation's resources to its owner
requires that remotely executed tasks be preempted and migrated elsewhere on demand. seewhatproporuon of the
Alternatively, these tasks may be executed ata lower priority [19].
Heterogeneity. It should be able to distinguish among different architectures,
eptrco.cessors of different processing capability, servers equipped with special hardware,
11.10.2 The Sprite System
The Sprite system [9].is geted to_wards workstation-oriented environment Sprite
usesa_state-ch ge-driven mforma on pohcy where each workstation, on becoming
11.10 LOAD SHARING POLICIES: CASE STUDIES a receiver, notifiesa central coordinator process that it isa receiver. The location
11.10.l The V-System li c y i s centralized; to locate a receiver, a workstation contacts the central
pro c e ss .
coordinator
Th, V-sy,rem [3SJ ""' , "'re-,h,n&e<lri,on ;,ronn.,;o, policy. Each nod, broM
casts (or publishes) its state whenever its state changes significantly. State information Sprite's selection policy is primarily manual. Tasks must be chosen by users for
consists of the expected CPU and memory utilizations and particulars about the ma remote execution, and the workstation on which these tasks reside isidentified asa
chfoc ;"'"• '""";",._.,,type, o,U,re""' or, floating-po;,, co-procos,o,, ",· sender. Because the Sprite system is targeted for an environment in which workstations
Th, broad,.., ,t,re ;,ronn,tion ;, cotchol by •II the nodo,, Ir tho d;shibotod•Y em
large, each machine can only cache information about the bestN nodes (for exsat mp e,
i" arc individually owned, it must guarantee the availability of the workstation's resources
to the workstation owner. To do so, it evicts foreign tasks from a workstation when
only th°" node, h• ng tin,""°',.,._, CPU ,nd memo,y). I, ever the owner wishes to use the workstation. During eviction, the selection policy is
automatic, and it selects only foreign tasks for eviction. The evicted tasks arc returned
TheA"'IO<tion
u-,,,.fo,. poUcy-,
rel,tive tra,uf., by the
poUoy V-s_,.
;, """ thotseloc• only,,...,,.
defin,e, Mrivod •••,,.,
nowly rece;ve, ;r"" , to their home workstations. . . . .
In keeping with its selection policy, the tran fer policy us m_S nte 1s in
'r complete coming into play only under the following two cond1uons. First, work
theM mrut lighUy lo,dof n°""' ;n the ,y,rem, •nd.,, ,onde, ;r ;, ;, not. Tho0 In th
stations
1.s case, a threshoIdas
identified receivers
-base only
d poi·icy for transfers
decides of tas ch?sen
that a workstation 1s b_ythe users.
fo••••, 0 a30receiver dswhedn

pol;,y ;,• ._,..,;,.., poUcy th,i loc,o,, ttce;,.,.., follow,. Whon• ""'
•":, "'°" th
num r
·
e Workstation
the be
of
has had no key board or mouse be
·
active
k · I
tass 1s ess
input
than the num r o
d
for at leastat the
f processors
onlywhen foreign
secon
tasks
an
worksta-
exe-
lion. Second,a workstation is identified asFa sennneral transfersa node is identified
• m b;..,, the"" coot,;,;,, theM Hghtly 1......., mocltino, thot can ,au,fyb,!
;,'"°"'"
"'1<'' ""l•;remcn" ;,
"';, one of tho""domly
co"""''""
M ""'";"" then
from tho by COO•Wting
,., the
,na ;,ta,kP<>llo! the>erify
;, ..._.,,..
to loc.i each,.
locolly.
the Ir tho loc,I
Othe,w;,.,•
corrcc,no,, mac/,
m"',.':,
of the cuung at that workstatJon _mus e. ev
· . b
tlle transfer is requested. TheSprite system
icted or no •
the c,ch TM, randon, "'1,a;on "<I- the ch- that moltiplo m,ch;,,., w,II ''., asa sender manually and 1mphc11ly w_hen d transfer policies because they felt that
tho ...,,"more m,cb;n, fo,- ta,k """';o,.
If the ""'""' dota m,tcho, the h•"1:,.1
"'d" (w;,s;•, de.,.. of °""""Y), the pollof ""'Mno ;, seloctod f<><""""' ·,
designers used semi-automated selecuon_ nswould not outweigh the implementation
the benefits of completely automated pohcte
task. hOtherIwise, the entry for the polled machine is Updated with the latest difficulties. . uting resources. a foreign process can.be
To promote a fair a1Iocat1on of compk .n 10 be allocated to another foreign
. 1
infon
an t en8au od
e ection procedure ts repeated. In practice, the cache entries have ee
. . b n foun evicted from a workstation toa IIow the wotrhsta ,ontralcoordinatorcannot find an "I dei
to bo "'"rare,"' mo., than thre, poll, ,,. ran,ly ""l•;red [3S). Process d . · s · If ed -ce findsa userthat has beenu
IIocated under the following con i u on ·
q"'" Workstation for a remote execuu·on requesl an 11 •
DISTRIBUTED SCHEDULING 283
282 ADVANCED CONCEPTS IN OPERATING SYSTEMS
more than its fair share ofworkstations, then one of the heavy user's processe. .t0.4 The Stealth Distributed Scheduler
11
evicted froma workstation. The freed workstation is then allocated to the process shis The Stealth iSlributed_ Sc ed ler[ 19] differs from V-System, Sprite, and Condor in the
had received less than its faisr hare. The evicted process can be automatically transt t degree to which load diStnbut ng cooperates with local resource allocation at individual
at nodes. Like C_ond_o and Spnte, ste_alt is targeted for workstation environments in
elsewhere if idle workstations become available. erred which the ava1lab1hty_ of a workstation s resources must be guaranteed to its owner.
Fora parallelized version of UNIX 'make', Sprite designers have obs d While Condor a Spnt rely on preemptive transfers to guarantee availability, however,
speed-up factor of5 fora system containing 12 workstations. erve a Stealth accomplishes this task through preemptive local resource allocation.
A number of researchers and practitioners have noted that even when workstations
are under use b their ow rs, they are often only lightly utilized, leaving large ponions
of their processmg capacities unused. The designers of Stealth [ 19) observed that, over
a network of workstations, this unused capacity represents a considerable ponion of
11.10.3 Condor the total unused capacity in the system, often well over half. To exploit this capacity.
Condo_r ?3] 1s· concerned with scheduling long-running • CPU-intensive
· · tasks
h" h(back-
h Stealth allows foreign tasks to execute at workstations even while those workstations
ground
[ tasks) only. Condor is designed for workstation environment in w_ 1c t e are in use by their owners. Owners are insulated from these foreign tasks through prior
total availability of a workstation's resources 1s guaranteed to the user logged in at the itized local resource allocation. Stealth includes a prioritized CPU scheduler, a unique
console (owner) of the workstation. prioritized virtual memory system, and a prioritized file system cache. Through these
Condor's selection and transfer policies are similar to Sprite's in that most transfers means, owners are assured that their tasks get the resources they need, while foreign
are manually initiated by users. Unlike Sprite, however, Condor is centralized, with tasks receive only the resources that are left over (which are generally substantial).
a certain workstation designated as the controller. To transfer a task, a user links it In effect, Stealth. replaces an expensive global operation (preemptive transfer) with a
with a special system-call library and places it in a local queue of background tasks.
cheap local operation (prioritized allocation). By doing so, Stealth is able to increase
The controller's duty is to find idle workstations for these tasks. To accomplish this,
the accessibility of unused computing capacity (by exploiting underused workstations,
Condor uses a periqdic infonnation policy, in which the controller periodically polls
as well as idle workstations), as well as reduce the overhead of load distributing.
each workstation at 2 minute intervals to find those workstations that are idle and those
Task selection is fully automated under Stealth, and takes into account the avail
that have background tasks waiting. A workstation is considered idle only when the
ability of the CPU and memory resources, as well as past successes and failures with
owner has not been active for at least 12.5 minutes. Information about background
the transfer of similar tasks under similar resource availability conditions. The
tasks 1s queued at the controller. If an idle workstation is found, a background task is
transferred to that workstation. remainder of Stealth's load di tributing policy is identical to the stable sender-initiated
adaptive policy discussed in Sec. 11.6, because, under Stealth, preemptive transfers are
Ifa foreign background task is being served at a workstation a local scheduler
th not nec essary to assure the availability of workstation resources to their wners,
at at workstation che ks f?r local activity from the owner every• thirty seconds. If
he ?wner hasandbeen ac_uve smce Stealth_is able to use relatively cheap non-preemptive transfers almost exclusively.
,ore1gn task saves its state If the
th previous
k . check, the local
. scheduler
. preempts
. the Preempuve
. .
·more, the ,ore1 gn task is pre t"e wor station owner remains active for 5 minute•s or transfers are necessary only to prevent the starvation of foreign tasks.
it originat d Th k emp tve1Y transferred back to the workstation from which
e • e tas may later be t i d · • • · d
by the controller. rans erre to an idle workstation 1f one 1s locate
A significant feature of Condor' h d .
to computing resources 10 bothh s sc uhng scheme is that it provides fair access
nd
ll.ll TASK MIGRATION
Up-Down algorithm, in which theavya hght u ers_- Fair allocation is managed by the
perf rmancc comparison of several load sharing algorithms (Sec. 11.7) showed
Periodically. the indices are upd c?nt oller ma ntams an index for each workstation.
toz ro. Whenever a task submi1ed n :followi g ann r. Initially the indices 3R: set load receiver-initiated task transfers can improve system pcrfonnance at high system

: n.
th tran:fe Ho ever, receiver-initiated transfers (see Sec. 11.6.2) require preemptive task
e mde" of the submitting workst 15/ .rk sta uon 1s assigned to an idle do rs -Ci.e., the transfers of partially executed tasks). Even though most systems
not assigned to an idle workstatio/ nc eased. If, on the other hand, the task is ca n t operate at high system loads, an occasional occurrence of high system load
worksta11on,
hecks to see if any new foreign t sk :sm _x_is decreased. The controller periodically m n disrupt service to the users. If such circumstances arc frequent, system designers
1sdie workstation is available ands ome Iiore1gn w umgI kfor
f an idle workstation. If so, but •no wtaaykcon idera preemptive task transfer facility. Also,
to some
its w distributed
runnin schedulers for
c· s auon nvironments guarantee the works_tation er by preempting foreign
· g i.e., the workstation with th . as rom a lowest-priority worksta11on st
Pt_ree_mpted
ore1gn task ·and the freed worksta11·on.
· 1se ass1oned
1ghest index
h value), .then that foreign task dis en.
WhVtand
.ronmmigrating them
ent require to another
preemptive taskworkstauon. Other
transfers to avm "d d1stnbutcd schedulers
starva·uon. An for this
01h er ·snuation
is transferred back 10 thew · ". . I0 1 e new foreign task. The preempte
orkstatton from wh1'ch ·1 . . d
1 onginate .
erem preemptive transfers are beneficial is when most of the system load originates
.
284 ADVANCED CONCEPTS IN OPERATING SYSTEMS

ata few nodes in the system (heterogeneous workload). In this case, receiver-initialed
task transfers
I result in improved syStem ?erf rmanc...
n t h1.ssect1·0n, we focus on task migratio.n fa.c1hbt1es that allkow preemptive Residual dependencies areu d . oisnraUTEo !CilEixn.iNo 28S
trans
fers. At this time, it is necessary to make a distmctJon etween_tas placement and fonnance, and compl xity [9]. Res d: !le for th reasons, namely, rcliabili
task migration. Task placement refers to e process depends on its previous host() ndenc1es reduce reliabilityth · per
transfer of a ta k th t 1s yet to begin
execution toa new location and start its execut10n there. Task gratJon refe s t_o the previously migrated fails, the task ITUs. h.
· - " th
Ii any one of the hosts from
g t be unabl
transfer of ash_ehlTllgratcd denc1 es , 1 .uect e performance of the . e to make progress.
w 1 c t he task
a task that has already begun execution to a new location and continuing its executi R es' d al d
there. To migratea partially ex uted task to a new location, the task's state should: call a e bya migrated task may h: ttask: Since a memory ace 0 a s;!::
made available at the new location. mumcat1on delays of these remote ope . redirected to a previous host, the
The general steps involved in task migration are: com- dependen. c1esa 1so reduce the availabilit raftiotnhs can .slow the task' s
0
progress, Residual due to remote operations initiated by rc ious hosts by
increasing their loads.
1. State transfer: The transfer of the task's state to the ew machine. The task's dependencies complicate the system'soepetas_ migrated from them. Finally, n:sidual
state includes such information as the contents of the registers, the task stack, several nodes [9). For instance, the checkrau.on.s by di'stribut·ing a task's state
whether among much more complex if its state is distribut:mtmg and recovery of a process
. the task is ready,blocked, etc., virtual memory address space, file descriptors, any becon1e
temporary files the task might have created, and buffered messages. In addition memory management may become complex ong y nodes. As another example,
the current working directory, signal masks and handlers, resource usage statistics' memory management must distinguish bet as exp amed under Accent) because the
task and toa remote task. The si.tuati.on miwghetengemt emmuohry segme'nfts that b•elong to a
references to children processes (if any), etc., may be maintained by the kernel local
a part of the task's state [9]. The task is suspended (frozen) at some point during ti.mes. c worse I a task m1gratcs several
the transfer so that the state does not change further, and then the transfer of the We next describe the state transfer mechanisms in task · · f ·1·· f
task's state is completed. ·
several expen men tal d ' •
1 stn buted operating nugrauon ac1 iucs o
2. Unfreeze: The task is installed at the cw machine and is put in the ready queue systems.
so that it can continue executing. tauon that requiresa previous host to redirect messages meant fora migrated
task_ to thc ,.,,.., host of the migrated "'1<. (c) Locatiott-dcp,ndent system
calb
11.12 ISSUES IN TASK MIGRATION resources that exist only at the home node. These system calls must be forward
the home node where the task originated [8, 19].
In the design ofa task migration mechanism, several issues play an important role
in determining the efficiency of the mechanism. These issues include state transfer,
location transparency, and structure/organization.

ll.12.1 State Transfer

There are two important issues to be considered in state transfer. (1) The cost to support
remote execution, which includes delays due to freezing the task, obtaining and
er ring the state, and unfreezing the task. The lengthy freezing ofa task during
m1grat1o may result in the abortion of tasks interacting with it, asa result of
timeouts. Hen , it is desirable thata migrating task be frozen for as little time as
possible. (2) Rtsid
dependencies, which refer to the amount of resourcesa former host ofa preem.,.....
or migrated task continues to dedicate to service requests from the migrated task.
following are examples of where residual dependency occurs. (a) An impleme tatton
that does not transfer all the virtual memory address space at the time of migratton
but
rat er transfers pages to the new host as they are referenced [37). (b) An implemcn·
THE_V•S STEM. !he gration facility in the V-System [37] attempts to reduce the
freezing time of a migrating task by precopying the state. In.this technique, the bulk of the
task state is copied to the new host before freezing tl)I: task, thereby reducing the time during
which it is frozen. To precopy the state, after the new host for a task is selected, the task's
complete address space is copied as an initial copy to the new hosL Then, all the pages
that were modified (dirty pages) during the copy are recopied. This task of recopying dirty
pages is repeated until the number of dirty pages is relatively small or until no significant
reduction in the number of dirty pages is achieved. Finally, the task is frozen and the
remaining dirty pages and the task's execution stale are copied (see Fig. 11.9). A key
point to note is that successive copying ?perations w_i l pl n:s bly take less time than
earlier copy operations, thereby allowmg fewer mod1ficauons to
occur during their execution time. . . .
While precopying the task state in this way reduces the t1rne dunng which a
migrati task · fro · • creases the number of messages that must be sent to the 1s 1
zen,1 m usedb th · lion. Asa n:sult, ng
new host, thus increasing the resource overhead ca Y e migra th
. • ti g tasks at a performance cost to ose this
method provides an advantage to migran ks already n:siding at the receiving tasks
left behind at the sending host and those tas
host.
. h than the V-System in the transler of
SPRITE. Sprite [8} takes a d1 ere t app its new host. To reduce the time during the
virtual address space of a trugratlng tas t of data transfemd to the new host,
Which a task is frozen and to reduce the am;taccess mechanism provided by its file
Sprite makes use of the location- parenta : of the migrating laSk are swapped to
system (see Sec. 9.5.2). All the modi K1fdescripton for the correspondin swap
the file server. The page tables and the task. The address space of the task ,s then
flles are then sent to the new host of
DISTRIBlfTED SCHEDULING 287
286 ADVANCED CONCEl'l'S IN OPERATING SYSTEMS
By no copyi?g the entire ddress space, the time required for migration and
Time the time dunng whicha process ts frozen are reduced. However, if the migrated task
accesses more than one fourth of its address space, the higher cost of fetching individual
!Illlilllllllll Source pages during remote execution outweighs the savings achieved during migration.
The disadvantages of the Accent migration facility are as follows:
] Destination

V
fflilIIIIlill Process executes • Memory management is complex because it must distinguish between the memory
segments that are present locally and those that are not. If segments are not present
HIIIIIIIIB B e e Source
Virtual address
space transfer
locally, the memory subsystem must invoke remote procedure calls to a remote
host to transfer the required memory segments. In addition, previous sites must be
informed of the demise of memory objects.

e I
BIIBIIIBIIIIIIOIIIIJHIIIHIIIIIIHl/1111119111 DeStin•tion End of residual
dependencies • Previous hosts are burdened with servicing the migrated task's memory access re
Accenl quests and must commit memory resources for a remote task. Therefore, the cost
due to residual dependencies in Accent can potentially be much higher than for the
File Server previous two mechanisms.
• Reduced fault tolerance. If one of the previous hosts of a migrated task fails, a
Source
remote task may have to abort because of the unavailability of some of its memory
mm$il,1;1mLmmnnmmmnmmm Destination segments.

Sprite

FIGURE 11.9 11.12.2 Location Transparency


Different techniques for the transfer of virtual address space (adapted from(8)). operat.ing· system. mvo..A••- hosts (see Fig.11.9).• w •c transfers the necessary page from
the previous
demand-paged in at the new host (see Fig. 11.9). Note that only pages that have
been dinied are swapped out The file server generally stores the swapped pages in
its cache to a.void slow disk access during migration. Further reduction in task
freezing time can rbeelaot bivtae il nyefdewbydtihrtey reppaegaetsed s wlaepftp. ing-
ar e
out of dirty pages without freezing the task, until

A CE_NT. T k migration in Accent (39] also tries to reduce the time during whicha
m grat ng 1'.1sk • froz n and the amount of data transferred to the new host. Reduction in
m1gra_1ton me 1s achieved by using a feature called copy-on-reference. The motivation
for this design co es fro_mthe observation that tasks usea relative! small art of
their add s space while executing, and hence the entire addr dy tp eed
to be copied to the new host. ess space oes non
The migration of a task inA t • •
vinual memory address space) cce • vo1ves copying the task's state (excluding,1ts
information about the vinuaJ :Ocopying •ts memory maps (which provide
addressing new host. As the task executes a ry address space), and initiating the
task at the not present locaJJy results in the n w host, the modification to
memory segments trheferencesa location that is nOt rea11on of those segments
locally When the task e copy-on-reference mechanismPreshe.nht at its newh ost, the
Many distributed systems support the notion of location- transparency wherein
services are provided to user processes irrespective of the location of the
processes and services. In distributed systems that suppo task migration, it is
essential that location trans parency be supported. That is, task migration
should hide the locations of tasks, just as a distributed file system hides the
location of files. In addition, the remote execution of tasks should not require
any special provisions in the programs. Looation transparency in principle
requires that names (e.g., process names, file names)-be independent of their
locations (i.e., host names). By implementing a unifonn name space
throughout the system, a task can be guaranteed the same access to resources
independent of its present location of execution. In addition, the migration of
tasks should be transparent t the rest of the system. In other words, any
operation (such as signaling) or commu ication that was possible before the
migration of a task should also be rossible after tis migration.
Typically, the mapping of names to physical addresses in disttibuted
systems is
handled in two ways. First, addresses are maintained as hints. If an access fails,
hints can be updated either by multicasting a query ?r through some other mea?s.
1:Jlis ethod poses no serious hindrance to task migranon. An effect ()f a
task m1granon m such a system is that hints maintaining the task's address
are no longer correct. Second, an object can be accessed with the help of
pointers. In such cases. whenever a task migrates, pointers may have to be
updated to enable continued a s toar_id from the new location. If the
pointers are maintained by th kernel, the 1t _,s rc attvely easier to Update the
pointers. On the other hand, if the pomters are amtamedm the addn:ss space of
tasks, then updating the pointers can become mored1fficult.
288 ADVANCED CONCEffl IN OP
ERATING SYSTEMS
• •
OISTIUBUTED SCHEDULING 289

Transferring . · a t te of. a .m1gra


the entires
u·ng task to the new locat10n also aids in kernel. If th polic modules do not impose a heavy overhead on the system due to
. ncy This a11 ows most kernel calls hto be local rather
• than their interactions with thekernel, then they fit best in utility processes. This approach
achieving location transpare · the new' machine can handle t e requests for virtual
remote. For example, the kernel at etc 1.5 used in Charlotte [I], Sprite [9], and in [25].
memory management, file I/0, IPC, . Third, the interplay between the task migration mechanism and various other
mechanisms plays an important role in deciding where a module resides. Typically.
. • transparency is accomplished through several mech- there will be interaction between the task migration mechanism, the memory manage
SPRITE. In Spnte location Id·stributed file system provides file service, (2) the
anisms: (I) a loc tion transpkiu:enma e available at the new host, and therefore, any ker- ment system, the inte rocess com unication mechanisms, and the file system. The
entire state of a nugrating tas is . . . I . d mechanisms can be designed to be mdependent of one another so that if one mecha
nel calls made . w11 1. be 1oca1 at the new host' and (3) by mamtam· mg f ocatiokn- ependent nism's protocol changes, the other's need not (I]. Another flexibility provided by this
. mformau•on (sueh as the current host of a task) at the home machme o a tdas· . The home principle is that the migration mechanism can be turned off without interfering with
m a c h m. e of a task 1·s the machine on which. th.e task woul.d have execu. te 1f th.ere other mechanisms. On the other hand, the integration of mechanisms can reduce redun
had
·b Oou·gration of the task at all. To mamtam the location-dependent mformatton of dancy of mechanisms as well as make use of existing facilities [I]. For example, Sprite
ee n n h . Th i ,· · · simplified its migration mechanism design by storing the task state as a file and using
a t a s k, a copy of its PCB is maintained at the home mac _ me. s m,orma tion 1s
_used the distributed file system for transferring the state to a new host. One serious disad
for forwarding signals automatically. Whenever a task signals another task, the signal vantage of integrated mechanisms, however, is that if one mechanism breaks down, all
is sent to the task's home machine from which it is forwarded to the task's current the other mechanisms that depend on it will also break down. ·
location. Whenever a task forks off a child process, the task's home machine provides
the task ID and updates its own data structure to reflect the existence of a new child
process and its location. When a process tenninates, a similar protocol is used to update 11.12.4 Performance
the data structure at a process's home machine. Other location-dependent calls such as Comparing the performance of task migration mechanisms implemented in different
the time of day are also forwarded to a task's home machine to ensure that the task systems is a difficult task, because of the different hardware, operating systems, [PC
sees monotonically increasing clock values.
mechanisms, file systems, policy mechanisms, etc., on which the mechanisms are based.
Sprite's mechanism leaves no residual dependency on any machine except the In this section, we provide the performance figures for two implementations of task
task's homemachine, leaving the task vulnerable to failure of the home machine. migration mechanisms.

11.12.3 Structure of a Migration Mechanism SPRITE. The Sprite environment consists of a collection of SPARCSTATION I work
The first issue in the design of a tak · 1· f ·1· · · · stations connected by a local area network. Each workstation runs the Spri1e opera1ing
teh poIi.cy-ma. king modules (see sSec mI I1gr4a) m,,_n ac1 1ty 1s d. eciding whether to. system whose kernel-call interface is much like that of 3.4 BSD UNIX (29). The task
separate modules respons·1b1 0 11 . · • •110m mechanism modules (these migration mechanism makes use of a remote procedure caJI mechanism. A remo1e pro
include tasks) Thi·s deci·s .e . r . c o e c u n g, tr an sfem ng, and reinstating the cedure call has a round trip latency of about 1.6 milliseconds and a throughput of 480 to
· 1 0n 1 s u n po rt a n t as · h · . .
state of migrating ease of development I1 Bys 1 . ' th as imp icauons for both 660 Kbytes/second when issued on SPARC workstations (10 MIPS) connected through
performance and the a 10 Mbits/second Ethernet [9].
having to change the. mecehpar tmg e two, one can easily test different policies
without anisms and vice Th . . Table 11.1 presents the costs associated with task migration. Note that the cost of
mechanism modules versa. . us, the of and migration depends on the size of the virtual address space and the number of opened
simplifies the'develo separation pohcy
. The second issue in the desin ofa pmenta! eff rts. files.
pohcy and mechanisms should resfd [I) k migration_ facility is deciding where the
collect the task state. Typically e · e first step m the migration ofa task is to The average time for migrating a task in Sprite is about 33 millisecon . In
child P esses) is maintained•:;:: 3; of ,the state (such as file pointers, references to
Table I 1.1, the time for migration does not include the cost of selectmg and releas1ng a
ost. In Sprite, oncea host is selected, many tasks can be migrated t o_i bt efore
me hanism is closely intenwined wiih i::::1s data structure. In addition, the migration
h!ch are generally inside the kernel H rprocess communication (IPC) mechanisms, releasing
11 so that it can be assigned to another host. Just selecting and releasmga ho t 1.a es
inside th kernel [I]. · ence, the migration mechanism may best fit 36 milliseconds. The Migrate "null" process gives the ov rhead due solely to
. Pohcy modules decide wheth
m1gr:i11on
: ng these d cisions is simple, th:r l sk transfer should occur. If the process of mechanisms. This includes the cost of transferring the environment o the task. Th exec
effici: I et1plementation more ffi n odules can be placed in the kernel. This arguments in the table refer to the command line arguments and env1ronmen1 v ablcs.
to make deci: icy m od les require large amouas both 1Y s of modules can
t h
interact ns, en 11 also may b n t s o f state information from the kemel CHARLOTTE. The Charlotte system co sists of VA l_l- 50 mach!nes o..,nncct
e more e ff i ·
cient to place these modules in the by a Pronet token ring[ 1 ]. In this system. 1t takes 11 m1lhseconds to senda_ Kh)
to.:
290 ADVANCED CONCEPTS IN OPERATINGSYSTEMS DISTIUBUTl!D SCHEDUUNO 291
TABLE 11.l the metric usc-,d to measure the load at nodes characterizes the load properly. The CPU
Costs associated with process migration in Sprite (adapted queue length_ha be_en found_to be a good load indicator.
from [9]) Lo d d1stn?uting algonthmi. have been characterized as static, dynamic, or adap
Time/Rate tive. Static algonthms do not make use of system state information in making
Action
36 msec decisions regarding the transfer of load from one node to another. On the other hand,
dynamic
Select and release idle host 76 msec
Migrate "null" process
algorithms do m e use of system s_tate information when making decisions. There
9.4 msec/file fore, these algonthms have a potential to outperform static algorithms. Adaptive al
Transfer details of open files
480 Kbytes/sec
Flush modified file blocks to the server gorithms e a special_ class of dynamic algorithms iu that they adapt their activities,
Flush modified pages 660 Kbytes/sec
480 Kbytes/scc by dynarmcally changmg the parameters of the algorithm, to suit the changing system
Transfer exec arguments
Fork, exec null process with migration, wait for child to exit 81 mscc stale.
Fork, exec null process locally, wait for child to exit 46 msec Load distributing algorithms can further be classified as load balancing or load
sharing algorithms, based on their l'lad distributing principle. Both types of algorithms
TABLE 11.2 strive to reduce the likelihood of an unshared state. Load balancing algorithms however,
Costs associated with process migration In Charlotte (adapted ' go a step further by attempting to equalize the load.11 at all computers. Because a load
from [1]) balancing algorithm transfers tasks at a higher rate than a load sharing algorithm, the
higher overhead incurred by load balancing algorithms may outweigh this potential
Action at sending host Time in 1ime in performance improvement.
msec Action at receiving host msec Typically, load distributing algorithms have four policy components: (I) a transfer
Handle an offer 5.0 Handle an offer 5.4 policy that determines whether a node is in a suitable state to participate in a task
Prepare 2 Kbyte information 2.6 transfer, (2) a selection policy that determines which task should be transferred, (3) a
to transfercontext Install 2 Kbyte infonnation 1.2
Mmhall 1.8 Demarshal context location policy that Jetermines to which node a task selected for transfer should be
Other (mostly kernel 6.9 1.2 sent, and (4) an information policy which is responsible for triggering the collection of
context switching) Other 4.7 system state information.
Based on which type of nodes initiate load distributing actions, load distribut
ing algorithms have been widely referred to as sender-initiated, receiver-initiated, and
packet to another machine, 0.4 millisecond to switch contexts between kernel and symmetrically initiated algorithms. In sender-initiated algorithms, senders (overloaded
process, 10 milliseconds to transfer a single packet between processes residin ?n the nodes) look for receivers (underloaded or idle nodes) to transfer their load. In receiver
same machine, and 23 milliseconds to transfer a packet between processes res1 ng on
initiated policies, receivers solicit load from senders. A symmetrically initiated policy
differentmachines. The average elapsed time to migratea small (32 Kbytc) bnklcss
process is 242 millisecol)ds after an idle host has been found and it has agreed to accept is a combination of both, where load sharing actions are triggered·by the demand for
a remote task. Each additional 2 Kbytes of state information adds 12.2 millisecon extra processing power or extra work. ·
The task transfers performed for load distributing can be of two types, nonpre
stoprtehaed ma sigsrha otiwo nn tiinmTea. bTlhee1 1c.o.2st of various operations performed emptive and preemptive. In nonpreemptive transfers, tasks th_athave not yet begun
duringa migration execution are transferred. Preemptive transfers involve the transfer of tasks that have
15
already begun execution. These transfers are expensive compared to nonprecmptive
transfers, because the state of the tasks must be transferred to the new location also.
11.13 SIJMMARY In this chapter, we described several load sharing al ori_thm_s, their performanc ,
o.,, 11,.,
m.•m.f""i"c.
,,., '"'""• lho m""" of <omp,ti"" tw "'""" from tobis
e1nu-gh ':°
nd policies employed in.several implementations of load di_stnbutmg s hemes. _ a d

wo,k, of <omp•••· or,.,. .,,..........,,,, -•"1ioo,. Soch - 00 prom• - tton, we discussed how several task migration implementations have tned to nunuruze
To the delay due to the transfer of state.
peOoma, , ""'"•'li,bi1i1y, '"' impm""' """'ibili1y .,,.,""""' •Y' dolw
realize its high performance Potential, a gOOd load distributing scheme 1s essen
11•14
..,,oilLoad''"'"'""•
"•""'""" """'"'''""' ,,,_ " ...'"''""'"" """''"".':· • _,
,1,on ,,,. try ''""'°'" ."""""-or d,,..bul<d ,.,.
1o
FURTHER READING
s [ 3 0_]R, ommel presents a general formula fo the probabili y that any one n e
by ••nsf<oi"" l d fm,n ,.,Wly 1,_ ._ lo lighlly 10""'1o, ulle nodeL I....., . in the Jutem 1s underloaded while some other node m the _system 1s o crlo de . This
transfers are to be effective in improving the system's performance, it is importan proba- ty can be used to define the likelihood of load shanng success m a distnbuted
system.
REFElENCES 293
292 REFERENCES
The availability of idle CPU cycle in a network of workstations is discussed by 3 Bryant, R. M., and R. A.Finkel, "A Stable Di tributed Schedu . . " .
. 4 the 2nd L.International Conference
''The Influence of W,oon Distributed
kl d Computing_
. SYSling
tems,Algonlhm,
Apr. 1981,Proceedings
pp. 314-323.of
Mutka and Livny in [26] and by Mutka tn[2 ]. . Cabrera•
A discussion on the selection of tasks suitable for remote execution can be ti0 • r oa on Load Balancin Stra · " p d' •' h
· Summer USENIX Con'erenc • ••Jue n 198 6, pp. 446-458. teg res, rocee i ngsOJ t e
in [28]. Utopia [42] isa load sharing facility for large, heterogeneous distributed : Casavant, T. L., and J. G. Kuhl, "A Taxonomy of Scheduli'ng · Ge Pu D' 'b ted
.5 • S t " IEEE .., . in
I
nera - rpose tstn u
;.
terns. -YS ems, ,ransactlons on Software Engineering, vol. 14, no. 2, Feb. 1988,
In (22]. Lin and Keller present a gradient model load balancing method ti
multiprocessorsystem. Tilborg and Wittie describea wave scheduling schem
network of computers in [38]. In [2], Baumgartner and Wah presenta load ba
0a tr 4
!,·
6. Casavan_t,T. L.,and G. uhl, "Effects of Respon and Stability on Scheduling in
Distributed Computing Systems, IEE.E Transaction., on Software Enginuring vol 14 no 11
a Nov 1988
scheme which has been implemented in a network of Sun workstations. .ancmg pp. 1578-1587. ' . ' . ' . '
In[ 13], Hae discusses an algorithm for improving performance through fil . 7. Chou, T. C. K., a d J. - Abraham, "Load Balancing in Distributed Systems," IEEE
Transactions on Software Engineering, vol. 8, no. 4, July 1982.
cation, file migration, and process migration. e reph-
8. Douglis, F., and J. _C>usterhout, "Process Migration in the Sprite Operating System," ProcudinR·'
'ln [17], Kremien and Kramer study the performance, efficiency and t b.'
of the :th International Conferen e on Distributed Computing Systems, SepL 1987, pp. 18-25.
many load sharing algorithms. ' s a ihty of
9. Doughs, F., and J. Ousterhout, 'Transparent Process Migration: Design Alternatives and the
. In [5], Casavant and Sprite Implementation," Software Practice and Experience, vol. 21, no. 8, Aug. 1991, pp.
Kuhl describe a taxonomy of scheduling scheme . tnbuted 757-785.
systems. In [ 12], Eskicioglu presents a bibliography of s or d1s sc? IO. Eager, D. L.• E. D. Lazowska, and J. Zahorjan, "A Comparison of Receiver-Initiated and
emes. Smith discusses a survey of process migration schemes in p;ocess migration Sender Initiated Adaptive Load Sharing," Performance Evaluation, North-Holland, vol. 6, no.
Mtlgrom present a survey of load distributing schemes that ha b [ ]. Jacqmot and I, March 1986, pp. 53 !1.
UNIX-based systems in [15]. ve een implemented on 11. Eager, D. L., E. D. Lazowska, and J. Zahorjan, "Adaptive Load Sharing in Homogeneous
Distributed Systems," IEEE Transactions on Software Enginuring, vol. 12, no. 5, May 1986,
pp. 662 75.
PROBLEMS 12. Eskicioglu, M. R., "Process Migration: An Annotated Bibliography," Newsletter. IEEE Com
puter Society Technical Committu on Operating Systems and Application Environments, vol.
11. 1. Identify the actions that belong to the transfer policy actions in the load sharing of 4, no. 4, Winter 1990.
the V-System. 13. Hae, A., "A Distributed Algorithm for Performance lmpgivement Through File Replication,
11.2. Identify the actions that belong to the location policy actions in the load sharing of File Migration, and Process Migration," IEEE Transactions on Software Engineering, vol. 1.5,
the V-System. no. 11, Nov. 1989, pp. 1459-1470.
14. Hagmann, R., "Process Server: Sharing Processing Power in a Workstation EnvironmenL"
11.J. Discuss how well the three load sharing implementations of Sec. I I. 10 satisfy the
Proceedings of the 6th International Conference on Distributed Computing Systems, May 1986,
scalability criterion.
pp. 260-267. -
11.4. Under what condition will process migration in the V-System fail to satisfy the 15. Jacqmot, C., and E. Milgrom, "UNIX and Load Balancing: A Survey," European UNIX User
stability criterion discussed in Sec. 11.5.
Group, Spring Conference, Apr. 1989, pp. 1-15.
11.S. Predict the performance of the receiver-initiated load sharing algorithm when the 16. Kleinrock, L., Queueing Systems, vol. I: Theory, John Wiley & Sons, New York. 1975.
entire system workload is generated at only a few nodes in the system instead of 17. Kremien, 0., and J. Kramer, "Methodical Analysis of Adaptive Load Sharing Algorithms,"
equ ly at a)I the odes in the system. (Hint: performance depends on how successful IEEE Transactions on Parallel and Distributed Systems, vol. I, no. 2, Nov. 1992. pp. 747-760.
receivers will be tn locating senders.) 18. Krueger, P., Distributed Scheduling for a Changing Environment. PhD thesis. University of
11.6. Identify all the overheads in a load sharing policy. Wisconsin-Madison, available as Technical Report 780, June 1988.
11 7 nd
· •se -initiated algorithms cause system instability at high system loads. Predict, 19. Krueger. P., and R. Chawla, "The Stealth Distributed Scheduler," Procudings of the I Ith
analyticall_y,atwhat_system load the instability will occur. Assume Probelimi =t S, International Conference on Distributed Computing Systems. May 1991, pp. 336--343.
averarI service requirement of a task = I second, overhead incuncd bya processor 20. Krueger, P., and R. Finkel, "An Adaptive Load Balancing Algorithm for• Multicomputer,"
to po or to reply to a poll = 3 milliseconds. Technical Report 539, University of Wisconsin-Madison, - 1984. . . . "
21. Krueger, P., and M. Livny. ''TIie Diverse Objectives of D1stnbuted_ Scheduling Pohc1es, Pro
REFERENCES ceedings of the 7th International Conference on Distributed Computing Systems, Sept. 1987, pp.
242-249.
I. Ansy, Y., and R. Finkel, "Desi nin................................................................................................. , 22. Lin, F. C. H.. and R. M. Keller, "TIie Gradienl Model Load Balancing Melhod," IEEE
92.IEEE Comp11ter. vol. 22. no gs g A Process M1grat1on Facility: The Charlotte Expenence, Transactions on Software Engineering, vol. 13, no. I, Jan. 1987, pp. 32-38.
Baumganner• J. M ., da n · w· ept. 1989. "GAM pp. 47-56. • 23. LitzJcow, M. J .• M. Livny, and M. w. Mutka. "Cond r-A Hunter.of Idle Wort.stations."
· 0•·0ompu1er S y stems wi t h8 M Wah' - · · MON: A Load Balancing Strategy for
C Proceedings of the 8th International Conference on Distnbuted Computing Systems. June 1988.
.. .., ._1
8 A 1
u t1access Networks " IE'" 38
· • ug. 1989p. p. 1098-I I09. • c.E Transactions on Computers, vo·1 • pp. I04-111. roadc · ·
24. Livny, M., and M. Melman, "Load Balancing in Homogeneous B. ast Distnbuted Syslems."
Proceedings of the ACM Comput r Networlc Performance Symposium, Apr. 1982, pp. 47-.5.5.

You might also like