0% found this document useful (0 votes)
9 views

Cloud Computing.

This paper presents a new algorithm using non-sequential Monte Carlo Simulation (MCS) to evaluate and design highly reliable and utilized cloud computing systems (CCSs). The proposed methods allow for quantitative assessment of CCS reliability and effective planning for service level agreements (SLAs), addressing the conflicting needs of users for reliability and providers for resource utilization. Results indicate the effectiveness of these methods in large-scale systems, providing insights into CCS reliability and design.

Uploaded by

aadhi0503
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Cloud Computing.

This paper presents a new algorithm using non-sequential Monte Carlo Simulation (MCS) to evaluate and design highly reliable and utilized cloud computing systems (CCSs). The proposed methods allow for quantitative assessment of CCS reliability and effective planning for service level agreements (SLAs), addressing the conflicting needs of users for reliability and providers for resource utilization. Results indicate the effectiveness of these methods in large-scale systems, providing insights into CCS reliability and design.

Uploaded by

aadhi0503
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Snyder et al.

Journal of Cloud Computing: Advances, Systems and


Applications (2015) 4:11
DOI 10.1186/s13677-015-0036-6

RESEARCH Open Access

Evaluation and design of highly reliable and


highly utilized cloud computing systems
Brett Snyder1 , Jordan Ringenberg3 , Robert Green2* , Vijay Devabhaktuni1 and Mansoor Alam1

Abstract
Cloud computing paradigm has ushered in the need to provide resources to users in a scalable, flexible, and
transparent fashion much like any other utility. This has led to a need for developing evaluation techniques that can
provide quantitative measures of reliability of a cloud computing system (CCS) for efficient planning and expansion.
This paper presents a new, scalable algorithm based on non-sequential Monte Carlo Simulation (MCS) to evaluate
large scale cloud computing system (CCS) reliability, and it develops appropriate performance measures. Also, a new
iterative algorithm is proposed and developed that leverages the MCS method for the design of highly reliable and
highly utilized CCSs. The combination of these two algorithms allows CCSs to be evaluated by providers and users
alike, providing a new method for estimating the parameters of service level agreements (SLAs) and designing CCSs
to match those contractual requirements posed in SLAs. Results demonstrate that the proposed methods are
effective and applicable to systems at a large scale. Multiple insights are also provided into the nature of CCS reliability
and CCS design.
Keywords: Cloud computing; Reliability; System design; Monte Carlo simulation

Introduction Program to cloud based IT services [4, 5]. Furthermore,


Cloud computing provides a cost-effective means of trans- companies such as Netflix, IBM, Google, and Yahoo are
parently providing scalable computing resources to match heavily investing in cloud computing research and infras-
the needs of individual and corporate consumers. Despite tructure to enhance the reliability, availability, and security
the heavy reliance of society on this new technological of their own cloud based services [6–8].
paradigm, failure and inaccessibility are quickly becom- Thus, from the user’s perspective, there is a great need
ing a major issue. Current reports state that up to $ 285 to build a highly available and highly reliable cloud. While
million yearly have been lost due to such failures with cloud providers feel the necessity to provide not only
an average of 7.74 hours of unavailability per service per high levels of availability and reliability to meet quality-of-
year (about 99.91 % availability) [1–3]. Despite these out- service (QoS) requirements and service level agreements
ages, rapid adoption of cloud computing has continued for (SLAs), but also desire to build a highly utilized sys-
the mission-critical aspects of the private and public sec- tem, with hopes of leading to higher profitability. Under
tors, particularly due to the fact that industrial partners these considerations, the balance between maximal uti-
are unaware of this issue [2, 3]. This is particularly discon- lization of a cloud computing system’s (CCS’s) resources
certing considering President Obama’s $ 20 billion dollar is in direct conflict with the cloud user’s interest of high
Federal Cloud Computing Strategy and the rapid migra- reliability and availability. In other words, the provider is
tion of government organizations like NASA, the Army, willing to allow a degradation in reliability as long as their
the Federal Treasury, Alcohol, Tobacco, and Firearms, the profitability continues as, in reality, it is the user, not the
Government Service agency, the Department of Defense, provider, that pays the economic consequences of cloud
and the Federal Risk and Authorization Management failures. Note that from a user-based, SLA driven perspec-
tive, reliability refers to the ability of the cloud to serve
*Correspondence: [email protected] the user’s need over some time period and does not refer
2 Department of Computer Science, Bowling Green State University, 1001 E.
Wooster St., 43403 Bowling Green, OH, USA to simple failures within a CCS that do not hinder user
Full list of author information is available at the end of the article service.

© 2015 Snyder et al. This is an Open Access article distributed under the terms of the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly credited.
Snyder et al. Journal of Cloud Computing: Advances, Systems and Applications (2015) 4:11 Page 2 of 16

This need to provide a highly reliable, uninterrupted as a CCS is effectively a complex network availabil-
cloud service while effectively utilizing all available ity problem. Other works focus on conceptual issues
resources is highly desired by cloud providers/users and [17–19], hierarchical graphs [20], the use of grid com-
clearly demonstrates a gap in current CCS research, call- puting for dynamic scalability in the cloud [21], and
ing for the establishment of efficient methods which can priority graphs [22], or the development of performance
quantitatively evaluate and design CCSs based on the indices [23].
competing needs of users (reliability) and providers (uti- When considering QoS, one of the largest bodies of
lization). As such, the goal of this study is the design work has been completed by Lin and Chang [24–29].
and evaluation of CCSs considering stochastic failures These works develop a sequential and systematic method-
in the CCS as well as stochastic virtual machine (VM) ology based on capacitive flow networks for maintaining
requests. the QoS of a CCS with an integrated maintenance budget.
In order to achieve this goal, this study makes mul- The main focus of the model is maintaining acceptable
tiple contributions including 1) Developing a compu- transmission times between clients and providers given
tationally efficient method for evaluating the reliability a certain budget. The work developed in [30] presents a
of CCSs using non-sequential Monte Carlo simulation hierarchical method for evaluating availability of a CCS
(MCS) considering stochastic hardware failures and VM that is focused on the response time of user requests
requests, 2) Extending this new model in order to design for resources. The majority of the work deals with VM
highly reliable and utilized CCSs based on potential work- failure rates, bandwidth bottlenecks, response time, and
loads, and 3) Discussing the practical implications of latency issues. The demonstrated solutions to these issues
the proposed technique. As opposed to most previous are the use of their newly developed architecture along
work, the proposed method 1) Focuses on simulation- with request redirection. A similar, though only concep-
based analysis, 2) Is highly scalable due to the use of tual approach, is developed in [31, 32] where a Fault
MCS, and 3) Uses a newly developed, intuitive system Tolerance Manager (FTM) is developed and inserted
representation. between System and Application layers of the CCS.
The remainder of this paper is organized as fol- Another approach to this issue is an optimal checkpoint-
lows: Section “Related works” reviews background lit- ing strategy that is used to ensure the availability of a
erature that is pertinent to the proposed methodology; given system [33, 34]. Other methods of approaching fault
Section “Proposed methodologies” presents the the newly tolerance from a middleware perspective can be found
proposed application of non-sequential MCS, its formu- in [20, 35].
lation for assessing the reliability of a CCS, and its use While the previous works have dealt mainly with the
in a new, iterative algorithm for designing highly reli- modeling of user requests and data transmission, another
able and highly utilized CCSs; Section “Experimental important aspect of system failure in a CCS is the failure
results” details the experimental results achieved includ- of hardware. The state-of-the-art in this area is embod-
ing CCS test systems designed and evaluated using the ied in five main works that focus on evaluating data
proposed methods; Section “Discussion” presents a dis- logs from multiple data centers and/or consumer PCs.
cussion and comments on using non-sequential MCS The evaluation of these logs begins in [36] where hard-
as a tool for CCS reliability assessment and the role of ware failures of multiple data centers are examined to
this technology in SLAs. Insights gathered during CCS determine explicit rates of failure for different compo-
reliability assessments and CCS design are also given nents — namely disks, CPUs, memory, and RAID con-
in Section “Practical implications”; and, finally, Section trollers. The most important finding of this paper is
“Conclusion” concludes the paper with a summary as well that the largest source of failure in such data centers is
as directions for future work. disk failure. Intermittent hardware errors are evaluated
in [37].
Related works This work continues in [38] where failures in CPU,
Cloud computing reliability DRAM, and disks in consumer PCs are evaluated. Spe-
Many works reference the terms reliability and avail- cial attention is paid to recurring faults as the work
ability when focused on CCSs. Though, in most cases, suggests that once a PC component fails, it is much
the terms refer to increasing system stability through more likely to fail again. The paper also examines fail-
active management [9] or redundancy [10, 11]. While ures that are not always noticeable to an end-user, such
these works begin to lay a strong foundation in this as 1-bit failures in DRAM. A thorough evaluation of
area, they also expose certain gaps in knowledge. Most failures and reliability at all levels of the CCS is found
of these works tend to evaluate either some aspect of in [39].
QoS or the impact of hardware failures. Many of the ini- Instead of focusing on internal hardware failures, Gill
tial works focus on the use of Markov chains [12–16], et al. focus on network failures in data centers [40, 41].
Snyder et al. Journal of Cloud Computing: Advances, Systems and Applications (2015) 4:11 Page 3 of 16

These studies conclude that 1) Data center networks general operation of the MCS requires the repeated sam-
are highly reliable, 2) Switches are highly reliable, 3) pling of a state space and the evaluation of those states
Load balancers most often experience faults due to sampled, all four steps of the MCS algorithm (sampling,
software failures, 4) Network failures typically cause classification, calculation, and convergence) are depen-
small failures that lose a large number of smaller pack- dent on an efficient representation of individual states.
ets, and 5) Redundancy is useful, but not a perfect This representation as well as further details regarding the
solution. implementation of MCS in this study are detailed in the
An interesting companion to the study of hardware fail- following section.
ures is the large scale performance study performed in
[42]. While this study does not explicitly focus on fail-
ures or reliability, it does provide a thorough analysis of Proposed methodologies
resource utilization and general workloads in data centers. This section presents a review of the non-sequential MCS
The work evaluates the utilization of various hardware algorithm in a formulation applicable to CCS reliability
pieces including CPUs, memory, disks, and entire file evaluation. While this formulation is focused on evaluat-
systems. ing the reliability of a CCS, this same algorithm can be
used to 1) Evaluate the reliability of an already existing
CCS under various loads (or, potentially in real-time) and
Monte Carlo simulation 2) Design a CCS with a high level of reliability that is also
MCS is a stochastic simulation tool which is often used to highly utilized. As such, this section also presents an itera-
evaluate complex systems as it remains tractable regard- tive algorithm for the design of a highly reliable and highly
less of dimensionality. The MCS algorithm comes in two utilized CCS.
varieties: non-sequential and sequential. Sequential MCS Such a simulation-based technique is required because,
is typically used to evaluate complex systems that require when hardware resources are considered, it is impor-
some aspect of time dependence. Because of this, this tant to look beyond a simple calculations that determine
variant of the algorithm requires more computational whether or not enough resources are available. A more
overhead and takes longer to converge. Non-sequential- complex issue is calculating the amount of resources
MCS (referred to as MCS for the remainder of this study) required in light of the stochastic failure rates of hard-
exhibits a higher computational efficiency than sequential ware resources in the system as coupled with varying user
MCS. The downside of the non-sequential MCS algo- requests for VMs. In such a case, one must look at the state
rithm is that the rate at which convergence time typically of the system across multiple “snapshots” of existence in
increases with problem dimensionality or system size. √ order to ensure that enough resources will be available to
Also note that the rate of convergence for MCS is 1/ N handle the workload, even when some portion of hard-
where N is the number of samples drawn. This means ware fails or general usage increases. Non-sequential MCS
that convergence does not depend upon dimensional- allows for such an analysis.
ity, allowing MCS to handle problems with a large state
space. While this can become an issue, it is easily han-
dled as the MCS algorithm is highly parallel and, in the System evaluation using MCS
case of long running simulation requirements, may be As described in the previous section, the MCS algorithm
easily parallelized in order to quickly simulate complex is highly dependent on an efficient method for state rep-
systems. resentation in order to achieve convergence through the
The general non-sequential MCS algorithm used for iterative process of sampling a state, classifying a state,
evaluating a CCS in this study is shown in Fig. 1. As the performing any necessary calculations, and then checking

Fig. 1 General MCS algorithm. The generic algorithm used for evaluating system reliability. Note that the “Classify Sampled State” and “Perform
Calculations” steps are modified in any implementation
Snyder et al. Journal of Cloud Computing: Advances, Systems and Applications (2015) 4:11 Page 4 of 16

convergence. Each of these algorithmic steps are dis- very large CCSs (i.e. adding a single bit to the binary
cussed in the subsections below. As the study is focused CCS vector for each additional server). This state repre-
on evaluating and designing systems with a high level of sentation scheme is highly advantageous, allowing for a
reliability (the probability of the system functioning dur- high level of customization and extensibility, leading to
ing some time period, t) from a user-based perspective, an array of variations that should be able to model all
throughout this work the assumption is maintained that available cloud computing service types (i.e. IaaS, SaaS,
the system is measured and evaluated while in use. In PaaS, etc.). The only change for considering an addi-
other words, unallocated resources and their failures are tional resource type is appending an extra bit to each
not considered. server’s binary state string as represented by a binary
The following sections describe the state representa- number. For example, if there was a need to extend this
tion used in the MCS algorithms as well as each stage of model to include a network interface card (NIC) on each
the MCS algorithm used in this study (sampling, classi- server, the bit representation could simply be extended
fication, and determining convergence). According to the by a single digit. This could be done for any variety of
process defined in Fig. 1, the MCS algorithm will use resources.
the state representation to repeatedly sample the state One objection that may be raised to this methodol-
space, classify each sampled state, and then determine ogy is the lack of inclusion regarding partially failed,
convergence based on these details. de-rated, or grey states. Such states do play an impor-
tant role, particularly when considering specific resources.
For example, portions of a hard drive may be marked
State representation as damaged or unusable and, thus, excluded from total
In this study, we consider the modeling of a single server resources available. Though, as the state model is highly
that exists inside of a CCS. Such a server can be repre- malleable, de-rated states may be included through
sented as an Y -bit bit field, X, where Y is the number of the inclusion of a three-or-more state model where
resource types being considered. Using a bit field repre- the 0/1 model currently suggested is replaced by a
sentation is not a new concept as it is commonly used 0/1/2 model where zero represents a completely failed
in a variety of disciplines and problem formulations, but resource, one represents a derated resource, and two rep-
the authors are unaware of any use of this methodology resents a fully functioning resource. For the purposes
to represent CCSs. In the proposed representation, each of this research, such an extension is left for future
bit represents the state of a resource; a “1” denotes an work.
up/functioning state and a “0” a down/failed state. This For the simulations performed in this study, servers
type of state is depicted in Fig. 2. Furthering this rep- are considered as consisting of CPU, memory, hard disk
resentation, the state of a single server can be distilled drive (HDD), and bandwidth resources or P, M, H, and
to a single bit according to (1) where S is a single state B respectively. Thus, the state of a single server is rep-
with I resources each represented as Xi . This method- resented as a 4-bit, bit field (e.g. a state of 1101 repre-
ology results in an entire CCS may be represented as sents a server with CPU, memory, and bandwidth in up
a binary vector with each bit representing the state of states and the HDD in a failed state). This state clearly
a single server, either failed — 0 — or functioning — represents the IaaS model of cloud computing (provid-
1. Since each server can take on 2 possible states, the ing requested infrastructure resources) and is chosen as
entire state space will consist of 2N states, where N is IaaS is the foundation for other types of services (i.e.
the total number of servers. Again, this provides a highly SaaS is built upon PaaS which is, in turn, built upon
expandable framework for representing and evaluating IaaS). Accordingly, this state space representation may be

Fig. 2 MCS state representation. An example showing the states of two, individual servers. The server on the left has failed while the server on the
right has not failed
Snyder et al. Journal of Cloud Computing: Advances, Systems and Applications (2015) 4:11 Page 5 of 16

expanded to encompass resources specific to each of these other case the system will have failed. The mathematics of
models. this method are shown in (5)–(8). Note that this method-
ology may easily be extended to any number or resources
including databases, software packages, etc.

I
S= Xi (1) 
V
i=1 Yrequested = Yv (5)
Sampling v=0

In order to effectively sample a state from a given state



S
space, a uniform distribution, u, is used. Since the relia- Yavailable = Ys (6)
bility of a device is exponentially distributed according to s=0
its annual failure rate (AFR), each uniformly distributed 
number is transformed into an exponentially distributed 0 Yrequested ≤ Yavailable
Ycurtailed = (7)
number. Thus, ui is transformed into an exponentially dis- 1 otherwise
tributed random number, ri , using the well-known inver-
 
sion method, according to (2). An AFR represents the 0 Y Ycurtailed > 0
estimated probability that a device will fail during a full Sx = (8)
1 otherwise
year of use. In this study, all AFR values are derived from
the work found in [36–42]. It should be noted that this is an approximation of a
The binary state string, X, is constructed by generat- real-world scenario. In reality, the assignment and usage of
ing a series of random values that are compared to each resources is more accurately calculated using a bin pack-
resource’s AFR. Specifically, the value of any given loca- ing formulation — an extension that is currently slated for
tion in the state string will be determined by comparing ri future work.
to the AFR of resource i according to (3). Determining convergence
In order to evaluate system level performance using MCS,
ri = −ln(1 − ui )/AFRi (2) some measure must be calculated in order to determine
 convergence of the algorithm. As the goal of this study is
0 ri ≤ AFRi
Xi = (3) the evaluation of reliability and utilization, the metric for
1 otherwise
convergence is defined as R. R is defined as the probabil-
ity that a CCS will be encountered in a functional state
Note that AFR is a simplistic measure of system avail- and is defined in (9) and (10) as the ratio of failed states
ability in contrast to a more robust measure like forced sampled to total states sampled (K). While R is the metric
outage rate (FOR). This is because AFR does not take into of interest in this study, convergence is determined by the
account the combination of failure and repair rates that a metric F, or the probability that the CCS will be found in
measure like FOR encompasses. As this is an exploratory a failed state. In order to determine convergence, variance
study, the authors chose AFR rather than the FOR due (σ 2 ) and standard deviation (σ ) of the F value are calcu-
to the lack of accurate repair and failure rates for CPUs, lated as defined in (11)–(12). Note √ that it is well known
HDDs, memory, etc. that MCS converges at a rate of 1/ N and that a more
detailed derivation of (9)-(12) for MCS can be found in
State classification [43].
The state classification step of MCS relies on a straight-
1 
K
forward comparison of the resources requested and
F= Sx (9)
resources available as a measure of system adequacy. K
x=1
Thus, a state will be sampled and the provided resources
1 
K
are compared to those available. For the system to ade-
R=1−F =1− Sx (10)
quately supply the needed resources the relation in (4) K
x=1
must hold for each individual CCS resource as defined
1
below. σ 2 (F) = (F − F 2 ) (11)
K

V (F)
Yrequested ≤ Yavailable (4) σ (F) = (12)
F
Convergence criteria
When a CCS supplies more resources than are The main driver behind the convergence of the MCS algo-
requested the system will be in a functioning state. In any rithm is the sampling of failure states. Accordingly, highly
Snyder et al. Journal of Cloud Computing: Advances, Systems and Applications (2015) 4:11 Page 6 of 16

reliable CCS systems will exhibit few such states and will Algorithm 1 Basic algorithm for iteratively developing a
take longer to converge than a system with a state space highly reliable, highly utilized cloud computing system
containing an abundance of failure states. The sampling Choose 0 ≤ Rdesired ≤ 1
of failed states drives σ (R) towards 0, to provide an accu- Choose 0 ≤ UTILdesired ≤ 1
rate estimate of R. In this study, there are two rules for Choose VMcount ≥ 0
determining whether the non-sequential MCS algorithm
has converged: Ractual , UTILactual ← MCS Algorithm (Considers
(iterations > 10 and σ (R) < 0.080) (13) stochastic resource failures)

or while (Ractual < Rdesired ) and (UTILactual <


UTILdesired ) do
(iterations > 20, 000 and R > 0.999999). (14)
The first convergence criterion provides early termina- for Each Resource, Y ∈ {P, M, H, B} do
tion for simulations that have an extremely low R after the if Resourcerequested > Resourceavailable then
first 10 samples (Generally, a highly unreliable system). Increase Resourceavailable
The second convergence criterion keeps highly reliable end if
(R > 0.999999) CCSs from running for long periods of end for
time due to the very sparse distribution of failed states in
the state space. Ractual , UTILactual ← MCS Algorithm (Considers
stochastic resource failures)
System design using MCS
While the proposed implementation of MCS is focused on //If necessary, change number of VMs to achieve
evaluating the reliability of complex CCSs, the same algo- result
rithm also has applications in designing highly reliable, Choose VMcount ≥ 0 to achieve desired CCS load
highly utilized CCSs. In this study an iterative algorithm
for designing such a system (and, potentially expand- end while
ing that system) is developed. Though this algorithm is
used in this study for the design of test systems (i.e.
model systems that are used for testing the proposed If the addition of resources to the CCS is infeasi-
MCS algorithm), the algorithm may also be used for ble, one way to control the reliability is to perturb the
the planning and design of highly reliable, highly uti- number of VM allocations as opposed to adjusting the
lized CCSs under predicted loads. The algorithm itself amount of resources. The optimal number of maximum
relies on the prospect of increasing the amount of avail- VM allocations to satisfy a pre-specified reliability thresh-
able resources that currently cause resource inadequacies. old can be easily calculated by repeatedly applying the
The additional resources yield simultaneous increases in non-sequential MCS algorithm. After each MCS iteration,
CCS reliability as well as overall resource utilization. The the number of VMs allocated is adjusted up or down if
novel algorithmic process of enhancing CCS reliability the reliability is higher or lower than the threshold respec-
and resource utilization when the addition of resources tively. Conversely, if the reliability is at a desired level and
is possible is summarized in Algorithm 1 where Rdesired resources cannot be added to improve resource utilization
and UTILdesired are the desired level of system reliabil- (and the number of maximum VM allocations is satis-
ity and system utilization, Ractual and UTILactual are the factory), excess, under-utilized resources can be removed
measured levels of system reliability and utilization, and from the CCS and re-purposed.
VMcount is the number of VMs currently allocated. While
this algorithm may appear to be deterministic, the MCS Experimental results
algorithm embedded inside of the algorithm is stochastic This section will present the implementation details of the
and considers the probabilistic failure of available system simulation software as well as an overview of test system
resources. This means that the amount of resources avail- design and the different VM allocation schemes that are
able are not increased simply to accommodate the amount used. Actual results from CCS reliability simulations are
of resources requested (which is a simple calculation). also introduced and analyzed.
Instead, the algorithm increases the amount of resources
available in order to handle resource requests while also Implementation notes
considering stochastic failures of system resources, thus The simulation software is implemented in Java 7 (using
solving a much more difficult problem and leading to a IntelliJ IDEA 12) and is run on a Dell Inspiron E6430 with
more robust system design. a 64-bit version of Windows 7, an Intel Core i7 @ 2.4 GHz,
Snyder et al. Journal of Cloud Computing: Advances, Systems and Applications (2015) 4:11 Page 7 of 16

and 8GB RAM. All of the results are stored in a MySQL in which all trials of a single simulation have identical VM
database for future analysis. allocations. This scheme results in a tight bound on the
In each simulation, a CCS is abstracted to consist of a variance of the reliability of repeated simulations due to
finite pool of available and requested resources. The avail- the only stochastic behavior arising within the MCS algo-
able resources correspond to the servers that compose a rithm. The process for performing a static allocation is
cloud. The requested resources consist of the VMs allo- shown in Algorithm 2 where VMinstances refers to the cur-
cated on the cloud. The specific resources focused on rent set of VMs being allocated on the CCS, VMinstance is
within this study include CPUs, memory, hard disk drives a single VM instance, PROBi is the probability of a single
(HDD), and bandwidth. Each simulation resource is tied type of VMinstance occurring, and i refers to a singular VM
to a pre-specified AFR. The specific AFRs used in this type.
study are shown in Table 1. The AFRs used were gathered
and estimated from practical and theoretical research in
the current literature that analyzed hardware failures of Algorithm 2 Algorithm for static VM allocation
CCSs [12, 36, 38, 40–42]. Choose SET(VMinstances )
In order to provide for composition of more complex for each VMinstance i do
CCSs as well as reusable resource definitions, the soft- Choose 0 ≤ PROBi ≤ 1
ware is built around a hierarchy of components. The most end for 
basic component is a single server consisting of a finite Require: PROBi = 1
pool of CPU, memory, HDD, and bandwidth resources.
From here, clusters are constructed as groups of servers Choose VMcount ≥ 0
and clouds are built from a collection of clusters. Likewise, for each VMinstance i do
the requested resources are constructed from a hierarchi- Allocate VMcount * PROBi instances of i
cal abstraction. The most basic unit is a VM consisting end for
of a finite pool of CPU, memory, HDD, and bandwidth
resources. Individual VMs are combined into groups to
represent the total requested resources. In conjunction The second allocation scheme adds a second source
with grouping individual VMs together, a probability dis- of stochastic behavior to the reliability simulation. In
tribution is assigned in accordance with how each type of this scheme, the allocation of VMs is based on a user
VM in that group is to be allocated. specified probability distribution over a discrete set of
The simulations are setup by specifying a particular set VM types. Multiple simulations set up identically under
of available resources in the form of a cloud, and a set of this arrangement will have varying VM allocations. This
requested resources in the form of a VM grouping. The scheme provides a view of highly dynamic VM allocation
non-sequential Monte-Carlo algorithm is run to conver- policies, and provides insight on how to better control the
gence for each simulation. It is important to note that overall reliability in these rapidly changing environments.
individual component failures are not tracked by the MCS Algorithm 3 details the process of dynamic VM allocation.
algorithm. In reality, a specific component that fails once
has been shown to fail more often than its peers. This
simplification of the state space allows the non-sequential Algorithm 3 Algorithm for dynamic VM allocation
MCS algorithm to converge more efficiently than one that Choose SET(VMinstances )
keeps track of this additional information. Thus, the AFRs for each VMinstance i do
used in each simulation apply to all resources, and are Choose 0 ≤ PROBi ≤ 1
never modified based on past inadequacies. end for 
There were two types of simulations performed over the Require: PROBi = 1
course of this study which differ in the way VMs are allo-
cated. The first scheme allocates VMs in a static manner, Choose VMcount ≥ 0
for 1 to VMcount do
sum = 0
Generate random number 0 ≤ r ≤ 1
Table 1 Annual Failure Rates (AFR) used in the simulations
for probi in PROB do
Component AFR sum = sum + probi
CPU 2% if r ≤ sum then
Memory 1%
Allocate VM of type i
end if
Hard Disk Drives 8% end for
Bandwidth 1% end for
Snyder et al. Journal of Cloud Computing: Advances, Systems and Applications (2015) 4:11 Page 8 of 16

Test systems Table 3 Test-Bed CCS - 98 VM’s


Due to a lack of standardized test CCSs within literature, # of Cores Memory (GB) HDD Size (GB) Bandwidth
the authors needed to develop test-bed systems for simu- (Mbps)
lation. In order to accomplish this goal, this study began Available 400 400 10,000 50,000
with the small CCS depicted in Table 2. The table depicts Requested 196 196 9,800 9,800
the total available server resources as well as the resources
Difference 204 204 200 40,200
consumed by the allocation of one VM. This CCS for-
mulation provided insights into 1) What caused different Utilization 0.49 0.49 0.98 0.20
types of hardware failures, 2) The ways in which conver-
gence of the MCS is reached (from above or below), and 3)
The number of iterations required for convergence. Based in conjunction with the efficiency of the MCS algorithm’s
on this initial system, further and more complex systems convergence allows for quick and easy design of highly
were designed in order to test the proposed methodology. reliable and highly utilized CCSs.
The design of highly reliable and highly utilized clouds
Static VM allocation strikes a balance between resource utilization that is high
An overview of a test-bed simulation with an allocation enough to use a majority of cloud resources yet is safely
of 98 VMs is shown in Table 3 and the reliability aver- below the threshold in which many concurrent failures
aged over 20 trials is 0.98 ± 0.0025. The results from a are likely. Since the reliability of a resource is exponen-
typical trial are shown in Figs. 3 and 4. It is evident that tially distributed in accordance with its AFR, resources
the HDD resource is the cause of all 153 failures dur- with high AFRs must be carefully considered, especially
ing the simulation. This is due to the high utilization of at high utilization percentages. In the previous examples
the HDD resource (98 %) and comparatively low utiliza- the HDD resource was highly utilized, yet it also has an
tion of all other resources (49 % for CPU and memory AFR much higher than any of the other resources con-
and 20 % utilization for bandwidth). This shows that the sidered. This over-utilization of a failure prone resource
cloud provider could serve many more VMs if they were provides an opportunity to greatly improve CCS relia-
to add more HDD to this particular cloud. The extra HDD bility. Using the iterative algorithm from section 4.3, the
resource would allow for more VM’s to be allocated on test-bed CCS from Table 4 is optimized to obtain a much
the cloud and in effect allow for a much larger utilization higher reliability by allocating only 190 VM’s (rather than
percentage of the other resources. the 194 that were previously requested). The resulting R
As such, another simulation is performed with an value averaged over 20 trials is 0.999970±0.000031, which
extra 10,000 GB of HDD resources to the test-bed CCS is around a 0.4 % improvement. All resource inadequacies
(Table 4). The results of this simulation with an alloca- are due to HDD again, reinforcing the detrimental effects
tion of 194 VMs is shown in Figs. 5 and 6. The overview of high AFRs on a CCS’s reliability. More iterations of
table shows that the utilization of the CPU, memory, Algorithm 1 can be performed in order to further increase
and bandwidth resources have almost doubled while the CCS reliability, yet resource utilization will be reduced.
HDD utilization has been decreased slightly. The result- Subsequently, more complex CCSs are designed with
ing reliability averaged over 20 trials is 0.9960 ± 0.0003. the insights gained from the test-bed system in mind.
The additional HDD resources have yielded a substan- After qualifying the impacts of various resource alloca-
tial increase in total cloud utilization while simultaneously tions, the authors simulate a real world virtual CCS run
increasing the reliability by around 2.22 %. Consequently, by the Extreme Science and Engineering Discovery Envi-
the cloud provider was able to supply an additional 96 ronment (XSEDE) partnership [44]. The XSEDE partner-
VMs to consumers with a much higher reliability than in ship is composed of numerous United States universities.
the previous example. Also noteworthy is that the main XSEDE is an advanced, powerful, and robust collection
resource inadequacy is still HDD. This behavior is due of integrated advanced digital resources and services that
to the significantly higher AFR of HDD versus the other supports a CCS composed of 16 supercomputers as well
resources. The ability to quickly modify CCS resources as high-end data visualization and data analysis resources.
The hardware resources provided by the XSEDE CCS are
depicted in Table 5.
Table 2 Small test-bed CCS used in initial simulations In order to provide many unique, and realistic VM
# of Cores Memory (GB) HDD Size (GB) Bandwidth instances beyond the one in the initial test-bed system,
(Mbps) the authors chose to simulate a large subset of the avail-
able Amazon EC2 VMs [45]. A listing of the Amazon EC2
Available 400 400 10,000 50,000
VM instances that were used in simulations is shown in
Each VM 2 2 100 100
Table 6. Using VMs mirrored after the actual Amazon
Snyder et al. Journal of Cloud Computing: Advances, Systems and Applications (2015) 4:11 Page 9 of 16

Fig. 3 Pattern of reliability convergence — initial simulation. Pattern of reliability convergence for the initial test-bed CCS simulation as defined in
Table 3

EC2 VMs allows for the exploration of complex CCSs with and HDD. The EC2 VMs also have bandwidth requests
much more complex allocation configurations, especially that are minimal compared to the jobs that are most likely
when the distribution of VMs is assigned in accordance performed on the XSEDE supercomputers.
with a probability distribution. Many configurations can be achieved using all of the
EC2 instances which provide high reliability and high
Simulation of the XSEDE CCS resource utilization of 2 out of the 4 resources (almost
The XSEDE cloud which has many more resources avail- exclusively CPU and HDD). Bandwidth is always under-
able than the original test-bed system, proves much more utilized at around 7–10 %. The main conclusion that can
difficult in balancing high reliability with high utiliza- be drawn is that in order to effectively serve VM alloca-
tion. Initially, allocation of each type of Amazon EC2 tions likened to the EC2 instances, the XSEDE CCS would
VM instance is varied based on the amount of resources greatly benefit from additional CPU and HDD resources.
provided. Achieving high CCS reliability is quite straight- The addition of these resources would allow the XSEDE
forward in this manner. Yet, finding a balance of EC2 CCS to serve many more VMs under the conditions of
instances that also yielded a high utilization of each this study at an even higher reliability level. Another
resource is highly difficult. This makes sense as the important observation is that VM instances that are light
XSEDE cloud is primarily aimed at scientific computing on resources are very good at increasing utilization to a
and data visualization which requires allocations that are desired level without sacrificing reliability. This behavior
quite different than the normal EC2 instances. For exam- is very similar to the way in which a bucket can be filled to
ple, an M1 Small or M1 Medium instance would rarely the brim with coarse rocks yet there is always room to add
be allocated on the XSEDE CCS as it would be insuffi- in finer particles to fill in the empty voids.
cient for performing large scale scientific calculations. A Results of a simulation performed with the XSEDE CCS
more likely scenario would be the allocation of VMs which are shown in Figs. 7 and 8 with the VM allocation per-
utilize much higher quantities of each resource, such as centages shown in Fig. 9. There are a total of 13,000
the Cluster Compute Eight Extra Large EC2 instance. Yet, VM instances allocated yielding CPU, memory, HDD, and
even when allocation is performed using only the most bandwidth utilizations of 0.95, 0.25, 0.98, and 0.08 respec-
intensive EC2 VMs there is still an overreliance on CPU tively. Of note is the 5,015,055 sampled states that are

Fig. 4 Component failures by resource type — initial simulation. Component failures by resource type for the initial test-bed CCS simulation as
defined in Table 3
Snyder et al. Journal of Cloud Computing: Advances, Systems and Applications (2015) 4:11 Page 10 of 16

Table 4 Test-bed CCS (with extra HDD) - 194 VM’s 0.89±0.1912. Fig. 10 shows a histogram of the R values for
# of Cores Memory (GB) HDD Size (GB) Bandwidth the 500 simulations with a bin size of 0.01. Although the
(Mbps) resulting distribution of CCS reliability is heavily skewed
Available 400 400 20,000 50,000 to the highly reliable side, it is very important to note that
there are a total of 9 simulations that resulted in immedi-
Requested 388 388 19,400 19,400
ate, total failures (R value of 0.0). A specific allocation that
Difference 12 12 600 30,600
results in a R of 1.0 was characterized by allocation per-
Utilization 0.97 0.97 0.97 0.39 centages of 50.8 %, 29.7 %, and 19.4 % for the M1 Small,
M1 Medium, and M1 Large VMs. Likewise, an allocation
that results in a R of 0.0 consisted of 49.4 %, 30.2 %, and
required for convergence compared with the simulation 20.4 % of M1 Small, M1 Medium, and M1 Large VMs.
displayed in Figs. 5 and 6 in which only 41,082 sam- These numbers illustrate how a slight difference in VM
ples are needed. This increase in sampling illustrates the allocation can result in a large change in the reliability of a
importance of developing more sophisticated failure-state CCS.
sampling schemes to speed up MCS convergence for large For comparison, 500 simulations are performed using
CCS systems. the exact setup described above except that the VM types
are statically allocated. Hence, M1 Small, M1 Medium,
Probability based VM allocation and M1 Large occupy 50 %, 30 %, and 20 % of the 31,050
Simulations are also performed in which multiple, dif- VM allocations respectively for all 500 simulations. This
ferent VM instances are allocated on the CCS in corre- simulation results in a R value of 0.9917±0.0006. The min-
spondence with a specified probability distribution. This imum and maximum R values are 0.989304 and 0.993757,
enables the evaluation of a more realistic set of simula- respectively. These simulations illustrate that even a sim-
tions that showcase how allocation variability affects the ple CCS under a dynamic allocation policy can exhibit
overall reliability of a CCS. The most important observa- highly deceptive reliability characteristics when too few
tion from these simulations is that the reliability of a CCS simulations are performed.
can vary wildly when a fixed number of VMs are allocated This highly erratic reliability behavior is a direct result of
using a specified probability distribution. In the event the combined use of probabilistic methods of VM alloca-
that a multitude of resource intensive VMs are allocated tion and the design of a test system that is highly utilized.
on the cloud the CCS would be very likely to fail. Con- In any highly utilized system, a slight variance in workload
versely, if a majority of the VM allocations are very light can easily move a system from highly utilized and stable to
on resource requirements the CCS would exhibit 100 % over-utilized and unstable. This also showcases the value
reliability. of the novel application of non-sequential MCS in order to
For example, a simulation is performed using three of efficiently simulate a CCS numerous times. This behavior
the Amazon EC2 VM instances: M1 Small, M1 Medium, becomes even more variable when more than three VM
and M1 Large with respective allocation probabilities of types are used in the probability distribution. The abil-
50 %, 30 %, and 20 %, sampled from a uniform distribu- ity to assess the distribution of reliability across a broad
tion. The XSEDE cloud is used as the available resources range of allocation schemes greatly aids in designing a
(Fig. 5) and 31,050 VMs are allocated using the specified CCS that maintains high reliability. Without an abundance
distribution. The CCS’s reliability varies from 0.0 to 1.0 of simulations, the design of highly reliable CCSs is a futile
over the course of 500 simulations. The average R value is exercise within such a dynamic environment.

Fig. 5 Pattern of reliability convergence — initial simulation plus 10,000 GB HDD. Pattern of reliability convergence for the initial test-bed CCS
simulation with 10,000 GB extra HDD resource as defined in Table 4
Snyder et al. Journal of Cloud Computing: Advances, Systems and Applications (2015) 4:11 Page 11 of 16

Fig. 6 Component failures — initial simulation plus 10,000 GB HDD. Component failures by resource type for initial test-bed CCS simulation with
10,000 GB extra HDD resource as defined in Table 4

Discussion a large amount of storage space would typically see fail-


A recurring observation throughout the empirical sim- ures in other components — CPUs, RAM, etc. Yet, this
ulations is the importance of matching VM allocations same functionality can be leveraged to craft probability
to hardware that is specifically suited to the type of job distributions over VM instances that allow for highly reli-
at hand. For instance, naively allocating 116,000 small able clouds with each individual resource being highly
test-bed VM’s on the XSEDE cloud results in a reliabil- utilized. By intelligently allocating the proper types of
ity of 98.7 % due to shortage of hard disk resources. In VM instances, overall cloud reliability can be controlled
addition, this setup does not efficiently utilize the avail- with a fine degree of precision while efficiently using
able CPU (68 %), memory (34 %), or bandwidth (5 %) the available resource pool. In fact, if the probability
resources. distribution over VM instances and AFRs are known,
Yet, when a hand crafted group of VM’s that are much limits on CCS reliability can be readily established via
more CPU and memory intensive are allocated on the simulation.
XSEDE cloud the total pool of resources are much more For instance, Fig. 11 depicts repeated simulations of
efficiently utilized; all while retaining a very high degree the XSEDE CCS with varying numbers of VMs allocated.
of reliability (≥ 99.9999 %). It is quickly evident that
groups of VMs can be engineered to cause any particu-
lar type of resource inadequacy. Furthermore, it is very Table 6 Amazon EC2 VM instances (Est. Bandwidth)
likely that users requesting resources that do not require VM Name # of Cores Memory HDD Size Bandwidth
(GB) (GB) (Mbps)
M1 Small 2 2 160 100
M1 Medium 4 4 410 500
Table 5 XSEDE Cloud Resources (Est. Bandwidth)
M1 Large 8 8 850 1000
Cluster # of Cores Memory HDD Size Bandwidth
(GB) (GB) (Mbps) M1 Extra Large 16 16 1,690 1,000

Ranger 62,976 125,952 1,810,560 3,936,000 M3 Extra Large 26 16 1,690 1,000

Wispy 128 512 8,000 32,000 M3 Double 52 32 3,380 2,000


Extra Large
GordonION 768 3,072 256,000 640,000
High Memory 13 18 420 500
KrakenXT75 112,896 150,528 2,455,488 9,408,000 Extra Large
Lonestar 4 22,656 45,312 275,648 18,880,000 High Memory 26 35 850 1,000
Steele 7,144 28,576 446,500 893,000 Double Extra
Large
Gordon Compute 16,384 65,536 4,096,000 102,400,000
Cluster High Memory 52 70 1,690 1,000
Quad Extra
Trestles 10,368 20,736 143,208 3,240,000 Large
Quarry 896 3,584 2,128 112,000 High CPU 10 2 350 500
Medium
Stampede 102,400 204,800 320,000 87,296,000
High CPU Extra 40 8 1,690 1,000
Blacklight 1,024 32,768 150,000 15,000
Large
Keeneland 4,224 8,448 1,761,144 3,600,960
Cluster Compute 176 64 3,370 10,000
Totals 341,864 689,824 11,724,676 236,212,960 Eight Extra Large
Snyder et al. Journal of Cloud Computing: Advances, Systems and Applications (2015) 4:11 Page 12 of 16

Fig. 7 Pattern of reliability convergence — XSEDE CCS. Pattern of reliability convergence for the XSEDE CCS (Table 5) simulation using the VM
allocation shown in Fig. 9

The VM instances used in this simulation were the EC2 used to manage virtualized storage resources to increase
instances: M1 Small, M1 Medium, and M1 Large, at utilization rates of disk while maintaining high reliability.
distribution percentages of 20 %, 40 %, and 40 %, respec- Further, this study only considers the availability of the
tively. The figure shows that the XSEDE CCS can ade- servers hosting the VMs and does not consider other sub-
quately supply resources for 21,567 VMs, while retaining systems like external storage, storage area network (SAN),
a reliabilty greater than 99.9104 %. If the number of VM or failures in other systems such as software bugs or
allocations is decreased to 21,447, the minimum relia- human error. There are also other features of the cloud
bility over 10 trials increases to 99.9992 %. Of course, that cannot be modeled by this approach which affect the
these results are highly dependent on the accuracy of the overall reliability such as live migration of VMs, or spe-
AFRs specified at simulation time, as well as the num- cific CCS network topologies such as DCell which have
ber of trials performed. This method can be extended nodes with varying importance to overall CCS reliabil-
to establish accurate CCS reliability curves that illustrate ity. Live migration of VMs can increase the reliability of
CCS reliability characteristics across a wide range of VM the cloud by allowing VMs to be seamlessly transferred
allocations. between servers for load balancing or server repair. The
CCS network topology can have large consequences on
Limitations overall reliability as server centric architectures like DCell
It is important to note that this methodology consid- have relay nodes that are more important than individ-
ers a worst case scenario in which all VM allocations ual compute nodes for CCS reliability. Further, oversub-
consume full resources at all times. In reality, a server scription has further implications that may also impact
hypervisor manages the resources requested by each VM, reliability.
allowing more VM guests to be allocated without detri-
mental impacts on overall cloud reliability. For instance, Practical implications
one physical CPU core may be mapped to four virtual The importance of cloud availability and reliability extend
CPU cores and each guest VM will wait for its share of beyond academic interest as they also have monetary
virtual cores to be available from the hypervisor to run consequences when considering cloud SLAs. For exam-
their compute tasks. Similarly, a storage hypervisor can be ple, the Amazon EC2 SLA states that Amazon will use

Fig. 8 Component failures by resource type — XSEDE CCS. Component failures by resource type for the XSEDE CCS (Table 5) simulation using the
VM allocation shown in Fig. 9
Snyder et al. Journal of Cloud Computing: Advances, Systems and Applications (2015) 4:11 Page 13 of 16

Fig. 9 Distribution of EC2 VM instances (Table 6) used in the XSEDE CCS simulation

commercially reasonable efforts to make Amazon EC2 downtime. CCS resource consumers lose money each
and Amazon ABS each available with a MUP of at least minute that their website or IT infrastructure is down.
99.95 % [46]. Amazon calculates MUP by subtracting the According to InformationWeek, IT downtime costs $ 26.5
percentage of minutes during the month in which the billion in lost revenue per year [48]. Thus, there is a
Amazon service is in a state of “Region Unavailable.” This mutual interest in making CCS availability and resilience
leaves a mere 21.6 minutes of downtime that Amazon is approach 100 %. This is in addition to other cloud com-
allotted to meet a MUP of 99.95 % in a 30 day month. puting consequences that could arise including data loss
In the event that Amazon cannot meet the terms of this and data security. In order to retain its customer base,
SLA, a credit is issued to the consumer in accordance with CCS providers must be highly aware of the consequences
Table 7. Rackspace, on the other hand, guarantees that of downtime as well as pro-actively pursuing increased
their infrastructure will be available 100 % of the time and reliability by continually re-evaluating their infrastructure
will issue account credits up to the full monthly fee for to provide highly available and reliable service. This is
affected servers based on the number of hours the server especially important in light of the sheer number of CCS
is down [47]. failures and issues showcased on the IWGCR website [1].
The cloud resource providers are not the only parties In order to uphold such stringent uptime requirements
that experience detrimental economic effects from CCS efficient, effective ways of evaluating and improving cloud

Fig. 10 Distribution of R values from 500 probability based VM allocation simulations


Snyder et al. Journal of Cloud Computing: Advances, Systems and Applications (2015) 4:11 Page 14 of 16

Fig. 11 Minimum CCS reliability based on number of VM Allocations (10 trials)

reliability and availability are extremely important. In fact, MCS. It was shown that non-sequential MCS provides
when dealing with such high MUPs, availability, and lost an efficient and flexible way to determine the reliability
revenue requirements, every minute of uptime is cru- of a CCS based on a set of discrete resources. A novel
cial. The methods developed in this study for the design algorithm for CCS expansion planning was also intro-
and evaluation of CCSs provides one set of tools for effi- duced which facilitates the design of highly reliable and
ciently pursuing this goal. CCS providers can leverage the highly utilized CCSs. Finally, new test-bed CCS systems
proposed algorithms to assess their infrastructure from were developed that can be used for future CCS reliability
a vantage point of both reliability and utilization. Non- and availability analyses. Based on the insights garnered
sequential MCS and its derivatives may also be used to during this study, future work may include the following:
validate and assess their SLAs in order to ensure that
they are effortlessly meeting the required MUPs. The • An extension of the CCS expansion planning
resource expansion and planning algorithms can be iter- algorithm developed in this study to incorporate
atively applied to maximize revenue and minimize costs economic factors to simultaneously maximize CCS
associated with providing CCS resources to consumers. reliability and the cloud providers return on
The decision to add resources to a CCS no longer would investment (ROI);
be necessitated by a failure, but could be justified by • As this study considered a CCS to have only CPU,
quantitative metrics provided via the proposed methods. memory, HDD, and bandwidth resources, a future
These methods can also save time by simulating new extension may include graphics processing units
CCS designs prior to actually building physical systems (GPUs), databases, software packages, etc. This
to ensure that the resources will be sufficient for the extension would provide for highly realistic
expected VM allocation load. simulations that take into account many more
variables than were considered in this preliminary
Conclusion study.
This study has presented and analyzed a novel approach • An extension of the proposed algorithm to consider
to assessing the reliability of CCSs using non-sequential partially failed or derated states. For instance, a
multiple core CPU may still function at a reduced
level if only a subset of the available cores fail;
Table 7 SLA service commitments and credits • Evaluating implications of server and storage
Terms Service credit percentage hypervisors on the reliability of a CCS instead of using
a simple resources requested vs. available scheme; and
99.0 % ≤ MUP < 99.95 % 10.0 %
• This study has made an important approximation —
MUP < 99.0 % 30.0 %
calculating resource assignment and usage in an
Snyder et al. Journal of Cloud Computing: Advances, Systems and Applications (2015) 4:11 Page 15 of 16

additive manner. In the future, this will be improved 2. Gagnaire M, Diaz F, Coti C, Cerin C, Shiozaki K, Xu Y, Delort P, Smets JP,
to formulate the resource assignments via a bin Lous JL, Lubiarz S, Leclerc P (2011) Downtime statistics of current cloud
solutions. Technical report, International Working Group on Cloud
packing problem formulation for more realistic Computing Resiliency (June 2012) https://iwgcr.files.wordpress.com/
results. 2012/06/iwgcr-paris-ranking-001-en1.pdf
3. Cerin C, Coti C, Delort P, Diaz F, Gagnaire M, Gaumer Q, Guillaume N, Lous
Competing interests J, Lubiarz S, Raffaelli J, Shiozaki K, Schauer H, Smets J, Seguin L. Downtime
The authors declare that they have no competing interests. statistics of current cloud solutions. Technical report, International
Working Group on Cloud Computing Resiliency (June 2013) http://iwgcr.
Authors’ contributions org/wp-content/uploads/2013/06/IWGCR-Paris.Ranking-003.2-en.pdf
BS developed the simulation software, carried out the simulations, analyzed 4. Kundra V. Federal cloud computing strategy. Technical report, The United
results, and drafted the manuscript. JR aided in simulation analysis and helped States Government https://www.dhs.gov/sites/default/files/publications/
to draft the manuscript. RG participated in the conception, design, and digital-strategy/federal-cloud-computingstrategy.pdf
implementation of the study and helped to draft the manuscript. VD helped 5. Rosenberg J, Mateos A (2011) The Cloud at Your Service: The When, How,
to conceive of and plan the study while also contributing to the manuscript. and Why of Enterprise Cloud Computing. 1st edn. Manning Publications,
MA conceived of the study and helped draft the manuscript. All authors read Greenwich, Connecticut
and approved the final manuscript. 6. Izrailevsky Y, Tseitlin A (2011) The Netflix Simian Army. http://techblog.
netflix.com/2011/07/netflix-simian-army.html
Authors’ information 7. IBM/Google Academic Cloud Computing Initiative. http://www.
BS received his B.S. in Computer Science and Engineering, B.S. in Electrical cloudbook.net/directories/research-clouds/ibm-google-academic-
Engineering, and M.S. in Computer Science and Engineering from the cloud-computing-initiative (2012)
University of Toledo (UT) in Toledo, OH in 2010 and 2013, respectively. His 8. Cloud Computing. http://labs.yahoo.com/nnComputing (2011)
research interests include the application of computer science to cloud 9. Chang CS, Bostjancic D, Williams M (2010) Availability management in a
computing, machine learning, artificial intelligence, and signal processing. In virtualized world. In: Boursas L, Carlson M, Jin H, Sibilla M, Wold K (eds).
2009, he received the UT Electrical Engineering Student of the Year Award. Systems and Virtualization Management. Standards and the Cloud.
JR received his B.S. degree in Computer Science in 2009 and his M.S. degree in Communications in Computer and Information Science, vol 71. Springer,
Computer Science in 2011 from Bowling Green State University (BGSU). He Berlin Heidelberg Vol. 71. pp 87–93
was awarded his Ph.D. from the University of Toledo in 2014, where his 10. Bauer E, Adams R (2012) Reliability and Availability of Cloud Computing.
research focused on biomedical imaging and virtual reality-based modeling 1st edn. Wiley-IEEE Press, Piscataway, New Jersey
applications. He is now an Assistant Professor of Computer Science at The 11. Oner KB, Scheller-Wolf A, van Houtum G-J (2013) Redundancy
University of Findlay, where he is performing research in Human-Computer optimization for critical components in high-availability technical
Interaction (HCI) and usability engineering, in addition to his interests in systems. Oper Res 61(1):224–264
computer vision and virtual reality. 12. Kim DS, Machida F, Trivedi KS (2009) Availability modeling and analysis of
RG received his B.S. in Computer Science from Geneva College in 2005, his M.S. a virtualized system. In: 15th IEEE Pacific Rim International Symposium on
from Bowling Green State University in 2007, and his Ph.D. from the University Dependable Computing, Shanghai, China. pp 365–371
of Toledo in 2012. He currently serves as an Assistant Professor at Bowling 13. Ghosh R, Trevedi K, Naik V, Kim D (2012) Interacting markov chain based
Green State University. His research is driven by a love for code and covers a hierarchical approach for cloud services. Technical report, IBM (April 2010)
variety of areas including High Performance Computing, Population-based http://domino.research.ibm.com/library/cyberdig.nsf/papers/
Metaheuristics, Software Development, and the application of these interests AABCE247ECDECE0F8525771A005D42B6
to the evaluation and analysis of complex networks including the power grid 14. Che J, Zhang T, Lin W, Xi H (2011) A markov chain-based availability
and cloud computing systems. model of virtual cluster nodes. In: International Conference on
VD received the B.Eng. degree in EEE and the M.Sc. degree in physics from Computational Intelligence and Security, Hainan, China. pp 507–511
BITS, Pilani, in 1996, and the Ph.D. in electronics from Carleton University, 15. Zheng J, Okamura H, Dohi T (2012) Component importance analysis of
Canada, in 2003. Since 2008, he is an Associate Professor in the EECS virtualized system. In: International Conference on Ubiquitous
Department at the University of Toledo. Dr. Devabhaktuni’s R&D interests Intelligence and Computing, Fukuoka, Japan. pp 462–469
include applied electromagnetics, biomedical applications of wireless sensor 16. Longo F, Trivedi K, Russo S, Ghosh R, Frattini F (2014) Scalable Analytics for
networks, computer aided design, device modeling, image processing, IaaS Cloud Availability. IEEE Trans Cloud Comput 99(PrePrints):1
infrastructure monitoring, neural networks, optimization methods, power 17. Zissis D, Lekkas D (2012) Addressing cloud computing security issues.
theft modeling and education, RF/microwave devices, and virtual reality. Future Generation Comput Syst 28(3):583–592
MA received the B.S. degree in Electrical Engineering with honors from AMU 18. Page S Cloud computing-availability. Technical report, ISA/BIT Learning
Aligarh, India in 1969. He received the M.E. with distinction and Ph.D. degrees Centre http://uwcisa.uwaterloo.ca/Biblio2/Topic/ACC626
from IISc. Bangalore, in 1971 and 1974 respectively. He served as the Graduate 19. Ahuja SP, Mani S (2012) Availability of services in the era of cloud
Director of the EECS Department from 1996 - 1998 and the Undergraduate computing. Netw Commun Technol 1(1):2–6
Director of the CSE program from 1998 - 2001. He received 2008 EECS Teacher 20. Wang W, Chen H, Chen X (2012) An availability-aware approach to
of the year award, 2008 College of Engineering Outstanding Teacher award resource placement of dynamic scaling in cloud. In: IEEE Fifth International
and 2006 IEEE Engineer of the year award of the Toledo Section of the IEEE. Conference on Cloud Computing, Honolulu, Hawaii. pp 930–931
21. Jeong YS, Park JH (2013) High availability and efficient energy
Author details consumption for cloud computing service with grid infrastructure.
1 Department of Electrical Engineering & Computer Science, University of Comput Electrical Eng 39(1):15-23. http://www.sciencedirect.com/
Toledo, 2801 W. Bancroft St., 43406 Toledo, OH, USA. 2 Department of science/article/pii/S0045790612000456
Computer Science, Bowling Green State University, 1001 E. Wooster St., 43403 22. Manesh RE, Jamshidi M, Zareie A, Abdi S, Parseh F, Parandin F (2012)
Bowling Green, OH, USA. 3 Department of Computer Science, The University of Presentation an approach for useful availability servers cloud computing
Findlay, 1000 North Main St., 45840 Findlay, OH, USA. in schedule list algorithm. Int J Comput Sci Issues 9(4):465–470
23. Ferrari A, Puccinelli D, Giordano S (2012) Characterization of the impact of
resource availability on opportunistic computing. In: MCC Workshop on
Received: 9 June 2014 Accepted: 11 May 2015
Mobile Cloud Computing. ACM, Helsinki, Finland. pp 35–40
24. Lin YK, Chang PC (2010) Estimation of maintenance reliability for a cloud
computing network. Int J Oper Res 7(1):53–60
References 25. Lin YK, Chang PC (2011) Maintenance reliability estimation for a cloud
1. IWGCR: International Working Group on Cloud Computing Resiliency. computing network with nodes failure. Expert Syst Appl
http://iwgcr.org/ (2013) 38(11):14185–14189
Snyder et al. Journal of Cloud Computing: Advances, Systems and Applications (2015) 4:11 Page 16 of 16

26. Lin YK, Chang PC (2011) Performance indicator evaluation for a cloud
computing system from qos viewpoint. Quality & Quantity 47(3):1–12
27. Lin YK, Chang PC (2012) Evaluation of system reliability for a cloud
computing system with imperfect nodes. Syst Eng 15(1):83–94
28. Lin YK, Chang PC (2012) Approximate and accurate maintenance
reliabilities of a cloud computing network with nodes failure subject to
budget. Int J Prod Econ 139(2):543–550
29. Lin YK, Chang PC (2012) Estimation Method to Evaluate a System
Reliability of a Cloud Computing Network. United States Patent
Application. http://www.google.com/patents/US20120023372
30. Qian H, Medhi D, Trivedi K (2011) A hierarchical model to evaluate quality
of experience of online services hosted by cloud computing. In: IFIP/IEEE
International Symposium on Integrated Network Management, Dublin,
Ireland. pp 105–112
31. Jhawar R, Piuri V, Santambrogio M (2012) A comprehensive conceptual
system-level approach to fault tolerance in cloud computing. In: IEEE
International Systems Conference (SysCon), Vancouver, BC. pp 1–5
32. Jhawar R, Piuri V (2013). Chapter 7 - Fault Tolerance and Resilience in
Cloud Computing Environments, In Computer and Information Security
Handbook (Second Edition), edited by John R. Vacca, Morgan Kaufmann,
Boston, 2013, Pages 125-141, ISBN: 9780123943972, http://dx.doi.org/10.
1016/B978-0-12-394397-2.00007-6. (http://www.sciencedirect.com/
science/article/pii/B9780123943972000076)
33. Limrungsi N, Zhao J, Xiang Y, Lan T, Huang HH, Subramaniam S (2013)
Providing reliability as an elastic service in cloud computing. Technical
report, George Washington University (February 2012) ISBN:
978-1-4577-2052-9
34. Singh D, Singh J, Chhabra A (2012) Failures in cloud computing data
centers in 3-tier cloud architecture. Int J Inform Eng Electron Business
4(3):1–8
35. Zhao W, Melliar-Smith PM, Moser LE (2010) Fault tolerance middleware
for cloud computing. In: International Conference on Cloud Computing,
Miami, Florida. pp 1–8
36. Vishwanath KV, Nagappan N (2010) Characterizing cloud computing
hardware reliability. In: Proceedings of the 1st ACM Symposium on Cloud
Computing. SoCC ’10. ACM, New York, NY, USA. pp 193–204
37. Rashid L, Pattabiraman K, Gopalakrishnan S (2012) Intermittent hardware
errors recovery: Modeling and evaluation. In: 9th International
Conference on Quantitative Evaluation of SysTems. QEST 2012. pp 1–10
38. Nightingale EB, Douceur JR, Orgovan V (2011) Cycles, cells and platters: an
empirical analysisof hardware failures on a million consumer pcs. In:
Proceedings of the Sixth Conference on Computer Systems. EuroSys ’11.
ACM, New York, NY, USA. pp 343–356
39. Pham C, Cao P, Kalbarczyk Z, Iyer RK (2012) Toward a high availability
cloud: Techniques and challenges. In: Second International Workshop on
Dependability of Clouds, Data Centers and Virtual Machine Technology,
Boston, Massachusetts. pp 1–6
40. Gill P, Jain N, Nagappan N (2011) Understanding network failures in data
centers: measurement, analysis, and implications. In: Proceedings of the
ACM SIGCOMM 2011 Conference. SIGCOMM ’11. ACM, New York, NY,
USA. pp 350–361
41. Gill P, Jain N, Nagappan N (2011) Understanding network failures in data
centers: measurement, analysis, and implications. SIGCOMM Comput
Commun Rev 41(4):350–361
42. Birke R, Chen LY, Smirni E Data centers in the wild: A large performance
study. Technical report, IBM (April 2012) http://domino.research.ibm.
com/library/cyberdig.nsf/papers/0C306B31CF0D3861852579E40045F17F
43. Zio E (2013) Monte carlo simulation: The method. In: The Monte Carlo
Submit your manuscript to a
Simulation Method for System Reliability and Risk Analysiss. Springer journal and benefit from:
Series in Reliability Engineering. Springer, London
44. Extreme Science and Engineering Discovery Environment. https://www. 7 Convenient online submission
xsede.org/home 7 Rigorous peer review
45. Amazon EC2 Instance Types. http://aws.amazon.com/ec2/instance-types/ 7 Immediate publication on acceptance
46. Services AW (2013) Amazon EC2 Service Level Agreement. http://aws. 7 Open access: articles freely available online
amazon.com/ec2-sla
47. Rackspace: Managed Service Level Agreement. http://www.rackspace. 7 High visibility within the field
com/managed_hosting/support/servicelevels/managedsla/ (2013) 7 Retaining the copyright to your article
48. Harris C (2011) IT Downtime Costs $ 26.5 Billion In Lost Revenue. http://
www.informationweek.com/storage/disaster-recovery/it-downtime-
costs-265-billion-in-lost-re/229625441 Submit your next manuscript at 7 springeropen.com

You might also like