The Evolution of Distributed Computing Systems: From Fundamentals To New Frontiers
The Evolution of Distributed Computing Systems: From Fundamentals To New Frontiers
Abstract: Distributed systems have been an active field of research for over 60 years, and has played a
crucial role in Computer Science, enabling the invention of the Internet that underpins all facets of modern
life. Through technological advancements and their changing role in society, distributed systems have
undergone a perpetual evolution, with each change resulting in the formation of a new paradigm. Each new
distributed system paradigm - of which modern prominence include Cloud computing, Fog Computing, and
the Internet of Things (IoT) – allows for new forms of commercial and artistic value, yet also ushers in new
research challenges that must be addressed in order to realize and enhance their operation. However, it is
necessary to precisely identify what factors drive the formation and growth of a paradigm, and how unique
are the research challenges within modern distributed systems in comparison to prior generations of systems.
The objective of this work is to study and evaluate the key factors that have influenced and driven the
evolution of distributed system paradigms, from early mainframes, inception of the global inter-network, and
to present contemporary systems such as Edge computing, Fog Computing and IoT. Our analysis highlights
assumptions that have driven distributed system appear to be changing, including (i) an accelerated
fragmentation of paradigms driven by commercial interests and physical limitations imposed by the end of
Moore’s law, (ii) a transition away from generalized architectures and frameworks towards increasing
specialization, and (iii) each paradigm architecture results in some form of pivoting between centralization
and decentralization coordination. Finally, we discuss present day and future challenges of distributed
research pertaining to studying complex phenomena at scale and the role of distributed systems research in
the context of climate change.
Keywords: Distributed Computing, Computing Systems, Evolution, Green Computing
1. Introduction
Societal prosperity of the latter half of the 21st century has been underpinned by the Internet, formed by
large-scale computing infrastructure composed of distributed systems which have accelerated economic,
social and scientific advancement [1]. The complexity and scale of such systems have been driven by
increased societal demand and dependence on such computing infrastructure, which in turn has resulted in
the formation of new distributed system paradigms. In fact, these paradigms have evolved in response to
technological changes and usage, resulting in alterations to the operational characteristics and assumptions of
the underlying computing infrastructure. For example, early mainframe systems provided centralised
computing and storage interfaced by teletype terminals. Clustering and packet switching alongside
advancement in microprocessor technology and GUIs transferred computing from large mainframes operated
remotely to home PCs [5][6]. Standardisation of network protocols enabled global networks-of-networks to
exchange messages for global applications [1]. Organisations developed frameworks and protocols capable
of offloading computation to remote machine pools of computing resources such as processing, storage and
memory [2][3], eventually incorporating sensing and actuator objectives with embedded network capabilities
[4]. Thus, distributed systems paradigms have evolved to distribute and facilitate service from centralised
clusters, extending infrastructure beyond the boundaries of central networks forming paradigms such as IoT
and Fog computing [8][9].
2. Background
Distributed systems describe a class of computing system in which hardware and software components are
connected by means of a network, and coordinate their actions via message passing in order to meet a shared
objective [11][12]. Whilst paradigms exhibit differing operational behaviour and leverage various
technologies, these systems are defined by their underlying core characteristics and elements that facilitate
their operation.
2.1 Characteristics
Transparent Concurrency: Distributed Systems are inherently concurrent, with any participating resource
accessible via any number of local or remote processes. The capacity and availability of such a system can be
increased by adding resources that require mechanisms for accounting and identification. Such a system is
vulnerable to volatile inter-actor behaviours and must be resilient to node failure as well as lost and delayed
messages [16]. The management and access of objects, hardware or data in a distributed networked
environment is also of particular importance due to potential for physical resource contention [2][6][7][13].
Lack of Shared Clock Computing: Systems maintain their own independent time, interpreted from a
variety of sources, and as such Operating Systems (OSs) are susceptible from clock skew and drift.
Furthermore, detecting when a message was sent or received is important for ensuring correct system
behaviour. Therefore, events are tracked by means of conceptual Logical and Vector clocks; by sequencing
messages, processes distributed across a network are able to ensure total event ordering [10][14][15].
2
Dependable and Secure Operation: Components of a distributed system are autonomous, and service
requests are dependent on correct transaction of operation between sub-systems. Failure of any subsystem
may affect the result of service requests and may manifest in ways that are difficult to effectively mitigate.
Fault tolerance and dependability are key characteristics towards ensuring the survivability of distributed
systems and allow services to recover from faults and whilst maintaining correct service [16].
2.2 Elements
Physical System Architecture: Physical system architecture identifies physical devices that exchange
messages in a distributed system and what medium they communicate over. Early distributed systems such as
mainframes were physically connected to clients. Later packet switching enabled long-haul multi-hop
communication. Cellular networks incorporate mobile computing systems, whilst modern systems host
services at specialised hardware between services providers and consumers. Initial designs of distributed
systems aimed to provide service across local or campus wide networks of tens to hundreds of machines, and
were focused on the development of operating systems and remote storage [1] [2]. Early efforts were
designed to explore potential challenges and demonstrate their feasibility [9] and to enhance their functional
and non-functional properties (performance, security, dependability, etc).
Entities: A logical perspective of a distributed system describes several process exchanging messages in
order to achieve a common goal [17] [18]. Contemporary systems extend this definition by considering
logical and aggregate entities, such as Objects and Components, used for abstracting resource and
functionality [19]. Here systems are exposed as well-defined interfaces capable of describing natural
decomposition of functional software requirements, and enabled exploring the loose-coupling between
interchangeable components for domain specific problems found in distributed computing [20]. More recent
systems leverage web services and micro-services, that consider their deployment to physical hardware as
well as constraints including locality, utilization and stakeholders’ policies [35]. Grid and Cloud computing
enable distributed computing by abstracting processing, memory and disk space aggregation [21] whereas
Fog and Edge computing emphasize integrating mobile and embedded devices [22][28].
Communication Models: Several communication models support distributed systems [24] [25] [26]
including (i) Inter-process Communication: Enabling two different processes to communicate with each
other by means of operating system primitives such as pipes, streams, and datagrams in a client - server
architecture; (ii) Remote Invocation: Mechanisms and concepts enabling a process in one address space to
affect execution of operations, procedures and methods in another address space; and (iii) Indirect
Communication: Mechanisms enabling message exchanges between one to many processes via an
intermediary. In contrast with previous communication models, senders and receiving processes are
decoupled, and responsible for facilitating message exchange is passed to the intermediary [37] [38].
Consensus and Consistency: Distributed systems make decisions amongst groups of cooperating processes
each possessing possibly inconsistent states. Consensus algorithms are a mechanism in which a majority
subset of nodes or ‘quorum’ can fulfil a client request negotiate a truth and fulfil a client request. Replication
and partitioning are common techniques used to improve system scalability, reliability and availability [16]
when exposed to volatile environments. Consistency is a challenge to both replicated, partitioned storage and
consensus algorithms [10][16].
3
Table 1. Timeline of Distributed System Paradigms Formation and Key Technological Drivers
Model Elements
Year Driver Technology & Paradigm
Physical Conceptual Entities Communication
Inter-process Client terminal connections
Communication (IPC) Mainframe and telnet clients. share mainframe resources.
Clients (teletype
1960 - Clustering and Client-server terminals) & servers. Datagram transport
1970 packet switching Local networks interconnected Networks, provide specific (ATM, X.25)
(1967-1977) Supercomputer over packet switching services to private networks, Hosts (servers), switches,
infrastructure primarily for accessible clients across routers and mainframes
ARPANET and early research activity. geographic and organisational
Internet boundaries.
Web Services Educational organizations form Grids computing provides Cluster middleware
Grids for scientific goals. Most services
orchestration across
provisioned via off-the-
High speed organizational boundaries. REST, WSDL,
shelf-machines organised
broadband Grid computing VM para-virtualization XML, JSON,
into clusters
2000- application mobility. VMs enables resource isolation
described by Uniform
2010 x86 Virtualization between applications on shared MQTT, XMPP
Community Computing Resource Locators.
Services and resource hardware. (application layer
Hypervisors Virtualized Commodity group comm)
consolidation to datacenter. Grids and Cloud provide
Clusters Web services allow further
resource pooling (CPU,
Rise of smart phone adoption and service abstraction from Xen and KVM
Cloud computing memory, storage).
mobile computing. physical hardware. hypervisor.
IoT
Fog nodes
Specialization of computing
Software Defined Containers become P4, Openflow Open
tasks and hardware (GPU,
2010- Networks Smart objects and edge increasingly prominent SVN.
Edge Computing NPU, smart phones, sensors)
2020 infrastructure
Remote resources (Storage,
Containerization Cloudlets
processing).
Edge datacenters
Fog Computing
4
Consistency in distributed systems can be defined as strong consistency, where any update to a partition of a
data set is immediately reflect in any subsequent accesses, or weak consistency in which updates may
experience delay before they are propagated through the system and are reflected in subsequent access’s.
5
and web-server1. Standardisation of TCP/IP provided infrastructure for interconnected network of networks
known as the World Wide Web (WWW). This enables explosive growth of the number of hosts connected to
the Internet, and was the public’s first large societal exposure to Information Technology [3][6]. Mechanisms
such as Remote Procedure Calls (RPCs) were invented, allowing for the first time applications interfaced
with procedure, functions and method across address spaces and networks [7].
P2P, Grids & Web Services (1994-2000): Peer to Peer (P2P) applications such as Napster and Seti@Home
demonstrated it was feasible for a global networks of decentralised cooperating processes to perform large-
scale processing and storage. P2P enabled a division of workload amongst different peers/computing nodes
whereby other peers could communicate with each other directly from the application layer [8]2 without the
requirement of central coordinator. The creation of Web Services enables further abstraction of the system
interface from implementation in the Web [40]. Rather than facilitate direct communication between clients
and servers, Web Services mediated communication via a brokerage service [33]. Scientific communities
identified that creating federations for large pools of computing resources from commodity hardware could
achieve capability comparable to that of large supercomputing systems [41]. Beowulf enables resource
sharing amongst process by means of software libraries and middle-wares, conceptualising clustered
infrastructure as a single system [42]. Grid computing enabled open access to computing resources and
storage by means of open-protocols and middleware. This time period also saw the creation of effective x86
virtualization [43], which became a driving force for subsequent paradigms.
Cloud, Mobile & IoT (2000-2010): A convergence of cluster technology, virtualization, and middleware
resulted in the formation of the Cloud computing that enabled creating service models for provision
application and computing resource as a service [34]. Driven primarily by large technology organization who
constructed large-scale datacenter facilities, computation and storage began a transition from the client-side
to the provider side more similar to that of mainframes in the 1960s and 1970s [35] [36]. Mobile computing
enabled access to remote resources from resource constrained devices with limited network access [43] [66]
IoT also began to emerge from the mobile computing and sensor network communities providing common
objects with sensing, actuating and networking capabilities, contributing towards building a globally
connected network of ‘things’ [44].
Fog and Edge Computing (2010-present): Whilst data produced by IoT and Mobile computing platforms
continued to increase rapidly, collecting and processing the data in real-time was, and still remains an
unsolved issue [27]. This resulted in forming Edge computing whereby computing infrastructure such as
power efficient processors, and workload specific accelerators are placed between consumer devices and
datacenter providers [66]. Fog computing provides mechanisms that allow for provisioning applications upon
edge devices [45][46], capable of coordinating and executing dynamic workflows across decentralised
computing systems. The composition of Fog and Edge computing paradigms further extended the Cloud
computing model away from centralised stakeholders to decentralized multi-stakeholder systems [45]
capable of providing ultra-low service response times, increased aggregate bandwidths and geo-aware
provisioning [23][27]. Such a system may comprise of one-off federations or clusters, realised to meet single
application workflows or act as intermediate service brokers, and provide common abstractions such as
utility and elastic computing across heterogeneous, decentralised networks of specialised embedded devices,
contrasting with centralised networks found in clouds [22].
6
4. Trends & Observations
By appraising the evolution of the past six decades of distributed system paradigms shown in Table 1, it is
apparent that a variety of technological advancements within computer science have driven the formation of
new distributed paradigms. It is thus now possible to observe longer-term trends and characteristics of
particular interest within distributed systems research.
Centralization
Mainframe
(1955) Cloud Computing
(2006)
Grid Computing
(1999) Fog Computing
(2009)
Mobile Computing
(2004)
Network Computing
(1967)
IoT (2008)
ARPANET, TCP/IP, UDP,
Cluster HTTP,
Datagram Unix
(1962) HTML
Decentralization
Figure 1. Depiction of distributed system paradigm evolution.
7
nature, with the exception of Cloud computing which follows many similarities with the centralized
mainframe in terms of the coordination of computational resources within a datacenter facility which users
access via web APIs.
1955 1962 1955 1952 1965 1969 1973 1989 1994 1996 2003 2009 2011
The delay between the description of a potential paradigm and actual successful implementation in recent
years appears to be shorter in contrast to previous decades as shown in Figure 2. It is worth noting that
ascertaining the precise publication fully credited in accurately describing the full realization of a paradigm
due to a single individual or group is not necessarily feasible. Thus, we have attempted to seek papers which
first define the appropriate terminology and paradigm description that were later adopted. As shown Figure 2,
the formative years of distributed systems between 1960 - 1996 saw an average delay of 13 years and after
the adoption of the WWW saw an average 8.8-years delay. It is observable that most paradigm are conceived
and created sometime within 3-10 years, with the exception between 1960 – 1990 which is likely due to
insufficient technologies when first envisioned, Later paradigms again appear to be relatively short in
duration to create, and is likely a by-product of increased maturity of the research area, combined with its
pervasiveness within society and growth of research activity within each respective paradigm (i.e. there are a
sizable proportion of distributed researchers whom focus on a particular paradigm).
8
5.1 Accelerated Paradigm Specialization
It is observable that specific distributed system paradigms have a particular affinity for tackling different
objectives; whilst Cloud computing is capable of handling generalized application workload, paradigms such
as edge computing and fog computing have been envisioned to be particularly effective for sensor actuation
and increasingly important latency requirements. A growing number of microprocessors are being designed
to accelerate specific tasks (such as graphics and machine learning using GPUs and NPUs, respectively). In
tandem, the end of Moore’s law indicates that by 2025 chip density will reach a scale where heat dissipation
and quantum uncertainty make transistors unreliable [54]. When combining all of these factors together, it is
apparent computing systems are in the process of undergoing massive diversification. This diversification is
not solely limited to hardware but can also be observed in software.
For example, the last decade has seen resource management undergo a transition from centralized monolithic
scheduling to decentralized model architecture [47, 48, 66]. Centralized schedulers maintain a global view of
cluster state and are therefore able to make high quality placement decisions at the cost of latency [3][4][49]
[50]. However, decentralized schedulers maintain only partial state about the cluster, and so they are able to
make low latency decision at the cost placement quality [51]. As a result, we envision that further
diversification and fragmentation of the distributed paradigm will continue to accelerate and affect all of its
respective elements. For example, it is not hard to envision that the system that enables an infrastructure
autonomous vehicle operation being substantially different to that of remote sensor networks and smart
phones; we are already seeing such diversification with making custom OSs and applications for these
scenarios. In the case of cluster resource management there have been an increased research activity in
hybrid schedulers, capable of multiplexing centralized and decentralized architectures [52,53], and we expect
that future distributed systems must be capable of architectural adaptivity in response to changes to
operation.
9
community, where there is a substantive reliance on simulation or small to medium-scale distributed systems,
it will continue to become increasingly difficult to evaluate effectiveness of their approaches when exposed
to emergent behaviour within systems at scale. Whilst production systems from industry can greatly support
understanding of distributed systems at scale, it does not provide an avenue to conduct experiments within a
controller environment to test hypothesis effectively.
10
Learning systems [73] – comprising of clusters of GPUs dedicated to Deep Learning applications; require
effective energy management aware scheduling policies [70]. As such new orchestration mechanism capable
of capturing GPU, CPU, and memory energy characteristics [71] informing new scheduling algorithms
prioritising energy consumption in contrast with traditional performance and fairness scheduling objectives
[60] [77] [78]. Such scheduler should holistically consider energy consumption and account for out of band
costs including impact of workload consolidation on cooling systems [60] [78]. Furthermore, exergy and
energy source can be utilised to further inform datacentre operators about the carbon impact of their
infrastructure. Whilst, hybrid energy grids utilizing green intermittent decentralised energy sources including
solar and wind can provide clean energy whilst brown energy source can be utilized at peak time, minimized
reliance of fossil fuels energy sources, and achieve new sustainable computing standards [72].
6. Conclusions
In this paper, we have discussed and evaluated the evolution of the distributed paradigm over the past six
decades by focussing on the development and decentralised pivoting of networked computing systems. We
have identified core elements of distributed systems by describing their physical infrastructure, logical
entities and communication models. We examine how cross cutting factors such conceptual and physical
models influence centralisation and decentralisation across various paradigms. We observe long term trends
in distributed systems research, by identifying influential links between system paradigms, and technological
breakthroughs. Of particular interest, we have observed that distributed system paradigms have undergone a
long history of decentralisation up until the inception of the World Wide Web. In the following years,
pervasive computing paradigms --- such as the Internet of Things --- brought about by advancements and
specialisation of microprocessor architecture, operating systems designs, and networking infrastructure
further diversified both infrastructure and conceptual systems. Furthermore, it is apparent that the
diversification of distributed systems paradigms that begun at conception of the World Wide Web is likely to
further accelerate due to increased emphasis on decentralisation and prioritization of specialized hardware
and software for particular problems within domains such as machine learning and robotics. This is
somewhat removed from the past few decades which has emphasized generality and portability of distributed
system operation and as such will be the focus of research efforts over the coming years. Moreover, there are
potentially difficult challenges on the horizon related to the upfront cost of operating large systems testbeds
out of reach for most academic laboratories, and the impact of climate change and how it shapes future
system design.
Acknowledgements
This work is supported by the UK Engineering and Physical Sciences Research Council (EP/P031617/1).
References
[1] M. Armbrust et al., “Above the Clouds: A Berkeley View of Cloud Computing,” EECS Dep. Univ.
California, Berkeley, no. JANUARY, pp. 1–25, 2009.
[2] A. Botta, W. De Donato, V. Persico, and A. Pescap, “Integration of Cloud Computing and Internet of
Things : A Survey”, Future Generation Computer Systems, Vol. 56, pp. 684-700, 2016.
[3] M. I. Xinghuo Yu, Fellow IEEE, and Yusheng Xue, “Smart Grids: A Cyber–Physical Systems Perspective,”
Proc. IEEE | Vol. 104, vol. 104, no. 5, pp. 1058–1070, 2016.
[4] Cisco Systems, “Fog Computing and the Internet of Things: Extend the Cloud to Where the Things Are,”
Www.Cisco.Com, p. 6, 2016.
[5] Leslie Lamport, “Time, clocks, and the ordering of events in a distributed system,” Commun. ACM, vol. 21,
no. 7, pp. 558–565, 1978.
11
[6] K. W. Chow Yuan-Chieh, “Models for dynamic load balancing in a heterogeneous multiple processor
system,” IEEE Trans. Comput., vol. C, no. 5, pp. 354–361, 1979.
[7] A. D. Birrell and B. J. A. Y. Nelson, “Implementing Remote Procedure Calls,” vol. 2, no. 1, pp. 39–59,
1984.
[8] T. G. Walker Bruce, Popek Gerald, English Robert, Kline Charles, “The LOCUS Distributed Operating
System,” pp. 49–70, 1983.
[9] A. D. Birrell, R. Levin, M. D. Schroeder, and R. M. Needham, “Grapevine: an exercise in distributed
computing,” Commun. ACM, vol. 25, no. 4, pp. 260–274, 1982.
[10] L. Lamport, R. Shostak, and M. Pease, “The Byzantine Generals Problem,” ACM Trans. Program. Lang.
Syst., vol. 4, no. 3, pp. 382–401, 1982.
[11] P. H. Enslow, “What is a Distributed Data Processing System?,” vol. 11, no. 1, pp. 13–21, 1978.
[12] L. Gerard, “Distributed Systems - Towards a Formal Approach,” IFIP Congr., 1977.
[13] D. Thain, T. Tannenbaum, and M. Livny, “Distributed computing in practice: The Condor experience,”
Concurr. Comput. Pract. Exp., vol. 17, no. 2–4, pp. 323–356, 2005.
[14] C. Figde, “Logical Time in Distributed Computing systems,” Computer (Long. Beach. Calif)., pp. 28–33,
1991.
[15] M. Friedemann, “Virtual Time and Global States of Distributed Systems,” SIAM J. Comput., vol. 28, no. 5,
pp. 1829–1847, 1999.
[16] L. C. Algirdas Avižienis, Laprie Jean-Claude, Randell Brian, “Basic Concepts and Taxonomy of Dependable
and Secure Computing,” IEEE Trans. Dependable Secur. Comput., vol. 1, no. 1, pp. 11–33, 2004.
[17] V. S. Sunderam, G. A. Geist, J. Dongarra, and R. Manchek, “The PVM concurrent computing system:
Evolution, experiences, and trends,” Parallel Comput., vol. 20, no. 4, pp. 531–545, 1994.
[18] W. Gropp, “An Introduction to MPI Parallel Programming with the Message Passing Interface,” pp. 1–48,
1998.
[19] P. K. Gummadi, S. D. Gdbble, and U. Washington, “A Measurement Study of Napster and Gnutella as
Examples of Peer-to-Peer File Sharing Systems,” Comput. Commun. Rev., no. January, p. 2002, 2002.
[20] D. P. Anderson, J. Cobb, E. Korpela, M. Lebofsky, and D. Werthimer, “Seti@home An Experiment in
Public-Resource Computing,” Commun. ACM, vol. 45, no. 11, pp. 56–61, 2002.
[21] I. Foster, Y. Zhao, I. Raicu, and S. Lu, “Cloud Computing and Grid Computing 360-degree compared,” Grid
Comput. Environ. Work. GCE 2008, pp. 1–10, 2008.
[22] P. Mell and T. Grance, “The NIST Definition of Cloud Computing Recommendations of the National
Institute of Standards and Technology,” Nist Spec. Publ., vol. 145, p. 7, 2011.
[23] R. K. Naha et al., “Fog Computing: Survey of Trends, Architectures, Requirements, and Research
Directions,” vol. X, pp. 1–31, 2018.
[24] R. Baheti and H. Gill, “Cyber-physical Systems,” Impact Control Technol., no. 1, pp. 161--166, 2011.
[25] S. Karnouskos, “Cyber-physical systems in the SmartGrid,” 2011 9th IEEE Int. Conf. Ind. Informatics, vol.
1 VN-re, 2011.
[26] D. Evans, “The Internet of Things - How the Next Evolution of the Internet is Changing Everything,”
CISCO white Pap., no. April, pp. 1–11, 2011.
[27] S. S. Gill, P. Garraghan, and R. Buyya. "ROUTER: Fog enabled cloud based intelligent resource
management approach for smart home IoT devices." Journal of Systems and Software 154 (2019): 125-138.
[28] S. Singh and I. Chana. "A survey on resource scheduling in cloud computing: Issues and challenges."
Journal of grid computing 14, no. 2 (2016): 217-264.
[29] M. J. Flynn, “Very High-speed Computing Systems,” vol. 54, no. 12, pp. 1901–1909, 1966.
[30] S. Singh, I. Chana and M. Singh. "The journey of QoS-aware autonomic cloud computing." IT Professional
19, no. 2 (2017): 42-49.
[31] J. K. Casavant Thomas, “A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems,”
vol. 14, no. 2, 1988.
12
[32] K. Compton and S. Hauck, “Reconfigurable Computing : A Survey of Systems and Software,” vol. 34, no. 2,
pp. 171–210, 2002.
[33] J. Yu and R. Buyya, “A Taxonomy of Workflow Management Systems for Grid Computing,” pp. 1–31.
[34] S. Singh and I. Chana, “QoS-Aware Autonomic Resource Management in Cloud Computing: A Systematic
Review,” vol. 48, no. 3, 2015.
[35] A. Celesti, “Open Issues in Scheduling Microservices in the Cloud the types of devices that might,” pp. 81–
88, 2016.
[36] B. M. Leiner et al., “Internet Society (ISOC) All About the Internet : A Brief History of the Internet Internet
Society ( ISOC ) All About the Internet : A Brief History of the Internet,” pp. 1–18, 2000.
[37] Cerf VG; RE Icahn, “A Protocol for Packet Network Intercommunication,” ACM SIGCOMM Comput.
Commun. Rev. 71 Vol. 35, Number 2, April 2005, vol. 35, no. 2, pp. 71–82, 1974.
[38] D. K. Mockapetris Paul, “Development of the Domain Name System,” SIGCOMM ’88 Symp. Commun.
Archit. Protoc., 1988.
[39] D. Lindsay, S. S. Gill, and P. Garraghan. "PRISM: an experiment framework for straggler analytics in
containerized clusters." In Proceedings of the 5th International Workshop on Container Technologies and
Container Clouds, pp. 13-18. 2019.
[40] C. Peltz, “Web services orchestration and choreography,” IEEE Internet Comput., 36 (10), 46–52, 2003.
[41] I. Foster, C. Kesselman, and S. Tuecke, “The Anatomy of the Grid,” Hand Clin., vol. 17, no. 4, pp. 525–532,
2001.
[42] T. Sterling, D. J. Becker, D. Savarase, J. E. Dorband, U. A. Ranawake, and C. V Packer, “BEOWULF: A
parallel workstation for scientific computation,” Proceedings of the 24th International Conference on
Parallel Processing. pp. 2–5, 1995.
[43] S. S. Gill, X. Ouyang, and P. Garraghan. "Tails in the cloud: a survey and taxonomy of straggler
management within large-scale cloud data centres." The Journal of Supercomputing (2020): 1-40
[44] A. Whitmore, A. Agarwal, and L. Da Xu, “The Internet of Things — A survey of topics and trends,” no.
March 2014, pp. 261–274, 2015.
[45] A. Brogi, S. Forti, C. Guerrero, and I. Lera, “How to Place Your Apps in the Fog - State of the Art and Open
Challenges,” 2019.
[46] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge Computing: Vision and Challenges,” IEEE Internet
Things J., vol. 3, no. 5, pp. 637–646, 2016.
[47] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica,
“Mesos: A platform for fine-grained resource sharing in the data center.,” in NSDI, 2011, vol. 11, pp. 22–22.
[48] V. Vavilapallih, A. Murthyh, C. Douglasm, M. Konarh, R. Evansy, T. Gravesy, J. Lowey, S. Sethh, B. Sahah,
C. Curinom, O. O’Malleyh, S. Agarwali, H. Shahh, S. Radiah, B. Reed, and E. Baldeschwieler, “Apache
Hadoop YARN,” in SoCC , 2013, pp. 1–16.
[49] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, “Large-scale cluster
management at google with Borg,” in Proceedings of the Tenth European Conference on Computer Systems,
EuroSys ’15, (New York, NY, USA), ACM, 2015, pp. 18:1–18:17.
[50] I. Gog, M. Schwarzkopf, A. Gleave, R. M. N. Watson, and S. Hand, “Firmament: Fast, centralized cluster
scheduling at scale,” in Proc. 12th USENIX Symp. Oper. Syst. Design Implement., 2016, pp. 99–115.
[51] K. Ousterhout, P. Wendell, M. Zaharia, I. Stoica, “Sparrow: distributed, low latency scheduling”,
Proceedings of the 24th ACM Symposium on Operating Systems Principles, 2013, pp. 69-84.
[52] P. Delgado, F. Dinu, A.-M. Kermarrec, and W. Zwaenepoel, “Hawk: Hybrid datacenter scheduling,” in
USENIX ATC, 2015, pp. 499–510.
[53] K. Karanasos, S. Rao, C. Curino, C. Douglas, K. Chaliparambil, G. M. Fumarola, S. Heddaya, R.
Ramakrishnan, and S. Sakalanaga, “Mercury: Hybrid centralized and distributed scheduling in large shared
clusters,” in USENIX ATC, 2015, pp. 485–497.
[54] M. Waldrop “The Chips are Down for Moore’s Law”, Nature, 2016.
13
[55] G. Blair “Complex Distributed Systems: The Need for Fresh Perspectives”, IEEE ICDCS, 1410-1421, 2018.
[56] X. Liao, “Moving from Exascale to Zettascale Computing: Challenges and Techniques”, Froniters of
Information Technology & Electronic Engineering, pp. 1236-1244, 2018.
[57] W. V. Heddeghem, et al. “Trends in Worldwide ICT Electricity Consumption from 2007 to 2012”, Computer
Communications, 2014.
[58] C. Gossart, “Rebound Effects and ICT: A Review of the Literature”, ICT Innovations for Sustainability,
pp.435-448, 2014.
[59] IPCC, “Global Warming of 1.5 °C”, Intergovernmental Panel on Climate Change, 2018.
[60] X. Li, et al “Holistic virtual machine scheduling in cloud datacenters towards minimizing total energy”,
IEEE Transactions on Parallel and Distributed Systems, pp. 1317-1331, 2018.
[61] G. M. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,”
AFIPS spring Jt. Comput. Conf., pp. 1–4, 1967.
[62] S. S. Gill and A. Shaghaghi. "Security-Aware Autonomic Allocation of Cloud Resources: A Model, Research
Trends, and Future Directions." Journal of Organizational and End User Computing (JOEUC) 32, no. 3
(2020): 15-22.
[63] P. Garraghan, et al “Emergent Failures: Rethinking Cloud Reliability at Scale”, IEEE Cloud Computing, vol.
5, pp. 12-21, 2018.
[64] J. Gao, “Machine Learning Applications for Data Center Optimization”, Google White Paper, 2014.
[65] W. Xiao, et al, “Gandiva, Introspective Cluster Scheduling for Deep Learning” OSDI, 2018.
[66] S. S. Gill et al. "Transformative Effects of IoT, Blockchain and Artificial Intelligence on Cloud Computing:
Evolution, Vision, Trends and Open Challenges." Internet of Things (2019): vol. 8, 100118.
[67] A. J. Ferrer, J. Manuel Marquès, and J. Jorba. "Towards the decentralised cloud: Survey on approaches and
challenges for mobile, ad hoc, and edge computing." ACM Computing Surveys 51, no. 6 (2019): 1-36.
[68] M. A. Khan, F. Algarni, and M. T. Quasim. "Decentralised Internet of Things." In Decentralised Internet of
Things, pp. 3-20. Springer, Cham, 2020.
[69] I. Psaras. "Decentralised edge-computing and iot through distributed trust." In Proceedings of the 16th
Annual International Conference on Mobile Systems, Applications, and Services, pp. 505-507. 2018.
[70] S. S. Gill, P. Garraghan, V. Stankovski, G. Casale, R. K. Thulasiram, S. K. Ghosh, K. Ramamohanarao, and
R. Buyya. "Holistic resource management for sustainable and reliable cloud computing: An innovative
solution to global challenge." Journal of Systems and Software 155 (2019): 104-129.
[71] R. Yang, C. Hu, X. Sun, P. Garraghan, T. Wo, Z. Wen, H. Peng, J. Xu, and C. Li. "Performance-aware
speculative resource oversubscription for large-scale clusters." IEEE Transactions on Parallel and
Distributed Systems 31, no. 7 (2020): 1499-1517.
[72] S. S. Gill, S. Tuli, A. N. Toosi, F. Cuadrado, P. Garraghan, R. Bahsoon, H. Lutfiyya et al. "ThermoSim: Deep
learning based framework for modeling and simulation of thermal-aware resource management for cloud
computing environments." Journal of Systems and Software 164 (2020): 110596.
[73] W. Xiao, R.Bhardwaj, R. Ramjee, M. Sivathanu, N. Kwatra, Z. Han, P. Patel, X. Peng, H. Zhao, Q. Zhang, F.
Yang, L. Zhou. 2018. Gandiva: introspective cluster scheduling for deep learning. In Proceedings of the 13th
USENIX conference on Operating Systems Design and Implementation (OSDI’18). USENIX Association,
USA, 595–610.
[74] B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, “Borg, omega, and kubernetes,” Commun.
ACM, vol. 59, no. 5, pp. 50–57, 2016.
[75] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica, “Discretized streams: Fault-tolerant
streaming computation at scale,” SOSP 2013 - Proc. 24th ACM Symp. Oper. Syst. Princ., no. 1, pp. 423–
438, 2013.
[76] S. Arnautov et al., “SCONE: Secure linux containers with Intel SGX,” Proc. 12th USENIX Symp. Oper.
Syst. Des. Implementation, OSDI 2016, pp. 689–703, 2016.
14
[77] I. R. Z. Michael Kaufmann, IBM Research Zurich, Karlsruhe Institute of Technology; Kornilios Kourtis,
“The HCl Scheduler: Going all-in on Heterogeneity,” 9th {USENIX} Work. Hot Top. Cloud Comput.
(HotCloud 17), pp. 1–7, 2017.
[78] K. Ma, X. Li, W. Chen, C. Zhang, and X. Wang, “GreenGPU: A holistic approach to energy efficiency in
GPU-CPU heterogeneous architectures,” Proc. Int. Conf. Parallel Process., pp. 48–57, 2012.
[79] A. Alqahtani, E. Solaiman, P. Patel, S. Dustdar, R. Ranjan (2019). Service level agreement specification for
end-to-end IoT application ecosystems. Software: Practice and Experience, 49, 12, pp. 1689-1711
[80] A. Chandra, J. Weissman, and B. Heintz. "Decentralized edge clouds." IEEE Internet Computing 17, no. 5
(2013):
15