Cessing
Cessing
NILS TRUBKIN
NILS TRUBKIN
NILS TRUBKIN
Typeset in LATEX
Gothenburg, Sweden 2023
iv
An exploratory study of trade-offs in traditional vs. serverless stream processing
NILS TRUBKIN
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
Abstract
Stream is the natural form of data that is in a perpetual process of being generated.
Stream processing is a way to draw valuable insights from a data stream. With the
rapid increase in data volumes primarily driven by IoT devices, stream processing
has emerged as a practical approach for data processing. Some characteristics, such
as volumes of data and their distribution, can vary over time, leading to changes
in the computational requirements of such streaming applications. To be able to
adjust frameworks used to the changing requirements, elasticity is needed. As
traditional frameworks commonly used to run streaming processing applications,
known as Stream Processing Engines (SPE) are not flexible enough, there is often
some degree of over-provisioning. It means that the allocated resources are greater
than required and remain unutilized. Alternative approaches, such as serverless, can
ease scalability, but there are both pros and cons to the approach that this work
delves into. This work has implemented a SPE-like API for serverless framework
and with its help explores the differences between traditional and serverless models
of stream processing engines using Apache Flink and Apache OpenWhisk.
The study shows that OpenWhisk can be used for implementing and executing
streaming applications similar to those run by Flink. By correctly implementing the
logic and code, a behavior similar to Flink’s can be achieved in OpenWhisk. The
serverless nature of OpenWhisk, with its pay-per-use pricing model, allows for reduced
costs when the framework remains idle. Performance evaluation was performed using
a stateless application type (does not require the state of the application to be
preserved across multiple executions) utilizing map() API. Also, a stateful type of
application (requires the state of the application to be preserved across multiple
executions) was evaluated using windowAll() API with sum aggregate. The findings
indicate a latency increase of 300-400% in the most intensive test cases and lowered
throughput to 50% for OpenWhisk compared to Flink.
Conclusions that can be drawn reveal that Flink exhibits greater capacity and
performance compared to OpenWhisk for comparable workloads. Flink’s extensive
resource base, including APIs and support resources, makes it easier to develop
applications and positions it as a robust and well-established solution. On the other
hand, OpenWhisk is best suited for projects that do not require rich stream processing
libraries or explicit state management. Its high-level scalability abstraction, utilizing
Kubernetes, simplifies scaling operations. Both frameworks can be configured to act
similarly, with various benefits and tradeoffs depending on an individual use case.
v
Acknowledgements
I’m grateful to Vincenzo Gulisano for their invaluable guidance and feedback that
improved this work.
vii
Contents
List of Figures xi
1 Introduction 1
1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Scope of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background 9
2.1 Stream processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Use Case Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Aggregation of data streams . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Watermarks in data streams . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Streaming frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Serverless frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.9 Apache OpenWhisk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.10 Infrastructure for stream processing . . . . . . . . . . . . . . . . . . . 14
2.11 Scaling of stream processing applications and infrastructure . . . . . 15
3 Related Work 17
4 Use Cases 19
4.1 Examples of use cases for Flink . . . . . . . . . . . . . . . . . . . . . 19
4.2 Examples of use cases for OpenWhisk . . . . . . . . . . . . . . . . . . 20
4.3 Picking the right tool for the job . . . . . . . . . . . . . . . . . . . . 21
ix
Contents
8 Evaluation 39
8.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8.2 ’KNN’-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.3 ’Twitter’-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.4 General observations of OpenWhisk . . . . . . . . . . . . . . . . . . . 41
8.5 General observations of Flink . . . . . . . . . . . . . . . . . . . . . . 42
8.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
10 Conclusion 51
Bibliography 53
x
List of Figures
xi
List of Figures
xii
1
Introduction
The digitization of society and the emergence of the Internet of Things (IoT) has led
to an explosion of data volumes and requirements for real-time processing. Companies
and organizations need to draw valuable insights from the data and be able to analyze,
alert and make decisions based on the data being generated and processed. The
computational devices can now be found outside data centers, in small sensor and
monitoring devices. New requirements on data processing techniques have led to new
paradigms, each with its trade-offs. Some approaches revolve around moving the
processing down to individual devices, while others have concentrated on moving the
data up. While each approach has specific trade-offs in efficiency and performance,
stream data processing has gained much traction due to its ability to handle data
quickly in real-time and provide valuable insights[1][2].
Fig. 1.1 illustrates the volumes of data and the prediction for the years 2010-2025. In
the last three years (2020-2023), the volumes have doubled, and the trend seems to
continue for years to come[3]. As shown in Fig. 1.2, the increased volumes come from
the IoT devices, as the amount of non-IoT devices stays at the same levels[4]. As IoT
devices are usually low-power and offer limited computing power, the data must be
sent somewhere else to be processed meaningfully, rapidly increasing data volumes.
As a result, spending in all sectors utilizing IT in the form of cloud applications and
data analytics is increasing quickly. The increase for 2020/2021 is presented in Fig.
1.3. Cloud application and cloud infrastructure have increased their cost by 83% and
71%, respectively, while data analytics has increased by 63%[5]. Such a significant
increase over many years will compound and lead to magnitudes higher costs just a
few years later if there is no change to the approaches used to process large data
volumes.
1
1. Introduction
Figure 1.1: Data volumes growth and prediction for years 2010-2025[3]
when a data point is created to the final results. All the data has to arrive before
processing begins. Not only does the batch collection have to be finalized to begin
the processing, but it also has to be uploaded to the data warehouse and processed
from start to finish, which takes time.
Furthermore, this process typically does not involve reusing calculations for over-
lapping data parts in subsequent runs. In certain instances, like with a standard
map-reduce operation, it may rely more heavily on disk than memory. Fig. 1.4 shows
a typical batch-processing timeline. In that timeline, data generation is immediately
followed by storage in the database. The process is repeated for all data generated
until a batch of a predefined size or interval is collected. The collection of one or
multiple batches is then transferred to a data warehouse where the processing and
associated calculations take place. Once the batch processing is completed, the
output result is produced and can be retrieved and used for the intended purpose.
On the other hand, streaming applications are used to process and analyze data as it
is being generated or received. A streaming application runs on a stream processing
engine (SPE), a framework that can be configured for one or multiple stream jobs.
While reasonably new, this approach is very well-suited for the task. Almost all data
is created as a continuous stream of events. It is hard to give an example where a
data set is generated simultaneously and has no continuity[9]. Some data that can
be processed as a stream are stock price, temperature, traffic flow, or manufacturing
process. This data type is in modern cyber-physical systems such as smart grids,
intelligent vehicular networks, and agricultural or ecological projects[10]. The data
can be processed and analyzed to help draw conclusions and make meaningful insights
on how the system could be improved or help detect anomalies quickly. This type of
data processing is known as stateful stream processing. While stateless streaming
2
1. Introduction
operators exist, it is common for streaming operators to maintain a state. A state can
be preserved for an indefinite amount of time, thus avoiding the need for recalculating
it. Stateful stream processing is, therefore, more efficient and offers lower latency
from when a new data point is ingested to the result being produced. Fig. 1.5
shows a typical stream-processing timeline. In the timeline, the generated data goes
straight into SPE for processing, after which the output result is produced. The
delay of such a system can be very low, depending on the application.
When it comes to a batch-based approach, it’s easy to identify the beginning and the
end of finite input data. However, streaming presents a challenge because streams,
as described in detail in Sec. 2.1 and Fig. 2.1, are unbounded. Despite this, the state
maintained and updated while processing streams cannot grow indefinitely. To handle
this, the focus typically lies within the most recent portion of data, consisting of
finite data segments to be processed in smaller portions, using windows, as explained
in Sec. 2.2.
On the other hand, the batch-based approach poses no such challenge since the data
sets always have a defined beginning and end. If any state is maintained, it will only
be kept for the current batch processing.
The following works and examples will explore different ways streaming technology
3
1. Introduction
is used in today’s digital systems. It will also explain how it can benefit various
domains, such as data processing and industrial manufacturing.
A work by Najdataei et al. describes a system that can optimize stream processing
and related clustering by reusing the calculated data across multiple iterations of
the algorithm[11]. As the processing of one batch is finished and the next is started,
all the calculated variables are usually reset, and a new batch is calculated from
scratch. In that case, the calculated data is not preserved in memory across the
batches, which may negatively affect latency if it has to be recalculated. The latency
of a couple of hours to multiple days for the data point to appear in the results are
typical[9].
A stream processing system that can help save material, energy, and associated costs
in 3D manufacturing has been described in work by Gulisano et al. By analyzing
the data generated during the process, the system can improve the final product’s
quality and detect defects early on, allowing for timely adjustments to the process[12].
A system like this can be used in many different types of manufacturing where
the process is lengthy, and materials are costly. Moreover, this tool can assist the
4
1. Introduction
5
1. Introduction
begins with the client that creates a request and sends it to the serverless application
endpoint. Once received by the endpoint, the request triggers an event trigger that,
according to the configured rules of the application, triggers one or more functions
to process the request. The function returns a result and builds a reply to respond
to the client.
This work compares a traditional SPE and a serverless framework in the context
of stream processing applications. The comparison aims to convey the differences
between frameworks in running streaming applications and discuss the tradeoffs of
each approach. The motivation behind comparing a traditional SPE and a serverless
framework in stream processing applications is to understand how a serverless
framework can be configured to perform the tasks of a traditional SPE.
Due to the increasing costs in the IT sector related to stream processing, an option
where a serverless framework handles the task would tackle the problem of overpro-
vision of resources[5]. It is interesting to find out what tradeoffs and challenges are
present in the area of serverless computing and how they can be approached. On
a broader plane, it is interesting to understand if and how a serverless framework
can replace traditional SPE and implement an API similar to that of traditional
SPEs’. The discussion describes the use cases most appropriate for each framework,
features and capabilities, performance and scalability, ease of use and deployment,
observations, and references to related research in the field. It aims to provide insights
and a better understanding of the strengths and weaknesses of each approach.
Fig. 1.7 illustrates a comparative overview of traditional and serverless architectures
in a broad context. The top half of the figure shows a traditional architecture where
a client connects to a server’s front end that handles the requests and computes the
response, often based on a database in the back end. The serverless architecture shown
in the lower half eliminates the need for a dedicated server. Instead, it delegates the
tasks of managing the requests to a serverless framework, which executes functions
6
1. Introduction
and actions to fulfill the request. Once there are no requests to handle, the serverless
framework can go into an idle state, effectively pausing the cloud service billing for
the application. Details on billing for serverless frameworks are discussed in Sec. 7.1
and 6.2.
Figure 1.7: Traditional and serverless architecture comparison
The experimental evaluation assesses the performance metrics for applications imple-
mented for each framework. The results translate well to real-world scenarios as they
simulate a load comparable to simple streaming applications. The work’s primary
focus is on each framework’s differences, requirements, and similarities regarding the
achievable goals with both approaches.
7
1. Introduction
discussion will mention and briefly summarize the differences without detailed
discussion on, for example, redundancy, security, and reliability. Observations and
ease of use will be covered to the extent required during the implementation and
evaluation phases. The deployment will be discussed with different use cases in mind
based on the experience from this work and information available on the topic.
A good understanding of the Docker system and the higher-level Kubernetes frame-
work is necessary to deploy OpenWhisk using Kubernetes in its production deployment
mode, which is outside this paper’s scope. Therefore, it was deployed in its standalone
mode.
8
2
Background
9
2. Background
replenishment can be requested based on the stream processing result with short
notice. Stream processing implies that the data is moved up to a processing unit
that accumulates the steams from various sources and processes those according to
the implemented logic.
2.2 Windows
As streams do not have a bound and can be arbitrarily large, they create a challenge
of maintaining and updating a state within finite memory. One common way
of approaching this challenge is the division of the stream into finite segments
known as windows. Windows exist in many different types and sizes, and various
implementations exist for each type. This approach focuses on a finite data portion,
which can be seen as a smaller batch that includes the most recent data.
Windows are often used with aggregate functions, reducing the window’s contents
into a single value based on how the function is designed. Typical aggregate functions
include maximum, minimum, and average, which display the largest, smallest and
average values in given windows. More information on the aggregation of data
streams is presented in Sec. 2.4.
10
2. Background
hour, it sums the detections to output a throughput metric, just the number of cars
in the past hour.
Fig. 2.2 illustrates such a system, with input and output schema and the aggregate
parameters. In the figure, three tumbling windows are defined, with a window size
of 60 minutes (one hour). The first window contains one car, the second window
contains three cars, and the third window contains two cars.
The input tuple schema defines a boolean for detection that is set to true and the
timestamp of the event, in this case, in minutes. Usually, the timestamp is expressed
in milliseconds since the UNIX epoch[16], but for this particular example, we express
it in minutes and begin at zero for simplicity. What is essential is that the timestamp
is granulated enough for the task at hand, which in our case, is set to one minute.
The output schema contains the throughput result (the sum of all the vehicles in the
past hour) and a timestamp to indicate when the result was produced. The result is
produced at the end of each hour, and a new window is created that begins counting
from zero.
Figure 2.2: Use case example input and output schema and the parameters of the
aggregate
11
2. Background
12
2. Background
13
2. Background
is an optimization that reduces the overhead between the different tasks of the
streaming application, increasing the processing speed. These tasks are individual
transformations, be that map or an aggregation function, and can be chained. It is
achieved by coupling one function’s output to another’s input via a function call,
avoiding serialization and deserialization, resulting in higher efficiency and faster
processing.
Other features Flink offers are checkpoints. Those can be configured to be created at
a specific interval and be used as a backup solution for the current Flink system state.
In case of a system failure, the checkpoint can restore the system to the previous
state. Various configuration options exist depending on how critical it is for the
tuples to be processed only once, at least once, or neither.
Flink can be configured in separate clusters of machines within the same project to
offer parallelization, load balancing, and redundancy. However, while scaling can
be achieved, it requires manual configuration. Such a configuration would require
the developer to identify the requirements, configure cluster size, and tune Flink
parameters such as checkpointing intervals, network buffers, task parallelism, and
data serialization.
14
2. Background
cloud as a service, such as Amazon Kinesis, do not require manual maintenance and
can be automatically updated and maintained by the hosting provider.
OpenWhisk, on the other hand, aims to be a cloud service first and can be used
on a pay-per-use basis. That means that the developer is only billed for the time
their actions are actively running and not for when they are idle. As a result,
the hosting provider automatically handles all of the infrastructure, allowing the
developer to deploy code to an already configured system. This setup differs from
when a developer chooses to set up their cluster of Kubernetes with the OpenWhisk
framework running, as it requires considerably more configuration and knowledge to
be done correctly.
The described issue is a case of overprovisioning. When a developer rents a soft-
ware/hardware setup, the cost is set at a fixed rate and does not depend on the
amount of load or resource utilization. This billing model frequently results in charges
that exceed the actual usage of the system. Because even during periods of low
utilization, the same rate is charged, even if only a tiny amount of work is performed.
The benefits of selecting a cloud instance of OpenWhisk come from the fact that no
billing is incurred when no actions are executed. It is also a much simpler process,
as developers must only submit their code to deploy it on their instance immediately.
In addition, if the developer wishes to scale up or down, the OpenWhisk provider will
take the configuration task upon them with minimal developer interaction required.
On the other hand, running a self-hosted instance of OpenWhisk in a cloud or locally
requires the developer to configure the underlying Docker/Kubernetes environment
to deploy the OpenWhisk instance. However, once configured, both cases offer
similar features, as the OpenWhisk does not have to be restarted when a new code
is submitted. Other factors like downtime, redundancy, and security should also be
considered when picking the deployment variant, as data centers usually have better
security routines for confidentiality, integrity, and availability.
In the system described in Sec. 2.3, the car distribution will not be uniform throughout
the day. Instead, there will be a distinct peak and valley. We can assume that the
throughput is low, 1000 cars per day, and that adding each entry and aggregating the
values takes a fraction of a second. Recording and processing the data for each hour
will, over 24 hours, accumulate just a couple of seconds of execution time. Therefore,
an OpenWhisk instance billed per use will only bill a small amount based on action
execution time. However, an identical application deployed on a generic computing
cloud server running an OpenWhisk instance or a traditional SPE will bill for the
entire 24-hour period regardless of the system being idle most of the time.
15
2. Background
the time from the idea to a live application. It removes the need to understand and
configure the servers and up or down-scaling them when needed. In addition, this
approach puts the logic first and frees up resources that would otherwise be needed
to configure the cluster of data streaming applications.
Compared to a streaming application running on a traditional SPE, that query, once
deployed, can not be changed. If an application has to be modified, the current job
hosting the application has to be terminated, and a new job containing the updated
logic has to be deployed. Depending on the complexity of the application and the
consequences of downtime, this can introduce additional complexity related to the
deployment of updates to the application. In some cases, the current state of the
operators might need to be backed up and restored after re-deployment. Availability of
the service during the re-deployment process can also introduce additional downsides
and require careful planning of the procedure.
The use case example from Sec. 2.3 could be scaled in terms of computational
requirement and terms of results produced and insights gained. Computational
requirements could be increased by more traffic or additional roads added to the
streaming application. Additional features could include data processing and func-
tionality, such as multiple results produced, for example, for the last hour, day, and
week with the mean, maximum, and sum of each metric.
16
3
Related Work
Work within the area of stream processing surrounds a wide range of topics. This
section will mention some work that focuses mainly on the use cases and the
usefulness of the technology in other systems. It will also discuss other types of work
that concentrate more on the performance and optimization aspect of the stream
processing itself. Furthermore, the section will touch upon some work in serverless
processing. Most of it aims to understand how to use it most effectively and what
patterns make for good architecture and scalability of the code base.
Works related most to this paper include a work by Taibi et al., a multivocal literature
review of patterns for serverless functions, bringing forward the point that while
data processing can be expensive in a serverless setting, it allows for many defined
patterns to be put together and form a very scalable and fault-tolerant system[19].
The literature review work differs from the current work mainly in the scope of
covered use cases. While the current work focuses on streaming and compares the
differences between the two ways of processing them, the literature review focuses
on serverless only. It gives a much broader perspective of possible scenarios. The
literature review discusses patterns for authorization, availability, communication,
and aggregation. These patterns are not discussed in depth in the current work,
although some are introductorily mentioned in Sec. 2.
The system’s performance is a topic not covered by the literature review and is listed as
future work in the conclusion. The current work performs a performance measurement
to draw meaningful conclusions via experimental evaluation and compare the expected
latency and throughput for the tested frameworks.
Another work related to the comparison of various stream processing methods is a
work comparing window types. Verwiebe et al. detail the types of aggregation that
windows can support and methods for calculating such. The different approaches
can yield different performances, depending on, amongst other things, whether all
the tuples have to be stored or not, as well as the ability to perform the approximate
calculations of windows if the precision is not essential. Different frameworks for
processing windows were compared, and Flink was found to have the most extensive
support for various types and allows for a high configuration level[20].
The survey is related to the current work comparing different approaches to stream
processing. Similar to the previous related work, this work focuses not on performance
metrics but on each window type’s capabilities, benefits, and drawbacks. The
17
3. Related Work
similarities include the comparison of various SPE to highlight what types of windows
are supported by each of them.
Window aggregation is a widespread type of stream processing. It is a suitable
candidate for a use case when discussing and comparing different frameworks and the
stream processing capabilities of each framework. Both survey and the current work
frame the discussion around general-purpose stream processing systems. However,
the survey provides a more detailed examination of the specific window types and
their resulting performance consequences.
Stream processing in serverless environments is new, and less work is focused on it
today than on stream processing with the help of traditional SPEs. Works available
today often compare different setups and implementations within traditional SPEs
and only one system, such as Flink. Most stream processing work aims to understand
what bottlenecks exist and how to tackle them. This includes works such as [11],
[18], [21] and [22].
Performing comparable measurements between two different approaches has its
challenges. The challenges include minimization of the differences in the testing
environment. It can affect the evaluation results if the frameworks have different ways
of executing logic, such as the programming language used, libraries, and available
features.
For the evaluation results to be comparable, differences in the testing environment
should be minimized, and one approach is to develop equivalent implementations of
the logic in each framework using comparable code. Minor implementation differences
regarding buffer sizes, memory allocation, and even debug printouts can drastically
shift the performance evaluation metrics, making it look like the difference between
the frameworks is much larger than it is. Such differences can be challenging to
uncover and address without deep knowledge of the languages and algorithms used
in the implementation.
Evaluation will usually include testing the same function over a set of data with slight
changes to the configuration. The frameworks used and the tests performed will
have much in common, making the results consistent with minor variance. However,
evaluating different frameworks with each implementation can affect the results as
the degrees of freedom in such evaluation is much larger. The environments and tests
can often have more differences than commonalities, leading to increased difficulty
in performing a fair and non-biased evaluation. These challenges could partially
explain why most works focus on a more restrictive scope rather than a wide one.
The smaller scope entails less variation and more reliable data.
18
4
Use Cases
Stream processing can be valuable in many different areas revolving around cyber-
physical systems. Some use cases, such as those used in manufacturing, have been
presented in Sec1.
Botev et al. have described a system for stream processing that uses aggregation
to identify instances of fraud and irregularities concerning non-technical energy
losses in the field of electricity infrastructure[23]. The solution mainly applies to
developing countries, where energy losses can be as high as 50% due to theft and
power siphoning.
Duvignau et al. define a system to analyze vehicular network data using stream
aggregation in the context of querying data from individual vehicles as data sets[24].
A system of this type can help in both the areas of traffic control as well as energy
balancing when it comes to the charging of electric vehicles.
19
4. Use Cases
suited for Flink. It is well suited due to the state being preserved and quickly
available at a later point in time. Use cases running complex logic requiring long
computational times are better suited for this type of framework. An example of
utilizing state management to optimize the application is described by Van Rooij
and his team, which developed a system to reduce delays and improve responsiveness
when tuples arrive late. The system uses predictive analysis based on the received
data to achieve this[26].
Flink is a well-established SPE and, as such, has a great deal of documentation and
community resources that can aid a developer in building streaming applications.
The availability of support resources, on its own, can be a considerable argument in
favor of Flink in developing applications requiring large amounts of documentation
and examples available during the development process, as it can decrease the time
features take to implement.
Lastly, other requirements, such as determinism and correctness guarantees, might be
present and must be considered. Gulisano et al. talk about the role of event-time order
in data streaming analysis. Due to the nature of multi-threaded applications and their
ability to produce results out of order, it is essential to produce deterministic results no
matter in which order the data arrives. Use cases with strict correctness requirements
require less implementation from scratch in this type of framework. While the same
result is possible within the serverless setting, it requires the implementation of such
logic from scratch. The watermarks have a crucial role in achieving this. Notifying the
data streaming application that all the late tuples should have arrived by some point
in time can, depending on the correctness guarantees, offer a reasonable trade-off
between latency and the correctness of the produced result[27].
20
4. Use Cases
"data buckets" and performed performance evaluations for data in memory and
dedicated S3 storage. The findings indicate that an application must be made more
non-serverless for optimal outcomes since data must be stored in a specific location.
So while resource provisioning, autoscaling, logging, and fault-tolerance are all the
benefits of serverless applications, the additional latency incurred might make them
a poor fit for latency-sensitive applications[28].
OpenWhisk is relatively new and, as such, has a limited amount of documentation and
examples available. It can lead to increased times during implementation and testing
as the support resources can be scarce and hard to come by. Use cases that use this
type of framework should therefore be ready to implement some logic from scratch
and not rely heavily on well-written guides about best practices and boilerplate code
that usually exists freely available on the internet. Serverless computing is still very
much in its infancy and is an open area of research, which can lead to situations
with minimal support available.
Furthermore, a point can be made about the serverless stream processing financial
model. Use cases with a low or uneven distribution of resource utilization can be
financed with lesser funds for the same amount of computational power. This type
of financial model makes the framework a good choice for applications looking to be
deployed with the lowest incurred costs.
Sec. 2.10 and 6.2 discusses some other advantages and disadvantages of serverless
systems.
21
4. Use Cases
The example presented in Sec. 2.3 could be implemented in either of the frameworks
due to the simplicity of the task. From the point of cost-effectiveness, OpenWhisk,
with a pay-per-use model, offers good efficiency for low-medium rate streams that
have no activity a lot of the time. As low delay and high throughput are not essential
to produce a result in this application, the choice is dictated by other factors, such
as the specifics of the implementation for a particular system that detects cars.
22
5
Features and Capabilities
One of its compelling features is the ability to use many different streams as input;
it supports various interfaces, such as file-based sources, message queues, socket
streams, and custom sources. This ability is essential to incorporate Flink into an
already existing system.
However, Flink requires the data to be converted to one of its internal objects, such
as DataStream, to be used with its stream processing libraries. This requirement
comes from the fact that the data must be serialized between different parts of the
Flink to perform efficiently. DataStream implements Serializable and can transform
arbitrary data into a native type that Flink understands[29].
Flink’s limitations include primarily supporting only two languages, Java and Scala.
However, it does offer experimental support for Python and SQL as well. As the
primary choice, Java allows parts of the code to be unit tested and is generally debug-
friendly with the ability to pause the execution of selected parts of the code. On the
other hand, getting useful debug info from the Flink runtime is more complicated,
as it will often produce errors that do not convey the precise root of the exception.
One of Flink’s strengths is its close reliance on the hardware compared to OpenWhisk.
Tangwongsan et al. explored various general incremental sliding-window aggregation
improvements and optimization. However, while the performance benefits are mea-
surable, they mostly rely on hardware-side factors such as memory allocation and
cache miss minimization. Moreover, these improvements can be hard to use in the
serverless setting, where small functions are run in the local scope, and the memory
stored variables are deinitialized on return[21].
Flink’s primary use cases it performs well in include real-time data processing, data
pipelines, batch processing, and complex event processing.
23
5. Features and Capabilities
24
5. Features and Capabilities
Flink is meant to be efficient, pipelined, and able to process large volumes of data
with minimal overhead. It offers an API specifically designed to process data streams,
with many operators for mapping, filtering, and aggregating the data. A work by
Gulisano et al. discusses parallelization overheads within stream processing and
proposes a way to avoid those, improving the scalability and performance of streaming
systems[18].
OpenWhisk aims to be a "one size fits all." framework that can, with minimal effort,
be retrofitted into an existing system and connect various parts of the application.
When looking into different ways to connect actions, OpenWhisk offers nearly limitless
possibilities with rules, triggers, and action chains. The encapsulation paradigm
works well for this approach but can also present increased complexity problems for
larger systems. In addition, the serverless approach is still an active field of study,
and best practices still need to be clarified as some research attempts to formulate.
Work has been performed in the context of optimizing operations often performed by
streaming applications. For example, Gulisano et al. have shown some methods for
implementing specifically stream "join" operations to increase performance in shared
memory and determinism. These implementations can benefit use cases where the
system is closely integrated with hardware and unsuitable for serverless solutions[22].
25
5. Features and Capabilities
26
6
Ease of Use and Deployment
27
6. Ease of Use and Deployment
However, it lacks any library that would make it easy to configure stream processing,
and any code related to it should be created as a custom implementation. So while
custom implementation is possible, it may result in bugs that are hard to trace and
detect, given OpenWhisk’s limited logging and debugging abilities. As a result, bugs
in the stream processing implementation could significantly decrease performance
and even affect correctness in some cases.
The deployment of the OpenWhisk in its standalone is as simple as the deployment of
Flink. However, the standalone mode is generally used for testing logic and has many
limitations relating to the rate of action execution and, as a result, the maximal
performance. A Kubernetes cluster of isolated Docker containers must be configured
to deploy OpenWhisk in its full-feature variant. The configuration of Kubernetes
is relatively complex and requires knowledge of the cloud computing domain to be
done correctly. It also puts high requirements on the hardware and is meant to
be deployed as a cloud instance. If the developer wishes to leverage the serverless
paradigm, they have a choice of renting a set-up instance of OpenWhisk from IBM
and financing it using a pay-per-use model.
Deployment of the example application presented in Sec. 2.3 using OpenWhisk
would not result in over-provision as the framework’s resources would scale down and
idle at times when no new tuples are received. Coding effort would be comparably
higher, as the logic would need to be implemented without native libraries. However,
no other tools would need to be used to forward the data from the sensors to the
SPE via HTTP requests, as OpenWhisk natively supports them and can receive and
process data this way.
28
6. Ease of Use and Deployment
ment. As the data volumes change over time, it may be necessary to overprovision
Flink to handle peak loads, but this would provide more extensive analysis capabili-
ties. On the other hand, OpenWhisk is better suited for scaling up/down based on
demand but may require more coding effort. Both cases have benefits and drawbacks
and require consideration depending on the intended use.
29
6. Ease of Use and Deployment
30
7
System Overview and Architecture
This section explains the systems being compared and evaluated. It also defines the
testing environment in which experimental evaluation is performed.
31
7. System Overview and Architecture
programs. This work selected Apache Kafka as an input to Flink, which serves as
a message and queue system connected directly to the Flink input. In addition, a
Flask web server receives requests and stores result data locally in a file. It acts
as a Flink output sink, to which Flink sends the processed data at the end of the
operators’ graph. The reason for feeding new input via HTTP requests was selected
to make the evaluation of Flink and OpenWhisk more comparable, as OpenWhisk
only accepts HTTP requests as input. See Fig. 7.1 for the Flink Pipeline.
32
7. System Overview and Architecture
one for input tuples sent to the streaming application and one for output to get the
results from the streaming application. It is shown in Fig.7.3. It can switch between
two modes of framework used (Flink or OpenWhisk) and multiple benchmark modes
with different computational load types.
Furthermore, it can be configured with an interval for repeating the HTTP requests
to the tested service and payload size. Size can mean different things depending on
the test being executed, but the main goal is to create batches of a specific size and
send these all at once with each request. Most tests create the tuples using random
values for the scheme.
Various inputs to the Flink framework, such as sockets, files, Kafka, and storage
in memory, were evaluated. While some methods allow for high performance and
processing throughput, they lack a comparable counterpart in OpenWhisk. When
using OpenWhisk, the input must be provided through HTTP requests due to the
platform’s design. It is necessary because OpenWhisk action or trigger can only be
executed and triggered via an HTTP request, and the comparison should have as
few differences as possible between the systems. One such input that Flink supports
is Kafka, and it is relatively easy to allow Kafka to emit the messages received from
a simple HTTP web server.
The dashboard serves as a client. It generates tuples according to a specification
and sends them to a local application, OpenWhisk or Flink. It was configured to
automatically perform many instances of the same test with different variables for
interval and size. Interval defines the time between requests, while size controls the
benchmark-specific variable, for example, the size of a tuple. After the tests, each
test’s data is recorded and exported.
Each test generates two files, ’latency’ and ’throughput.’ The first file includes the
latencies of each request which can be used to conclude the average latency, for
example. The second file includes the data on throughput for each test; it can be
used to understand how many packets were processed versus how many were dropped
due to the high receiver load.
Two different scripts were created to generate visual graphs illustrating the data for
throughput and latency. The graphs’ x-axis is the size of the test, and the y-axis
is either the percentage of the successfully sent data tuples or the latency of the
response.
7.2.1 Dashboard
The dashboard described in Sec. 7.2 is illustrated in Fig. 7.3. It includes features for
a selection of current frameworks and tests indicated with labels one, two, and three.
The tables for input and output are indicated with labels four and five. Input fields
indicated by label six are configuring the interval and batch size. Label seven points
to timestamp information used in debugging. A download button for the export of
statistics is indicated by label eight.
Other features not visible on the dashboard front page yet configurable directly in
33
7. System Overview and Architecture
the code include automatic benchmark execution that loops multiple tests iterations.
After each test, a delay is inserted to keep test results separate and not interfere.
The results are automatically labeled and exported into files.
34
7. System Overview and Architecture
processing old data when new tests are started. All periods are configurable but have
been set to static values throughout the evaluation. The phase in which evaluation
is performed, and results are recorded was set to five minutes. The cooldown period
was set to ten seconds. It is the time delay after the evaluation during which no
new requests are generated but replies to the old requests can still be received and
included in the result data. Once five minutes and ten seconds have passed, the
data is automatically exported and downloaded as a file. No new data points can
be added to the set. Lastly, a grace period, a delay set to one full minute, is taking
place to ensure that frameworks load have minimized, no more processing of old
data is occurring, and new tests can begin after the delay.
35
7. System Overview and Architecture
Algorithm 1 addKNN
1: function addKNN(args)
2: dataArray ← args?.getAsJsonArray(”data”)
3: if dataArray is null then
4: return { "body" : { "success" : false, "error" : "data is null" } }
5: end if
6: data ← Gson().fromJson(dataArray, Array<KNNArray> ::
class.java)
7: print "add: data"
8: knn ← KNN()
9: predictions ← mutableListOf()
10: for tuple in data.mapNotNull{it.values} do
11: prediction ← knn.predict(tuple)
12: predictions.add(prediction)
13: end for
14: print "predictions: predictions"
15: time ← System.currentTimeMillis()
16: return { "body" : { "timestamp" : time, "result" : predictions } }
17: end function
The first part is presented as a pseudo-code in the algorithm 2. In line 2, the incoming
JSON object is converted to an appropriate object. Lines 5 to 10 are related to
establishing contact with the database and reading the database contents. Lines 11
to 15 contain a ’for loop’ that writes the new tuple data and updates the related
field for the current count. Lines 16 to 18 write the changes back to the database.
Finally, line 21 returns the result of the database operation; it is not related to the
tuple result itself.
The second part is presented as a pseudo-code in the algorithm 3. Lines 2 to 7 are
related to establishing contact with the database and reading the database contents.
Lines 8 to 13 contain a ’for loop’ that, for each user, sorts the entries according
to the associated count and returns the top three most used words for each user.
Finally, line 15 returns the result as a JSON object.
In Figure 7.4, the graph illustrates the operational process of the "Twitter" application.
36
7. System Overview and Architecture
Algorithm 2 addTwitter
1: function addTwitter(args)
2: tuple ← arg?.getAsJsonObject(”data”)
3: print "add: tuple"
4: result ← null
5: MongoClients.create(uri).use { mongoClient in
6: database ← mongoClient.getDatabase(”streaming”)
7: collection ← database.getCollection(”twitter”, TwitterData ::
class.java)
8: query ← Document(”user”, tuple.user)
9: existingDoc ← collection.find(query).first()
10: wordsDoc ← existingDoc.values.toMutableMap()
11: for word in tuple.values do
12: currentCount ← wordsDoc.get(word)?:0
13: wordsDoc.set(word, currentCount + 1)
14: print "word: word, currentCount: currentCount"
15: end for
16: update ← Document(”$set”, Document(”values”, wordsDoc))
17: options ← UpdateOptions().upsert(true)
18: result ← collection.updateOne(query, update, options)
19: print "result: result"
20: }
21: return Gson().toJsonTree({ "success" : (result.wasAcknowledged()
== true), "result" : result }).asJsonObject
22: end function
Algorithm 3 getTwitterResult
1: function getResult
2: const client ← new MongoClient(’mongodb://172.17.0.6:27017’)
3: await client.connect()
4: const database ← client.db(′ streaming ′ )
5: const collection ← database.collection(′ twitter′ )
6: const result ← {}
7: const users ← awaitcollection.find().toArray()
8: users.forEach(user → {
9: const values ← user.values
10: const sortedV alues ← Object.entries(values).sort((a, b) → b[1] −
a[1])
11: const top3 ← sortedV alues.slice(0, 3)
12: result[user.user] ← top3
13: })
14: client.close()
15: return { "body" : { "result" : result } }
16: end function
37
7. System Overview and Architecture
38
8
Evaluation
This section will cover the evaluation part of the work, the nature of the experiments,
and discuss their results. The experimental evaluation covers both cases for stateful
and stateless applications. The details of the system that the evaluation is performed
on will be explained. The methodology of the tests performed is also described in
this section. Furthermore, details observed and found relevant to the evaluation will
be highlighted and discussed.
39
8. Evaluation
new data tuples can be sent. Kafka is one of the packages that can be used for
this purpose, and while it can do much more than accept HTTP requests, it is used
exclusively as a communicator between the network and Flink. Furthermore, the
network is limited purely to the local host, and the Flask web server handles the
actual communication with the network outside the local host. Kafka acts as a
buffer that temporarily stores messages from the web server and emits those at a
rate appropriate for the subscriber, Flink.
Some implementation is necessary to store the data related to stateful operations in
a serverless environment. Such an implementation can take many different forms.
The traditional and standard way of doing it is using a database. A database does
not have to be used, however. It can also be implemented as a file, variable, or
persistent state. The primary consideration is that it can not be implemented inside
the OpenWhisk framework, as the deployed containers are not persistent and can be
shut down or reset anytime, as described in the documentation. While OpenWhisk
has persistent settings and rules that can be written and read, they are not meant
for application-specific variables and should not be used for this purpose.
A local or remote database is an implementation a streaming application will likely
use, as it is the preferred way in many tutorials and examples[30]. Therefore, a
choice was made to go with MongoDB. The database is deployed using Docker as an
isolated container. The database can be reached locally by the OpenWhisk actions
to write and read the data in JSON format or documents, as internally referred to
within MongoDB.
8.2 ’KNN’-test
To perform a test with a stateless application, it should receive a tuple, process it in
some way and return the result, all without storing any persistent variables.
Such a test is suited well for serverless applications. It does not use any database
since the operation is stateless. It can be performed over a large set of tuples to
increase the workload. The result returned is a list of predictions for the tuples.
Computations are similar to window aggregation, although on a much shorter time
scale. Test of this type evaluates the system’s ability to perform computations
unaffected by any overhead that comes with storing persistent variables. In this case,
no overhead is associated with contacting a database on the local host.
The KNN test is an implementation of the K-nearest-neighbors algorithm. An Iris
flower data set, also known as Fisher’s Iris data set, consisting of 4-dimensional
tuples with float values[31] is used. When a request is sent to the stream processing
framework, it is done in batches of a predefined size. Then, all the tuples are
processed, a prediction of 3 possible choices is made, and one for each tuple is
returned in the response. In this manner, the application can receive and classify
tuples containing information about objects. In this case, the objects classified are
flowers, classified according to type. Prediction is based on the four properties of
each flower.
40
8. Evaluation
8.3 ’Twitter’-test
To perform a test with a stateful application, it should receive a tuple, process it in
some way and return a reply while storing the tuple as a persistent variable. In this
case, the operation is a window aggregation, and the tuples belonging to a window
are the persistent variables.
The Twitter test aims to aggregate strings and find out what words are the most used
by a specific user. The request contains a username alongside a set of configurable
sizes containing randomly selected words from a defined list. Once the stream
processing receives the request, it is aggregated every 5 seconds, and the result
produced returns the three most used words of all time for a specified use, as well
as the total count for these three words. The operator is stateful and acts as a
sum window with infinite size. It also sorts the results to return the top three most
repeated values, an extra but comparably light operation.
When the limit is exceeded, the responses to action requests will return the error
"429 TOO MANY REQUESTS". This limitation makes it impossible to run the
actions more than one time per second (one action for sending data and one more
action to fetch the result). To ensure no overlapping occurs and triggers the error
over time, an interval of 1100ms was found to be suitable as the "fastest" possible
rate.
If the processing algorithm does not require long computation times, there is yet
another obstacle when sending data to OpenWhisk. If the HTTP request exceeds a
specific size, OpenWhisk will deny the request claiming that the request is too long.
This limitation can only be solved by splitting the request in two and sending one
41
8. Evaluation
after the other, at which point the previous issues regarding the maximum rate of
requests come into play.
The performance and function of OpenWhisk in standalone mode can be affected by
the configured Java heap size, as the standalone mode is essentially a Java application
under the hood. In many long tests with large data sets, a Java out-of-memory error
was observed numerous times when the size was set at 4GB. Changing the Java heap
size via an environment variable to 8GB has solved the issues for the tests run within
this paper’s scope.
Evaluating the OpenWhisk accurately in its standalone mode presets some challenges
related to computational overhead, as it exceeds the computational overhead of Flink
and imposes limitations discussed in Sec. 7.1 and 6.2. Therefore, OpenWhisk must
be appropriately deployed using Kubernetes and the related configuration file for
use cases requiring a very high data submission rate and result polling.
OpenWhisk has also shown a significant deviation in latency times at the begging of
some tests, which required some adjustments to the testing methodology. Tests and
measurements were to be performed and recorded after an initial warm-up period
once the latencies stabilized. This deviation is caused by the design of OpenWhisk,
which starts and allocates resources for the containers lazily, or in other words, only
when required and not earlier.
8.6 Results
Tests performed with the help of the dashboard described in Sec. 7.2.1 produced two
types of results; throughput and latency. The throughput is collected as the total
size of the input packages sent during the test and responses (acknowledgment of
successful reception) received. Since the length of each test is statically set to be five
minutes, the throughput is expressed in units over time, that is, tuples processed per
five minutes.
The second metric collected is latency. The latency is calculated at the input request.
Therefore, it does not include the entire path of all the intermediary systems. Latency
42
8. Evaluation
shows how quickly the pipeline can consume new data and sustain a configured
throughput rate without dropping requests.
On the other hand, if we consider the time it takes for the input request to be
handled and the response to be received, we end up with an accurate metric for the
system load and the time it takes to process a request for a specific size. This latter
metric has shown a stable linear relation between the request size, the system load,
and the observed latency value. Moreover, it shows a linear increase with a bigger
input size or shorter request interval, which is expected and much more interesting
than a metric limited by the window calculation interval.
The results presented below have been produced by running the tests for 5 minutes
for each size configuration. Each size has been tested with two intervals, 2000 ms
and 1100 ms. The second option creates nearly twice the workload and approaches
the standalone OpenWhisk’s maximum capacity without risking the limit error’s
triggering.
Care has been taken to ensure that the CPU was not busy performing background
tasks, and an initial grace period before starting the benchmark was allowed, during
which the results were not recorded. This period ensured that all the systems were
running, all containers had started up, and all the one-time calculations at the start
were executed and would not influence the result.
Test points resulting in no package throughput for an OpenWhisk test mean that
at that size, the request entity is too large to be processed and is denied by the
OpenWhisk entirely, which can be seen in Figs. 8.5, 8.6, 8.7 and 8.8.
43
8. Evaluation
Figures 8.1 and 8.2 show metrics of latency and throughput for a Flink KNN test
described in Sec. 8.2 with an interval of 1100ms and 2000ms respectively. Comparable
performance can be observed, indicating that stateless applications scale very well
with Flink. Throughput can be doubled with negligible effect on latency and no
dropped packets in this test.
(a) Latency vs. Batch Size (b) Throughput vs. Batch Size
(a) Latency vs. Batch Size (b) Throughput vs. Batch Size
44
8. Evaluation
Figures 8.3 and 8.4 show metrics of latency and throughput for a Flink Twitter test
described in Sec. 8.3 with an interval of 1100ms and 2000ms respectively. Comparable
performance can be observed, indicating that stateful applications also scale very
well with Flink. Throughput can be doubled with some effect on latency, and no
dropped packets in this test. The biggest increase in latency is about 10ms, which is
acceptable given double throughput.
(a) Latency vs. Batch Size (b) Throughput vs. Batch Size
(a) Latency vs. Batch Size (b) Throughput vs. Batch Size
45
8. Evaluation
Figures 8.5 and 8.6 show metrics of latency and throughput for an OpenWhisk
KNN test described in Sec. 8.2 with an interval of 1100ms and 2000ms respectively.
Comparable performance can be observed, indicating that stateless applications scale
very well with OpenWhisk. Throughput can be doubled with little effect on latency.
Dropped packets can be observed at large batch sizes, related to the request size
being too big; it has no connection to the increased throughput. The biggest latency
increase is around 50ms which is acceptable given the latency magnitude of 500ms.
(a) Latency vs. Batch Size (b) Throughput vs. Batch Size
(a) Latency vs. Batch Size (b) Throughput vs. Batch Size
46
8. Evaluation
Figures 8.7 and 8.8 show metrics of latency and throughput for an OpenWhisk
Twitter test described in Sec. 8.3 with an interval of 1100ms and 2000ms respectively.
Worsen performance indicates that stateful applications are the most difficult to
scale well with OpenWhisk. Throughput can be doubled with some effect on latency.
Dropped packets can be observed throughout the second part of the test, related
to the increased throughput. The biggest latency increase is around 20ms which
is acceptable given the latency magnitude of 140ms. It is, however, clear that it is
increasing fast with additional throughput, and the framework cannot keep up at a
certain point. Dropped packets can be observed at large batch sizes, related to the
request size being too big.
(a) Latency vs. Batch Size (b) Throughput vs. Batch Size
(a) Latency vs. Batch Size (b) Throughput vs. Batch Size
47
8. Evaluation
48
9
Performance and Scalability
As described in 1.1, this work compares the performance and scalability of Flink and
OpenWhisk framework. The results present in Sec. 8.6 are discussed in this section.
49
9. Performance and Scalability
or a magnitude lower for applications deployed with Flink. Sustainable data rates
are also higher for Flink, and its limitation on packet size is less restrictive than
OpenWhisk’s.
50
10
Conclusion
Data volumes are rapidly increasing, primarily due to the increasing usage of IoT
devices. This can be observed in Fig. 1.1 and 1.2. Stream processing has been
established as an effective way to process data streams continuously. As steams
evolve, the computational needs and requirements associated with processing them
change. The traditional SPEs lack the necessary elasticity to efficiently adapt to
changing requirements and optimize computational resource usage. As more data
is generated, costs associated with the analysis and processing of it increase. This
means that traditional SPEs are prone to over-provisioning and not utilizing available
resources to a high degree, and it can become a problem requiring reconsidering
available deployment options to keep the costs down.
This work has implemented a prototype SPE-like API for OpenWhisk to be utilized
similarly to the traditional SPEs. It has shown that streaming applications can be
implemented in both frameworks and highlighted the differences in outcomes. It was
shown that with the correct code and logic, an API similar to that of Flink could
be similarly implemented in OpenWhisk. With the pay-per-use model, costs can be
reduced as the framework runs idle.
The evaluation of the frameworks has yielded findings that are outlined in Section
8. Tests performed included a stateless application utilizing map() API of Flink
and the prototype equivalent for OpenWhisk. For a stateful application test, the
second test utilized windowAll() and watermarks API and prototype equivalents. The
results indicate that Flink possesses greater capacity and performance for workloads
comparable to OpenWhisk’s. It is easier to write applications due to the provided
APIs and existing resource base consisting of communities and applications written for
Flink. It can be a good choice for use cases that look for robust and well-established
solutions.
Looking at the results for stateless applications, the latency increase in the case of
OpenWhisk has been around 500-900% increasing with the batch size. The bandwidth
has been equivalent for both Flink and OpenWhisk, meaning that both frameworks
handled the data rate as expected, although only up to a specific size, after which
OpenWhisk denied the request due to it being too long. In the case of stateful
application, a 300-400% increase in latency was noticed. During the most intense
part of the test, bandwidth was about 50% lower for OpenWhisk due to many denied
requests due to the action limit set in OpenWhisk’s standalone mode.
51
10. Conclusion
Future work in this direction could explore more generic implementations of the
APIs related to OpenWhisk that could be more easily used in applications, leading
to faster development time while relying on these APIs. Development of such API
would significantly close the gap between the traditional and serverless frameworks
regarding development coding effort.
In conclusion, both Apache Flink and Apache OpenWhisk offer valuable options for
stream processing, with their respective strengths and considerations. The choice
between the two frameworks depends on the project’s requirements, the need for
advanced features, the expected development and coding effort, and the financing
strategy.
52
Bibliography
53
Bibliography
54
Bibliography
55