Introduction to Stream
Processing with Apache Flink®
Kostas Kloudas
Vasia Kalavri
Jonas Traub
Who are we?
Kostas: software engineer @ data Artisans
Vasia: PhD student @ KTH Stockholm
Jonas: research associate @ TU Berlin
2
Overview
What is Stream Processing?
What is Apache Flink?
Windowed computations over streams
Handling time
Handling node failures
Handling planned downtime
Handling code upgrades
3
Demo instructions…
Robust Stream Processing with Apache Flink®: A Simple Walkthrough
http://data-artisans.com/robust-stream-processing-flink-walkthrough/#more-1181
Make sure you download: Apache Flink 1.0.3
4
Stateless stream processing
5
Stateful stream processing
6
Why should you care?
Data production is and has always been a
continuous process.
Stream processing enables the obvious:
Continuous processing on data that is
continuously produced
7
What is Apache Flink?
8
A data processing engine
Apache Flink is an open source platform for
distributed stream and batch processing
9
The Apache Flink Ecosystem
SQL
SQL
10
What does Flink provide?
High Throughput and Low Latency
• Yahoo! Benchmark : https://yahooeng.tumblr.com/post/135321837876/benchmarking-
streaming-computation-engines-at
• Extended by Data Artisans: http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
11
What does Flink provide?
High Throughput and Low Latency
Event-time (out-of-order) processing
Exactly-once semantics
Flexible windowing
Fault-Tolerance
12
Time for demo…
Robust Stream Processing with Apache Flink®: A Simple Walkthrough
http://data-artisans.com/robust-stream-processing-flink-walkthrough/#more-1181
13
Setup:
Sensor
Data
14
Windowed computations
15
Handling time
16
Handling time
The system has to respect the same clock
as the data.
17
Event Time vs Processing Time
Event Time
Episode Episode Episode Episode Episode Episode Episode
IV V VI I II III VII
1977 1980 1983 1999 2002 2005 2015
Processing Time
18
Handling time: Watermarks
Special events generated by the sources.
A watermark for time T states that event
time has progressed to T in that particular
stream (or partition).
No events with a timestamp smaller than T
can arrive any more.
19
Handling time: Watermarks
Sources emit elements and watermarks….
…operators always emit the lowest watermark
20
Handling time: Watermarks
21
Handling node failures
22
Checkpoints
Sources emit elements and checkpoints….
23
Checkpoints
24
Handling planned downtime
25
Handling code upgrades
26
Is Apache Flink only that?
Apache Flink is an open source platform for
distributed stream and batch processing
27
Its lively community
Apache Flink Community Growth
Stars on Github Contributors Forks on Github
1800 250 1200
1600
200 1000
1400
1200 800
150
1000
600
800
100
600 400
400 50 200
200
0 0 0
Feb.15 Dec.15 Aug.16 Feb.15 Dec.15 Aug.16 Feb.15 Dec.15 Aug.16
You can join:
• Follow: @ApacheFlink, @dataArtisans
• Read: flink.apache.org/blog, data-artisans.com/blog
• Subscribe: (news | user | dev) @ flink.apache.org
28
Its Users
…https://flink.apache.org/poweredby.html
29
All of them will meet at...
http://flink-forward.org/
All of them will meet at...
http://flink-forward.org/
Further Reading
Event-time processing:
• The Dataflow Model: http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
• http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/
Checkpointing and State:
• Distributed Snapshots: Determining Global States of Distributed Systems
http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf
• Lightweight Asynchronous Snapshots for Distributed Dataflows
https://arxiv.org/abs/1506.08603
• Working with State in Flink: https://ci.apache.org/projects/flink/flink-docs-
master/dev/state.html
Savepoints:
• https://ci.apache.org/projects/flink/flink-docs-master/setup/savepoints.html
32