Apache Flink Stream Processing
Suneel Marthi
@suneelmarthi
Washington DC Apache Flink Meetup,
Capital One, Vienna, VA
November 19, 2015
Source Code
2
https://github.com/smarthi/DC-FlinkMeetup
Flink Stack
3
Streaming dataflow runtime
Specialized
Abstractions
/ APIs
Core APIs
Flink Core
Runtime
Deployment
The Full Flink Stack
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream
HadoopM/R
Local Cluster Yarn Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading
Streaming dataflow runtime
Storm(WiP)
Zeppelin
Stream Processing ?
▪ Real World Data doesn’t originate in micro
batches and is pushed through systems.
▪ Stream Analysis today is an extension of
the Batch paradigm.
▪ Recent frameworks like Apache Flink,
Confluent are built to handle streaming
data.
5
Web server KafkaTopic
Requirements for a Stream Processor
▪ Low Latency
▪ Quick Results (milliseconds)
▪ High Throughput
▪ able to handle million events/sec
▪ Exactly-once guarantees
▪ Deliver results in failure scenarios
6
Fault Tolerance in Streaming
▪ at least once: all operators see all events
▪ Storm: re-processes the entire stream in
failure scenarios
▪ exactly once: operators do not perform
duplicate updates to their state
▪ Flink: Distributed Snapshots
▪ Spark: Micro-batches
7
Batch is an extension of Streaming
▪ Batch: process a bounded
stream (DataSet) on a stream
processor
▪ Form a Global Window over
the entire DataSet for join or
grouping operations
Flink Window Processing
9
Courtesy: Data Artisans
What is a Window?
▪ Grouping of elements info finite buckets
▪ by timestamps
▪ by record counts
▪ Have a maximum timestamp, which means, at
some point, all elements that need to be
assigned to a window would have arrived.
10
Why Window?
▪ Process subsets of Streams
▪ based on timestamps
▪ or by record counts
▪ Have a maximum timestamp, which means, at
some point, all elements that need to be
assigned to a window will have arrived.
11
Different Window Schemes
▪ Global Windows: All incoming elements are assigned to the same
window
stream.window(GlobalWindows.create());
▪ Tumbling time Windows: elements are assigned to a window of
size (1 sec below) based on their timestamp, elements assigned to
exactly one window
keyedStream.timeWindow(Time.of(5, TimeUnit.SECONDS));
▪ Sliding time Windows: elements are assigned to a window of
certain size based on their timestamp, windows “slide” by the
provided value and hence overlap
stream.window(SlidingTimeWindows.of(Time.of(5, TimeUnit.SECONDS), Time.of(1,
TimeUnit.SECONDS)));
12
Different Window Schemes
▪ Tumbling count Windows: defines window of 1000
elements, that “tumbles”. Elements are grouped
according to their arrival time in groups of 1000
elements, each element belongs to exactly one window
stream.countWindow(1000);
▪ Sliding count Windows: defines a window of 1000
elements that slides every “100” elements, elements
can belong to multiple windows.
stream.countWindow(1000, 100)
13
Tumbling Count Windows Animation
14
Courtesy: Data Artisans
Count Windows
15
Tumbling Count Window, Size = 3
Count Windows
16
Tumbling Count Window, Size = 3
Count Windows
17
Tumbling Count Window, Size = 3
Count Windows
18
Tumbling Count Window, Size = 3
Count Windows
19
Tumbling Count Window, Size = 3
Count Windows
20
Tumbling Count Window, Size = 3
Count Windows
21
Tumbling Count Window, Size = 3
Count Windows
22
Tumbling Count Window, Size = 3
Sliding every 2 elements
Count Windows
23
Tumbling Count Window, Size = 3
Sliding every 2 elements
Count Windows
24
Tumbling Count Window, Size = 3
Sliding every 2 elements
Count Windows
25
Tumbling Count Window, Size = 3
Sliding every 2 elements
Flink Streaming API
26
Flink DataStream API
27
public class StreamingWordCount {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
// Create a DataStream from lines in File
DataStream<String> text = env.readTextFile(“/path”);
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new LineSplitter())
// Converts DataStream -> KeyedStream
.keyBy(0) //Group by first element of the Tuple
.sum(1);
counts.print();
env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job
}
//FlatMap implantation which converts each line to many <Word,1> pairs
public static class LineSplitter implements
FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}
Source code - https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/
java/org/apache/flink/examples/StreamingWordCount.java
Streaming WordCount (Explained)
▪ Obtain a StreamExecutionEnvironment
▪ Connect to a DataSource
▪ Specify Transformations on the
DataStreams
▪ Specifying Output for the processed data
▪ Executing the program
28
Flink DataStream API
29
public class StreamingWordCount {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
// Create a DataStream from lines in File
DataStream<String> text = env.readTextFile(“/path”);
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new LineSplitter())
// Converts DataStream -> KeyedStream
.keyBy(0) //Group by first element of the Tuple
.sum(1);
counts.print();
env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job
}
//FlatMap implantation which converts each line to many <Word,1> pairs
public static class LineSplitter implements
FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}
Source code - https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/
java/org/apache/flink/examples/StreamingWordCount.java
Flink Window API
30
Keyed Windows (Grouped by Key)
31
public class WindowWordCount {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
// Create a DataStream from lines in File
DataStream<String> text = env.readTextFile(“/path”);
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new LineSplitter())
.keyBy(0) //Group by first element of the Tuple
// create a Window of 'windowSize' records and slide window
// by 'slideSize' records

.countWindow(windowSize, slideSize)
.sum(1);
counts.print();
env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job
}
//FlatMap implantation which converts each line to many <Word,1> pairs
public static class LineSplitter implements
FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/
examples/WindowWordCount.java
Keyed Windows
32
public class WindowWordCount {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
// Create a DataStream from lines in File
DataStream<String> text = env.readTextFile(“/path”);
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new LineSplitter())
.keyBy(0) //Group by first element of the Tuple
// Converts KeyedStream -> WindowStream
.timeWindow(Time.of(1, TimeUnit.SECONDS))
.sum(1);
counts.print();
env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job
}
//FlatMap implantation which converts each line to many <Word,1> pairs
public static class LineSplitter implements
FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}
https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/
examples/WindowWordCount.java
Global Windows
33
All incoming elements of a given key are assigned to
the same window.
lines.flatMap(new LineSplitter())

//group by the tuple field "0"

.keyBy(0)

// all records for a given key are assigned to the same window
.GlobalWindows.create()

// and sum up tuple field "1"

.sum(1)

// consider only word counts > 1

.filter(new WordCountFilter())
Flink Streaming API (Tumbling Windows)
34
• All incoming elements are assigned to a window of
a certain size based on their timestamp,
• Each element is assigned to exactly one window
Flink Streaming API (Tumbling Window)
35
public class WindowWordCount {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
// Create a DataStream from lines in File
DataStream<String> text = env.readTextFile(“/path”);
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new LineSplitter())
.keyBy(0) //Group by first element of the Tuple
// Tumbling Window
.timeWindow(Time.of(1, TimeUnit.SECONDS))
.sum(1);
counts.print();
env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job
}
//FlatMap implantation which converts each line to many <Word,1> pairs
public static class LineSplitter implements
FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}
https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/
examples/WindowWordCount.java
Demos
36
}
Twitter + Flink Streaming
37
• Create a Flink DataStream from live Twitter feed
• Split the Stream into multiple DataStreams based
on some criterion
• Persist the respective streams to Storage
https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/
examples/twitter
Flink Event Processing: Animation
38
Courtesy: Ufuk Celebi and Stephan Ewen, Data Artisans
39
32-35
24-27
20-23
8-110-3
4-7
Tumbling Windows of 4 Seconds
123412
4
59
9 0
20
20
22212326323321
26
353642
39
tl;dr
40
• Event Time Processing is unique to Apache Flink
• Flink provides exactly-once guarantees
• With Release 0.10.0, Flink supports Streaming
windows, sessions, triggers, multi-triggers, deltas
and event-time.
References
41
• Data Streaming Fault Tolerance in Flink
Data Streaming Fault Tolerance in Flink
• Light Weight Asynchronous snapshots for
distributed Data Flows
http://arxiv.org/pdf/1506.08603.pdf
• Google DataFlow paper
Google Data Flow
Acknowledgements
42
Thanks to following folks from Data Artisans for their
help and feedback:
• Ufuk Celebi
• Till Rohrmann
• Stephan Ewen
• Marton Balassi
• Robert Metzger
• Fabian Hueske
• Kostas Tzoumas
Questions ???
43

Apache Flink Stream Processing

  • 1.
    Apache Flink StreamProcessing Suneel Marthi @suneelmarthi Washington DC Apache Flink Meetup, Capital One, Vienna, VA November 19, 2015
  • 2.
  • 3.
    Flink Stack 3 Streaming dataflowruntime Specialized Abstractions / APIs Core APIs Flink Core Runtime Deployment
  • 4.
    The Full FlinkStack Gelly Table ML SAMOA DataSet (Java/Scala) DataStream HadoopM/R Local Cluster Yarn Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading Streaming dataflow runtime Storm(WiP) Zeppelin
  • 5.
    Stream Processing ? ▪Real World Data doesn’t originate in micro batches and is pushed through systems. ▪ Stream Analysis today is an extension of the Batch paradigm. ▪ Recent frameworks like Apache Flink, Confluent are built to handle streaming data. 5 Web server KafkaTopic
  • 6.
    Requirements for aStream Processor ▪ Low Latency ▪ Quick Results (milliseconds) ▪ High Throughput ▪ able to handle million events/sec ▪ Exactly-once guarantees ▪ Deliver results in failure scenarios 6
  • 7.
    Fault Tolerance inStreaming ▪ at least once: all operators see all events ▪ Storm: re-processes the entire stream in failure scenarios ▪ exactly once: operators do not perform duplicate updates to their state ▪ Flink: Distributed Snapshots ▪ Spark: Micro-batches 7
  • 8.
    Batch is anextension of Streaming ▪ Batch: process a bounded stream (DataSet) on a stream processor ▪ Form a Global Window over the entire DataSet for join or grouping operations
  • 9.
  • 10.
    What is aWindow? ▪ Grouping of elements info finite buckets ▪ by timestamps ▪ by record counts ▪ Have a maximum timestamp, which means, at some point, all elements that need to be assigned to a window would have arrived. 10
  • 11.
    Why Window? ▪ Processsubsets of Streams ▪ based on timestamps ▪ or by record counts ▪ Have a maximum timestamp, which means, at some point, all elements that need to be assigned to a window will have arrived. 11
  • 12.
    Different Window Schemes ▪Global Windows: All incoming elements are assigned to the same window stream.window(GlobalWindows.create()); ▪ Tumbling time Windows: elements are assigned to a window of size (1 sec below) based on their timestamp, elements assigned to exactly one window keyedStream.timeWindow(Time.of(5, TimeUnit.SECONDS)); ▪ Sliding time Windows: elements are assigned to a window of certain size based on their timestamp, windows “slide” by the provided value and hence overlap stream.window(SlidingTimeWindows.of(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS))); 12
  • 13.
    Different Window Schemes ▪Tumbling count Windows: defines window of 1000 elements, that “tumbles”. Elements are grouped according to their arrival time in groups of 1000 elements, each element belongs to exactly one window stream.countWindow(1000); ▪ Sliding count Windows: defines a window of 1000 elements that slides every “100” elements, elements can belong to multiple windows. stream.countWindow(1000, 100) 13
  • 14.
    Tumbling Count WindowsAnimation 14 Courtesy: Data Artisans
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
    Count Windows 22 Tumbling CountWindow, Size = 3 Sliding every 2 elements
  • 23.
    Count Windows 23 Tumbling CountWindow, Size = 3 Sliding every 2 elements
  • 24.
    Count Windows 24 Tumbling CountWindow, Size = 3 Sliding every 2 elements
  • 25.
    Count Windows 25 Tumbling CountWindow, Size = 3 Sliding every 2 elements
  • 26.
  • 27.
    Flink DataStream API 27 publicclass StreamingWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) // Converts DataStream -> KeyedStream .keyBy(0) //Group by first element of the Tuple .sum(1); counts.print(); env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job } //FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } } Source code - https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/ java/org/apache/flink/examples/StreamingWordCount.java
  • 28.
    Streaming WordCount (Explained) ▪Obtain a StreamExecutionEnvironment ▪ Connect to a DataSource ▪ Specify Transformations on the DataStreams ▪ Specifying Output for the processed data ▪ Executing the program 28
  • 29.
    Flink DataStream API 29 publicclass StreamingWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) // Converts DataStream -> KeyedStream .keyBy(0) //Group by first element of the Tuple .sum(1); counts.print(); env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job } //FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } } Source code - https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/ java/org/apache/flink/examples/StreamingWordCount.java
  • 30.
  • 31.
    Keyed Windows (Groupedby Key) 31 public class WindowWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) .keyBy(0) //Group by first element of the Tuple // create a Window of 'windowSize' records and slide window // by 'slideSize' records
 .countWindow(windowSize, slideSize) .sum(1); counts.print(); env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job } //FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } }https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/ examples/WindowWordCount.java
  • 32.
    Keyed Windows 32 public classWindowWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) .keyBy(0) //Group by first element of the Tuple // Converts KeyedStream -> WindowStream .timeWindow(Time.of(1, TimeUnit.SECONDS)) .sum(1); counts.print(); env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job } //FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } } https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/ examples/WindowWordCount.java
  • 33.
    Global Windows 33 All incomingelements of a given key are assigned to the same window. lines.flatMap(new LineSplitter())
 //group by the tuple field "0"
 .keyBy(0)
 // all records for a given key are assigned to the same window .GlobalWindows.create()
 // and sum up tuple field "1"
 .sum(1)
 // consider only word counts > 1
 .filter(new WordCountFilter())
  • 34.
    Flink Streaming API(Tumbling Windows) 34 • All incoming elements are assigned to a window of a certain size based on their timestamp, • Each element is assigned to exactly one window
  • 35.
    Flink Streaming API(Tumbling Window) 35 public class WindowWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) .keyBy(0) //Group by first element of the Tuple // Tumbling Window .timeWindow(Time.of(1, TimeUnit.SECONDS)) .sum(1); counts.print(); env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job } //FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } } https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/ examples/WindowWordCount.java
  • 36.
  • 37.
    Twitter + FlinkStreaming 37 • Create a Flink DataStream from live Twitter feed • Split the Stream into multiple DataStreams based on some criterion • Persist the respective streams to Storage https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/ examples/twitter
  • 38.
    Flink Event Processing:Animation 38 Courtesy: Ufuk Celebi and Stephan Ewen, Data Artisans
  • 39.
    39 32-35 24-27 20-23 8-110-3 4-7 Tumbling Windows of4 Seconds 123412 4 59 9 0 20 20 22212326323321 26 353642 39
  • 40.
    tl;dr 40 • Event TimeProcessing is unique to Apache Flink • Flink provides exactly-once guarantees • With Release 0.10.0, Flink supports Streaming windows, sessions, triggers, multi-triggers, deltas and event-time.
  • 41.
    References 41 • Data StreamingFault Tolerance in Flink Data Streaming Fault Tolerance in Flink • Light Weight Asynchronous snapshots for distributed Data Flows http://arxiv.org/pdf/1506.08603.pdf • Google DataFlow paper Google Data Flow
  • 42.
    Acknowledgements 42 Thanks to followingfolks from Data Artisans for their help and feedback: • Ufuk Celebi • Till Rohrmann • Stephan Ewen • Marton Balassi • Robert Metzger • Fabian Hueske • Kostas Tzoumas
  • 43.