Streaming Data Flow
with Apache Flink
Till Rohrmann
trohrmann@apache.org
@stsffap
Recent History
April ‘14 December ‘14
v0.5 v0.6 v0.7
April ‘15
Project
Incubation
Top Level
Project
v0.8 v0.9
Currently moving towards 0.10 and 1.0 release.
What is Flink?
Deployment

Local (Single JVM) · Cluster (Standalone, YARN)
DataStream API
Unbounded Data
DataSet API
Bounded Data
Runtime
Distributed Streaming Data Flow
Libraries
Machine Learning · Graph Processing · SQL-like API
What is Flink?
Streaming
Topologies
Stream
Time
Window
Count
Low Latency
Long Batch Pipelines
Resource Utilization
1.2
1.4
1.5
1.2
0.8
0.9
1.0
0.8
Rating Matrix User Matrix Item Matrix
1.5
1.7
1.2
0.6
1.0
1.1
0.8
0.4
W X Y ZW X Y Z
A
B
C
D
4.0
4.5
5.0
3.5
2.0
3.5
4.0
2.0
1.0
= X
User
Machine Learning
Iterative Algorithms
Graph Analysis
53
1 2
4
0.5
0.2 0.9
0.3
0.1
0.4
0.7
Mutable State
Stream Processing
Real world data is unbounded and is pushed to
systems.
BatchStreaming
Stream Platform Architecture
Server
Logs
Trxn
Logs
Sensor
Logs
Downstream
Systems
Flink
– Analyze and correlate streams
– Create derived streams
Kafka
– Gather and backup streams
– Offer streams
Cornerstones of Flink
Low Latency for fast results.
High Throughput to handle many events per second.
Exactly-once guarantees for correct results.
Expressive APIs for productivity.
sum
DataStream API
keyBy
sumTime Window
Time Window
sum
DataStream API
keyBy
sumTime Window
Time Window
sum
DataStream API
keyBy
sumTime Window
Time Window
sum
DataStream API
keyBy
sumTime Window
Time Window
sum
DataStream API
keyBy
sumTime Window
Time Window
DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment

.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment

.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment

.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment

.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment

.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment

.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment

.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment

.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
DataStream API
public static class SplitByWhitespace

implements FlatMapFunction<String, Tuple2<String, Integer>> {



@Override

public void flatMap (
String value, Collector<Tuple2<String, Integer>> out) {


String[] tokens = value.toLowerCase().split("W+");



for (String token : tokens) {

if (token.length() > 0) {

out.collect(new Tuple2<>(token, 1));

}

}

}

}
DataStream API
public static class SplitByWhitespace

implements FlatMapFunction<String, Tuple2<String, Integer>> {



@Override

public void flatMap (
String value, Collector<Tuple2<String, Integer>> out) {


String[] tokens = value.toLowerCase().split("W+");



for (String token : tokens) {

if (token.length() > 0) {

out.collect(new Tuple2<>(token, 1));

}

}

}

}
DataStream API
public static class SplitByWhitespace

implements FlatMapFunction<String, Tuple2<String, Integer>> {



@Override

public void flatMap (
String value, Collector<Tuple2<String, Integer>> out) {


String[] tokens = value.toLowerCase().split("W+");



for (String token : tokens) {

if (token.length() > 0) {

out.collect(new Tuple2<>(token, 1));

}

}

}

}
DataStream API
public static class SplitByWhitespace

implements FlatMapFunction<String, Tuple2<String, Integer>> {



@Override

public void flatMap (
String value, Collector<Tuple2<String, Integer>> out) {


String[] tokens = value.toLowerCase().split("W+");



for (String token : tokens) {

if (token.length() > 0) {

out.collect(new Tuple2<>(token, 1));

}

}

}

}
DataStream API
public static class SplitByWhitespace

implements FlatMapFunction<String, Tuple2<String, Integer>> {



@Override

public void flatMap (
String value, Collector<Tuple2<String, Integer>> out) {


String[] tokens = value.toLowerCase().split("W+");



for (String token : tokens) {

if (token.length() > 0) {

out.collect(new Tuple2<>(token, 1));

}

}

}

}
DataStream API
public static class SplitByWhitespace

implements FlatMapFunction<String, Tuple2<String, Integer>> {



@Override

public void flatMap (
String value, Collector<Tuple2<String, Integer>> out) {


String[] tokens = value.toLowerCase().split("W+");



for (String token : tokens) {

if (token.length() > 0) {

out.collect(new Tuple2<>(token, 1));

}

}

}

}
Pipelining
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, …);
// DataStream WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // split stream by word
.sum(1); // sum per word as they arrive
Source Map Reduce
Pipelining
S1 M1 R1
S2 M2 R2
Source Map Reduce
Complete pipeline online concurrently.
Pipelining
S1 M1 R1
S2 M2 R2
Chained tasks
Complete pipeline online concurrently.
Source Map Reduce
Pipelining
S1 M1 R1
S2 M2 R2
Chained tasks
Complete pipeline online concurrently.
Source Map Reduce
S1 · M1
Pipelining
S1
S2 M2
M1 R1
Complete pipeline online concurrently.
Chained tasks Pipelined Shuffle
Source Map Reduce
S1 · M1
R2
Pipelining
Complete pipeline online concurrently.
Worker Worker
Pipelining
Complete pipeline online concurrently.
Worker Worker
Pipelining
Complete pipeline online concurrently.
Worker Worker
Pipelining
Complete pipeline online concurrently.
Worker Worker
Pipelining
Complete pipeline online concurrently.
Worker Worker
Streaming Fault Tolerance
At Most Once
• No guarantees at all
At Least Once
• Ensure that all operators see all events.
Exactly Once
• Ensure that all operators see all events.
• Do not perform duplicates updates to operator state.
Flink gives you all guarantees.
Distributed Snapshots
Barriers flow through the topology in line with data.
Flink guarantees exactly once processing.
Part of snapshot
Distributed Snapshots
Flink guarantees exactly once processing.


JobManager
Master
State Backend
Checkpoint Data
Source 1: State 1:
Source 2: State 2:
Source 3: Sink 1:
Source 4: Sink 2:
Offset: 6791
Offset: 7252
Offset: 5589
Offset: 6843
Distributed Snapshots
Flink guarantees exactly once processing.


JobManager
Master
State Backend
Checkpoint Data
Source 1: State 1:
Source 2: State 2:
Source 3: Sink 1:
Source 4: Sink 2:
Offset: 6791
Offset: 7252
Offset: 5589
Offset: 6843
Start Checkpoint
Message
Distributed Snapshots
Flink guarantees exactly once processing.


JobManager
Master
State Backend
Checkpoint Data
Source 1: 6791 State 1:
Source 2: 7252 State 2:
Source 3: 5589 Sink 1:
Source 4: 6843 Sink 2:
Emit Barriers
Acknowledge with
Position
Distributed Snapshots
Flink guarantees exactly once processing.


JobManager
Master
State Backend
Checkpoint Data
Source 1: 6791 State 1:
Source 2: 7252 State 2:
Source 3: 5589 Sink 1:
Source 4: 6843 Sink 2:
Received barrier
at each input
Distributed Snapshots
Flink guarantees exactly once processing.


JobManager
Master
State Backend
Checkpoint Data
Source 1: 6791 State 1:
Source 2: 7252 State 2:
Source 3: 5589 Sink 1:
Source 4: 6843 Sink 2:
s1 Write snapshot
of its state
Received barrier
at each input
Distributed Snapshots
Flink guarantees exactly once processing.


JobManager
Master
State Backend
Checkpoint Data
Source 1: 6791 State 1: PTR1
Source 2: 7252 State 2: PTR2
Source 3: 5589 Sink 1:
Source 4: 6843 Sink 2:
s1
Acknowledge with
pointer to state
s2
Distributed Snapshots
Flink guarantees exactly once processing.


JobManager
Master
State Backend
Checkpoint Data
Source 1: 6791 State 1: PTR1
Source 2: 7252 State 2: PTR2
Source 3: 5589 Sink 1: ACK
Source 4: 6843 Sink 2: ACK
s1 s2
Acknowledge Checkpoint
Received barrier
at each input
Distributed Snapshots
Flink guarantees exactly once processing.


JobManager
Master
State Backend
Checkpoint Data
Source 1: 6791 State 1: PTR1
Source 2: 7252 State 2: PTR2
Source 3: 5589 Sink 1: ACK
Source 4: 6843 Sink 2: ACK
s1 s2
Operator State
Stateless Operators
ds.filter(_ != 0)
System state
ds.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS))
User defined state
public class CounterSum implements RichReduceFunction<Long> {
private OperatorState<Long> counter;
@Override public Long reduce(Long v1, Long v2) throws Exception {
counter.update(counter.value() + 1);
return v1 + v2;
}
@Override public void open(Configuration config) {
counter = getRuntimeContext().getOperatorState(“counter”, 0L, false);
}
}
Batch on Streaming
DataStream API
Unbounded Data
DataSet API
Bounded Data
Runtime
Distributed Streaming Data Flow
Libraries
Machine Learning · Graph Processing · SQL-like API
Batch on Streaming
Run a bounded stream (data set) on

a stream processor.
Bounded
data set
Unbounded
data stream
Batch on Streaming
Stream Windows
Pipelined
Data Exchange
Global View
Pipelined or Blocking
Data Exchange
Infinite Streams Finite Streams
Run a bounded stream (data set) on

a stream processor.
Batch Pipelines
Data exchange

is mostly streamed
Some operators block
(e.g. sort, hash table)
DataSet API
ExecutionEnvironment env = ExecutionEnvironment

.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
DataSet API
ExecutionEnvironment env = ExecutionEnvironment

.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
DataSet API
ExecutionEnvironment env = ExecutionEnvironment

.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
DataSet API
ExecutionEnvironment env = ExecutionEnvironment

.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
DataSet API
ExecutionEnvironment env = ExecutionEnvironment

.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
DataSet API
ExecutionEnvironment env = ExecutionEnvironment

.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
DataSet API
ExecutionEnvironment env = ExecutionEnvironment

.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
Batch-specific optimizations
Cost-based optimizer
• Program adapts to changing data size
Managed memory
• On- and off-heap memory
• Internal operators (e.g. join or sort) with out-of-core
support
• Serialization stack for user-types
Demo Time
Getting Started
Project Page: http://flink.apache.org
Getting Started
Project Page: http://flink.apache.org
Quickstarts: Java & Scala API
Getting Started
Project Page: http://flink.apache.org
Docs: Programming Guides
Getting Started
Project Page: http://flink.apache.org
Get Involved: Mailing Lists, Stack Overflow, IRC, …
Blogs
http://flink.apache.org/blog
http://data-artisans.com/blog
Twitter
@ApacheFlink
Mailing lists
(news|user|dev)@flink.apache.org
Apache Flink
Thank You!

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015

  • 1.
    Streaming Data Flow withApache Flink Till Rohrmann [email protected] @stsffap
  • 2.
    Recent History April ‘14December ‘14 v0.5 v0.6 v0.7 April ‘15 Project Incubation Top Level Project v0.8 v0.9 Currently moving towards 0.10 and 1.0 release.
  • 3.
    What is Flink? Deployment
 Local(Single JVM) · Cluster (Standalone, YARN) DataStream API Unbounded Data DataSet API Bounded Data Runtime Distributed Streaming Data Flow Libraries Machine Learning · Graph Processing · SQL-like API
  • 4.
    What is Flink? Streaming Topologies Stream Time Window Count LowLatency Long Batch Pipelines Resource Utilization 1.2 1.4 1.5 1.2 0.8 0.9 1.0 0.8 Rating Matrix User Matrix Item Matrix 1.5 1.7 1.2 0.6 1.0 1.1 0.8 0.4 W X Y ZW X Y Z A B C D 4.0 4.5 5.0 3.5 2.0 3.5 4.0 2.0 1.0 = X User Machine Learning Iterative Algorithms Graph Analysis 53 1 2 4 0.5 0.2 0.9 0.3 0.1 0.4 0.7 Mutable State
  • 5.
    Stream Processing Real worlddata is unbounded and is pushed to systems. BatchStreaming
  • 6.
    Stream Platform Architecture Server Logs Trxn Logs Sensor Logs Downstream Systems Flink –Analyze and correlate streams – Create derived streams Kafka – Gather and backup streams – Offer streams
  • 7.
    Cornerstones of Flink LowLatency for fast results. High Throughput to handle many events per second. Exactly-once guarantees for correct results. Expressive APIs for productivity.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
    DataStream API StreamExecutionEnvironment env= StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  • 14.
    DataStream API StreamExecutionEnvironment env= StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  • 15.
    DataStream API StreamExecutionEnvironment env= StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  • 16.
    DataStream API StreamExecutionEnvironment env= StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  • 17.
    DataStream API StreamExecutionEnvironment env= StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  • 18.
    DataStream API StreamExecutionEnvironment env= StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  • 19.
    DataStream API StreamExecutionEnvironment env= StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  • 20.
    DataStream API StreamExecutionEnvironment env= StreamExecutionEnvironment
 .getExecutionEnvironment() DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window counts.print(); env.execute();
  • 21.
    DataStream API public staticclass SplitByWhitespace
 implements FlatMapFunction<String, Tuple2<String, Integer>> {
 
 @Override
 public void flatMap ( String value, Collector<Tuple2<String, Integer>> out) { 
 String[] tokens = value.toLowerCase().split("W+");
 
 for (String token : tokens) {
 if (token.length() > 0) {
 out.collect(new Tuple2<>(token, 1));
 }
 }
 }
 }
  • 22.
    DataStream API public staticclass SplitByWhitespace
 implements FlatMapFunction<String, Tuple2<String, Integer>> {
 
 @Override
 public void flatMap ( String value, Collector<Tuple2<String, Integer>> out) { 
 String[] tokens = value.toLowerCase().split("W+");
 
 for (String token : tokens) {
 if (token.length() > 0) {
 out.collect(new Tuple2<>(token, 1));
 }
 }
 }
 }
  • 23.
    DataStream API public staticclass SplitByWhitespace
 implements FlatMapFunction<String, Tuple2<String, Integer>> {
 
 @Override
 public void flatMap ( String value, Collector<Tuple2<String, Integer>> out) { 
 String[] tokens = value.toLowerCase().split("W+");
 
 for (String token : tokens) {
 if (token.length() > 0) {
 out.collect(new Tuple2<>(token, 1));
 }
 }
 }
 }
  • 24.
    DataStream API public staticclass SplitByWhitespace
 implements FlatMapFunction<String, Tuple2<String, Integer>> {
 
 @Override
 public void flatMap ( String value, Collector<Tuple2<String, Integer>> out) { 
 String[] tokens = value.toLowerCase().split("W+");
 
 for (String token : tokens) {
 if (token.length() > 0) {
 out.collect(new Tuple2<>(token, 1));
 }
 }
 }
 }
  • 25.
    DataStream API public staticclass SplitByWhitespace
 implements FlatMapFunction<String, Tuple2<String, Integer>> {
 
 @Override
 public void flatMap ( String value, Collector<Tuple2<String, Integer>> out) { 
 String[] tokens = value.toLowerCase().split("W+");
 
 for (String token : tokens) {
 if (token.length() > 0) {
 out.collect(new Tuple2<>(token, 1));
 }
 }
 }
 }
  • 26.
    DataStream API public staticclass SplitByWhitespace
 implements FlatMapFunction<String, Tuple2<String, Integer>> {
 
 @Override
 public void flatMap ( String value, Collector<Tuple2<String, Integer>> out) { 
 String[] tokens = value.toLowerCase().split("W+");
 
 for (String token : tokens) {
 if (token.length() > 0) {
 out.collect(new Tuple2<>(token, 1));
 }
 }
 }
 }
  • 27.
    Pipelining DataStream<String> data =env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, …); // DataStream WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // split stream by word .sum(1); // sum per word as they arrive Source Map Reduce
  • 28.
    Pipelining S1 M1 R1 S2M2 R2 Source Map Reduce Complete pipeline online concurrently.
  • 29.
    Pipelining S1 M1 R1 S2M2 R2 Chained tasks Complete pipeline online concurrently. Source Map Reduce
  • 30.
    Pipelining S1 M1 R1 S2M2 R2 Chained tasks Complete pipeline online concurrently. Source Map Reduce S1 · M1
  • 31.
    Pipelining S1 S2 M2 M1 R1 Completepipeline online concurrently. Chained tasks Pipelined Shuffle Source Map Reduce S1 · M1 R2
  • 32.
    Pipelining Complete pipeline onlineconcurrently. Worker Worker
  • 33.
    Pipelining Complete pipeline onlineconcurrently. Worker Worker
  • 34.
    Pipelining Complete pipeline onlineconcurrently. Worker Worker
  • 35.
    Pipelining Complete pipeline onlineconcurrently. Worker Worker
  • 36.
    Pipelining Complete pipeline onlineconcurrently. Worker Worker
  • 37.
    Streaming Fault Tolerance AtMost Once • No guarantees at all At Least Once • Ensure that all operators see all events. Exactly Once • Ensure that all operators see all events. • Do not perform duplicates updates to operator state. Flink gives you all guarantees.
  • 38.
    Distributed Snapshots Barriers flowthrough the topology in line with data. Flink guarantees exactly once processing. Part of snapshot
  • 39.
    Distributed Snapshots Flink guaranteesexactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: State 1: Source 2: State 2: Source 3: Sink 1: Source 4: Sink 2: Offset: 6791 Offset: 7252 Offset: 5589 Offset: 6843
  • 40.
    Distributed Snapshots Flink guaranteesexactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: State 1: Source 2: State 2: Source 3: Sink 1: Source 4: Sink 2: Offset: 6791 Offset: 7252 Offset: 5589 Offset: 6843 Start Checkpoint Message
  • 41.
    Distributed Snapshots Flink guaranteesexactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: Source 2: 7252 State 2: Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: Emit Barriers Acknowledge with Position
  • 42.
    Distributed Snapshots Flink guaranteesexactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: Source 2: 7252 State 2: Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: Received barrier at each input
  • 43.
    Distributed Snapshots Flink guaranteesexactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: Source 2: 7252 State 2: Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: s1 Write snapshot of its state Received barrier at each input
  • 44.
    Distributed Snapshots Flink guaranteesexactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: PTR1 Source 2: 7252 State 2: PTR2 Source 3: 5589 Sink 1: Source 4: 6843 Sink 2: s1 Acknowledge with pointer to state s2
  • 45.
    Distributed Snapshots Flink guaranteesexactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: PTR1 Source 2: 7252 State 2: PTR2 Source 3: 5589 Sink 1: ACK Source 4: 6843 Sink 2: ACK s1 s2 Acknowledge Checkpoint Received barrier at each input
  • 46.
    Distributed Snapshots Flink guaranteesexactly once processing. 
 JobManager Master State Backend Checkpoint Data Source 1: 6791 State 1: PTR1 Source 2: 7252 State 2: PTR2 Source 3: 5589 Sink 1: ACK Source 4: 6843 Sink 2: ACK s1 s2
  • 47.
    Operator State Stateless Operators ds.filter(_!= 0) System state ds.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS)) User defined state public class CounterSum implements RichReduceFunction<Long> { private OperatorState<Long> counter; @Override public Long reduce(Long v1, Long v2) throws Exception { counter.update(counter.value() + 1); return v1 + v2; } @Override public void open(Configuration config) { counter = getRuntimeContext().getOperatorState(“counter”, 0L, false); } }
  • 48.
    Batch on Streaming DataStreamAPI Unbounded Data DataSet API Bounded Data Runtime Distributed Streaming Data Flow Libraries Machine Learning · Graph Processing · SQL-like API
  • 49.
    Batch on Streaming Runa bounded stream (data set) on
 a stream processor. Bounded data set Unbounded data stream
  • 50.
    Batch on Streaming StreamWindows Pipelined Data Exchange Global View Pipelined or Blocking Data Exchange Infinite Streams Finite Streams Run a bounded stream (data set) on
 a stream processor.
  • 51.
    Batch Pipelines Data exchange
 ismostly streamed Some operators block (e.g. sort, hash table)
  • 52.
    DataSet API ExecutionEnvironment env= ExecutionEnvironment
 .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();
  • 53.
    DataSet API ExecutionEnvironment env= ExecutionEnvironment
 .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();
  • 54.
    DataSet API ExecutionEnvironment env= ExecutionEnvironment
 .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();
  • 55.
    DataSet API ExecutionEnvironment env= ExecutionEnvironment
 .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();
  • 56.
    DataSet API ExecutionEnvironment env= ExecutionEnvironment
 .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();
  • 57.
    DataSet API ExecutionEnvironment env= ExecutionEnvironment
 .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();
  • 58.
    DataSet API ExecutionEnvironment env= ExecutionEnvironment
 .getExecutionEnvironment() DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...); // DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences counts.print();
  • 59.
    Batch-specific optimizations Cost-based optimizer •Program adapts to changing data size Managed memory • On- and off-heap memory • Internal operators (e.g. join or sort) with out-of-core support • Serialization stack for user-types
  • 60.
  • 61.
    Getting Started Project Page:http://flink.apache.org
  • 62.
    Getting Started Project Page:http://flink.apache.org Quickstarts: Java & Scala API
  • 63.
    Getting Started Project Page:http://flink.apache.org Docs: Programming Guides
  • 64.
    Getting Started Project Page:http://flink.apache.org Get Involved: Mailing Lists, Stack Overflow, IRC, …
  • 65.
  • 66.