Apache Flink® Training
Intro
Apache Flink® Training
Flink v1.3 – 8.9.2017
Where we’re going today
▪ Stateful stream processing as a paradigm for
continuous data
▪ Apache Flink is a sophisticated and battle-
tested stateful stream processor with a
comprehensive set of features
▪ Efficiency, management, and operational issues
for state are taken very seriously
2
Stream Processing
process records
one-at-a-time
...
Your
Code
Long running computation, on an endless stream of input
3
Distributed Stream Processing
...
...
...
● partitions input streams by
some key in the data
● distributes computation
across multiple instances
Your Your Your
Code Code Code ● each instance is
responsible for some key
range
4
Stateful Stream Processing
...
...
var x = …
update local
variables/structures
qwe
Your Your if (condition(x)) {
Code Code
…
}
Process
5
Stateful Stream Processing
...
...
var x = …
update local
variables/structures
qwe
Your Your if (condition(x)) {
Code Code
… ● embedded local state
} backend
● state co-partitioned with
Process
the input stream by key
6
About time ...
...
...
When should results be emitted?
● Control for determining when the computation
has fully processed all required events
● It’s mostly about time. e.g. have I received all
events for 3 - 4 pm?
Your Your
Code Code
● Did event B occur within 5 minutes of event A?
● Wall clock time is not correct. Event-time
awareness is required.
7
Traditional batch processing
t
● Continuously
... ingesting data
● Time-bounded
batch files
2017-06-14 2017-06-14 2017-06-13 2017-06-13
01:00am 00:00am 11:00pm 10:00pm
● Periodic batch
jobs
Batch
jobs
8
Traditional batch processing (II)
t
● Consider computing
... conversion metric
(# of A → B per hour)
● What if the conversion
crossed time
2017-06-14 2017-06-14 2017-06-13 2017-06-13
boundaries?
01:00am 00:00am 11:00pm 10:00pm → carry intermediate
results to next batch
● What if events come
intermediate out of order?
state
→ ???
9
The ideal way
accumulate state
● view of the “history” of the
input stream
● output depends
● counters, in-progress on notion of
windows time
● parameters of incrementally ● outputs when
trained ML models, etc. results are
complete
● state influences the output
long running computation
10
The ideal way (II)
● Stateful stream processing as
a new paradigm to
continuously process
continuous data
Stateful Stream Processor
that handles ● Produce accurate results
● Having results available in
real-time (with low latency
Large Time / Order / and high throughput) is a
Distributed State Completeness
natural consequence of the
model
consistently, robustly, and efficiently
● Process both real-time and
servicing historic data using exactly the
same application
11
Flink APIs and Runtime
12
Apache Flink Stack
Libraries
DataStream API DataSet API
Stream Processing Batch Processing
Runtime
Distributed Streaming Data Flow
Streaming and batch as first class citizens.
13
Programming Model
Source Source
Computation stat Computation stat
e e
Transformation Computation stat
e
Computation stat
e
Sin
Sin k
k
14
Parallelism
Distributed
Execution
Levels of abstraction
Stream SQL high-level langauge
Table API (dynamic tables) declarative DSL
stream processing &
DataStream API (streams, windows) analytics
low-level
Process Function (events, state, time) (stateful stream
processing)
18
Process Function
class MyFunction extends ProcessFunction[MyEvent, Result] {
// declare state to use in the program
lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext().getState(…)
def processElement(event: MyEvent, ctx: Context, out: Collector[Result]): Unit = {
// work with event and state
(event, state.value) match { … }
out.collect(…) // emit events
state.update(…) // modify state
// schedule a timer callback
ctx.timerService.registerEventTimeTimer(event.timestamp + 500)
}
def onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[Result]): Unit = {
// handle callback when event-/processing- time instant is reached
}
}
19
Data Stream API
val lines: DataStream[String] = env.addSource(
new FlinkKafkaConsumer10<>(…))
val events: DataStream[Event] = lines.map((line) => parse(line))
val stats: DataStream[Statistic] = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum(new MyAggregationFunction())
stats.addSink(new RollingSink(path))
20
Table API & Stream SQL
21
Deployment Options
22
Local Execution
▪ Starts local Flink cluster
▪ All processes run in the same Job Manager
JVM
Task Task
▪ Behaves just like a regular Cluster Manager Manager
▪ Local cluster can be started in
your IDE! Task Task
Manager Manager
▪ Very useful for developing and
debugging
JVM
23
Remote Execution
Client Job Manager
Submit
job
Task Task
▪ Submit a Job to a Manager Manager
remotely running
cluster
Task Task
Manager Manager
▪ Monitor the status of
a job
Cluster
24
YARN Job Mode
▪ Brings up a Flink Resource Manager
cluster in YARN to run
a single job
Node Manager Node Manager
Task
Job Manager
▪ Better isolation than Manager
session mode
Node Manager Node Manager
Task Other
Manager Application
Client
YARN Cluster
25
YARN Session Mode
▪ Starts a Flink cluster in Resource Manager
YARN containers
▪ Multi-user scenario Node Manager Node Manager
▪ Resource sharing Job Manager
Task
Manager
▪ Easy to setup
Node Manager Node Manager
Task Other
Manager Application
Client
YARN Cluster
26
Other Deployment Options
▪ Apache Mesos
• Either with or without DC/OS
▪ Amazon Elastic MapReduce
• Available in EMR 5.1.0
▪ Google Compute Engine
• Available via bdutil
▪ Docker / Kubernetes
Flink in the real world
28
Flink community
Github
41 meetups
16,544 members
29
Powered by Flink
Zalando, one of the largest ecommerce King, the creators of Candy Crush Saga,
companies in Europe, uses Flink for real- uses Flink to provide data science teams
time business process monitoring. with real-time analytics.
Alibaba, the world's largest retailer, built a Bouygues Telecom uses Flink for real-time
Flink-based system (Blink) to optimize event processing over billions of Kafka
search rankings in real time. messages per day.
See more at
30
flink.apache.org/poweredby.html
31
Largest job has > 20 operators, runs on > 5000
vCores in 1000-node cluster, processes millions of
events per second
Complex jobs of > 30 operators running 24/7,
processing 30 billion events daily, maintaining state
of 100s of GB with exactly-once guarantees
30 Flink applications in production for more than one
year. 10 billion events (2TB) processed daily
32
What is being built with Flink?
▪ First wave for streaming was lambda architecture
• Aid batch systems to be more real-time
▪ Second wave was analytics (real time and lag-time)
• Based on distributed collections, functions, and windows
▪ The next wave is much broader:
A new architecture for event-driven applications
33
@
Complete social network implemented
using event sourcing and
CQRS (Command Query Responsibility Segregation)
34
Flink Forward 2016
35
Flink Forward 2017
San Francisco Berlin
• 10-11 April 2017
• 11-13 September 2017
• The first Flink Forward event
outside of Berlin • Over 350 attendees last year
• Talks are online at sf.flink- • Registration opening soon!
forward.org/
36
http://training.data-artisans.com/