Stephan Ewen - Experiences running Flink at Very Large Scale
The document discusses the implementation and challenges of running Apache Flink at large scales, highlighting use cases such as stream ingestion and event-driven applications. Key issues addressed include dependency conflicts, checkpointing, state management, and performance optimization strategies. The talk emphasizes Flink's evolving architecture and the importance of understanding its underlying mechanisms for robust large-scale application deployment.
3
Various usecases
• Example: Stream ingestion, route events to Kafka, ES, Hive
• Example: Model user interaction sessions
Mix of stateless / moderate state / large state
Stream Processing as a Service
• Launching, monitoring, scaling, updating
@
5
Blink basedon Flink
A core system in Alibaba Search
• Machine learning, search, recommendations
• A/B testing of search algorithms
• Online feature updates to boost conversion rate
Alibaba is a major contributor to Flink
Contributing many changes back to open source
@
Interacting with theenvironment
Dependency conflicts are amongst the biggest problems
• Next versions trying to radically reduce dependencies
• Make Hadoop an optional dependency
• Rework shading techniques
The deployment ecosystem is crazy complex
• Yarn, Mesos & DC/OS, Docker & K8s, standalone, …
• Containers and overlay networks are tricky
• Authorization and authentication ecosystem complex it itself
• Continuous work to improve integration
21
22.
External systems
Dependencyon any external system eventually causes
downtime
• Mainly: HDFS / S3 / NFS / … for checkpoints
We plan to reduce dependency on those more and more in
the next versions
22
23.
Type Serialization
Typeserialization is a harder problem in streaming than in
batch
• The data structure updates require more serialization
• Types are often more complex than in batch
State lives long and across jobs
• Requires to "version" state and serializers
• Requires a "schema evolution" path
• Much enhanced support in Flink 1.3, more still to come
23
24.
24
…is the mostimportant part of
running a large scale Flink application
Robustly checkpointing…
Understanding Checkpoints
31
How wellbehaves
the alignment?
(lower is better)
How long do
snapshots take?
delay =
end_to_end – sync – async
long delay = under backpressure
under constant backpressure
means the application is
under provisioned
too long means
too much state
per node
snapshot store cannot
keep up with load
(low bandwidth)
vastly improved with
incremental checkpoints in Flink 1.3
most important
robustness metric
32.
Heavy alignments
Aheavy alignment typically happens at some point
Different load on different paths
Skewed window emission
(lots of data on one node)
Stall of one operator on the path
34
33.
Heavy alignments
Aheavy alignment typically happens at some point
Different load on different paths
Skewed window emission
(lots of data on one node)
Stall of one operator on the path
35
34.
Heavy alignments
Aheavy alignment typically happens at some point
Different load on different paths
Skewed window emission
(lots of data on one node)
Stall of one operator on the path
36
GC stall
35.
Catching up fromheavy alignments
Operators that did heavy alignment need to catch up again
Otherwise, next checkpoint will have a
heavy alignment as well
37
operator
bc
operator
23 14
a
consumed first after
checkpoint completed
bc a
36.
Catching up fromheavy alignments
Giving the computation time to catch up before starting the
next checkpoint
• Set the min-time-between-checkpoints
• Ideas to change checkpoints to policy based (spend x% of capacity
on checkpoints)
Asynchronous checkpoints mitigate most of problem
• Very short stalls in the pipelines means shorter alignment phase
• Catch up already happens concurrently to state materialization
38
37.
Asynchrony of differentstate types
40
State Flink 1.2 Flink 1.3 Flink 1.4
Keyed state
RocksDB ✔ ✔ ✔
Keyed State
on heap
✘ (✔)
(hidden in 1.2.1) ✔ ✔
Timers ✘ ✘ ✔ (PR)
Operator State ✘ ✔ ✔
38.
When to usewhich state backend?
41
Async. Heap/FS RocksDB
State ≥ Memory ?
Complex Objects?
(expensive serialization)
high data rate?
no yes
yes no
yes
no
a bit
simplified
Exceeding FS requestcapacity
Job size: multiple 1000 operators
Checkpoint interval: few secs
State size: KBs per operator, 1000 of state chunks
Via the S3 FS (from Hadoop), writes ensure "directory"
exists, 2 HEAD requests
Symptom: S3 blocked off connections after exceeding
1000s HEAD requests / sec
46
44.
Reducing FS stressfor small state
47
JobManager TaskManager
Checkpoint
Coordinator
Task
TaskManager
Task
TaskTask
Root Checkpoint File
(metadata) checkpoint data
files
Fs/RocksDB state backend
for most states
45.
Reducing FS stressfor small state
48
JobManager TaskManager
Checkpoint
Coordinator
Task
TaskManager
Task
TaskTask
checkpoint data
directly in metadata file
Fs/RocksDB state backend
for small states
ack+data
Increasing small state
threshold reduces number
of files (default: 1KB)
Deploying Tasks
50
Happens duringinitial deployment and recovery
JobManager TaskManager
Akka / RPC Akka / RPC
Blob Server Blob Server
Deployment RPC Call
Contains
- Job Configuration
- Task Code and Objects
- Recover State Handle
- Correlation IDs
48.
Deploying Tasks
51
Happens duringinitial deployment and recovery
JobManager TaskManager
Akka / RPC Akka / RPC
Blob Server Blob Server
Deployment RPC Call
Contains
- Job Configuration
- Task Code and Objects
- Recover State Handle
- Correlation IDs
KBs
up to MBs
KBs
few bytes
49.
RPC volume duringdeployment
52
(back of the napkin calculation)
number of
tasks
2 MB
parallelism
size of task
objects
100010 x x
x x
=
= RPC volume
20 GB
~20 seconds on full 10 GBits/s net
> 1 min with avg. of 3 GBits/s net
> 3 min with avg. of 1GBs net
50.
Timeouts and Failuredetection
53
~20 seconds on full 10 GBits/s net
> 1 min with avg. of 3 GBits/s net
> 3 min with avg. of 1GBs net
Default RPC timeout: 10 secs
default settings lead to failed
deployments with RPC timeouts
Solution: Increase RPC timeout
Caveat: Increasing the timeout makes failure detection slower
Future: Reduce RPC load (next slides)
51.
Dissecting the RPCmessages
54
Message part Size
Variance across
subtasks
and redeploys
Job Configuration KBs constant
Task Code and Objects up to MBs constant
Recover State Handle KBs variable
Correlation IDs few bytes variable
52.
Upcoming: Deploying Tasks
55
Out-of-bandtransfer and caching of
large and constant message parts
JobManager TaskManager
Akka / RPC Akka / RPC
Blob Server Blob Cache
(1) Deployment RPC Call
(Recover State Handle,
Correlation IDs, BLOB pointers)
(2) Download and cache BLOBs
(Job Config, Task Objects) MBs
KBs
Apache Flink's LayeredAPIs
57
Process Function (events, state, time)
DataStream API (streams, windows)
Table API (dynamic tables)
Stream SQL
Stream- &
Batch Processing
Analytics
Stateful
Event-Driven
Applications
55.
Process Function
58
class MyFunctionextends ProcessFunction[MyEvent, Result] {
// declare state to use in the program
lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext().getState(…)
def processElement(event: MyEvent, ctx: Context, out: Collector[Result]): Unit = {
// work with event and state
(event, state.value) match { … }
out.collect(…) // emit events
state.update(…) // modify state
// schedule a timer callback
ctx.timerService.registerEventTimeTimer(event.timestamp + 500)
}
def onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[Result]): Unit = {
// handle callback when event-/processing- time instant is reached
}
}
56.
Data Stream API
59
vallines: DataStream[String] = env.addSource(
new FlinkKafkaConsumer09<>(…))
val events: DataStream[Event] = lines.map((line) => parse(line))
val stats: DataStream[Statistic] = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum(new MyAggregationFunction())
stats.addSink(new RollingSink(path))
Consistency
74
distributed transactions
at scaletypically
at-most / at-least once
exactly once
per state
=1 =1snapshot consistency
across states
Classic tiered architecture Streaming architecture
72.
Scaling a Service
75
separatelyprovision additional
database capacity
provision compute
and state together
Classic tiered architecture Streaming architecture
provision compute
73.
Rolling out anew Service
76
provision a new database
(or add capacity to an existing one)
provision compute
and state together
simply occupies some
additional backup space
Classic tiered architecture Streaming architecture
Repair External State
78
Streamingarchitecture
live application external state
overwrite
with correct results
backed up data
(HDFS, S3, etc.)
application on backup input
events
76.
Repair External State
79
Streamingarchitecture
live application external state
overwrite
with correct results
backed up date
(HDFS, S3, etc.)
Each application doubles as
a batch job!
application on backup input
events