Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForward 2016 09/12/16)

Till Rohrmann
trohrmann@apache.org
@stsffap
Dynamic Scaling: How Apache
Flink® Adapts to Changing
Workloads

Resource Adaption
3
+
time
Workload
Resources
time
Workload
Resources

What Is This Talk About?
§ Flink’s approach to dynamic scaling
§ Current state with demo
§ Outlook on next development steps
4

Basic Idea
6
• Spread work across more workers to decrease workload

Scaling Stateless Jobs
7
Scale Up Scale Down
Source
Mapper
Sink
• Scale up: Deploy new tasks
• Scale down: Cancel running tasks

Scaling Stateful Jobs
8
?
• Problem: Which state to assign to new task?

Keyed vs. Non-keyed State
10
• State bound to a key
• E.g. Keyed UDF and window state
• State bound to a subtask
• E.g. Source state
Keyed Non-keyed

Repartitioning Keyed State
§ Similar to consistent
hashing
§ Split key space into
key groups
§ Assign key groups to
tasks
11
Key space
Key group #1 Key group #2
Key group #3Key group #4

Repartitioning Keyed State contd.
§ Rescaling changes
key group
assignment
§ Maximum parallelism
defined by #key
groups
12

Repartitioning Non-keyed state
§ User defined merge and
split functions
• Most general approach
§ Breaking non-keyed state
up into finer granularity
• State has to contain
multiple entries
• Automatic repartitioning
wrt granularity
13
#1 #2
#3

Repartitioning Non-keyed State contd.
§ Non-keyed state entries gathered at the
job manager
§ Repartitioning schemes
• Repartition & send
• Union & broadcast
14

Example: Kafka Source
15
partitionId: 1, offset: 42
• Store offset for each partition
• Individual entries are repartitionable

Rescaling: Why is That so Hard?
§ Handling of state
§ Repartitioning of keyed & non-keyed
state
§ Unique among open source stream
processors, afaik
16

Demo Topology
18
Kafka Source Counter
KeyBy

Current State and next Steps
19

Current State
§ Manual rescaling
1. Take savepoint
2. Stop the job
3. Restart job with adjusted parallelism and
savepoint
20

Next Steps
§ Integrate savepoint with stop signal
§ Rescaling individual operators w/o restart
§ Dynamic container de-/allocation
• “Running Flink Everywhere” by Stephan
Ewen, 16:45 at Kesselhaus
21

Auto Scaling Policies
22
• Latency
• Throughput
• Resource utilization
• Kubernetes on GCE, EC2 and Mesos (marathon-
autoscale) already support auto-scaling

Conclusion
§ Scaling of keyed and non-keyed state
§ Flink supports manual rescaling with
restart
(WIP branch: https://github.com/tillrohrmann/flink/tree/partitionable-op-state)
§ Future versions might support scaling on
the fly and automatic rescaling policies
23

Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForward 2016 09/12/16)

More Related Content

What's hot

Viewers also liked

Similar to Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForward 2016 09/12/16)

More from Till Rohrmann

Recently uploaded

Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForward 2016 09/12/16)