Experiences Running
Apache Flink at
Very Large Scale
@StephanEwen
Berlin Buzzwords, 2017
1
Some large scale use cases
2
3
 Various use cases
• Example: Stream ingestion, route events to Kafka, ES, Hive
• Example: Model user interaction sessions
 Mix of stateless / moderate state / large state
 Stream Processing as a Service
• Launching, monitoring, scaling, updating
@
4
@
5
 Blink based on Flink
 A core system in Alibaba Search
• Machine learning, search, recommendations
• A/B testing of search algorithms
• Online feature updates to boost conversion rate
 Alibaba is a major contributor to Flink
 Contributing many changes back to open source
@
@
6
7
@
Social network implemented using event sourcing
and CQRS (Command Query Responsibility
Segregation) on Kafka/Flink/Elasticsearch/Redis
More: https://data-artisans.com/blog/drivetribe-cqrs-apache-flink
How we learned to view Flink
through its users
8
System for Event–driven Applications
9
Event-driven
Applications
Stream Processing
Batch Processing
Stateful, event-driven,
event-time-aware processing
(event sourcing, CQRS, …)
(streams, windows, …)
(data sets)
Event Sourcing + Memory Image
10
event log
persists events
(temporarily)
event /
command
Process
main memory
update local
variables/structures
periodically snapshot
the memory
Event Sourcing + Memory Image
11
Recovery: Restore snapshot and replay events
since snapshot
event log
persists events
(temporarily)
Process
Distributed Memory Image
12
Distributed application, many memory images.
Snapshots are all consistent together.
Stateful Event & Stream Processing
13
Scalable embedded state
Access at memory speed &
scales with parallel operators
Stateful Event & Stream Processing
14
Re-load state
Reset positions
in input streams
Rolling back computation
Re-processing
Stateful Event & Stream Processing
15
Restore to different
programs
Bugfixes, Upgrades, A/B testing, etc
Compute, State, and Storage
16
Classic tiered architecture Streaming architecture
database
layer
compute
layer
application state
+ backup
compute
+
stream storage
and
snapshot storage
(backup)
application state
System for Event–driven Applications
17
Event-driven
Applications
Stream Processing
Batch Processing
Stateful, event-driven,
event-time-aware processing
(event sourcing, CQRS, …)
(streams, windows, …)
(data sets)
Apache Flink's Layered APIs
18
Process Function (events, state, time)
DataStream API (streams, windows)
Table API (dynamic tables)
Stream SQL
Stream- &
Batch Processing
Analytics
Stateful
Event-Driven
Applications
Lessons Learned from Running
Flink
19
20
The event/stream pipeline
generally just works

Interacting with the environment
 Dependency conflicts are amongst the biggest problems
• Next versions trying to radically reduce dependencies
• Make Hadoop an optional dependency
• Rework shading techniques
 The deployment ecosystem is crazy complex
• Yarn, Mesos & DC/OS, Docker & K8s, standalone, …
• Containers and overlay networks are tricky
• Authorization and authentication ecosystem complex it itself
• Continuous work to improve integration
21
External systems
 Dependency on any external system eventually causes
downtime
• Mainly: HDFS / S3 / NFS / … for checkpoints
 We plan to reduce dependency on those more and more in
the next versions
22
Type Serialization
 Type serialization is a harder problem in streaming than in
batch
• The data structure updates require more serialization
• Types are often more complex than in batch
 State lives long and across jobs
• Requires to "version" state and serializers
• Requires a "schema evolution" path
• Much enhanced support in Flink 1.3, more still to come
23
24
…is the most important part of
running a large scale Flink application
Robustly checkpointing…
Review: Checkpoints
25
Trigger checkpoint Inject checkpoint barrier
stateful
operation
source /
transform
Review: Checkpoints
26
Take state snapshot Trigger state
snapshot
stateful
operation
source /
transform
Review: Checkpoint Alignment
27
begin aligning
checkpoint
barrier n
xy
operator
aligning
ab
operator
23 1
input buffer
y
Review: Checkpoint Alignment
28
bc
operator
23 1
emit barrier n
c
operator
23 1
input buffer
continuecheckpoint
4 4
a
Understanding Checkpoints
29
Understanding Checkpoints
30
How well behaves
the alignment?
(lower is better)
How long do
snapshots take?
delay =
end_to_end – sync – async
Understanding Checkpoints
31
How well behaves
the alignment?
(lower is better)
How long do
snapshots take?
delay =
end_to_end – sync – async
long delay = under backpressure
under constant backpressure
means the application is
under provisioned
too long means
 too much state
per node
 snapshot store cannot
keep up with load
(low bandwidth)
vastly improved with
incremental checkpoints in Flink 1.3
most important
robustness metric
Heavy alignments
 A heavy alignment typically happens at some point
 Different load on different paths
 Skewed window emission
(lots of data on one node)
 Stall of one operator on the path
34
Heavy alignments
 A heavy alignment typically happens at some point
 Different load on different paths
 Skewed window emission
(lots of data on one node)
 Stall of one operator on the path
35
Heavy alignments
 A heavy alignment typically happens at some point
 Different load on different paths
 Skewed window emission
(lots of data on one node)
 Stall of one operator on the path
36
GC stall
Catching up from heavy alignments
 Operators that did heavy alignment need to catch up again
 Otherwise, next checkpoint will have a
heavy alignment as well
37
operator
bc
operator
23 14
a
consumed first after
checkpoint completed
bc a
Catching up from heavy alignments
 Giving the computation time to catch up before starting the
next checkpoint
• Set the min-time-between-checkpoints
• Ideas to change checkpoints to policy based (spend x% of capacity
on checkpoints)
 Asynchronous checkpoints mitigate most of problem
• Very short stalls in the pipelines means shorter alignment phase
• Catch up already happens concurrently to state materialization
38
Asynchrony of different state types
40
State Flink 1.2 Flink 1.3 Flink 1.4
Keyed state
RocksDB ✔ ✔ ✔
Keyed State
on heap
✘ (✔)
(hidden in 1.2.1) ✔ ✔
Timers ✘ ✘ ✔ (PR)
Operator State ✘ ✔ ✔
When to use which state backend?
41
Async. Heap/FS RocksDB
State ≥ Memory ?
Complex Objects?
(expensive serialization)
high data rate?
no yes
yes no
yes
no
a bit
simplified
42
We are hiring!
data-artisans.com/careers
44
Backup Slides
Avoiding DDOSing other systems
45
Exceeding FS request capacity
 Job size: multiple 1000 operators
 Checkpoint interval: few secs
 State size: KBs per operator, 1000 of state chunks
 Via the S3 FS (from Hadoop), writes ensure "directory"
exists, 2 HEAD requests
 Symptom: S3 blocked off connections after exceeding
1000s HEAD requests / sec
46
Reducing FS stress for small state
47
JobManager TaskManager
Checkpoint
Coordinator
Task
TaskManager
Task
TaskTask
Root Checkpoint File
(metadata) checkpoint data
files
Fs/RocksDB state backend
for most states
Reducing FS stress for small state
48
JobManager TaskManager
Checkpoint
Coordinator
Task
TaskManager
Task
TaskTask
checkpoint data
directly in metadata file
Fs/RocksDB state backend
for small states
ack+data
Increasing small state
threshold reduces number
of files (default: 1KB)
Distributed Coordination
49
Deploying Tasks
50
Happens during initial deployment and recovery
JobManager TaskManager
Akka / RPC Akka / RPC
Blob Server Blob Server
Deployment RPC Call
Contains
- Job Configuration
- Task Code and Objects
- Recover State Handle
- Correlation IDs
Deploying Tasks
51
Happens during initial deployment and recovery
JobManager TaskManager
Akka / RPC Akka / RPC
Blob Server Blob Server
Deployment RPC Call
Contains
- Job Configuration
- Task Code and Objects
- Recover State Handle
- Correlation IDs
KBs
up to MBs
KBs
few bytes
RPC volume during deployment
52
(back of the napkin calculation)
number of
tasks
2 MB
parallelism
size of task
objects
100010 x x
x x
=
= RPC volume
20 GB
~20 seconds on full 10 GBits/s net
> 1 min with avg. of 3 GBits/s net
> 3 min with avg. of 1GBs net
Timeouts and Failure detection
53
~20 seconds on full 10 GBits/s net
> 1 min with avg. of 3 GBits/s net
> 3 min with avg. of 1GBs net
Default RPC timeout: 10 secs
default settings lead to failed
deployments with RPC timeouts
Solution: Increase RPC timeout
Caveat: Increasing the timeout makes failure detection slower
Future: Reduce RPC load (next slides)
Dissecting the RPC messages
54
Message part Size
Variance across
subtasks
and redeploys
Job Configuration KBs constant
Task Code and Objects up to MBs constant
Recover State Handle KBs variable
Correlation IDs few bytes variable
Upcoming: Deploying Tasks
55
Out-of-band transfer and caching of
large and constant message parts
JobManager TaskManager
Akka / RPC Akka / RPC
Blob Server Blob Cache
(1) Deployment RPC Call
(Recover State Handle,
Correlation IDs, BLOB pointers)
(2) Download and cache BLOBs
(Job Config, Task Objects) MBs
KBs
Layers of abstraction
56
Ogres have
layers
So do
squirrels
Apache Flink's Layered APIs
57
Process Function (events, state, time)
DataStream API (streams, windows)
Table API (dynamic tables)
Stream SQL
Stream- &
Batch Processing
Analytics
Stateful
Event-Driven
Applications
Process Function
58
class MyFunction extends ProcessFunction[MyEvent, Result] {
// declare state to use in the program
lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext().getState(…)
def processElement(event: MyEvent, ctx: Context, out: Collector[Result]): Unit = {
// work with event and state
(event, state.value) match { … }
out.collect(…) // emit events
state.update(…) // modify state
// schedule a timer callback
ctx.timerService.registerEventTimeTimer(event.timestamp + 500)
}
def onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[Result]): Unit = {
// handle callback when event-/processing- time instant is reached
}
}
Data Stream API
59
val lines: DataStream[String] = env.addSource(
new FlinkKafkaConsumer09<>(…))
val events: DataStream[Event] = lines.map((line) => parse(line))
val stats: DataStream[Statistic] = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum(new MyAggregationFunction())
stats.addSink(new RollingSink(path))
Table API & Stream SQL
60
Events, State, Time, and Snapshots
61
Events, State, Time, and Snapshots
62
f(a,b)
Event-driven function
executed distributedly
Events, State, Time, and Snapshots
63
f(a,b)
Maintain fault tolerant local state similar to
any normal application
Main memory +
out of core (for maps)
Events, State, Time, and Snapshots
64
f(a,b)
wall clock
event time clock
Access and react to
notions of time and progress,
handle out-of-order events
Events, State, Time, and Snapshots
65
f(a,b)
wall clock
event time clock
Snapshot point-in-time
view for recovery,
rollback, cloning,
versioning, etc.
Stateful Event & Stream Processing
66
Source
Transformation
Transformation
Sink
val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…))
val events: DataStream[Event] = lines.map((line) => parse(line))
val stats: DataStream[Statistic] = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum(new MyAggregationFunction())
stats.addSink(new RollingSink(path))
Streaming
Dataflow
Source Transform Window
(state read/write)
Sink
Stateful Event & Stream Processing
67
Source
Filter /
Transform
State
read/write
Sink
Stateful Event & Stream Processing
68
Scalable embedded state
Access at memory speed &
scales with parallel operators
Stateful Event & Stream Processing
69
Re-load state
Reset positions
in input streams
Rolling back computation
Re-processing
Stateful Event & Stream Processing
70
Restore to different
programs
Bugfixes, Upgrades, A/B testing, etc
"Classical" versus
Streaming Architecture
71
Compute, State, and Storage
72
Classic tiered architecture Streaming architecture
database
layer
compute
layer
application state
+ backup
compute
+
stream storage
and
snapshot storage
(backup)
application state
Performance
73
synchronous reads/writes
across tier boundary
asynchronous writes
of large blobs
all modifications
are local
Classic tiered architecture Streaming architecture
Consistency
74
distributed transactions
at scale typically
at-most / at-least once
exactly once
per state
=1 =1snapshot consistency
across states
Classic tiered architecture Streaming architecture
Scaling a Service
75
separately provision additional
database capacity
provision compute
and state together
Classic tiered architecture Streaming architecture
provision compute
Rolling out a new Service
76
provision a new database
(or add capacity to an existing one)
provision compute
and state together
simply occupies some
additional backup space
Classic tiered architecture Streaming architecture
Repair External State
77
Streaming architecture
events
live application external state
wrong results
backed up data
(HDFS, S3, etc.)
Repair External State
78
Streaming architecture
live application external state
overwrite
with correct results
backed up data
(HDFS, S3, etc.)
application on backup input
events
Repair External State
79
Streaming architecture
live application external state
overwrite
with correct results
backed up date
(HDFS, S3, etc.)
Each application doubles as
a batch job!
application on backup input
events

Stephan Ewen - Experiences running Flink at Very Large Scale

  • 1.
    Experiences Running Apache Flinkat Very Large Scale @StephanEwen Berlin Buzzwords, 2017 1
  • 2.
    Some large scaleuse cases 2
  • 3.
    3  Various usecases • Example: Stream ingestion, route events to Kafka, ES, Hive • Example: Model user interaction sessions  Mix of stateless / moderate state / large state  Stream Processing as a Service • Launching, monitoring, scaling, updating @
  • 4.
  • 5.
    5  Blink basedon Flink  A core system in Alibaba Search • Machine learning, search, recommendations • A/B testing of search algorithms • Online feature updates to boost conversion rate  Alibaba is a major contributor to Flink  Contributing many changes back to open source @
  • 6.
  • 7.
    7 @ Social network implementedusing event sourcing and CQRS (Command Query Responsibility Segregation) on Kafka/Flink/Elasticsearch/Redis More: https://data-artisans.com/blog/drivetribe-cqrs-apache-flink
  • 8.
    How we learnedto view Flink through its users 8
  • 9.
    System for Event–drivenApplications 9 Event-driven Applications Stream Processing Batch Processing Stateful, event-driven, event-time-aware processing (event sourcing, CQRS, …) (streams, windows, …) (data sets)
  • 10.
    Event Sourcing +Memory Image 10 event log persists events (temporarily) event / command Process main memory update local variables/structures periodically snapshot the memory
  • 11.
    Event Sourcing +Memory Image 11 Recovery: Restore snapshot and replay events since snapshot event log persists events (temporarily) Process
  • 12.
    Distributed Memory Image 12 Distributedapplication, many memory images. Snapshots are all consistent together.
  • 13.
    Stateful Event &Stream Processing 13 Scalable embedded state Access at memory speed & scales with parallel operators
  • 14.
    Stateful Event &Stream Processing 14 Re-load state Reset positions in input streams Rolling back computation Re-processing
  • 15.
    Stateful Event &Stream Processing 15 Restore to different programs Bugfixes, Upgrades, A/B testing, etc
  • 16.
    Compute, State, andStorage 16 Classic tiered architecture Streaming architecture database layer compute layer application state + backup compute + stream storage and snapshot storage (backup) application state
  • 17.
    System for Event–drivenApplications 17 Event-driven Applications Stream Processing Batch Processing Stateful, event-driven, event-time-aware processing (event sourcing, CQRS, …) (streams, windows, …) (data sets)
  • 18.
    Apache Flink's LayeredAPIs 18 Process Function (events, state, time) DataStream API (streams, windows) Table API (dynamic tables) Stream SQL Stream- & Batch Processing Analytics Stateful Event-Driven Applications
  • 19.
    Lessons Learned fromRunning Flink 19
  • 20.
  • 21.
    Interacting with theenvironment  Dependency conflicts are amongst the biggest problems • Next versions trying to radically reduce dependencies • Make Hadoop an optional dependency • Rework shading techniques  The deployment ecosystem is crazy complex • Yarn, Mesos & DC/OS, Docker & K8s, standalone, … • Containers and overlay networks are tricky • Authorization and authentication ecosystem complex it itself • Continuous work to improve integration 21
  • 22.
    External systems  Dependencyon any external system eventually causes downtime • Mainly: HDFS / S3 / NFS / … for checkpoints  We plan to reduce dependency on those more and more in the next versions 22
  • 23.
    Type Serialization  Typeserialization is a harder problem in streaming than in batch • The data structure updates require more serialization • Types are often more complex than in batch  State lives long and across jobs • Requires to "version" state and serializers • Requires a "schema evolution" path • Much enhanced support in Flink 1.3, more still to come 23
  • 24.
    24 …is the mostimportant part of running a large scale Flink application Robustly checkpointing…
  • 25.
    Review: Checkpoints 25 Trigger checkpointInject checkpoint barrier stateful operation source / transform
  • 26.
    Review: Checkpoints 26 Take statesnapshot Trigger state snapshot stateful operation source / transform
  • 27.
    Review: Checkpoint Alignment 27 beginaligning checkpoint barrier n xy operator aligning ab operator 23 1 input buffer y
  • 28.
    Review: Checkpoint Alignment 28 bc operator 231 emit barrier n c operator 23 1 input buffer continuecheckpoint 4 4 a
  • 29.
  • 30.
    Understanding Checkpoints 30 How wellbehaves the alignment? (lower is better) How long do snapshots take? delay = end_to_end – sync – async
  • 31.
    Understanding Checkpoints 31 How wellbehaves the alignment? (lower is better) How long do snapshots take? delay = end_to_end – sync – async long delay = under backpressure under constant backpressure means the application is under provisioned too long means  too much state per node  snapshot store cannot keep up with load (low bandwidth) vastly improved with incremental checkpoints in Flink 1.3 most important robustness metric
  • 32.
    Heavy alignments  Aheavy alignment typically happens at some point  Different load on different paths  Skewed window emission (lots of data on one node)  Stall of one operator on the path 34
  • 33.
    Heavy alignments  Aheavy alignment typically happens at some point  Different load on different paths  Skewed window emission (lots of data on one node)  Stall of one operator on the path 35
  • 34.
    Heavy alignments  Aheavy alignment typically happens at some point  Different load on different paths  Skewed window emission (lots of data on one node)  Stall of one operator on the path 36 GC stall
  • 35.
    Catching up fromheavy alignments  Operators that did heavy alignment need to catch up again  Otherwise, next checkpoint will have a heavy alignment as well 37 operator bc operator 23 14 a consumed first after checkpoint completed bc a
  • 36.
    Catching up fromheavy alignments  Giving the computation time to catch up before starting the next checkpoint • Set the min-time-between-checkpoints • Ideas to change checkpoints to policy based (spend x% of capacity on checkpoints)  Asynchronous checkpoints mitigate most of problem • Very short stalls in the pipelines means shorter alignment phase • Catch up already happens concurrently to state materialization 38
  • 37.
    Asynchrony of differentstate types 40 State Flink 1.2 Flink 1.3 Flink 1.4 Keyed state RocksDB ✔ ✔ ✔ Keyed State on heap ✘ (✔) (hidden in 1.2.1) ✔ ✔ Timers ✘ ✘ ✔ (PR) Operator State ✘ ✔ ✔
  • 38.
    When to usewhich state backend? 41 Async. Heap/FS RocksDB State ≥ Memory ? Complex Objects? (expensive serialization) high data rate? no yes yes no yes no a bit simplified
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
    Exceeding FS requestcapacity  Job size: multiple 1000 operators  Checkpoint interval: few secs  State size: KBs per operator, 1000 of state chunks  Via the S3 FS (from Hadoop), writes ensure "directory" exists, 2 HEAD requests  Symptom: S3 blocked off connections after exceeding 1000s HEAD requests / sec 46
  • 44.
    Reducing FS stressfor small state 47 JobManager TaskManager Checkpoint Coordinator Task TaskManager Task TaskTask Root Checkpoint File (metadata) checkpoint data files Fs/RocksDB state backend for most states
  • 45.
    Reducing FS stressfor small state 48 JobManager TaskManager Checkpoint Coordinator Task TaskManager Task TaskTask checkpoint data directly in metadata file Fs/RocksDB state backend for small states ack+data Increasing small state threshold reduces number of files (default: 1KB)
  • 46.
  • 47.
    Deploying Tasks 50 Happens duringinitial deployment and recovery JobManager TaskManager Akka / RPC Akka / RPC Blob Server Blob Server Deployment RPC Call Contains - Job Configuration - Task Code and Objects - Recover State Handle - Correlation IDs
  • 48.
    Deploying Tasks 51 Happens duringinitial deployment and recovery JobManager TaskManager Akka / RPC Akka / RPC Blob Server Blob Server Deployment RPC Call Contains - Job Configuration - Task Code and Objects - Recover State Handle - Correlation IDs KBs up to MBs KBs few bytes
  • 49.
    RPC volume duringdeployment 52 (back of the napkin calculation) number of tasks 2 MB parallelism size of task objects 100010 x x x x = = RPC volume 20 GB ~20 seconds on full 10 GBits/s net > 1 min with avg. of 3 GBits/s net > 3 min with avg. of 1GBs net
  • 50.
    Timeouts and Failuredetection 53 ~20 seconds on full 10 GBits/s net > 1 min with avg. of 3 GBits/s net > 3 min with avg. of 1GBs net Default RPC timeout: 10 secs default settings lead to failed deployments with RPC timeouts Solution: Increase RPC timeout Caveat: Increasing the timeout makes failure detection slower Future: Reduce RPC load (next slides)
  • 51.
    Dissecting the RPCmessages 54 Message part Size Variance across subtasks and redeploys Job Configuration KBs constant Task Code and Objects up to MBs constant Recover State Handle KBs variable Correlation IDs few bytes variable
  • 52.
    Upcoming: Deploying Tasks 55 Out-of-bandtransfer and caching of large and constant message parts JobManager TaskManager Akka / RPC Akka / RPC Blob Server Blob Cache (1) Deployment RPC Call (Recover State Handle, Correlation IDs, BLOB pointers) (2) Download and cache BLOBs (Job Config, Task Objects) MBs KBs
  • 53.
    Layers of abstraction 56 Ogreshave layers So do squirrels
  • 54.
    Apache Flink's LayeredAPIs 57 Process Function (events, state, time) DataStream API (streams, windows) Table API (dynamic tables) Stream SQL Stream- & Batch Processing Analytics Stateful Event-Driven Applications
  • 55.
    Process Function 58 class MyFunctionextends ProcessFunction[MyEvent, Result] { // declare state to use in the program lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext().getState(…) def processElement(event: MyEvent, ctx: Context, out: Collector[Result]): Unit = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } def onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[Result]): Unit = { // handle callback when event-/processing- time instant is reached } }
  • 56.
    Data Stream API 59 vallines: DataStream[String] = env.addSource( new FlinkKafkaConsumer09<>(…)) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum(new MyAggregationFunction()) stats.addSink(new RollingSink(path))
  • 57.
    Table API &Stream SQL 60
  • 58.
    Events, State, Time,and Snapshots 61
  • 59.
    Events, State, Time,and Snapshots 62 f(a,b) Event-driven function executed distributedly
  • 60.
    Events, State, Time,and Snapshots 63 f(a,b) Maintain fault tolerant local state similar to any normal application Main memory + out of core (for maps)
  • 61.
    Events, State, Time,and Snapshots 64 f(a,b) wall clock event time clock Access and react to notions of time and progress, handle out-of-order events
  • 62.
    Events, State, Time,and Snapshots 65 f(a,b) wall clock event time clock Snapshot point-in-time view for recovery, rollback, cloning, versioning, etc.
  • 63.
    Stateful Event &Stream Processing 66 Source Transformation Transformation Sink val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…)) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) Streaming Dataflow Source Transform Window (state read/write) Sink
  • 64.
    Stateful Event &Stream Processing 67 Source Filter / Transform State read/write Sink
  • 65.
    Stateful Event &Stream Processing 68 Scalable embedded state Access at memory speed & scales with parallel operators
  • 66.
    Stateful Event &Stream Processing 69 Re-load state Reset positions in input streams Rolling back computation Re-processing
  • 67.
    Stateful Event &Stream Processing 70 Restore to different programs Bugfixes, Upgrades, A/B testing, etc
  • 68.
  • 69.
    Compute, State, andStorage 72 Classic tiered architecture Streaming architecture database layer compute layer application state + backup compute + stream storage and snapshot storage (backup) application state
  • 70.
    Performance 73 synchronous reads/writes across tierboundary asynchronous writes of large blobs all modifications are local Classic tiered architecture Streaming architecture
  • 71.
    Consistency 74 distributed transactions at scaletypically at-most / at-least once exactly once per state =1 =1snapshot consistency across states Classic tiered architecture Streaming architecture
  • 72.
    Scaling a Service 75 separatelyprovision additional database capacity provision compute and state together Classic tiered architecture Streaming architecture provision compute
  • 73.
    Rolling out anew Service 76 provision a new database (or add capacity to an existing one) provision compute and state together simply occupies some additional backup space Classic tiered architecture Streaming architecture
  • 74.
    Repair External State 77 Streamingarchitecture events live application external state wrong results backed up data (HDFS, S3, etc.)
  • 75.
    Repair External State 78 Streamingarchitecture live application external state overwrite with correct results backed up data (HDFS, S3, etc.) application on backup input events
  • 76.
    Repair External State 79 Streamingarchitecture live application external state overwrite with correct results backed up date (HDFS, S3, etc.) Each application doubles as a batch job! application on backup input events