Stephan Ewen
@stephanewen
The Stream Processor
as a Database
Apache Flink
2
Streaming technology is enabling the obvious:
continuous processing on data that is
continuously produced
Apache Flink Stack
3
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.
Programs and Dataflows
4
Source
Transformation
Transformation
Sink
val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…))
val events: DataStream[Event] = lines.map((line) => parse(line))
val stats: DataStream[Statistic] = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.apply(new MyAggregationFunction())
stats.addSink(new RollingSink(path))
Source
[1]
map()
[1]
keyBy()/
window()/
apply()
[1]
Sink
[1]
Source
[2]
map()
[2]
keyBy()/
window()/
apply()
[2]
Streaming
Dataflow
What makes Flink flink?
5
Low latency
High Throughput
Well-behaved
flow control
(back pressure)
Make more sense of data
Works on real-time
and historic data
True
Streaming
Event Time
APIs
Libraries
Stateful
Streaming
Globally consistent
savepoints
Exactly-once semantics
for fault tolerance
Windows &
user-defined state
Flexible windows
(time, count, session, roll-your own)
Complex Event Processing
The (Classic) Use Case
Realtime Counts and Aggregates
6
(Real)Time Series Statistics
7
stream of events realtime statistics
The Architecture
8
collect log analyze serve & store
The Flink Job
9
case class Impressions(id: String, impressions: Long)
val events: DataStream[Event] =
env.addSource(new FlinkKafkaConsumer09(…))
val impressions: DataStream[Impressions] = events
.filter(evt => evt.isImpression)
.map(evt => Impressions(evt.id, evt.numImpressions)
val counts: DataStream[Impressions]= stream
.keyBy("id")
.timeWindow(Time.hours(1))
.sum("impressions")
The Flink Job
10
Kafka
Source
map()
window()/
sum()
Sink
Kafka
Source
map()
window()/
sum()
Sink
filter()
filter()
keyBy()
keyBy()
Putting it all together
11
Periodically (every second)
flush new aggregates
to Redis
How does it perform?
12
Latency Throughput
Number of
Keys
99th Percentile
Latency (sec)
9
8
2
1
Storm 0.10
Flink 0.10
60 80 100 120 140 160 180
Throughput
(1000 events/sec)
Spark Streaming 1.5
Yahoo! Streaming Benchmark
13
Latency
(lower is better)
Extended Benchmark: Throughput
14
Throughput
• 10 Kafka brokers with 2 partitions each
• 10 compute machines (Flink / Storm)
• Xeon E3-1230-V2@3.30GHz CPU (4 cores HT)
• 32 GB RAM (only 8GB allocated to JVMs)
• 10 GigE Ethernet between compute nodes
• 1 GigE Ethernet between Kafka cluster and Flink nodes
Scaling Number of Users
 Yahoo! Streaming Benchmark has 100 keys only
• Every second, only 100 keys are written to
key/value store
• Quite few, compared to many real world use cases
 Tweet impressions: millions keys/hour
• Up to millions of keys updated per second
15
Number of
Keys
Performance
16
Number of
Keys
The Bottleneck
17
Writes to the key/value
store take too long
Queryable State
18
Queryable State
19
Queryable State
20
Optional, and
only at the end of
windows
Queryable State Enablers
 Flink has state as a first class citizen
 State is fault tolerant (exactly once semantics)
 State is partitioned (sharded) together with the operators
that create/update it
 State is continuous (not mini batched)
 State is scalable (e.g., embedded RocksDB state backend)
21
Queryable State Status
 [FLINK-3779] / Pull Request #2051 :
Queryable State Prototype
 Design and implementation under evolution
 Some experiments were using earlier versions of the
implementation
 Exact numbers may differ in final implementation, but order
of magnitude is comparable
22
Queryable State Performance
23
Queryable State: Application View
24
Application only interested in latest realtime results
Application
Queryable State: Application View
25
Application requires both latest realtime- and older results
Database
realtime results older results
Application Query Service
current time
windows
past time
windows
Apache Flink Architecture Review
26
Queryable State: Implementation
27
Query Client
State
Registry
window()/
sum()
Job Manager Task Manager
ExecutionGraph
State Location Server
deploy
status
Query: /job/operation/state-name/key
State
Registry
window()/
sum()
Task Manager
(1) Get location of "key-partition"
for "operator" of" job"
(2) Look up
location
(3)
Respond location
(4) Query
state-name and key
local
state
register
Contrasting with key/value stores
28
Turning the Database Inside Out
 Cf. Martin Kleppman's talks on
re-designing data warehousing
based on log-centric processing
 This view angle picks up some of
these concepts
 Queryable State in Apache Flink = (Turning DB inside out)++
29
Write Path in Cassandra (simplified)
30
From the Apache Cassandra docs
Write Path in Cassandra (simplified)
31
From the Apache Cassandra docs
First step is durable write to the commit log
(in all databases that offer strong durability)
Memtable is a re-computable
view of the commit log
actions and the persistent SSTables)
Write Path in Cassandra (simplified)
32
From the Apache Cassandra docs
First step is durable write to the commit log
(in all databases that offer strong durability)
Memtable is a re-computable
view of the commit log
actions and the persistent SSTables)
Replication to Quorum
before write is acknowledged
Durability of Queryable state
33
snapshot
state
Durability of Queryable state
34
Event sequence is the ground truth and
is durably stored in the log already
Queryable state
re-computable
from checkpoint and log
snapshot
state Snapshot replication
can happen in the
background
Performance of Flink's State
35
window()/
sum()
Source /
filter() /
map()
State index
(e.g., RocksDB)
Events are persistent
and ordered (per partition / key)
in the log (e.g., Apache Kafka)
Events flow without replication or synchronous writes
Performance of Flink's State
36
window()/
sum()
Source /
filter() /
map()
Trigger checkpoint Inject checkpoint barrier
Performance of Flink's State
37
window()/
sum()
Source /
filter() /
map()
Take state snapshot RocksDB:
Trigger state
copy-on-write
Performance of Flink's State
38
window()/
sum()
Source /
filter() /
map()
Persist state snapshots Durably persist
snapshots
asynchronously
Processing pipeline continues
Conclusion
39
Takeaways
 Streaming applications are often not bound by the stream
processor itself. Cross system interaction is frequently biggest
bottleneck
 Queryable state mitigates a big bottleneck: Communication
with external key/value stores to publish realtime results
 Apache Flink's sophisticated support for state makes this
possible
40
Takeaways
Performance of Queryable State
 Data persistence is fast with logs (Apache Kafka)
• Append only, and streaming replication
 Computed state is fast with local data structures and no
synchronous replication (Apache Flink)
 Flink's checkpoint method makes computed state persistent
with low overhead
41
Go Flink!
42
Low latency
High Throughput
Well-behaved
flow control
(back pressure)
Make more sense of data
Works on real-time
and historic data
True
Streaming
Event Time
APIs
Libraries
Stateful
Streaming
Globally consistent
savepoints
Exactly-once semantics
for fault tolerance
Windows &
user-defined state
Flexible windows
(time, count, session, roll-your own)
Complex Event Processing
Flink Forward 2016, Berlin
Submission deadline: June 30, 2016
Early bird deadline: July 15, 2016
www.flink-forward.org
We are hiring!
data-artisans.com/careers

The Stream Processor as the Database - Apache Flink @ Berlin buzzwords

  • 1.
    Stephan Ewen @stephanewen The StreamProcessor as a Database Apache Flink
  • 2.
    2 Streaming technology isenabling the obvious: continuous processing on data that is continuously produced
  • 3.
    Apache Flink Stack 3 DataStreamAPI Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.
  • 4.
    Programs and Dataflows 4 Source Transformation Transformation Sink vallines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…)) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .apply(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) Source [1] map() [1] keyBy()/ window()/ apply() [1] Sink [1] Source [2] map() [2] keyBy()/ window()/ apply() [2] Streaming Dataflow
  • 5.
    What makes Flinkflink? 5 Low latency High Throughput Well-behaved flow control (back pressure) Make more sense of data Works on real-time and historic data True Streaming Event Time APIs Libraries Stateful Streaming Globally consistent savepoints Exactly-once semantics for fault tolerance Windows & user-defined state Flexible windows (time, count, session, roll-your own) Complex Event Processing
  • 6.
    The (Classic) UseCase Realtime Counts and Aggregates 6
  • 7.
    (Real)Time Series Statistics 7 streamof events realtime statistics
  • 8.
    The Architecture 8 collect loganalyze serve & store
  • 9.
    The Flink Job 9 caseclass Impressions(id: String, impressions: Long) val events: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(…)) val impressions: DataStream[Impressions] = events .filter(evt => evt.isImpression) .map(evt => Impressions(evt.id, evt.numImpressions) val counts: DataStream[Impressions]= stream .keyBy("id") .timeWindow(Time.hours(1)) .sum("impressions")
  • 10.
  • 11.
    Putting it alltogether 11 Periodically (every second) flush new aggregates to Redis
  • 12.
    How does itperform? 12 Latency Throughput Number of Keys
  • 13.
    99th Percentile Latency (sec) 9 8 2 1 Storm0.10 Flink 0.10 60 80 100 120 140 160 180 Throughput (1000 events/sec) Spark Streaming 1.5 Yahoo! Streaming Benchmark 13 Latency (lower is better)
  • 14.
    Extended Benchmark: Throughput 14 Throughput •10 Kafka brokers with 2 partitions each • 10 compute machines (Flink / Storm) • Xeon [email protected] CPU (4 cores HT) • 32 GB RAM (only 8GB allocated to JVMs) • 10 GigE Ethernet between compute nodes • 1 GigE Ethernet between Kafka cluster and Flink nodes
  • 15.
    Scaling Number ofUsers  Yahoo! Streaming Benchmark has 100 keys only • Every second, only 100 keys are written to key/value store • Quite few, compared to many real world use cases  Tweet impressions: millions keys/hour • Up to millions of keys updated per second 15 Number of Keys
  • 16.
  • 17.
    The Bottleneck 17 Writes tothe key/value store take too long
  • 18.
  • 19.
  • 20.
  • 21.
    Queryable State Enablers Flink has state as a first class citizen  State is fault tolerant (exactly once semantics)  State is partitioned (sharded) together with the operators that create/update it  State is continuous (not mini batched)  State is scalable (e.g., embedded RocksDB state backend) 21
  • 22.
    Queryable State Status [FLINK-3779] / Pull Request #2051 : Queryable State Prototype  Design and implementation under evolution  Some experiments were using earlier versions of the implementation  Exact numbers may differ in final implementation, but order of magnitude is comparable 22
  • 23.
  • 24.
    Queryable State: ApplicationView 24 Application only interested in latest realtime results Application
  • 25.
    Queryable State: ApplicationView 25 Application requires both latest realtime- and older results Database realtime results older results Application Query Service current time windows past time windows
  • 26.
  • 27.
    Queryable State: Implementation 27 QueryClient State Registry window()/ sum() Job Manager Task Manager ExecutionGraph State Location Server deploy status Query: /job/operation/state-name/key State Registry window()/ sum() Task Manager (1) Get location of "key-partition" for "operator" of" job" (2) Look up location (3) Respond location (4) Query state-name and key local state register
  • 28.
  • 29.
    Turning the DatabaseInside Out  Cf. Martin Kleppman's talks on re-designing data warehousing based on log-centric processing  This view angle picks up some of these concepts  Queryable State in Apache Flink = (Turning DB inside out)++ 29
  • 30.
    Write Path inCassandra (simplified) 30 From the Apache Cassandra docs
  • 31.
    Write Path inCassandra (simplified) 31 From the Apache Cassandra docs First step is durable write to the commit log (in all databases that offer strong durability) Memtable is a re-computable view of the commit log actions and the persistent SSTables)
  • 32.
    Write Path inCassandra (simplified) 32 From the Apache Cassandra docs First step is durable write to the commit log (in all databases that offer strong durability) Memtable is a re-computable view of the commit log actions and the persistent SSTables) Replication to Quorum before write is acknowledged
  • 33.
    Durability of Queryablestate 33 snapshot state
  • 34.
    Durability of Queryablestate 34 Event sequence is the ground truth and is durably stored in the log already Queryable state re-computable from checkpoint and log snapshot state Snapshot replication can happen in the background
  • 35.
    Performance of Flink'sState 35 window()/ sum() Source / filter() / map() State index (e.g., RocksDB) Events are persistent and ordered (per partition / key) in the log (e.g., Apache Kafka) Events flow without replication or synchronous writes
  • 36.
    Performance of Flink'sState 36 window()/ sum() Source / filter() / map() Trigger checkpoint Inject checkpoint barrier
  • 37.
    Performance of Flink'sState 37 window()/ sum() Source / filter() / map() Take state snapshot RocksDB: Trigger state copy-on-write
  • 38.
    Performance of Flink'sState 38 window()/ sum() Source / filter() / map() Persist state snapshots Durably persist snapshots asynchronously Processing pipeline continues
  • 39.
  • 40.
    Takeaways  Streaming applicationsare often not bound by the stream processor itself. Cross system interaction is frequently biggest bottleneck  Queryable state mitigates a big bottleneck: Communication with external key/value stores to publish realtime results  Apache Flink's sophisticated support for state makes this possible 40
  • 41.
    Takeaways Performance of QueryableState  Data persistence is fast with logs (Apache Kafka) • Append only, and streaming replication  Computed state is fast with local data structures and no synchronous replication (Apache Flink)  Flink's checkpoint method makes computed state persistent with low overhead 41
  • 42.
    Go Flink! 42 Low latency HighThroughput Well-behaved flow control (back pressure) Make more sense of data Works on real-time and historic data True Streaming Event Time APIs Libraries Stateful Streaming Globally consistent savepoints Exactly-once semantics for fault tolerance Windows & user-defined state Flexible windows (time, count, session, roll-your own) Complex Event Processing
  • 43.
    Flink Forward 2016,Berlin Submission deadline: June 30, 2016 Early bird deadline: July 15, 2016 www.flink-forward.org
  • 44.