Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark

Arbitrary Stateful Aggregations
using Structured Streaming
in Apache Spark™
Burak Yavuz
5/16/2017

2
Outline
• Structured Streaming Concepts
• Stateful Processing in Structured Streaming
• Use Cases
• Demos

3
The simplest way to perform streaming analytics
is not having to reason about streaming at all

5
Input: data from source as an
append-only table
Trigger: how frequently to check
input for new data
Query: operations on input
usual map/filter/reduce
new window, session ops
Trigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
New Model

6
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[complete mode]
output all the rows in the result table
New Model
Result: final operated table
updated every trigger interval
Output: what part of result to
write to data sink after every
trigger
Complete output: Write full result
table every time

7
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[append mode]
output only new rows since
last trigger
Result: final operated table updated
every trigger interval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table
every time
Append output: Write only new rows that got
added to result table since previous batch
*Not all output modes are feasible with all queries
New Model

9
Output Modes
• Append mode (default) - New rows added to the Result Table
since the last trigger will be outputted to the sink. Rows will be
output only once, and cannot be rescinded.
Example use cases: ETL

10
Output Modes
• Complete mode - The whole Result Table will be outputted to
the sink after every trigger. This is supported for aggregation
queries.
Example use cases: Monitoring

11
Output Modes
• Update mode - (Available since Spark 2.1.1) Only the rows in the
Result Table that were updated since the last trigger will be
outputted to the sink.
Example use cases: Alerting, Sessionization

12
Outline
• Use Cases
• Demos

13
Event time Aggregations
Many use cases require aggregate statistics by event time
E.g. what's the #errors in each system in 1 hour windows?
Many challenges
Extracting event time from data, handling late, out-of-order data
DStream APIs were insufficient for event time operations

14
Event time Aggregations
Windowing is just another type of grouping in Struct.
Streaming
number of records every hour
parsedData
.groupBy(window("timestamp","1 hour"))
.count()
parsedData
.groupBy(
"device",
window("timestamp","10 mins"))
.avg("signal")
avg signal strength of each
device every 10 mins
Use built-in functions to extract event-time
No need for separate extractors

15
Advanced Aggregations
Powerful built-in
aggregations
Multiple simultaneous
aggregations
Custom aggs using
reduceGroups, UDAFs
parsedData
.groupBy(window("timestamp","1 hour"))
.agg(avg("signal"), stddev("signal"), max("signal"))
variance, stddev, kurtosis, stddev_samp, collect_list,
collect_set, corr, approx_count_distinct, ...
// Compute histogram of age by name.
val hist = ds.groupBy(_.type).mapGroups {
case (type, data: Iter[DeviceData]) =>
val buckets = new Array[Int](10)
data.map(_.signal).foreach { a => buckets(a/10)+=1 }
(type, buckets)
}

16
Stateful Processing for Aggregations
In-memory,
streaming state
maintained for
aggregations
12:00 - 13:00 1 12:00 - 13:00 3
13:00 - 14:00 1
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 5
12:00 - 13:00 5
13:00 - 14:00 2
14:00 - 15:00 5
15:00 - 16:00 4
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 6
15:00 - 16:00 4
16:00 - 17:00 3
13:00 14:00 15:00 16:00 17:00
Keeping state allows late data to
update counts of old windows
But size of the state increases
indefinitely if old windows not dropped
red = state updated
with late data

18
Watermarking and Late Data
Watermark [Spark 2.1] - a
moving threshold that trails
behind the max seen event time
Trailing gap defines how late
data is expected to be
event time
max event time
watermark data older
than
watermark
not expected
12:30 PM
12:20 PM
trailing gap
of 10 mins

19
Data newer than watermark may
be late, but allowed to aggregate
Data older than watermark is "too
late" and dropped
State older than watermark
automatically deleted to limit the
amount of intermediate state
max event time
event time
watermark
late data
allowed to
aggregate
data too
late,
dropped

20
max event time
event time
watermark
allowed
lateness
of 10 mins
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
late data
allowed to
aggregate
data too
late,
dropped
Control the tradeoff between state
size and lateness requirements
Handle more late à keep more state
Reduce state à handle less lateness

21
Watermarking to Limit State [Spark 2.1]
data too late,
ignored in counts,
state dropped
Processing Time12:00
12:05
12:10
12:15
12:10 12:15 12:20
12:07
12:13
12:08
EventTime
12:15
12:18
12:04
watermark updated to
12:14 - 10m = 12:04
for next trigger,
state < 12:04 deleted
data is late, but
considered in counts
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
system tracks max
observed event time
12:08
wm = 12:04
10min
12:14
More details in blog post!

23
Working With Time
df.withWatermark("timestampColumn", "5 hours")
.groupBy(window("timestampColumn", "1 minute"))
.count()
.writeStream
.trigger("10 seconds")
Separate processing details (output rate, late data tolerance)
from query semantics.

24
Working With Time
.count()
.writeStream
How to group
data by time
Same in streaming & batch

25
Working With Time
.count()
.writeStream
How late
data can be

26
Working With Time
.count()
.writeStream
How often
to emit updates

27
Arbitrary Stateful Operations [Spark 2.2]
mapGroupsWithState
allows any user-defined
stateful ops to a
user-defined state
Direct support for per-key
timeouts in event-time or
processing-time
supports Scala and Java
ds.groupByKey(groupingFunc)
.mapGroupsWithState
(timeoutConf)
(mappingWithStateFunc)
def mappingWithStateFunc(
key: K,
values: Iterator[V],
state: GroupState[S]): U = {
// update or remove state
// set timeouts
// return mapped value
}

28
flatMapGroupsWithState
• Applies the given function to each group of data, while
maintaining a user-defined per-group state
• Invoked once per group in batch
• Invoked each trigger (with the existence of data) per group in
streaming
• Requires user to provide an output mode for the function

29
flatMapGroupsWithState
• mapGroupsWithState is a special case with
• Output mode: Update
• Output size: 1 row per group
• Supports both Processing Time and Event Time timeouts

30
Outline
• Use Cases
• Demos

31
Alerting
val monitoring = stream
.as[Event]
.groupBy(_.id)
.flatMapGroupsWithState(Append, GST.ProcessingTimeTimeout) {
(id: Int, events: Iterator[Event], state: GroupState[…]) =>
...
}
.writeStream
.queryName("alerts")
.foreach(new PagerdutySink(credentials))
Monitor a stream using custom stateful logic with timeouts.

32
Sessionization
val monitoring = stream
.as[Event]
.groupBy(_.session_id)
.mapGroupsWithState(GST.EventTimeTimeout) {
(id: Int, events: Iterator[Event], state: GroupState[…]) =>
...
}
.writeStream
.parquet("/user/sessions")
Analyze sessions of user/system behavior

34
SPARK SUMMIT 2017
DATA SCIENCE AND ENGINEERING AT SCALE
JUNE 5 – 7 | MOSCONE CENTER | SAN FRANCISCO
ORGANIZED BY spark-summit.org/2017
Discount Code: Databricks

We are hiring!
https://databricks.com/company/careers

Thank You
“Does anyone have any questions for my answers?” - Henry Kissinger

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark

More Related Content

What's hot(20)

Similar to Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark(20)

More from Databricks(20)

Recently uploaded(20)

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark