06/10/2024
Chapter 7
Stream processing
Structured streaming in Spark
1
06/10/2024
Spark Dstream
Pain points with DStreams
• Processing with event-time, dealing with late data
• DStream API exposes batch time, hard to incorporate event-
time
• Interoperate streaming with batch AND interactive
• RDD/DStream has similar API, but still requires translation
• Reasoning about end-to-end guarantees
• Requires carefully constructing sinks that handle failures
correctly
• Data consistency in the storage while being updated
2
06/10/2024
New model
• Input: data from source as an append-only table
• Trigger: how frequently to check input for new data
• Query: operations on input usual map/filter/reduce new
window, session ops
New model (2)
• Result: final operated table updated every trigger
interval
• Output: what part of result to write to data sink after
every trigger
• Complete output: Write full result table every time
3
06/10/2024
New model (3)
• Delta output: Write only the rows that changed in result
from previous batch
• Append output: Write only new rows
• *Not all output modes are feasible with all queries
Batch ETL with DataFrames
input = spark.read • Read from Json file
.format("json")
.load("source-path")
result = input • Select some devices
.select("device",
"signal")
.where("signal > 15")
• Write to parquet file
result.write
.format("parquet")
.save("dest-path")
4
06/10/2024
Streaming ETL with DataFrames
input = spark.read • Read from Json file
.format("json") stream
• Replace load() with
.stream("source- stream()
path")
result = input • Select some devices
.select("device", • Code does not change
"signal")
.where("signal >
15")
result.write • Write to Parquet file
stream
.format("parquet") • Replace save() with
startStream()
.startStream("dest-
path")
9
Streaming ETL with DataFrames
input = spark.read • read…stream() creates a
.format("json") streaming DataFrame,
does not start any of the
.stream("source- computation
path")
result = input
.select("device",
"signal")
.where("signal >
15") • write…startStream()
result.write defines where & how to
output the data and starts
.format("parquet") the processing
.startStream("dest-
path")
10
10
5
06/10/2024
Streaming ETL with DataFrames
input = spark.read
.format("json")
.stream("source-
path")
result = input
.select("device",
"signal")
.where("signal >
15")
result.write
.format("parquet")
.startStream("dest-
path")
11
11
Continuous Aggregations
input.avg("signal") • Continuously compute
average signal across all
devices
input.groupBy("device-
type") • Continuously compute
average signal of each
.avg("signal") type of device
12
12
6
06/10/2024
Continuous Windowed Aggregations
input.groupBy( • Continuously compute
$"device-type", average signal of each
type of device in last 10
window($"event- minutes using event-time
time- col", "10 min"))
.avg("signal")
• Simplifies event-time
stream processing (not
possible in DStreams)
Works on both, streaming
and batch jobs
13
13
Joining streams with static data
kafkaDataset = spark.read • Join streaming data from
.kafka("iot-updates") Kafka with static data via
JDBC to enrich the
.stream() streaming data …
staticDataset = ctxt.read
.jdbc("jdbc://", "iot-
device-info")
• … without having to think
joinedDataset = that you are joining
kafkaDataset.join( streaming data
staticDataset,
"device-type")
14
14
7
06/10/2024
Output modes
Defines what is outputted every time there is a trigger
Different output modes make sense for different queries
• Append mode with non- input.select("device", "signal")
aggregation queries
.write
.outputMode("append")
.format("parquet")
.startStream("dest-path")
input.agg(count("*"))
• Complete mode with
.write
aggregation queries
.outputMode("complete"
)
.format("parquet")
.startStream("dest-path")
15
15
Query Management
query = result.write • query: a handle to the
.format("parquet") running streaming
computation for managing it
.outputMode("append" • Stop it, wait for it to
) terminate
.startStream("dest- • Get status
path") • Get error, if terminated
query.stop() • Multiple queries can be
active at the same time
query.awaitTermination()
query.exception()
• Each query has unique
name for keeping track
query.sourceStatuses()
query.sinkStatus()
16
16
8
06/10/2024
Query execution
• Logically
• Dataset operations on
table (i.e. as easy to
understand as batch)
• Physically
• Spark automatically runs
the query in streaming
fashion (i.e. incrementally
and continuously)
17
17
Structured Streaming
• High-level streaming API built on Datasets/DataFrames
• Event time, windowing, sessions, sources & sinks
• End-to-end exactly once semantics
• Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Add, remove, change queries at runtime
• Build and apply ML models
18
18
9
06/10/2024
Internal execution
19
19
Batch Execution on Spark SQL
20
20
10
06/10/2024
Batch Execution on Spark SQL
21
21
Continuous Incremental Execution
22
22
11
06/10/2024
Continuous Incremental Execution
23
23
Continuous Aggregations
24
24
12
06/10/2024
Fault-tolerance
• All data and metadata in
the system needs to be
recoverable / replayable
25
25
Fault-tolerant Planner
• Tracks offsets by writing
the offset range of each
execution to a write
ahead log (WAL) in
HDFS
26
26
13
06/10/2024
Fault-tolerant Planner
• Tracks offsets by writing
the offset range of each
execution to a write
ahead log (WAL) in
HDFS
27
27
Fault-tolerant Planner
• Tracks offsets by writing
the offset range of each
execution to a write
ahead log (WAL) in
HDFS
• Reads log to recover
from failures, and re-
execute exact range of
offsets
28
28
14
06/10/2024
Fault-tolerant Sources
• Structured streaming
sources are by design
replayable (e.g. Kafka,
Kinesis, files) and
generate the exactly
same data given offsets
recovered by planner
29
29
Fault-tolerant State
• Intermediate "state data"
is a maintained in
versioned, keyvalue
maps in Spark workers,
backed by HDFS
• Planner makes sure
"correct version" of state
used to reexecute after
failure
30
30
15
06/10/2024
Fault-tolerant Sink
• Sink are by design
idempotent
(deterministic), and
handles re-executions to
avoid double committing
the output
31
31
Fault-tolerance
offset tracking in WAL
+
state management
+
fault-tolerant sources and sinks
=
end-to-end
exactly-once
guarantees
32
32
16
06/10/2024
Structured streaming
Fast, fault-tolerant, exactly-once
stateful stream processing
without having to reason about streaming
33
33
Usecase
• https://mapr.com/blog/real-time-analysis-popular-uber-
locations-spark-structured-streaming-machine-
learning-kafka-and-mapr-db/
34
34
17
06/10/2024
Usecase – Twitter sentiment analysis
35
35
Problem statement
36
36
18
06/10/2024
Importing packages
37
37
Twitter token authorization
38
38
19
06/10/2024
Dstream transformation
39
39
Generating tweet data
40
40
20
06/10/2024
Extracting sentiments
41
41
Results
42
42
21
06/10/2024
Output directory
43
43
Output usernames
44
44
22
06/10/2024
Output tweets and sentiments
45
45
Sentiments for Trump
46
46
23
06/10/2024
Applying sentiment analysis
47
47
References
• Zaharia, Matei, et al. "Discretized streams: an efficient and fault-
tolerant model for stream processing on large clusters." Presented
as part of the. 2012.
• Armbrust, Michael, et al. "Structured streaming: A declarative API
for real-time applications in apache spark." Proceedings of the
2018 International Conference on Management of Data. 2018.
48
48
24
06/10/2024
Thank you
for your
attention!!!
49
49
25