0% found this document useful (0 votes)
19 views25 pages

Lecture 7 - 1-Spark - Streaming

The document discusses structured streaming in Spark, highlighting the transition from DStreams to a new model that supports event-time processing, continuous aggregations, and fault-tolerance. It outlines the advantages of using DataFrames for both batch and streaming ETL processes, as well as the importance of output modes and query management in streaming applications. Additionally, it emphasizes the system's ability to ensure exactly-once semantics and recoverability in the event of failures.

Uploaded by

Tuân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views25 pages

Lecture 7 - 1-Spark - Streaming

The document discusses structured streaming in Spark, highlighting the transition from DStreams to a new model that supports event-time processing, continuous aggregations, and fault-tolerance. It outlines the advantages of using DataFrames for both batch and streaming ETL processes, as well as the importance of output modes and query management in streaming applications. Additionally, it emphasizes the system's ability to ensure exactly-once semantics and recoverability in the event of failures.

Uploaded by

Tuân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

06/10/2024

Chapter 7
Stream processing
Structured streaming in Spark

1
06/10/2024

Spark Dstream

Pain points with DStreams


• Processing with event-time, dealing with late data
• DStream API exposes batch time, hard to incorporate event-
time
• Interoperate streaming with batch AND interactive
• RDD/DStream has similar API, but still requires translation
• Reasoning about end-to-end guarantees
• Requires carefully constructing sinks that handle failures
correctly
• Data consistency in the storage while being updated

2
06/10/2024

New model
• Input: data from source as an append-only table
• Trigger: how frequently to check input for new data
• Query: operations on input usual map/filter/reduce new
window, session ops

New model (2)


• Result: final operated table updated every trigger
interval
• Output: what part of result to write to data sink after
every trigger
• Complete output: Write full result table every time

3
06/10/2024

New model (3)


• Delta output: Write only the rows that changed in result
from previous batch
• Append output: Write only new rows
• *Not all output modes are feasible with all queries

Batch ETL with DataFrames


input = spark.read • Read from Json file
.format("json")
.load("source-path")
result = input • Select some devices
.select("device",
"signal")
.where("signal > 15")
• Write to parquet file
result.write
.format("parquet")
.save("dest-path")

4
06/10/2024

Streaming ETL with DataFrames


input = spark.read • Read from Json file
.format("json") stream
• Replace load() with
.stream("source- stream()
path")
result = input • Select some devices
.select("device", • Code does not change
"signal")
.where("signal >
15")
result.write • Write to Parquet file
stream
.format("parquet") • Replace save() with
startStream()
.startStream("dest-
path")
9

Streaming ETL with DataFrames


input = spark.read • read…stream() creates a
.format("json") streaming DataFrame,
does not start any of the
.stream("source- computation
path")
result = input
.select("device",
"signal")
.where("signal >
15") • write…startStream()
result.write defines where & how to
output the data and starts
.format("parquet") the processing
.startStream("dest-
path")
10

10

5
06/10/2024

Streaming ETL with DataFrames


input = spark.read
.format("json")
.stream("source-
path")
result = input
.select("device",
"signal")
.where("signal >
15")
result.write
.format("parquet")
.startStream("dest-
path")
11

11

Continuous Aggregations
input.avg("signal") • Continuously compute
average signal across all
devices

input.groupBy("device-
type") • Continuously compute
average signal of each
.avg("signal") type of device

12

12

6
06/10/2024

Continuous Windowed Aggregations


input.groupBy( • Continuously compute
$"device-type", average signal of each
type of device in last 10
window($"event- minutes using event-time
time- col", "10 min"))
.avg("signal")

• Simplifies event-time
stream processing (not
possible in DStreams)
Works on both, streaming
and batch jobs

13

13

Joining streams with static data


kafkaDataset = spark.read • Join streaming data from
.kafka("iot-updates") Kafka with static data via
JDBC to enrich the
.stream() streaming data …

staticDataset = ctxt.read
.jdbc("jdbc://", "iot-
device-info")
• … without having to think
joinedDataset = that you are joining
kafkaDataset.join( streaming data
staticDataset,
"device-type")

14

14

7
06/10/2024

Output modes
Defines what is outputted every time there is a trigger
Different output modes make sense for different queries
• Append mode with non- input.select("device", "signal")
aggregation queries
.write
.outputMode("append")
.format("parquet")
.startStream("dest-path")

input.agg(count("*"))
• Complete mode with
.write
aggregation queries
.outputMode("complete"
)
.format("parquet")
.startStream("dest-path")

15

15

Query Management
query = result.write • query: a handle to the
.format("parquet") running streaming
computation for managing it
.outputMode("append" • Stop it, wait for it to
) terminate
.startStream("dest- • Get status
path") • Get error, if terminated

query.stop() • Multiple queries can be


active at the same time
query.awaitTermination()
query.exception()
• Each query has unique
name for keeping track
query.sourceStatuses()
query.sinkStatus()

16

16

8
06/10/2024

Query execution
• Logically
• Dataset operations on
table (i.e. as easy to
understand as batch)

• Physically
• Spark automatically runs
the query in streaming
fashion (i.e. incrementally
and continuously)

17

17

Structured Streaming
• High-level streaming API built on Datasets/DataFrames
• Event time, windowing, sessions, sources & sinks
• End-to-end exactly once semantics
• Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Add, remove, change queries at runtime
• Build and apply ML models

18

18

9
06/10/2024

Internal execution

19

19

Batch Execution on Spark SQL

20

20

10
06/10/2024

Batch Execution on Spark SQL

21

21

Continuous Incremental Execution

22

22

11
06/10/2024

Continuous Incremental Execution

23

23

Continuous Aggregations

24

24

12
06/10/2024

Fault-tolerance

• All data and metadata in


the system needs to be
recoverable / replayable

25

25

Fault-tolerant Planner

• Tracks offsets by writing


the offset range of each
execution to a write
ahead log (WAL) in
HDFS

26

26

13
06/10/2024

Fault-tolerant Planner

• Tracks offsets by writing


the offset range of each
execution to a write
ahead log (WAL) in
HDFS

27

27

Fault-tolerant Planner

• Tracks offsets by writing


the offset range of each
execution to a write
ahead log (WAL) in
HDFS

• Reads log to recover


from failures, and re-
execute exact range of
offsets

28

28

14
06/10/2024

Fault-tolerant Sources
• Structured streaming
sources are by design
replayable (e.g. Kafka,
Kinesis, files) and
generate the exactly
same data given offsets
recovered by planner

29

29

Fault-tolerant State
• Intermediate "state data"
is a maintained in
versioned, keyvalue
maps in Spark workers,
backed by HDFS
• Planner makes sure
"correct version" of state
used to reexecute after
failure

30

30

15
06/10/2024

Fault-tolerant Sink

• Sink are by design


idempotent
(deterministic), and
handles re-executions to
avoid double committing
the output

31

31

Fault-tolerance
offset tracking in WAL
+
state management
+
fault-tolerant sources and sinks
=
end-to-end
exactly-once
guarantees

32

32

16
06/10/2024

Structured streaming

Fast, fault-tolerant, exactly-once


stateful stream processing
without having to reason about streaming

33

33

Usecase
• https://mapr.com/blog/real-time-analysis-popular-uber-
locations-spark-structured-streaming-machine-
learning-kafka-and-mapr-db/

34

34

17
06/10/2024

Usecase – Twitter sentiment analysis

35

35

Problem statement

36

36

18
06/10/2024

Importing packages

37

37

Twitter token authorization

38

38

19
06/10/2024

Dstream transformation

39

39

Generating tweet data

40

40

20
06/10/2024

Extracting sentiments

41

41

Results

42

42

21
06/10/2024

Output directory

43

43

Output usernames

44

44

22
06/10/2024

Output tweets and sentiments

45

45

Sentiments for Trump

46

46

23
06/10/2024

Applying sentiment analysis

47

47

References
• Zaharia, Matei, et al. "Discretized streams: an efficient and fault-
tolerant model for stream processing on large clusters." Presented
as part of the. 2012.
• Armbrust, Michael, et al. "Structured streaming: A declarative API
for real-time applications in apache spark." Proceedings of the
2018 International Conference on Management of Data. 2018.

48

48

24
06/10/2024

Thank you
for your
attention!!!

49

49

25

You might also like