0% found this document useful (0 votes)

19 views25 pages

Lecture 7 - 1-Spark - Streaming

The document discusses structured streaming in Spark, highlighting the transition from DStreams to a new model that supports event-time processing, continuous aggregations, and fault-tolerance. It outlines the advantages of using DataFrames for both batch and streaming ETL processes, as well as the importance of output modes and query management in streaming applications. Additionally, it emphasizes the system's ability to ensure exactly-once semantics and recoverability in the event of failures.

Uploaded by

Tuân Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views25 pages

Lecture 7 - 1-Spark - Streaming

Uploaded by

Tuân Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

06/10/2024

Chapter 7
Stream processing
Structured streaming in Spark

1
06/10/2024

Spark Dstream

Pain points with DStreams

• Processing with event-time, dealing with late data
• DStream API exposes batch time, hard to incorporate event-
time
• Interoperate streaming with batch AND interactive
• RDD/DStream has similar API, but still requires translation
• Reasoning about end-to-end guarantees
• Requires carefully constructing sinks that handle failures
correctly
• Data consistency in the storage while being updated

2
06/10/2024

New model
• Input: data from source as an append-only table
• Trigger: how frequently to check input for new data
• Query: operations on input usual map/filter/reduce new
window, session ops

New model (2)

• Result: final operated table updated every trigger
interval
• Output: what part of result to write to data sink after
every trigger
• Complete output: Write full result table every time

3
06/10/2024

New model (3)

• Delta output: Write only the rows that changed in result
from previous batch
• Append output: Write only new rows
• *Not all output modes are feasible with all queries

Batch ETL with DataFrames

input = spark.read • Read from Json file
.format("json")
.load("source-path")
result = input • Select some devices
.select("device",
"signal")
.where("signal > 15")
• Write to parquet file
result.write
.format("parquet")
.save("dest-path")

4
06/10/2024

Streaming ETL with DataFrames

input = spark.read • Read from Json file
.format("json") stream
• Replace load() with
.stream("source- stream()
path")
result = input • Select some devices
.select("device", • Code does not change
"signal")
.where("signal >
15")
result.write • Write to Parquet file
stream
.format("parquet") • Replace save() with
startStream()
.startStream("dest-
path")
9

Streaming ETL with DataFrames

input = spark.read • read…stream() creates a
.format("json") streaming DataFrame,
does not start any of the
.stream("source- computation
path")
result = input
.select("device",
"signal")
.where("signal >
15") • write…startStream()
result.write defines where & how to
output the data and starts
.format("parquet") the processing
.startStream("dest-
path")
10

5
06/10/2024

Streaming ETL with DataFrames

input = spark.read
.format("json")
.stream("source-
path")
result = input
.select("device",
"signal")
.where("signal >
15")
result.write
.format("parquet")
.startStream("dest-
path")
11

Continuous Aggregations
input.avg("signal") • Continuously compute
average signal across all
devices

input.groupBy("device-
type") • Continuously compute
average signal of each
.avg("signal") type of device

6
06/10/2024

Continuous Windowed Aggregations

input.groupBy( • Continuously compute
$"device-type", average signal of each
type of device in last 10
window($"event- minutes using event-time
time- col", "10 min"))
.avg("signal")

• Simplifies event-time
stream processing (not
possible in DStreams)
Works on both, streaming
and batch jobs

Joining streams with static data

kafkaDataset = spark.read • Join streaming data from
.kafka("iot-updates") Kafka with static data via
JDBC to enrich the
.stream() streaming data …

staticDataset = ctxt.read
.jdbc("jdbc://", "iot-
device-info")
• … without having to think
joinedDataset = that you are joining
kafkaDataset.join( streaming data
staticDataset,
"device-type")

7
06/10/2024

Output modes
Defines what is outputted every time there is a trigger
Different output modes make sense for different queries
• Append mode with non- input.select("device", "signal")
aggregation queries
.write
.outputMode("append")
.format("parquet")
.startStream("dest-path")

input.agg(count("*"))
• Complete mode with
.write
aggregation queries
.outputMode("complete"
)
.format("parquet")
.startStream("dest-path")

Query Management
query = result.write • query: a handle to the
.format("parquet") running streaming
computation for managing it
.outputMode("append" • Stop it, wait for it to
) terminate
.startStream("dest- • Get status
path") • Get error, if terminated

query.stop() • Multiple queries can be

active at the same time
query.awaitTermination()
query.exception()
• Each query has unique
name for keeping track
query.sourceStatuses()
query.sinkStatus()

8
06/10/2024

Query execution
• Logically
• Dataset operations on
table (i.e. as easy to
understand as batch)

• Physically
• Spark automatically runs
the query in streaming
fashion (i.e. incrementally
and continuously)

Structured Streaming
• High-level streaming API built on Datasets/DataFrames
• Event time, windowing, sessions, sources & sinks
• End-to-end exactly once semantics
• Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Add, remove, change queries at runtime
• Build and apply ML models

9
06/10/2024

Internal execution

Batch Execution on Spark SQL

10
06/10/2024

Batch Execution on Spark SQL

Continuous Incremental Execution

11
06/10/2024

Continuous Incremental Execution

Continuous Aggregations

12
06/10/2024

Fault-tolerance

• All data and metadata in

the system needs to be
recoverable / replayable

Fault-tolerant Planner

• Tracks offsets by writing

the offset range of each
execution to a write
ahead log (WAL) in
HDFS

13
06/10/2024

Fault-tolerant Planner

• Tracks offsets by writing

the offset range of each
execution to a write
ahead log (WAL) in
HDFS

Fault-tolerant Planner

• Tracks offsets by writing

the offset range of each
execution to a write
ahead log (WAL) in
HDFS

• Reads log to recover

from failures, and re-
execute exact range of
offsets

14
06/10/2024

Fault-tolerant Sources
• Structured streaming
sources are by design
replayable (e.g. Kafka,
Kinesis, files) and
generate the exactly
same data given offsets
recovered by planner

Fault-tolerant State
• Intermediate "state data"
is a maintained in
versioned, keyvalue
maps in Spark workers,
backed by HDFS
• Planner makes sure
"correct version" of state
used to reexecute after
failure

15
06/10/2024

Fault-tolerant Sink

• Sink are by design

idempotent
(deterministic), and
handles re-executions to
avoid double committing
the output

Fault-tolerance
offset tracking in WAL
+
state management
+
fault-tolerant sources and sinks
=
end-to-end
exactly-once
guarantees

16
06/10/2024

Structured streaming

Fast, fault-tolerant, exactly-once

stateful stream processing
without having to reason about streaming

Usecase
• https://mapr.com/blog/real-time-analysis-popular-uber-
locations-spark-structured-streaming-machine-
learning-kafka-and-mapr-db/

17
06/10/2024

Usecase – Twitter sentiment analysis

Problem statement

18
06/10/2024

Importing packages

Twitter token authorization

19
06/10/2024

Dstream transformation

Generating tweet data

20
06/10/2024

Extracting sentiments

Results

21
06/10/2024

Output directory

Output usernames

22
06/10/2024

Output tweets and sentiments

Sentiments for Trump

23
06/10/2024

Applying sentiment analysis

References
• Zaharia, Matei, et al. "Discretized streams: an efficient and fault-
tolerant model for stream processing on large clusters." Presented
as part of the. 2012.
• Armbrust, Michael, et al. "Structured streaming: A declarative API
for real-time applications in apache spark." Proceedings of the
2018 International Conference on Management of Data. 2018.

24
06/10/2024

Thank you
for your
attention!!!

Spark Streaming
No ratings yet
Spark Streaming
99 pages
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
No ratings yet
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
51 pages
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
No ratings yet
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
50 pages
Stream Processing Chapter 5
No ratings yet
Stream Processing Chapter 5
23 pages
Spark Streaming for Data Engineers
No ratings yet
Spark Streaming for Data Engineers
22 pages
Sigmod Structured Streaming
No ratings yet
Sigmod Structured Streaming
13 pages
Databricks Streaming and Delta Live Tables
No ratings yet
Databricks Streaming and Delta Live Tables
69 pages
Lecture 11
No ratings yet
Lecture 11
31 pages
Lec 05
No ratings yet
Lec 05
10 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
Bài Giảng Spark Streaming
No ratings yet
Bài Giảng Spark Streaming
75 pages
8 - Streaming 3 - Spark Flink
No ratings yet
8 - Streaming 3 - Spark Flink
52 pages
Spark Streaming API Guide
No ratings yet
Spark Streaming API Guide
37 pages
Stream Data Processing
No ratings yet
Stream Data Processing
32 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Spark Streaming: Tathagata "TD" Das
No ratings yet
Spark Streaming: Tathagata "TD" Das
28 pages
Spark Streaming for Developers
100% (1)
Spark Streaming for Developers
28 pages
Unit 4 Streaming Data
No ratings yet
Unit 4 Streaming Data
4 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
7 - Streaming 2 - Calcite
No ratings yet
7 - Streaming 2 - Calcite
45 pages
Spark Structured Streaming
No ratings yet
Spark Structured Streaming
655 pages
Lec 19
No ratings yet
Lec 19
23 pages
Structured Streaming and Basic Concepts
No ratings yet
Structured Streaming and Basic Concepts
4 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Structured Streaming Guide
No ratings yet
Structured Streaming Guide
1 page
Continuous Application 1725280881
No ratings yet
Continuous Application 1725280881
72 pages
Lec 19
No ratings yet
Lec 19
24 pages
Unit Iii
No ratings yet
Unit Iii
19 pages
Spark Streaming
No ratings yet
Spark Streaming
14 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
Real-Time Streaming for Tech Pros
No ratings yet
Real-Time Streaming for Tech Pros
5 pages
09 - Apache Spark Streaming
No ratings yet
09 - Apache Spark Streaming
31 pages
Data Engineering Concepts For Mid-to-Senior Professionals
No ratings yet
Data Engineering Concepts For Mid-to-Senior Professionals
27 pages
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
No ratings yet
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
407 pages
Unit - 5 FBDA
No ratings yet
Unit - 5 FBDA
7 pages
Handling Event-Time and Late Data, Fault-Tolerant Semantics, Exactly-Once Semantics
No ratings yet
Handling Event-Time and Late Data, Fault-Tolerant Semantics, Exactly-Once Semantics
3 pages
Bda Unit-Iii-1
No ratings yet
Bda Unit-Iii-1
29 pages
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
100% (1)
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
307 pages
Spark Streaming Workflow Guide
No ratings yet
Spark Streaming Workflow Guide
25 pages
Understanding Data Processing in Databricks: From Spark Streaming To Structured Streaming
No ratings yet
Understanding Data Processing in Databricks: From Spark Streaming To Structured Streaming
12 pages
DSPL Casestidy
No ratings yet
DSPL Casestidy
3 pages
Streaming with Apache Flink
No ratings yet
Streaming with Apache Flink
232 pages
The Future of Real-Time in Spark: Reynold Xin @rxin
No ratings yet
The Future of Real-Time in Spark: Reynold Xin @rxin
30 pages
Spark Streaming for Data Engineers
No ratings yet
Spark Streaming for Data Engineers
34 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Flink
No ratings yet
Flink
31 pages
Untitled Document Copy 2
No ratings yet
Untitled Document Copy 2
5 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Kafka
No ratings yet
Kafka
78 pages
Spark Streaming Through Dynamic Batch Sizing
No ratings yet
Spark Streaming Through Dynamic Batch Sizing
4 pages
Hands-On Guide To Apache Spark 3: Build Scalable Computing Engines For Batch and Stream Data Processing Alfonso Antolínez García Download
No ratings yet
Hands-On Guide To Apache Spark 3: Build Scalable Computing Engines For Batch and Stream Data Processing Alfonso Antolínez García Download
77 pages
B.Tech. 3rd Yr CSE (AI) 2022 23 Revised - 30
No ratings yet
B.Tech. 3rd Yr CSE (AI) 2022 23 Revised - 30
1 page
009.4 - Traditional Vs Streaming Systems Data Models
No ratings yet
009.4 - Traditional Vs Streaming Systems Data Models
3 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
Master Asr
No ratings yet
Master Asr
13 pages
Multilingual Distilwhisper
No ratings yet
Multilingual Distilwhisper
8 pages
Group5
No ratings yet
Group5
33 pages
Business Plan
No ratings yet
Business Plan
29 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
Lecture 3 - 1-ML and Data Systems Fundamentals
No ratings yet
Lecture 3 - 1-ML and Data Systems Fundamentals
48 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
124 pages
5.1 Exploratory Analysis en
No ratings yet
5.1 Exploratory Analysis en
79 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
IPMA6212 Ea
No ratings yet
IPMA6212 Ea
5 pages
VST Generic Board Layout
No ratings yet
VST Generic Board Layout
7 pages
101 Aveva Marine Concepts User Guide
No ratings yet
101 Aveva Marine Concepts User Guide
26 pages
ExistDb With Java
No ratings yet
ExistDb With Java
3 pages
Camera Access Log Analysis
No ratings yet
Camera Access Log Analysis
5 pages
Manual PDF
No ratings yet
Manual PDF
58 pages
Security Audit for Developers
No ratings yet
Security Audit for Developers
16 pages
Document Scanner 3M
No ratings yet
Document Scanner 3M
6 pages
Swahili Tales As Told by Natives of Zanzibar
No ratings yet
Swahili Tales As Told by Natives of Zanzibar
524 pages
Python Lab Reports for Students
No ratings yet
Python Lab Reports for Students
109 pages
C-Bus™ DIN-Mounted Fan Controller
No ratings yet
C-Bus™ DIN-Mounted Fan Controller
24 pages
Assignment 2
No ratings yet
Assignment 2
8 pages
How To Add A Behaviour Incident To A Pupil Student Record
No ratings yet
How To Add A Behaviour Incident To A Pupil Student Record
8 pages
Key-Value Store Features & Use Cases
No ratings yet
Key-Value Store Features & Use Cases
17 pages
Oracle App Cloning Guide: Rapid Clone Method
No ratings yet
Oracle App Cloning Guide: Rapid Clone Method
5 pages
Online Reservation System For Mang Jhonny Cuisine Catering
No ratings yet
Online Reservation System For Mang Jhonny Cuisine Catering
49 pages
Enabling Smart Home Energy Management Through Gesture-Based Control and Iot Technology
No ratings yet
Enabling Smart Home Energy Management Through Gesture-Based Control and Iot Technology
2 pages
Philadelphia Sports Sponsorship Proposal
No ratings yet
Philadelphia Sports Sponsorship Proposal
13 pages
Wii Manual PDF
100% (1)
Wii Manual PDF
22 pages
Project Management Tutorial 1 Sunway
No ratings yet
Project Management Tutorial 1 Sunway
6 pages
Guru Resume
No ratings yet
Guru Resume
3 pages
Chapter 1
No ratings yet
Chapter 1
3 pages
Equifax Breach - Security Failures Uncovered
No ratings yet
Equifax Breach - Security Failures Uncovered
21 pages
Operations Guide For Microsoft Advanced Group Policy Management 2.5
No ratings yet
Operations Guide For Microsoft Advanced Group Policy Management 2.5
65 pages
Web Application and Security - MCQ
No ratings yet
Web Application and Security - MCQ
15 pages
Clifford Sze-Tsan Choy and Wan-Chi Siu - Fast Sequential Implementation of "Neural-Gas" Network For Vector Quantization
No ratings yet
Clifford Sze-Tsan Choy and Wan-Chi Siu - Fast Sequential Implementation of "Neural-Gas" Network For Vector Quantization
4 pages
Eddie Kramer Bass Channel Guide
No ratings yet
Eddie Kramer Bass Channel Guide
16 pages
Rdbms and SQL Notes
No ratings yet
Rdbms and SQL Notes
58 pages
Task 2 WEEK 2 TPOSS Example One
No ratings yet
Task 2 WEEK 2 TPOSS Example One
2 pages
Report:-: Non Statutory or Voluntary Reports
No ratings yet
Report:-: Non Statutory or Voluntary Reports
15 pages

Lecture 7 - 1-Spark - Streaming

Uploaded by

Lecture 7 - 1-Spark - Streaming

Uploaded by

06/10/2024

Pain points with DStreams

New model (2)

New model (3)

Batch ETL with DataFrames

Streaming ETL with DataFrames

Streaming ETL with DataFrames

Streaming ETL with DataFrames

Continuous Windowed Aggregations

Joining streams with static data

query.stop() • Multiple queries can be

Batch Execution on Spark SQL

Batch Execution on Spark SQL

Continuous Incremental Execution

Continuous Incremental Execution

• All data and metadata in

• Tracks offsets by writing

• Tracks offsets by writing

• Tracks offsets by writing

• Reads log to recover

• Sink are by design

Fast, fault-tolerant, exactly-once

Usecase – Twitter sentiment analysis

Twitter token authorization

Generating tweet data

Output tweets and sentiments

Sentiments for Trump

Applying sentiment analysis

You might also like