The Future of
Real-Time in Spark
Reynold Xin @rxin
Spark Summit, New York, Feb 18, 2016
Why Real-Time?
Making decisions faster is valuable.
• Preventing credit card fraud
• Monitoring industrial machinery
• Human-facing dashboards
• …
Streaming Engine
Noun.
Takes an input stream and produces an output stream.
Spark Unified Stack
SQL Streaming MLlib GraphX
Spark Core
Spark Unified Stack
SQL Streaming
Streaming MLlib GraphX
Spark Core
Introduced 3 years ago in Spark 0.7
50% users consider most important part of Spark
Spark Streaming
• First attempt at unifying streaming and batch
• State management built in
• Exactly once semantics
• Features required for large clusters
• Straggler mitigation, dynamic load balancing, fast fault-recovery
Streaming computations don’t run in isolation.
Use Case: Fraud Detection
ANOMALY
Ad-hoc analyze historic data
STREAM
Machine learning model
continuously updates
to detect new anomalies
Continuous Application
noun.
An end-to-end application that acts on real-time data.
Challenges Building Continuous
Applications
Integration with non-streaming systems often an after-thought
• Interactive, batch, relational databases, machine learning, …
Streaming programming models are complex
Integration Example
Stream MySQL
([Link], 10:08)
([Link], 10:09)
Streaming
([Link], 10:10) engine
... Page Minute Visits
What can go wrong? home 10:09 21
pricing 10:10 30
• Late events
... ... ...
• Partial outputs to MySQL
• State recovery on failure
• Distributed reads/writes
• ...
Complex Programming Models
Data
Late arrival, varying distribution over time, …
Processing Output
Business logic change & new ops How do we define
(windows, sessions) output over time & correctness?
Structured Streaming
The simplest way to perform streaming analytics
is not having to reason about streaming.
Spark 1.3 Spark 2.0
Static DataFrames Infinite DataFrames
Single API !
Structured Streaming
High-level streaming API built on Spark SQL engine
• Runs the same queries on DataFrames
• Event time, windowing, sessions, sources & sinks
Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Change queries at runtime
• Build and apply ML models
Trigger: every 1 sec
Model 1 2 3
Time
data up data up data up
Input
to PT 1 to PT 2 to PT 3
Query
Result output for output for output for
data at 1 data at 2 data at 3
complete
Output output
Trigger: every 1 sec
Model 1 2 3
Time
data up data up data up
Input
to PT 1 to PT 2 to PT 3
Query
Result output for output for output for
data at 1 data at 2 data at 3
delta
Output output
Model Details
Input sources: append-only tables
Queries: new operators for windowing, sessions, etc
Triggers: based on time (e.g. every 1 sec)
Output modes: complete, deltas, update-in-place
Example: ETL
Input: files in S3
Query: map (transform each record)
Trigger: “every 5 sec”
Output mode: “new records”, into S3 sink
Example: Page View Count
Input: records in Kafka
Query: select count(*) group by page, minute(evtime)
Trigger: “every 5 sec”
Output mode: “update-in-place”, into MySQL sink
Note: this will automatically update “old” records on late data!
Execution
Logically:
DataFrame
DataFrame operations on static data
(i.e. as easy to understand as batch)
Logical Plan
Physically:
Spark automatically runs the query in
Catalyst optimizer
streaming fashion
(i.e. incrementally and continuously) Continuous,
incremental execution
Example: Batch Aggregation
logs = [Link]("json").open("s3://logs")
[Link](logs.user_id).agg(sum([Link]))
.[Link]("jdbc")
.save("jdbc:mysql//...")
Example: Continuous Aggregation
logs = [Link]("json").stream("s3://logs")
[Link](logs.user_id).agg(sum([Link]))
.[Link]("jdbc")
.stream("jdbc:mysql//...")
Automatic Incremental Execution
T=0 Aggregate
T=1 Aggregate
T=2 Aggregate
…
Rest of Spark will follow
• Interactive queries should just work
• Spark’s data source API will be updated to support seamless
streaming integration
• Exactly once semantics end-to-end
• Different output modes (complete, delta, update-in-place)
• ML algorithms will be updated too
What can we do with this that’s hard
with other engines?
Ad-hoc, interactive queries
Dynamic changing queries
Benefits of Spark: elastic scaling, straggler mitigation, etc
Use Case: Fraud Detection
ANOMALY
Analyze Historic Data
STREAM
Machine Learning Model
continuously updates
to detect new anomalies
Timeline
Spark 2.0 Spark 2.1 +
• API foundation • Continuous SQL
• Kafka, file systems, and • BI app integration
databases • Other streaming sources / sinks
• Event-time aggregations • Machine learning
Thank you.
@rxin