Temporal Joins in
Kafka Streams and ksqlDB
Matthias J. Sax | Software Engineer
@MatthiasJSax
Ecosystem
2
@MatthiasJSax
ksqlDB: streaming database for Apache Kafka
• SQL interface to process data stored in Apache Kafka
• Declarative approach to stream processing
• Queries instead of “programming”
Kafka Streams: Java library for stream processing
• Part of Apache Kafka
• ”Functional” DSL but still programming
Both ksqlDB and Kafka Streams support joins.
Joins are powerful but streaming joins can be difficult to understand.
Joins: The Basics
3
@MatthiasJSax
https://www.confluent.io/kafka-summit-ny19/zen-and-the-art-of-streaming-joins/
Temporal Joins – Why should I give a Damn?
4
Static Data vs Streaming Data
• Data is constantly in motion
• Input tables are not static but updated all the time
• The result must be updated continuously and with deterministic semantics
Relational Joins are Defined over (static) Tables only:
• What about joining streams?
• What about joining a stream and a table?
Temporal Joins define deterministic (event-time) semantics
over continuously changing inputs.
@MatthiasJSax
Event-time vs Processing-time
5
Database Transactions are not predictable!
Database Txs offer ACID guarantees, that are defined over processing time:
• If you run a set of concurrent (read/write) transactions over a database multiple times, there is no guarantee
that you get the same result!
• You ”only” get a guarantee that each ”run” produces a consistent result
@MatthiasJSax
Example: Tx Processing
6
Tx1 w
Tx3 r (join)
Tx2 w
?
@MatthiasJSax
Streams, Records, Timestamps
7
Topic can be processed as:
• Event Stream (STREAM in ksqlDB / KStream in Kafka Streams)
• Changelog Stream (TABLE in ksqlDB / KTable in Kafka Streams)
• ”Tx Order” is determined upstream
Topic contains:
• Timestamped records
• Timestamps define “Tx Order”
• Need to obey pre-defined “Tx Order” when processing the data streams (ie, event-time semantics)
• Timestamps are data!
• Temporal joins are defined on event-time: provides deterministic processing semantics
@MatthiasJSax
* GlobalKTables in Kafka Streams are one exception (ie, non-deterministic stream-globalTable-join)
All* joins in Kafka Streams and ksqlDB
are temporal joins!
@MatthiasJSax
Versioned Tables
9
@MatthiasJSax
Tables evolve over time:
We can associate a different table version for each point in stream-time
Changelog Stream:
Table Versions:
14:01
a 14:03
b 14:05
c 14:08
b 14:11
a
14:01
a
14:03
b
14:05
c
14:05
14:01
a
14:08
b
14:05
c
14:08
14:11
a
14:08
b
14:05
c
14:11
14:01
a
14:03
b
14:03
14:01
a
14:01
stream-time
Temporal Table-Table Join
10
@MatthiasJSax
Join tables with the same version (ie, event-time)
Left Table
Right Table
Result Table
stream-time
14:01 14:03 14:05 14:08 14:11
14:02 14:04 14:06 14:07 14:09 14:10
Example: Table-Table Join
11
@MatthiasJSax
Data Enrichment: Stream-Table Join
12
@MatthiasJSax
Enrich events with table data: ”lookup join”
For each event-stream record, do a table lookup:
• Temporal table lookup: join a stream record with event-time T to table version T
Changelog Stream:
Input Table:
Input Stream:
Result Stream:
14:06
…
14:05
… 14:10
…
14:02
…
14:06
… 14:10
…
14:05
…
14:04
… 14:07
…
Example: Stream-Table Join
13
@MatthiasJSax
There is no concept of “bootstrapping” a table:
• Table versions will be evolved based on processing progress,
ie, stream-time.
• This ensure that the correct table version is loaded at each
point in stream-time.
@MatthiasJSax
Joining Event Streams – How to Handle Infinite Input
15
@MatthiasJSax
Event Streams are infinite and there is no concept of “versions”
Limit join “scope” with a temporal join condition, ie, a time-band-join.
-- mental model
SELECT * FROM stream1, stream2
WHERE
-- equi-join condition
stream1.key = stream2.key
AND
-- time condition
stream1.ts - windowSize <= stream2.ts
AND stream2.ts <= stream1.ts + windowSize
Joining Event Streams – How to Handle Infinite Input
16
@MatthiasJSax
Example: join window size 5
Left Stream
Right Stream
Result Stream
14:04
1 14:16
3
14:01
1 14:16
3
SELECT *
FROM leftStream AS l JOIN rightStream AS r
WITHIN 5 minutes ON l.id = r.id;
14:04
1 14:11
2 14:12
3
Left/Outer Stream-Stream Join
17
@MatthiasJSax
Example: spurious left join result with window size 5
Left Stream
Right Stream
Result Stream
14:04
1 14:16
3
14:01
1 14:16
3
14:04
1 14:11
2 14:12
3
14:11
2 14:12
3
Left/Outer Stream-Stream Join
18
@MatthiasJSax
Example: delayed left join result with window size 5 (WIP)
Left Stream
Right Stream
Result Stream
14:04
1 14:16
3
14:01
1 14:16
3
14:04
1 14:11
2 14:12
3
14:11
2
Timestamping Result Records
19
@MatthiasJSax
Result determinism requires deterministic result record event-timestamps
Out-of-Order data processing need to be considered
Example: Stream-Stream join with window size 5
14:04
1 14:16
2 14:08
2
14:01
1 14:11
2 14:23
2
14:04
1 14:16
2 14:11
2
max(l.ts; r.ts)
The Outlier: GlobalKTables
20
@MatthiasJSax
GlobalKTables have no concept of stream-time
Designed for “static” (but still mutable) data
• In contrast to regular tables, a GlobalKTable is bootstrapped at startup
• GlobalKTable updates are applied unsynchronized
• Stream-GlobalKTable join is non-deterministic on GlobalKTable updates
Global Changelog:
Global Table:
Input Stream:
14:05
…
14:02
… 14:10
…
14:04
… 14:07
… 14:09
…
Broadcast vs Replication and Temporal Semantics
21
@MatthiasJSax
time synchronized
unsynchronized
replicated
TABLE
KTable
n/a
GlobalKTable
TABLE*
KTable*
n/a
(*) with custom timestamp
extractor than ensures
“preferred processing”, e.g.,
always returns timestamp zero
sharded
Wrapping Up
22
Temporal Join are a Key Concept in Data Stream Processing
• Generalization of SQL joins (for snapshots) to continuously changing data
• Ensure deterministic / reproducible results
• Types of Temporal Joins:
• Joining evolving tables
• Joining streams to evolving tables
• Stream-Stream join
• Outlier: GlobalKTables
• Sharding vs replication & time synchronized vs unsynchronized/non-determistic
Thanks! We are hiring!
@MatthiasJSax
matthias@confluent.io | mjsax@apache.org

Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent

  • 1.
    Temporal Joins in KafkaStreams and ksqlDB Matthias J. Sax | Software Engineer @MatthiasJSax
  • 2.
    Ecosystem 2 @MatthiasJSax ksqlDB: streaming databasefor Apache Kafka • SQL interface to process data stored in Apache Kafka • Declarative approach to stream processing • Queries instead of “programming” Kafka Streams: Java library for stream processing • Part of Apache Kafka • ”Functional” DSL but still programming Both ksqlDB and Kafka Streams support joins. Joins are powerful but streaming joins can be difficult to understand.
  • 3.
  • 4.
    Temporal Joins –Why should I give a Damn? 4 Static Data vs Streaming Data • Data is constantly in motion • Input tables are not static but updated all the time • The result must be updated continuously and with deterministic semantics Relational Joins are Defined over (static) Tables only: • What about joining streams? • What about joining a stream and a table? Temporal Joins define deterministic (event-time) semantics over continuously changing inputs. @MatthiasJSax
  • 5.
    Event-time vs Processing-time 5 DatabaseTransactions are not predictable! Database Txs offer ACID guarantees, that are defined over processing time: • If you run a set of concurrent (read/write) transactions over a database multiple times, there is no guarantee that you get the same result! • You ”only” get a guarantee that each ”run” produces a consistent result @MatthiasJSax
  • 6.
    Example: Tx Processing 6 Tx1w Tx3 r (join) Tx2 w ? @MatthiasJSax
  • 7.
    Streams, Records, Timestamps 7 Topiccan be processed as: • Event Stream (STREAM in ksqlDB / KStream in Kafka Streams) • Changelog Stream (TABLE in ksqlDB / KTable in Kafka Streams) • ”Tx Order” is determined upstream Topic contains: • Timestamped records • Timestamps define “Tx Order” • Need to obey pre-defined “Tx Order” when processing the data streams (ie, event-time semantics) • Timestamps are data! • Temporal joins are defined on event-time: provides deterministic processing semantics @MatthiasJSax
  • 8.
    * GlobalKTables inKafka Streams are one exception (ie, non-deterministic stream-globalTable-join) All* joins in Kafka Streams and ksqlDB are temporal joins! @MatthiasJSax
  • 9.
    Versioned Tables 9 @MatthiasJSax Tables evolveover time: We can associate a different table version for each point in stream-time Changelog Stream: Table Versions: 14:01 a 14:03 b 14:05 c 14:08 b 14:11 a 14:01 a 14:03 b 14:05 c 14:05 14:01 a 14:08 b 14:05 c 14:08 14:11 a 14:08 b 14:05 c 14:11 14:01 a 14:03 b 14:03 14:01 a 14:01 stream-time
  • 10.
    Temporal Table-Table Join 10 @MatthiasJSax Jointables with the same version (ie, event-time) Left Table Right Table Result Table stream-time 14:01 14:03 14:05 14:08 14:11 14:02 14:04 14:06 14:07 14:09 14:10
  • 11.
  • 12.
    Data Enrichment: Stream-TableJoin 12 @MatthiasJSax Enrich events with table data: ”lookup join” For each event-stream record, do a table lookup: • Temporal table lookup: join a stream record with event-time T to table version T Changelog Stream: Input Table: Input Stream: Result Stream: 14:06 … 14:05 … 14:10 … 14:02 … 14:06 … 14:10 … 14:05 … 14:04 … 14:07 …
  • 13.
  • 14.
    There is noconcept of “bootstrapping” a table: • Table versions will be evolved based on processing progress, ie, stream-time. • This ensure that the correct table version is loaded at each point in stream-time. @MatthiasJSax
  • 15.
    Joining Event Streams– How to Handle Infinite Input 15 @MatthiasJSax Event Streams are infinite and there is no concept of “versions” Limit join “scope” with a temporal join condition, ie, a time-band-join. -- mental model SELECT * FROM stream1, stream2 WHERE -- equi-join condition stream1.key = stream2.key AND -- time condition stream1.ts - windowSize <= stream2.ts AND stream2.ts <= stream1.ts + windowSize
  • 16.
    Joining Event Streams– How to Handle Infinite Input 16 @MatthiasJSax Example: join window size 5 Left Stream Right Stream Result Stream 14:04 1 14:16 3 14:01 1 14:16 3 SELECT * FROM leftStream AS l JOIN rightStream AS r WITHIN 5 minutes ON l.id = r.id; 14:04 1 14:11 2 14:12 3
  • 17.
    Left/Outer Stream-Stream Join 17 @MatthiasJSax Example:spurious left join result with window size 5 Left Stream Right Stream Result Stream 14:04 1 14:16 3 14:01 1 14:16 3 14:04 1 14:11 2 14:12 3 14:11 2 14:12 3
  • 18.
    Left/Outer Stream-Stream Join 18 @MatthiasJSax Example:delayed left join result with window size 5 (WIP) Left Stream Right Stream Result Stream 14:04 1 14:16 3 14:01 1 14:16 3 14:04 1 14:11 2 14:12 3 14:11 2
  • 19.
    Timestamping Result Records 19 @MatthiasJSax Resultdeterminism requires deterministic result record event-timestamps Out-of-Order data processing need to be considered Example: Stream-Stream join with window size 5 14:04 1 14:16 2 14:08 2 14:01 1 14:11 2 14:23 2 14:04 1 14:16 2 14:11 2 max(l.ts; r.ts)
  • 20.
    The Outlier: GlobalKTables 20 @MatthiasJSax GlobalKTableshave no concept of stream-time Designed for “static” (but still mutable) data • In contrast to regular tables, a GlobalKTable is bootstrapped at startup • GlobalKTable updates are applied unsynchronized • Stream-GlobalKTable join is non-deterministic on GlobalKTable updates Global Changelog: Global Table: Input Stream: 14:05 … 14:02 … 14:10 … 14:04 … 14:07 … 14:09 …
  • 21.
    Broadcast vs Replicationand Temporal Semantics 21 @MatthiasJSax time synchronized unsynchronized replicated TABLE KTable n/a GlobalKTable TABLE* KTable* n/a (*) with custom timestamp extractor than ensures “preferred processing”, e.g., always returns timestamp zero sharded
  • 22.
    Wrapping Up 22 Temporal Joinare a Key Concept in Data Stream Processing • Generalization of SQL joins (for snapshots) to continuously changing data • Ensure deterministic / reproducible results • Types of Temporal Joins: • Joining evolving tables • Joining streams to evolving tables • Stream-Stream join • Outlier: GlobalKTables • Sharding vs replication & time synchronized vs unsynchronized/non-determistic
  • 23.