Pyflink Table API for non-JVM stream processing on a
data streaming platform
Diptiman Raichaudhuri
Staff Developer Advocate - Confluent
draichaudhuri@confluent.io
Agenda
● Kafka 101 - Introduction
● Apache Kafka - The Big Picture
● Apache Flink as a Data Stream
Processor
○ Flink Dynamic Table
○ Flink Window Functions
○ Code + Demo
● DSP - Data Streaming Platform
Kafka 101 - Introduction
Storage
• DB - table
Core
• Hadoop - file
abstraction • Kafka - ?
LOG
Immutable Event Log
Old New
Messages are added at the end of the log
Messages are just K/V bytes
plus headers + timestamp
Header
Timestamp
Key
Value
Topics
Clicks
Orders
Customers
Topics are similar in concept
to tables in a database
Partitions
Clicks p0
p1
p2
Messages are guaranteed to be
strictly ordered within a partition
Pub / Sub
Producing data
Old New
Messages are added at the end of the log
Consuming data - access is sequential
Read to offset & scan
Old New
Consumers have a position of their own
Old New
Sally Scan
is here
Consumers have a position of their own
Old New
Fred Scan
Sally Scan
is here is here
Consumers have a position of their own
Rick Scan
is here
Old New
Fred Scan
Sally Scan
is here is here
Producing to Kafka - No Key
Time
Partition
1
Partition 2
Partition 3
Messages will be produced in a
round robin fashion
Partition 4
Producing to Kafka - With Key
Time
Partition 1
A
Partition 2
B
hash(key) %
numPartitions = N
Partition 3
C
Partition 4
D
Consuming From Kafka - Single Consumer
Partition
1
Partition
2
C
Partition
3
Partition
4
Consuming From Kafka - Multiple Consumers
Partition
1
Partition
2 C1
Partition
3
C2
Partition
4
Consuming From Kafka - Grouped Consumers
Partition
1
CC
Partition
2
C1
Partition
3
Partition CC
4
C2
Linearly Scalable Architecture
Producers
Single topic:
- Many producers machines
- Many consumer machines
- Many Broker machines
No Bottleneck!
Consumers
Replicate to get fault tolerance
msg
leader
Machine B
Machine A
replicate msg
Partition Leadership and Replication
Partition TopicX TopicY TopicZ
1 partition1 partition1 partition1
Partition 2 TopicX TopicY TopicZ
partition2 partition2 partition2
Partition 3 TopicZ TopicX TopicY
partition3 partition3 partition3
Partition 4 TopicY TopicZ TopicX
partition4 partition4 partition4
Broker 1 Broker 2 Broker 3 Broker 4
Leader Follower
Replication provides resiliency
A replica takes over on machine failure
Partition Leadership and Replication
Partition TopicX TopicX TopicX
1 partition1 partition1 partition1
Partition 2 TopicX TopicX TopicX
partition2 partition2 partition2
Partition 3 TopicX TopicX TopicX
partition3 partition3 partition3
Partition 4 TopicX TopicX TopicX
partition4 partition4 partition4
Broker 1 Broker 2 Broker 3 Broker 4
Leader Follower
Partition Leadership and Replication
Partition TopicX TopicX TopicX
1 partition1 partition1 partition1
Partition 2 TopicX TopicX TopicX
partition2 partition2 partition2
Partition 3 TopicX TopicX TopicX
partition3 partition3 partition3
Partition 4 TopicX TopicX TopicX
partition4 partition4 partition4
Broker 1 Broker 2 Broker 3 Broker 4
Leader Follower
The log is a type of durable messaging
system
Similar to a traditional messaging system
(ActiveMQ, Rabbit, etc.) but with:
• Far better scalability
• Built-in fault tolerance/HA
• Storage
Apache Kafka - The Big Picture
Apache Kafka - Producer Internal
Apache Kafka - Consumer Internal
# Partitions > # Consumers
# Partitions == # Consumers
# Partitions < # Consumers
Apache Flink as a Data Stream
Processor
Real-time services rely on stream processing
A Sale Real-time Stream Processing
Rich Front-End
A Shipment Customer Experiences
Real-time
Data
A Customer Real-Time Backend
Experience Operations
A Trade
What is Apache Flink used for?
Transactions Messaging
Messaging Systems
Systems
Logs
Event-driven
Analytics
4
3 Applications
IoT Files
Files Data
Integration ETL
Events
Databases
Databases Key/Value Stores
Interactions
Key/Value Stores
…
Applications
Let’s start with an {event} stream
{EVENT} Stream processing
Event
{
"device_id": "01:B8:4R:7Y",
"temp": 34.5,
• Internet of Things "humidity": 0.45,
"motion": "true"
}
{
"cust_id": 0011223344,
• Business process change "loan_type": “housing”,
"status": “Y”
}
• User Interaction
• Microservice output
Event at a minimum
Key
Value
Event at a minimum - for Kafka
Topic
Mandatory
Value
Partition
Key
Optional
Header
TS
Data Stream
• Events ingested through an unbounded context.
• Events ingested perpetually, till the event producers stop
Data Streaming Platform - Common Components
Flink as a streaming data processor
• Flink can provide insights into the stream
Using DSL, queries, mutations
Using SQL statements
Using aggregations
Real-time services rely on stream processing
Files Real-time Stream Processing
Kafka
Sources Sinks
Apps
Databases SQL
Key/Value Stores
A Real World Example
Flink Dynamic Table
Flink’s APIs
Flink SQL
Table API (dynamic tables) declarative DSL
stream processing &
DataStream API (streams, windows) analytics
low-level stateful
Process Functions (events, state, time) stream processing
Flink Dynamic Table
Flink Dynamic Table
● Dynamic tables change over time
● Querying dynamic tables yields a Continuous Query
● A continuous query never terminates and produces dynamic
results -> another dynamic table.
Anatomy of a Flink Dynamic Table
Stream Table Duality - Append only table
Stream Table Duality - Update Table
Flink Table to Stream Conversion
● Append only stream - INSERT
● Retract Stream - INSERT:Add + DELETE:Retract +
UPDATE:Retract
● Upsert Stream - UPSERT + DELETE
○ Main diff with Retract - changes are encoded with a single
message and hence more efficient
Flink Window Functions
Events over time
Tumbling Window - Concept
Hopping Window - Concept
Session Window
Flink Dynamic Table - Windowing
Kafka Pyflink Getting Started Series - diptimanrc
Flink AI Remote Inference(OpenAI) Blog
Code Explanation
Flink DataFlow
FlinkSQL Transforms
Flink Table API for a Kafka Topic
Flink Table API - Tumbling Window Transform
Flink SQL - Tumbling Window Transform
The same transform using FlinkSQL
Flink Table API - EXPLAIN PLAN
Flink Table API - EXPLAIN PLAN
DSP - Data Streaming Platform
DSP - The End Product
Stream (Kafka)
Data Stream Data Product
READ AS
Connect
Custom Apps & Operational Apps &
Microservices Data Systems
In-stream processing
Connect
Databases
Schema Registry
Stream (Kafka)
READ AS
Data Warehouses /
Data Lakes
Connect
Log data & Decoupled Event-Driven Immutable
messaging systems Architecture Design Logs
COMING
Tableflow SOON
Stream Stream Data (Iceberg)
Catalog Lineage Portal
Third Party Compute
Engines
Stream Governance - Getting Started Series
Confluent Developer Newsletter - Kafka, Flink Latest …
Thanks / Q&A
Diptiman Raichaudhuri
Staff Developer Advocate - Confluent
[email protected]Skip Paywall
Sign Up for
Confluent Cloud
Get $400 worth free credits
for your first 30 Days
Use Promo Code - POPTOUT000MZG62
to skip the paywall!
85