0% found this document useful (0 votes)
11 views85 pages

Hyderabad Meetup Dec 7th 2024 - Diptiman - Confluent

The document presents an overview of the PyFlink Table API for non-JVM stream processing within a data streaming platform, focusing on Apache Kafka and Apache Flink. It covers key concepts such as Kafka's architecture, message production and consumption, and Flink's capabilities in real-time stream processing, including dynamic tables and window functions. The agenda includes a demonstration of Flink's functionalities and its integration with Kafka for effective data streaming solutions.

Uploaded by

23315a0503
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views85 pages

Hyderabad Meetup Dec 7th 2024 - Diptiman - Confluent

The document presents an overview of the PyFlink Table API for non-JVM stream processing within a data streaming platform, focusing on Apache Kafka and Apache Flink. It covers key concepts such as Kafka's architecture, message production and consumption, and Flink's capabilities in real-time stream processing, including dynamic tables and window functions. The agenda includes a demonstration of Flink's functionalities and its integration with Kafka for effective data streaming solutions.

Uploaded by

23315a0503
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Pyflink Table API for non-JVM stream processing on a

data streaming platform


Diptiman Raichaudhuri
Staff Developer Advocate - Confluent
draichaudhuri@confluent.io
Agenda
● Kafka 101 - Introduction

● Apache Kafka - The Big Picture

● Apache Flink as a Data Stream


Processor
○ Flink Dynamic Table
○ Flink Window Functions
○ Code + Demo

● DSP - Data Streaming Platform


Kafka 101 - Introduction
Storage
• DB - table
Core
• Hadoop - file
abstraction • Kafka - ?
LOG
Immutable Event Log

Old New

Messages are added at the end of the log


Messages are just K/V bytes
plus headers + timestamp

Header
Timestamp
Key
Value
Topics

Clicks

Orders

Customers

Topics are similar in concept


to tables in a database
Partitions

Clicks p0
p1
p2

Messages are guaranteed to be


strictly ordered within a partition
Pub / Sub
Producing data

Old New

Messages are added at the end of the log


Consuming data - access is sequential

Read to offset & scan


Old New
Consumers have a position of their own

Old New

Sally Scan

is here
Consumers have a position of their own

Old New

Fred Scan
Sally Scan

is here is here
Consumers have a position of their own

Rick Scan

is here

Old New

Fred Scan
Sally Scan

is here is here
Producing to Kafka - No Key
Time

Partition
1

Partition 2

Partition 3

Messages will be produced in a


round robin fashion
Partition 4
Producing to Kafka - With Key
Time

Partition 1
A

Partition 2
B
hash(key) %
numPartitions = N
Partition 3
C

Partition 4
D
Consuming From Kafka - Single Consumer
Partition
1

Partition
2

C
Partition
3

Partition
4
Consuming From Kafka - Multiple Consumers
Partition
1

Partition
2 C1

Partition
3
C2
Partition
4
Consuming From Kafka - Grouped Consumers
Partition
1

CC
Partition
2
C1

Partition
3

Partition CC
4
C2
Linearly Scalable Architecture

Producers
Single topic:
- Many producers machines
- Many consumer machines
- Many Broker machines
No Bottleneck!

Consumers
Replicate to get fault tolerance

msg
leader
Machine B

Machine A
replicate msg
Partition Leadership and Replication

Partition TopicX TopicY TopicZ


1 partition1 partition1 partition1

Partition 2 TopicX TopicY TopicZ


partition2 partition2 partition2

Partition 3 TopicZ TopicX TopicY


partition3 partition3 partition3

Partition 4 TopicY TopicZ TopicX


partition4 partition4 partition4

Broker 1 Broker 2 Broker 3 Broker 4

Leader Follower
Replication provides resiliency

A replica takes over on machine failure


Partition Leadership and Replication

Partition TopicX TopicX TopicX


1 partition1 partition1 partition1

Partition 2 TopicX TopicX TopicX


partition2 partition2 partition2

Partition 3 TopicX TopicX TopicX


partition3 partition3 partition3

Partition 4 TopicX TopicX TopicX


partition4 partition4 partition4

Broker 1 Broker 2 Broker 3 Broker 4

Leader Follower
Partition Leadership and Replication

Partition TopicX TopicX TopicX


1 partition1 partition1 partition1

Partition 2 TopicX TopicX TopicX


partition2 partition2 partition2

Partition 3 TopicX TopicX TopicX


partition3 partition3 partition3

Partition 4 TopicX TopicX TopicX


partition4 partition4 partition4

Broker 1 Broker 2 Broker 3 Broker 4

Leader Follower
The log is a type of durable messaging
system

Similar to a traditional messaging system


(ActiveMQ, Rabbit, etc.) but with:
• Far better scalability
• Built-in fault tolerance/HA
• Storage
Apache Kafka - The Big Picture
Apache Kafka - Producer Internal
Apache Kafka - Consumer Internal
# Partitions > # Consumers
# Partitions == # Consumers
# Partitions < # Consumers
Apache Flink as a Data Stream
Processor
Real-time services rely on stream processing

A Sale Real-time Stream Processing

Rich Front-End
A Shipment Customer Experiences

Real-time
Data

A Customer Real-Time Backend


Experience Operations

A Trade
What is Apache Flink used for?

Transactions Messaging
Messaging Systems
Systems
Logs
Event-driven
Analytics
4
3 Applications
IoT Files
Files Data
Integration ETL
Events
Databases
Databases Key/Value Stores
Interactions
Key/Value Stores


Applications
Let’s start with an {event} stream

{EVENT} Stream processing


Event
{
"device_id": "01:B8:4R:7Y",
"temp": 34.5,
• Internet of Things "humidity": 0.45,
"motion": "true"
}

{
"cust_id": 0011223344,
• Business process change "loan_type": “housing”,
"status": “Y”
}

• User Interaction

• Microservice output
Event at a minimum

Key

Value
Event at a minimum - for Kafka

Topic
Mandatory
Value
Partition
Key
Optional

Header
TS
Data Stream
• Events ingested through an unbounded context.
• Events ingested perpetually, till the event producers stop
Data Streaming Platform - Common Components
Flink as a streaming data processor

• Flink can provide insights into the stream


Using DSL, queries, mutations
Using SQL statements
Using aggregations
Real-time services rely on stream processing

Files Real-time Stream Processing

Kafka

Sources Sinks

Apps
Databases SQL
Key/Value Stores
A Real World Example
Flink Dynamic Table
Flink’s APIs

Flink SQL

Table API (dynamic tables) declarative DSL

stream processing &


DataStream API (streams, windows) analytics

low-level stateful
Process Functions (events, state, time) stream processing
Flink Dynamic Table
Flink Dynamic Table

● Dynamic tables change over time

● Querying dynamic tables yields a Continuous Query

● A continuous query never terminates and produces dynamic


results -> another dynamic table.
Anatomy of a Flink Dynamic Table
Stream Table Duality - Append only table
Stream Table Duality - Update Table
Flink Table to Stream Conversion

● Append only stream - INSERT

● Retract Stream - INSERT:Add + DELETE:Retract +


UPDATE:Retract

● Upsert Stream - UPSERT + DELETE


○ Main diff with Retract - changes are encoded with a single
message and hence more efficient
Flink Window Functions
Events over time
Tumbling Window - Concept
Hopping Window - Concept
Session Window
Flink Dynamic Table - Windowing
Kafka Pyflink Getting Started Series - diptimanrc
Flink AI Remote Inference(OpenAI) Blog
Code Explanation
Flink DataFlow
FlinkSQL Transforms
Flink Table API for a Kafka Topic
Flink Table API - Tumbling Window Transform
Flink SQL - Tumbling Window Transform
The same transform using FlinkSQL
Flink Table API - EXPLAIN PLAN
Flink Table API - EXPLAIN PLAN
DSP - Data Streaming Platform
DSP - The End Product

Stream (Kafka)
Data Stream Data Product

READ AS
Connect
Custom Apps & Operational Apps &
Microservices Data Systems
In-stream processing

Connect
Databases
Schema Registry
Stream (Kafka)

READ AS
Data Warehouses /
Data Lakes
Connect
Log data & Decoupled Event-Driven Immutable
messaging systems Architecture Design Logs

COMING
Tableflow SOON
Stream Stream Data (Iceberg)
Catalog Lineage Portal

Third Party Compute


Engines
Stream Governance - Getting Started Series
Confluent Developer Newsletter - Kafka, Flink Latest …
Thanks / Q&A
Diptiman Raichaudhuri
Staff Developer Advocate - Confluent
[email protected]
Skip Paywall

Sign Up for
Confluent Cloud
Get $400 worth free credits
for your first 30 Days

Use Promo Code - POPTOUT000MZG62


to skip the paywall!

85

You might also like