0% found this document useful (0 votes)
26 views15 pages

Building Real-Time Streaming Pipelines With Apache Flink & PyFlink - by Yousef Yousefi - Medium

The document provides a comprehensive guide on building real-time streaming pipelines using Apache Flink and PyFlink, focusing on advanced optimizations and stateful processing. Key topics include event-time processing with watermarks, using RocksDB for state management, and constructing a real-time fraud detection pipeline with Kafka and Elasticsearch. The article emphasizes Flink's capabilities in handling high-throughput event streams with low latency and offers practical code examples for implementation.

Uploaded by

sasa332138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views15 pages

Building Real-Time Streaming Pipelines With Apache Flink & PyFlink - by Yousef Yousefi - Medium

The document provides a comprehensive guide on building real-time streaming pipelines using Apache Flink and PyFlink, focusing on advanced optimizations and stateful processing. Key topics include event-time processing with watermarks, using RocksDB for state management, and constructing a real-time fraud detection pipeline with Kafka and Elasticsearch. The article emphasizes Flink's capabilities in handling high-throughput event streams with low latency and offers practical code examples for implementation.

Uploaded by

sasa332138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Building Real-Time Streaming Pipelines with Apache Flink & PyFlink | by Yousef Yousefi | ... https://medium.com/@usefusefi/building-real-time-streaming-pipelines-with-apache-flink-pyfl...

Building Real-Time Streaming


Pipelines with Apache Flink &
PyFlink
Yousef Yousefi 3 min read · Feb 21, 2025

1 of 15 6/14/2025, 10:13 AM
Building Real-Time Streaming Pipelines with Apache Flink & PyFlink | by Yousef Yousefi | ... https://medium.com/@usefusefi/building-real-time-streaming-pipelines-with-apache-flink-pyfl...

Apache Flink is a high-performance real-time stream processing engine


designed for stateful and event-driven applications. While batch processing
frameworks like Apache Spark work on stored data, Flink excels at
processing continuous event streams with ultra-low latency.

This guide will provide a deep dive into advanced Flink optimizations using
PyFlink (Python API for Apache Flink):
Optimizing event-time processing with Watermarks
Tuning RocksDB for large-scale stateful streaming

2 of 15 6/14/2025, 10:13 AM
Building Real-Time Streaming Pipelines with Apache Flink & PyFlink | by Yousef Yousefi | ... https://medium.com/@usefusefi/building-real-time-streaming-pipelines-with-apache-flink-pyfl...

Checkpointing & fault tolerance in production


Optimizing performance with Task Slots, Parallelism, and
Configuration
Building a real-time fraud detection pipeline with Kafka & Flink

Understanding Flink’s Execution Model

Stream Processing Paradigm


Unlike traditional batch processing, Flink operates on unbounded event
streams, supporting:

• Event Time Processing (timestamps from event source)

• Processing Time (system clock-based execution)

• Stateful Processing (keeping track of data across multiple events)

Optimizing Stateful Processing with RocksDB

Why RocksDB for Large Stateful Applications?


When Flink applications deal with millions of events per second, in-
memory state storage becomes a bottleneck.
RocksDB is the best choice because:

3 of 15 6/14/2025, 10:13 AM
Building Real-Time Streaming Pipelines with Apache Flink & PyFlink | by Yousef Yousefi | ... https://medium.com/@usefusefi/building-real-time-streaming-pipelines-with-apache-flink-pyfl...

Supports disk-based storage (scales better than heap memory)


Handles large stateful computations efficiently
Optimized for fast write-heavy workloads

Configuring RocksDB as the State Backend in PyFlink

from pyflink.datastream import StreamExecutionEnvironment


from pyflink.datastream.state_backend import EmbeddedRocksDBStateBackend

env = StreamExecutionEnvironment.get_execution_environment()
env.set_state_backend(EmbeddedRocksDBStateBackend()) # Use RocksDB for state management

Optimizing RocksDB Performance in Flink

state.backend.rocksdb.writebuffer.size = 64MB # Buffer before flushing to disk


state.backend.rocksdb.compaction.style = leveled # Optimized for streaming writes
state.backend.rocksdb.block.cache.size = 256MB # Cache frequently accessed data

Event-Time Processing with Watermarks

4 of 15 6/14/2025, 10:13 AM
Building Real-Time Streaming Pipelines with Apache Flink & PyFlink | by Yousef Yousefi | ... https://medium.com/@usefusefi/building-real-time-streaming-pipelines-with-apache-flink-pyfl...

Why is this important?


In real-time streaming, events don’t always arrive in order due to network
latency or processing delays. Watermarks help Flink correctly process late
events by tracking how far Flink has processed the stream.

Assigning Event-Time Timestamps in PyFlink

from pyflink.common.watermark_strategy import WatermarkStrategy


from datetime import timedelta

def extract_timestamp(event):
return event["timestamp"]

watermark_strategy = WatermarkStrategy \
.for_bounded_out_of_orderness(timedelta(seconds=10)) \
.with_timestamp_assigner(extract_timestamp)

stream = env.add_source(kafka_source).assign_timestamps_and_watermarks(watermark_strategy)

Key Points:

• We define a WatermarkStrategy to handle 10 seconds of late events

• Events with delayed timestamps will still be processed correctly

5 of 15 6/14/2025, 10:13 AM
Building Real-Time Streaming Pipelines with Apache Flink & PyFlink | by Yousef Yousefi | ... https://medium.com/@usefusefi/building-real-time-streaming-pipelines-with-apache-flink-pyfl...

Advanced Windowing Techniques

Tumbling Windows (Fixed Intervals)

from pyflink.common.typeinfo import Types


from pyflink.datastream.window import TumblingEventTimeWindows
from pyflink.datastream.functions import ReduceFunction

class SumTransactions(ReduceFunction):
def reduce(self, a, b):
return {"user_id": a["user_id"], "amount": a["amount"] + b["amount"]}

windowed_stream = stream \
.key_by(lambda event: event["user_id"], key_type=Types.STRING()) \
.window(TumblingEventTimeWindows.of(timedelta(minutes=5))) \
.reduce(SumTransactions())

windowed_stream.print()

Processes events every 5 minutes, ensuring late arrivals are included

Building a Real-Time Fraud Detection Pipeline


Now, let’s create a real-time fraud detection system using Kafka + Flink +

6 of 15 6/14/2025, 10:13 AM
Building Real-Time Streaming Pipelines with Apache Flink & PyFlink | by Yousef Yousefi | ... https://medium.com/@usefusefi/building-real-time-streaming-pipelines-with-apache-flink-pyfl...

Elasticsearch.

Kafka → Flink → Elasticsearch Pipeline in PyFlink

from pyflink.datastream.connectors.kafka import FlinkKafkaConsumer, FlinkKafkaProducer


from pyflink.datastream import StreamExecutionEnvironment
from pyflink.common.serialization import SimpleStringSchema

env = StreamExecutionEnvironment.get_execution_environment()

#Source: Read transactions from Kafka


kafka_source = FlinkKafkaConsumer(
topics="transactions",
properties={"bootstrap.servers": "kafka-broker:9092", "group.id": "flink-fraud"
deserialization_schema=SimpleStringSchema()
)
transactions = env.add_source(kafka_source)

#Process: Detect fraudulent transactions


def detect_fraud(transaction):
return transaction["amount"] > 5000 # Example rule: Flag transactions > $5000

fraud_alerts = transactions \
.filter(detect_fraud)

#Sink: Send fraud alerts to Kafka & Elasticsearch


kafka_sink = FlinkKafkaProducer(
topic="fraud-alerts",
producer_config={"bootstrap.servers": "kafka-broker:9092"},
serialization_schema=SimpleStringSchema()
)
fraud_alerts.add_sink(kafka_sink)

7 of 15 6/14/2025, 10:13 AM
Building Real-Time Streaming Pipelines with Apache Flink & PyFlink | by Yousef Yousefi | ... https://medium.com/@usefusefi/building-real-time-streaming-pipelines-with-apache-flink-pyfl...

env.execute("Fraud Detection Job")

What’s happening?

• Flink reads streaming transactions from Kafka

• Filters suspicious transactions (e.g., large transactions > $5000)

• Sends alerts back to Kafka for further processing

Scaling Flink for High-Throughput Processing

Set Parallelism for Maximum Throughput

env.set_parallelism(8) # Use 8 parallel tasks for processing

Optimize Checkpoints for High Availability

8 of 15 6/14/2025, 10:13 AM
Building Real-Time Streaming Pipelines with Apache Flink & PyFlink | by Yousef Yousefi | ... https://medium.com/@usefusefi/building-real-time-streaming-pipelines-with-apache-flink-pyfl...

env.enable_checkpointing(30000) # Checkpoint every 30 seconds

Configure Task Slots & Resources

taskmanager.memory.process.size: 8GB
taskmanager.numberOfTaskSlots: 4

In Summary
Apache Flink is an industry-leading real-time processing engine, capable of
handling millions of events per second with low-latency stateful
processing. This article covered:
Stateful processing with RocksDB
Advanced event-time processing with Watermarks
Optimizing Flink performance
Building a Kafka → Flink → Elasticsearch fraud detection pipeline

9 of 15 6/14/2025, 10:13 AM
Building Real-Time Streaming Pipelines with Apache Flink & PyFlink | by Yousef Yousefi | ... https://medium.com/@usefusefi/building-real-time-streaming-pipelines-with-apache-flink-pyfl...

Find the source code on GitHub: Apache Flink Real-Time Streaming


Optimization

Apache Flink Pyflink Realtime Streaming Stream Processing Data Engineering

Written by Yousef Yousefi


17 followers · 3 following

Data Engineer usefusefi.com

No responses yet

Sava Matic

10 of 15 6/14/2025, 10:13 AM
Building Real-Time Streaming Pipelines with Apache Flink & PyFlink | by Yousef Yousefi | ... https://medium.com/@usefusefi/building-real-time-streaming-pipelines-with-apache-flink-pyfl...

More from Yousef Yousefi

Yousef Yousefi Yousef Yousefi

Optimizing Apache Spark for Scaling ELT Pipelines with dbt:


Large-Scale Data Processing Advanced Modeling, Performanc…
Apache Spark has evolved as the cornerstone In the era of modern data engineering, dbt
for distributed data processing at scale,… (Data Build Tool) has become a mission-…

Feb 19 2 Feb 25 69

11 of 15 6/14/2025, 10:13 AM
Building Real-Time Streaming Pipelines with Apache Flink & PyFlink | by Yousef Yousefi | ... https://medium.com/@usefusefi/building-real-time-streaming-pipelines-with-apache-flink-pyfl...

Yousef Yousefi In DevOps.dev by Yousef Yousefi

Fivetran vs Airbyte: Mastering Data Pipeline Observability:


Data Ingestion Automation Monitoring, Logging, and Alertin…
In today’s data-driven world, automating the In modern data engineering, pipelines are no
journey from source to warehouse is a must.… longer just about moving data — they must …

Mar 6 3 Feb 24 3

See all from Yousef Yousefi

Recommended from Medium

12 of 15 6/14/2025, 10:13 AM
Building Real-Time Streaming Pipelines with Apache Flink & PyFlink | by Yousef Yousefi | ... https://medium.com/@usefusefi/building-real-time-streaming-pipelines-with-apache-flink-pyfl...

Vamsi In Yugen.ai Technology Blog by Yugen.ai

Checkpoints in Flink & Spark Real Time Fraud Detection Using


How Spark Uses Checkpoints: Apache Flink — Part 2
Detect suspicious transactions using
complex event patterns

Apr 3 1 Feb 23 17

13 of 15 6/14/2025, 10:13 AM
Building Real-Time Streaming Pipelines with Apache Flink & PyFlink | by Yousef Yousefi | ... https://medium.com/@usefusefi/building-real-time-streaming-pipelines-with-apache-flink-pyfl...

Bijeet Singh Mayurkumar Surani

End-to-End Exactly Once Mastering PySpark: Your


Processing in Apache Flink with… Complete Guide to 46 Essential…
Mimicking the transaction behaviour for data Collection of PySpark Functions
stores that do not natively support…

Mar 7 Jun 3 33

Pritam Mani Mohit Raj

Building Modern Data Apache Flink : Configuring Data


Architectures: Understanding… Partition and parallelism.
If you’ve ever felt overwhelmed by the
complexity of modern data pipelines, you’re…

5d ago Feb 23 1

See more recommendations

14 of 15 6/14/2025, 10:13 AM
Building Real-Time Streaming Pipelines with Apache Flink & PyFlink | by Yousef Yousefi | ... https://medium.com/@usefusefi/building-real-time-streaming-pipelines-with-apache-flink-pyfl...

Help Status About Careers Press Blog Privacy Rules Terms Text to speech

15 of 15 6/14/2025, 10:13 AM

You might also like