0% found this document useful (0 votes)

28 views44 pages

ITHome - Deep Dive Into Apache Flink - Gordon

The document introduces Apache Flink as an open-source platform for distributed stream and batch data processing, emphasizing the shift from traditional batch processing to continuous stream processing. It highlights Flink's capabilities, such as low latency, high throughput, and stateful processing, while discussing its architecture and fault tolerance mechanisms. The final takeaways suggest that stateful streaming is essential for handling continuously generated data and outlines upcoming features in Flink.

Uploaded by

drivesankofa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views44 pages

ITHome - Deep Dive Into Apache Flink - Gordon

Uploaded by

drivesankofa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Introduction to Apache Flink™:

How Stream Processing is Shaping

the Data Engineering Space

Tzu-Li (Gordon) Tai

[email protected]
@tzulitai
Who am I?
● 戴資力（Gordon）
● Apache Flink Committer
● Co-organizer of Apache Flink Taiwan User Group
● Software Engineer @ VMFive
● Java, Scala
● Enjoy developing distributed systems
Data Streaming is becoming
increasingly popular

1
Stream processing is enabling the
obvious: continuous processing on
data that is continuously produced

2
Streaming is the next programming
paradigm for data applications, and
you need to start thinking in terms
of streams

3
01 The Traditional Batch Way
t

...

● Continously
HDFS ingesting data
File
● Periodic
batch files

MapReduce / ● Periodic
Spark / Flink
Jobs batch jobs

4
01 The Traditional Batch Way
t

...

cross
boundary
intermediate
results

● Jobs often has “dangling” results near batch boundaries

● Need to save them, and input into the next batch job
5
02 Key Observations for Batch

● Way too many moving parts

● Implicit treatment of time (the batch boundaries)

● Treating continuous state as discrete

● Troublesome to get accurate, correct results

6
03 The “Ideal” Streaming Way
t

...

Streaming processor
that handles …
(1) continuous state
(2) out-of-order events
scalably, robustly, and efficiently

7
04 Apache Flink

Apache Flink
an open-source platform for distributed
stream and batch data processing

● Apache Top-Level Project since Jan. 2015

● Streaming Dataflow Engine at its core

○ Low latency
○ High Throughput
○ Stateful
○ Accurate
○ Distributed
8
04 Apache Flink

Apache Flink
an open-source platform for distributed
stream and batch data processing

● ~260 contributors, ~25 Committers / PMC

● Used adoption:
○ Alibaba - realtime search optimization
○ Uber - ride request fulfillment marketplace
○ Netflix - Stream Processing as a Service (SPaaS)
○ Kings Gaming - realtime data science dashboard
○ ...
9
04 Apache Flink

10
05 Scala Collection-like API
case class Word (word: String, count: Int)

DataSet API
val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap(_.split(“ ”)). map(word => Word(word,1))

.groupBy(“word”).sum(“count”)
.print()

DataStream API
val lines: DataStream[String] = env.addSource(new KafkaSource(...))

lines.flatMap(_.split(“ ”)). map(word => Word(word,1))

.keyBy(“word”).timeWindow(Time.seconds(5)). sum(“count”)
.print()

11
05 Scala Collection-like API
.filter(...).flatmap(...).map(...).groupBy(...).reduce(...)

● Becoming the de facto standard for new generation API to

express data pipelines

● Apache Spark, Apache Flink, Apache Beam ...

12
06 What does Flink’s Engine do?
process records
one-at-a-time

...
Your
Code

● Computation on a never-ending stream of data records

13
06 What does Flink’s Engine do?

Your
... Code

● System distributes the computation across the cluster

14
07 Streaming Dataflow Runtime
distributed queues as push-based
data shipping channels

Execution
JobGraph Graph
(parallel) Task Task Task
(logical) Manager Manager Manager

Application
Code
(DataSet / Optimizer /
Graph Job
DataStream) Manager Task Task Task
Generator Manager Manager Manager

Task Task Task

Manager Manager Manager

Client

concurrently executed
15
07 Streaming Dataflow Runtime
● A slightly closer look into the transmission of data ...

Taken from an
1. Record “A” enters Task 1, and is processed
output buffer pool
2. The record is serialized into an output buffer at Task 1

3. The buffer is shipped to Task 2’s input buffer Taken from an

input buffer pool

Observation: Buffers need to be available throughout the process

(think blocking queues used between threads)

16
07 Streaming Dataflow Runtime
● Natural, built-in backpressure
● Receiving data at a higher rate than a system can process
during a temporary load spike
○ ex. GC talls at processing tasks
○ ex. data source natural load spike

Normal stable case:

Temporary load spike:

Ideal backpressure handling:

17
07 Flexible Windows
● Due to one-at-a-time processing, Flink has very powerful
built-in windowing (certainly among the best in the current
streaming framework solutions)

○ Time-driven: Tumbling window, Sliding window

○ Data-driven: Count window, Session window

18
07 Time Windows

Tumbling Time Window Sliding Time Window

19
07 Count-Triggered Windows

20
07 Session Windows

21
08 What does Flink’s Engine do?

...
Your
Code

● Computation and ● Results depend on

state, ex.: history of stream
○ counters
○ in-progress State ● A stateful stream
windows processor gives
○ state machines tools to manage
○ trained ML state
models

22
09 What does Flink’s Engine do?
t4 t2 t3 t1 t1 - t2 t3 - t4

...
Your
Code

● Processing ● Core mechanics is

depends on called watermarks:
timestamps of when basically a way to
events were State measure and
generated advance clock time,
instead of relying on
machine time

23
09 Different Kinds of “Time”

24
09 Why Wall Time is Incorrect

● Think Twitter hash-tag count every 5 minutes

○ We would want the result to reflect the number of

Twitter tweets actually tweeted in a 5 minute
window
○ Not the number of tweet events the stream
processor receives within 5 minutes

25
09 Why Wall Time is Incorrect

● Think replaying a Kafka topic on a windowed

streaming application …

○ If you’re replaying a queue, windows are

definitely wrong if using a wall clock

26
10 Flink’s Streaming Fault Tolerance

Your Your
... ... ...
Code Code

State State

... ... Your Your

Code Code
...

State State

... ... Your Your

...
Code Code

State State

● Any operator in a Flink streaming topology can be stateful

● How to ensure that the states are correct upon failure?

27
10 Flink’s Streaming Fault Tolerance
● First, a recap of some guarantee concepts:

○ At-least-once: records may be processed more than once.

Think counting: may over count, resulting in wrong state

○ Exactly-once “state”: records appear to be processed only

once, with respect to the state.
Think counting: even on failure, each record is counted
exactly once

○ End-to-end exactly-once: records appear to be

processed only once, even to external systems
Think counting: for results stored externally, even after
failure, the results remain correct

28
11 Flink’s Streaming Fault Tolerance

Your Your
... ... ...
Code Code

State State

... ... Your Your

Code Code
...

State State

... ... Your Your

...
Code Code

State State

● Flink checkpoints: a combined snapshot of all operator state,

with the corresponding position in the source
● Based on Chandly-Lamport Algorithm: does not halt
any computation while taking consistent snapshots
29
12 Flink’s Savepoints

● Flink checkpoints: consistent snapshots of the whole

topology state that the system periodically takes

● Flink savepoints: manually triggered checkpoints that can

be persisted, and used to initialize state for a new streaming
job

t1 t2 t3
t

savepoint savepoint savepoint

state at t1 state at t2 state at t3

30
13 So, back to this ...
t

...

HDFS
File

MapReduce /
Spark / Flink
Jobs

31
13 So, back to this ...
t

...

Streaming processor
that handles …
(1) continuous state
(2) out-of-order events
scalably, robustly, and efficiently

32
14 What Flink provides, in a nutshell example

● No stateless point-in-time

33
14 What Flink provides, in a nutshell example

● Processing, or re-processing, in the batch way

34
14 What Flink provides, in a nutshell example

● Batch is inherently unsuitable for the nature of continuously

generated data
● State is corrupt at boundaries
35
14 What Flink provides, in a nutshell example

● Flink’s Stateful streaming naturally treats state continuously as it

processes your continuous data, and continuously generates
results

36
14 What Flink provides, in a nutshell example

● On reprocessing: initial state for the job reflects all previous

history data in the stream

37
14 What Flink provides, in a nutshell example

event-time processing

● On reprocessing: event-time processing guarantees correct

results, even when fast-forwarding to the head of stream

38
15 Final Takeaways

● Stateful Streaming correctly embraces the nature of

continuously generated data, and is the new programming
paradigm for their applications.

● Streaming isn’t only about real-time. Realtime is only a

natural advantage of streaming.

39
15 Final Takeaways
● The choice is all about your data, and your code.

● Think:
○ Is your data unbounded, or bounded?
■ Unbounded: click streams, page visits, impressions …
■ Bounded: (???)

● Think:
○ Does your code change faster than your data?
■ Data exploration, data mining, feature engineering …
■ In this case, it doesn’t really matter whether you use batch or streaming

○ Or does your data change faster than your code?

■ Production ETL pipelines, warehousing, serving, etc.
■ For accuracy and robustness, definitely think and design in terms of
streaming

40
15 Final Takeaways

● Upcoming features in Flink:

○ Dynamic scaling, with stateful streaming

○ Queryable state
○ Incremental state checkpointing
○ Even more savepoint functionality

41
15 Final Takeaways

● How Flink’s technology covers the application space:

Application Technology
Realtime applications Low-latency stateful streaming

Continuous applications High-latency stateful streaming

Analytics on historical data Batch as special case of

streaming

Request/Response Apps Queryable state

02data Stream Processing With Apache Flink
No ratings yet
02data Stream Processing With Apache Flink
61 pages
Apache Flink® Training: Intro
No ratings yet
Apache Flink® Training: Intro
37 pages
Apache SD Papers
No ratings yet
Apache SD Papers
21 pages
Apache Flink ™: Stream and Batch Processing in A Single Engine
No ratings yet
Apache Flink ™: Stream and Batch Processing in A Single Engine
11 pages
Continuous Processing With Apache Flink: Stephan Ewen @stephanewen
No ratings yet
Continuous Processing With Apache Flink: Stephan Ewen @stephanewen
41 pages
Apache Flink: Stream & Batch Processing Features
No ratings yet
Apache Flink: Stream & Batch Processing Features
15 pages
Module 08 Flink - Stream Processing and Batch Processing Platform
No ratings yet
Module 08 Flink - Stream Processing and Batch Processing Platform
40 pages
Flink HandsOn
No ratings yet
Flink HandsOn
39 pages
Large-Scale Apache Flink Insights
No ratings yet
Large-Scale Apache Flink Insights
76 pages
Streaming with Apache Flink
No ratings yet
Streaming with Apache Flink
232 pages
7 - Streaming 2 - Calcite
No ratings yet
7 - Streaming 2 - Calcite
45 pages
Apache Flink for Big Data Experts
No ratings yet
Apache Flink for Big Data Experts
68 pages
Flink: Another Data Stream Framework!
No ratings yet
Flink: Another Data Stream Framework!
7 pages
BOSS16 Tutorial Flink
No ratings yet
BOSS16 Tutorial Flink
32 pages
Apache Flink for Big Data Engineers
No ratings yet
Apache Flink for Big Data Engineers
116 pages
Flink: Big Data Huawei Course
No ratings yet
Flink: Big Data Huawei Course
22 pages
Chapter 7 Flink Stream and Batch Processing in A Single Engine
No ratings yet
Chapter 7 Flink Stream and Batch Processing in A Single Engine
45 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Mawaporasirukinu
No ratings yet
Mawaporasirukinu
2 pages
Stream Processing - Hands-On With Apache Flink (Giannis Polyzos) (Z-Library)
No ratings yet
Stream Processing - Hands-On With Apache Flink (Giannis Polyzos) (Z-Library)
234 pages
Apache Flink Introduction - Big Data Landscape
No ratings yet
Apache Flink Introduction - Big Data Landscape
26 pages
Datastream Api: Fault Tolerance
No ratings yet
Datastream Api: Fault Tolerance
26 pages
Apache Flink.9443699.Powerpoint
No ratings yet
Apache Flink.9443699.Powerpoint
6 pages
Report
No ratings yet
Report
5 pages
BDA Notes (Unit-1)
No ratings yet
BDA Notes (Unit-1)
11 pages
Apache Flink Tutorial
100% (1)
Apache Flink Tutorial
44 pages
Hyderabad Meetup Dec 7th 2024 - Diptiman - Confluent
No ratings yet
Hyderabad Meetup Dec 7th 2024 - Diptiman - Confluent
85 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
Flink
No ratings yet
Flink
31 pages
Lightweight Asynchronous Snapshots For Distributed Dataflows (Flink)
No ratings yet
Lightweight Asynchronous Snapshots For Distributed Dataflows (Flink)
8 pages
Apache Flink Is An Open-Source, Dis
No ratings yet
Apache Flink Is An Open-Source, Dis
2 pages
BD 11 Stream
No ratings yet
BD 11 Stream
60 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Big Data PDF
No ratings yet
Big Data PDF
10 pages
Unit - 5 FBDA
No ratings yet
Unit - 5 FBDA
7 pages
Bài Giảng Spark Streaming
No ratings yet
Bài Giảng Spark Streaming
75 pages
VERA White Paper
No ratings yet
VERA White Paper
35 pages
Ververica Platform Whitepaper Stream Processing For Real-Time Business, Powered by Apache Flink®
No ratings yet
Ververica Platform Whitepaper Stream Processing For Real-Time Business, Powered by Apache Flink®
22 pages
5a - Streaming Data Analytics PDF
No ratings yet
5a - Streaming Data Analytics PDF
37 pages
Stream Data Processing
No ratings yet
Stream Data Processing
32 pages
Cessing
No ratings yet
Cessing
67 pages
Spark Streaming
No ratings yet
Spark Streaming
14 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
6 - Streaming Part 1
No ratings yet
6 - Streaming Part 1
44 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Spark Streaming Workflow Guide
No ratings yet
Spark Streaming Workflow Guide
25 pages
Building Real-Time Streaming Pipelines With Apache Flink & PyFlink - by Yousef Yousefi - Medium
No ratings yet
Building Real-Time Streaming Pipelines With Apache Flink & PyFlink - by Yousef Yousefi - Medium
15 pages
Real-Time Data Stream Applications
No ratings yet
Real-Time Data Stream Applications
18 pages
8 - Streaming 3 - Spark Flink
No ratings yet
8 - Streaming 3 - Spark Flink
52 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
Big Data IV Nit
No ratings yet
Big Data IV Nit
15 pages
Stream Processing in Big Data
No ratings yet
Stream Processing in Big Data
39 pages
Csa Overview
No ratings yet
Csa Overview
9 pages
Unit 2 BD Mining Data Streams
No ratings yet
Unit 2 BD Mining Data Streams
34 pages
02 - Docker
No ratings yet
02 - Docker
34 pages
Kafka Overview
No ratings yet
Kafka Overview
51 pages
ECO2147 - Asgm1 - Summer2025V3 (1) - 1
No ratings yet
ECO2147 - Asgm1 - Summer2025V3 (1) - 1
8 pages
Spring Kafka Reference
No ratings yet
Spring Kafka Reference
226 pages
Apache-Kafka Bernhard-H Oss 2018
No ratings yet
Apache-Kafka Bernhard-H Oss 2018
35 pages
Java Test
No ratings yet
Java Test
2 pages
Programming Assignment
No ratings yet
Programming Assignment
42 pages
Visual Basic Programming Handouts-Part1
No ratings yet
Visual Basic Programming Handouts-Part1
11 pages
PPS-Queestion and Answers - 2024
No ratings yet
PPS-Queestion and Answers - 2024
21 pages
BTEC HND Computing Assessment Verification
No ratings yet
BTEC HND Computing Assessment Verification
140 pages
SPM Lab Manual
No ratings yet
SPM Lab Manual
7 pages
SALOME Tutorial
0% (1)
SALOME Tutorial
82 pages
Manual Testing
100% (1)
Manual Testing
31 pages
Murat OZTURK Resume
No ratings yet
Murat OZTURK Resume
1 page
JavaScript Cheatsheet
No ratings yet
JavaScript Cheatsheet
6 pages
ADS Record Final PDF
No ratings yet
ADS Record Final PDF
51 pages
Basic HTML & CSS Web Page Guide
No ratings yet
Basic HTML & CSS Web Page Guide
23 pages
Spiral Model Software Engineering
0% (1)
Spiral Model Software Engineering
15 pages
Java Introduction:: What Is JDK?
No ratings yet
Java Introduction:: What Is JDK?
28 pages
CICS Training Material
100% (6)
CICS Training Material
184 pages
(Ebook) Groovy Programming: An Introduction For Java Developers by Kenneth Barclay, John Savage ISBN 9780123725073, 0123725070 Available Instanly
No ratings yet
(Ebook) Groovy Programming: An Introduction For Java Developers by Kenneth Barclay, John Savage ISBN 9780123725073, 0123725070 Available Instanly
80 pages
Java Applet: Presented By: Feng Liu
No ratings yet
Java Applet: Presented By: Feng Liu
15 pages
Python Modules for Beginners
100% (1)
Python Modules for Beginners
41 pages
Mad Lab Questions
No ratings yet
Mad Lab Questions
46 pages
CV Alvi
No ratings yet
CV Alvi
3 pages
JavaScript and AngularJS Coding Tasks
No ratings yet
JavaScript and AngularJS Coding Tasks
4 pages
JavaScript Async/Await Guide
No ratings yet
JavaScript Async/Await Guide
9 pages
Cwsdpmi
No ratings yet
Cwsdpmi
3 pages
Redis Tutorial
No ratings yet
Redis Tutorial
110 pages
Bootstrap Was Developed by Mark Otto and Jacob Thornton at Twitter. It Was Released As An Open Source Product in August 2011 On Github
No ratings yet
Bootstrap Was Developed by Mark Otto and Jacob Thornton at Twitter. It Was Released As An Open Source Product in August 2011 On Github
14 pages
Hostel Management System Based Mobile App
No ratings yet
Hostel Management System Based Mobile App
7 pages
Course Content For AWS Cloud Training: Public Cloud: Amazon Web Services-Essentials
No ratings yet
Course Content For AWS Cloud Training: Public Cloud: Amazon Web Services-Essentials
7 pages
Computer Science Course Overview
No ratings yet
Computer Science Course Overview
11 pages
Internal Requisition Setup Guide
No ratings yet
Internal Requisition Setup Guide
25 pages
Software Quality Assurance Framework
100% (2)
Software Quality Assurance Framework
128 pages
Selenium Notes Upto Frameworks Full Course
No ratings yet
Selenium Notes Upto Frameworks Full Course
251 pages

ITHome - Deep Dive Into Apache Flink - Gordon

Uploaded by

ITHome - Deep Dive Into Apache Flink - Gordon

Uploaded by

Introduction to Apache Flink™:

How Stream Processing is Shaping

Tzu-Li (Gordon) Tai

● Jobs often has “dangling” results near batch boundaries

● Way too many moving parts

● Implicit treatment of time (the batch boundaries)

● Treating continuous state as discrete

● Troublesome to get accurate, correct results

● Apache Top-Level Project since Jan. 2015

● Streaming Dataflow Engine at its core

● ~260 contributors, ~25 Committers / PMC

lines.flatMap(_.split(“ ”)). map(word => Word(word,1))

lines.flatMap(_.split(“ ”)). map(word => Word(word,1))

● Becoming the de facto standard for new generation API to

● Apache Spark, Apache Flink, Apache Beam ...

● Computation on a never-ending stream of data records

● System distributes the computation across the cluster

Task Task Task

3. The buffer is shipped to Task 2’s input buffer Taken from an

Observation: Buffers need to be available throughout the process

Normal stable case:

Temporary load spike:

Ideal backpressure handling:

○ Time-driven: Tumbling window, Sliding window

Tumbling Time Window Sliding Time Window

● Computation and ● Results depend on

● Processing ● Core mechanics is

● Think Twitter hash-tag count every 5 minutes

○ We would want the result to reflect the number of

● Think replaying a Kafka topic on a windowed

○ If you’re replaying a queue, windows are

... ... Your Your

... ... Your Your

● Any operator in a Flink streaming topology can be stateful

○ At-least-once: records may be processed more than once.

○ Exactly-once “state”: records appear to be processed only

○ End-to-end exactly-once: records appear to be

... ... Your Your

... ... Your Your

● Flink checkpoints: a combined snapshot of all operator state,

● Flink checkpoints: consistent snapshots of the whole

● Flink savepoints: manually triggered checkpoints that can

savepoint savepoint savepoint

● Processing, or re-processing, in the batch way

● Batch is inherently unsuitable for the nature of continuously

● Flink’s Stateful streaming naturally treats state continuously as it

● On reprocessing: initial state for the job reflects all previous

● On reprocessing: event-time processing guarantees correct

● Stateful Streaming correctly embraces the nature of

● Streaming isn’t only about real-time. Realtime is only a

○ Or does your data change faster than your code?

● Upcoming features in Flink:

○ Dynamic scaling, with stateful streaming

● How Flink’s technology covers the application space:

Continuous applications High-latency stateful streaming

Analytics on historical data Batch as special case of

Request/Response Apps Queryable state

You might also like