0% found this document useful (0 votes)
28 views44 pages

ITHome - Deep Dive Into Apache Flink - Gordon

The document introduces Apache Flink as an open-source platform for distributed stream and batch data processing, emphasizing the shift from traditional batch processing to continuous stream processing. It highlights Flink's capabilities, such as low latency, high throughput, and stateful processing, while discussing its architecture and fault tolerance mechanisms. The final takeaways suggest that stateful streaming is essential for handling continuously generated data and outlines upcoming features in Flink.

Uploaded by

drivesankofa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views44 pages

ITHome - Deep Dive Into Apache Flink - Gordon

The document introduces Apache Flink as an open-source platform for distributed stream and batch data processing, emphasizing the shift from traditional batch processing to continuous stream processing. It highlights Flink's capabilities, such as low latency, high throughput, and stateful processing, while discussing its architecture and fault tolerance mechanisms. The final takeaways suggest that stateful streaming is essential for handling continuously generated data and outlines upcoming features in Flink.

Uploaded by

drivesankofa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Introduction to Apache Flink™:

How Stream Processing is Shaping


the Data Engineering Space

Tzu-Li (Gordon) Tai


[email protected]
@tzulitai
Who am I?
● 戴資力(Gordon)
● Apache Flink Committer
● Co-organizer of Apache Flink Taiwan User Group
● Software Engineer @ VMFive
● Java, Scala
● Enjoy developing distributed systems
Data Streaming is becoming
increasingly popular

1
Stream processing is enabling the
obvious: continuous processing on
data that is continuously produced

2
Streaming is the next programming
paradigm for data applications, and
you need to start thinking in terms
of streams

3
01 The Traditional Batch Way
t

...

● Continously
HDFS ingesting data
File
● Periodic
batch files

MapReduce / ● Periodic
Spark / Flink
Jobs batch jobs

4
01 The Traditional Batch Way
t

...

cross
boundary
intermediate
results

● Jobs often has “dangling” results near batch boundaries

● Need to save them, and input into the next batch job
5
02 Key Observations for Batch

● Way too many moving parts

● Implicit treatment of time (the batch boundaries)

● Treating continuous state as discrete

● Troublesome to get accurate, correct results

6
03 The “Ideal” Streaming Way
t

...

Streaming processor
that handles …
(1) continuous state
(2) out-of-order events
scalably, robustly, and efficiently

7
04 Apache Flink

Apache Flink
an open-source platform for distributed
stream and batch data processing

● Apache Top-Level Project since Jan. 2015

● Streaming Dataflow Engine at its core


○ Low latency
○ High Throughput
○ Stateful
○ Accurate
○ Distributed
8
04 Apache Flink

Apache Flink
an open-source platform for distributed
stream and batch data processing

● ~260 contributors, ~25 Committers / PMC

● Used adoption:
○ Alibaba - realtime search optimization
○ Uber - ride request fulfillment marketplace
○ Netflix - Stream Processing as a Service (SPaaS)
○ Kings Gaming - realtime data science dashboard
○ ...
9
04 Apache Flink

10
05 Scala Collection-like API
case class Word (word: String, count: Int)

DataSet API
val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap(_.split(“ ”)). map(word => Word(word,1))


.groupBy(“word”).sum(“count”)
.print()

DataStream API
val lines: DataStream[String] = env.addSource(new KafkaSource(...))

lines.flatMap(_.split(“ ”)). map(word => Word(word,1))


.keyBy(“word”).timeWindow(Time.seconds(5)). sum(“count”)
.print()

11
05 Scala Collection-like API
.filter(...).flatmap(...).map(...).groupBy(...).reduce(...)

● Becoming the de facto standard for new generation API to


express data pipelines

● Apache Spark, Apache Flink, Apache Beam ...

12
06 What does Flink’s Engine do?
process records
one-at-a-time

...
Your
Code

● Computation on a never-ending stream of data records

13
06 What does Flink’s Engine do?

Your
... Code

Your
... Code

Your
... Code

● System distributes the computation across the cluster

14
07 Streaming Dataflow Runtime
distributed queues as push-based
data shipping channels

Execution
JobGraph Graph
(parallel) Task Task Task
(logical) Manager Manager Manager

Application
Code
(DataSet / Optimizer /
Graph Job
DataStream) Manager Task Task Task
Generator Manager Manager Manager

Task Task Task


Manager Manager Manager

Client

concurrently executed
15
07 Streaming Dataflow Runtime
● A slightly closer look into the transmission of data ...

Taken from an
1. Record “A” enters Task 1, and is processed
output buffer pool
2. The record is serialized into an output buffer at Task 1

3. The buffer is shipped to Task 2’s input buffer Taken from an


input buffer pool

Observation: Buffers need to be available throughout the process


(think blocking queues used between threads)

16
07 Streaming Dataflow Runtime
● Natural, built-in backpressure
● Receiving data at a higher rate than a system can process
during a temporary load spike
○ ex. GC talls at processing tasks
○ ex. data source natural load spike

Normal stable case:

Temporary load spike:

Ideal backpressure handling:

17
07 Flexible Windows
● Due to one-at-a-time processing, Flink has very powerful
built-in windowing (certainly among the best in the current
streaming framework solutions)

○ Time-driven: Tumbling window, Sliding window


○ Data-driven: Count window, Session window

18
07 Time Windows

Tumbling Time Window Sliding Time Window

19
07 Count-Triggered Windows

20
07 Session Windows

21
08 What does Flink’s Engine do?

...
Your
Code

● Computation and ● Results depend on


state, ex.: history of stream
○ counters
○ in-progress State ● A stateful stream
windows processor gives
○ state machines tools to manage
○ trained ML state
models

22
09 What does Flink’s Engine do?
t4 t2 t3 t1 t1 - t2 t3 - t4

...
Your
Code

● Processing ● Core mechanics is


depends on called watermarks:
timestamps of when basically a way to
events were State measure and
generated advance clock time,
instead of relying on
machine time

23
09 Different Kinds of “Time”

24
09 Why Wall Time is Incorrect

● Think Twitter hash-tag count every 5 minutes

○ We would want the result to reflect the number of


Twitter tweets actually tweeted in a 5 minute
window
○ Not the number of tweet events the stream
processor receives within 5 minutes

25
09 Why Wall Time is Incorrect

● Think replaying a Kafka topic on a windowed


streaming application …

○ If you’re replaying a queue, windows are


definitely wrong if using a wall clock

26
10 Flink’s Streaming Fault Tolerance

Your Your
... ... ...
Code Code

State State

... ... Your Your


Code Code
...

State State

... ... Your Your


...
Code Code

State State

● Any operator in a Flink streaming topology can be stateful


● How to ensure that the states are correct upon failure?

27
10 Flink’s Streaming Fault Tolerance
● First, a recap of some guarantee concepts:

○ At-least-once: records may be processed more than once.


Think counting: may over count, resulting in wrong state

○ Exactly-once “state”: records appear to be processed only


once, with respect to the state.
Think counting: even on failure, each record is counted
exactly once

○ End-to-end exactly-once: records appear to be


processed only once, even to external systems
Think counting: for results stored externally, even after
failure, the results remain correct

28
11 Flink’s Streaming Fault Tolerance

Your Your
... ... ...
Code Code

State State

... ... Your Your


Code Code
...

State State

... ... Your Your


...
Code Code

State State

● Flink checkpoints: a combined snapshot of all operator state,


with the corresponding position in the source
● Based on Chandly-Lamport Algorithm: does not halt
any computation while taking consistent snapshots
29
12 Flink’s Savepoints

● Flink checkpoints: consistent snapshots of the whole


topology state that the system periodically takes

● Flink savepoints: manually triggered checkpoints that can


be persisted, and used to initialize state for a new streaming
job

t1 t2 t3
t

savepoint savepoint savepoint


state at t1 state at t2 state at t3

30
13 So, back to this ...
t

...

HDFS
File

MapReduce /
Spark / Flink
Jobs

31
13 So, back to this ...
t

...

Streaming processor
that handles …
(1) continuous state
(2) out-of-order events
scalably, robustly, and efficiently

32
14 What Flink provides, in a nutshell example

● No stateless point-in-time

33
14 What Flink provides, in a nutshell example

● Processing, or re-processing, in the batch way

34
14 What Flink provides, in a nutshell example

● Batch is inherently unsuitable for the nature of continuously


generated data
● State is corrupt at boundaries
35
14 What Flink provides, in a nutshell example

● Flink’s Stateful streaming naturally treats state continuously as it


processes your continuous data, and continuously generates
results

36
14 What Flink provides, in a nutshell example

● On reprocessing: initial state for the job reflects all previous


history data in the stream

37
14 What Flink provides, in a nutshell example

event-time processing

● On reprocessing: event-time processing guarantees correct


results, even when fast-forwarding to the head of stream

38
15 Final Takeaways

● Stateful Streaming correctly embraces the nature of


continuously generated data, and is the new programming
paradigm for their applications.

● Streaming isn’t only about real-time. Realtime is only a


natural advantage of streaming.

39
15 Final Takeaways
● The choice is all about your data, and your code.

● Think:
○ Is your data unbounded, or bounded?
■ Unbounded: click streams, page visits, impressions …
■ Bounded: (???)

● Think:
○ Does your code change faster than your data?
■ Data exploration, data mining, feature engineering …
■ In this case, it doesn’t really matter whether you use batch or streaming

○ Or does your data change faster than your code?


■ Production ETL pipelines, warehousing, serving, etc.
■ For accuracy and robustness, definitely think and design in terms of
streaming

40
15 Final Takeaways

● Upcoming features in Flink:

○ Dynamic scaling, with stateful streaming


○ Queryable state
○ Incremental state checkpointing
○ Even more savepoint functionality

41
15 Final Takeaways

● How Flink’s technology covers the application space:

Application Technology
Realtime applications Low-latency stateful streaming

Continuous applications High-latency stateful streaming

Analytics on historical data Batch as special case of


streaming

Request/Response Apps Queryable state

42

You might also like