SlideShare a Scribd company logo
Fault Tolerance and Job
Recovery in Apache Flink™
Till Rohrmann
trohrmann@apache.org
@stsffap
2
Better be safe than sorry
§  Failures will happen
§  EMC estimated $1.7 billion costs due to
data loss and system downtime
§  Recovery will save you time and costs
§  Switch between algorithms
§  Live upgrade of your system
3
Fault Tolerance
4
Fault tolerance guarantees
§  At most once
•  No guarantees at all
§  At least once
•  For many applications
sufficient
§  Exactly once
§  Flink provides all guarantees
5
Checkpoints
§  Consistent snapshots of distributed data
stream and operator state
6
Barriers
§  Markers for checkpoints
§  Injected in the data flow
7
8
§  Alignment for multi-input operators
Operator State
§  Stateless operators
§  System state
§  User defined state
9
ds.filter(_	!=	0)	
ds.keyBy(0).window(TumblingTimeWindows.of(5,	TimeUnit.SECONDS))	
public	class	CounterSum	implements	RichReduceFunction<Long>	{	
	private	OperatorState<Long>	counter;	
	
	@Override	public	Long	reduce(Long	v1,	Long	v2)	throws	Exception	{	
		counter.update(counter.value()	+	1);	
		return	v1	+	v2;	
	}	
	
	@Override	public	void	open(Configuration	config)	{	
		counter	=	getRuntimeContext().getOperatorState(“counter”,	0L,	false);	
	}	
}
10
11
12
13
Advantages
§  Separation of app logic from recovery
•  Checkpointing interval is just a config
parameter
§  High throughput
•  Controllable checkpointing overhead
§  Low impact on latency
14
15
Cluster High Availability
16
Without high availability
17
JobManager
TaskManager
With high availability
18
JobManager
TaskManager
Stand-by
JobManager
Apache	Zookeeper™	
KEEP	GOING
Persisting jobs
19
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
Job	
1.  Submit	job
Persisting jobs
20
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Submit	job	
2.  Persist	execuAon	graph
Persisting jobs
21
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Submit	job	
2.  Persist	execuAon	graph	
3.  Write	handle	to	ZooKeeper
Persisting jobs
22
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Submit	job	
2.  Persist	execuAon	graph	
3.  Write	handle	to	ZooKeeper	
4.  Deploy	tasks
Handling checkpoints
23
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Take	snapshots
Handling checkpoints
24
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Take	snapshots	
2.  Persist	snapshots	
3.  Send	handles	to	JM
Handling checkpoints
25
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Take	snapshots	
2.  Persist	snapshots	
3.  Send	handles	to	JM	
4.  Create	global	checkpoint
Handling checkpoints
26
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Take	snapshots	
2.  Persist	snapshots	
3.  Send	handles	to	JM	
4.  Create	global	checkpoint	
5.  Persist	global	checkpoint
Handling checkpoints
27
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Take	snapshots	
2.  Persist	snapshots	
3.  Send	handles	to	JM	
4.  Create	global	checkpoint	
5.  Persist	global	checkpoint	
6.  Write	handle	to	ZooKeeper
Conclusion
28
29
30
TL;DL
§  Job recovery mechanism with low latency
and high throughput
§  Exactly one processing semantics
§  No single point of failure
è Flink will always keep processing
your data
31
flink.apache.org
@ApacheFlink

More Related Content

What's hot (20)

PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
PDF
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PDF
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward
 
PDF
Pulsar connector on flink 1.14
宇帆 盛
 
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
PPTX
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward
 
PDF
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Flink Forward
 
PDF
Big Data Warsaw
Maximilian Michels
 
PPTX
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
confluent
 
PDF
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward
 
PDF
Stream Processing with Apache Flink
C4Media
 
PPTX
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward
 
PDF
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink Forward
 
PPTX
Flink Streaming @BudapestData
Gyula Fóra
 
PDF
A look at Flink 1.2
Stefan Richter
 
PDF
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
Flink Forward
 
PDF
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward
 
PPTX
Apache flink 1.0.0 overview
MapR Technologies
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward
 
Pulsar connector on flink 1.14
宇帆 盛
 
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Flink Forward
 
Big Data Warsaw
Maximilian Michels
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
confluent
 
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward
 
Stream Processing with Apache Flink
C4Media
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward
 
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward
 
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink Forward
 
Flink Streaming @BudapestData
Gyula Fóra
 
A look at Flink 1.2
Stefan Richter
 
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward
 
Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...
Flink Forward
 
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...
Flink Forward
 
Apache flink 1.0.0 overview
MapR Technologies
 

Viewers also liked (13)

PDF
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
PPTX
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Robert Metzger
 
PDF
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Till Rohrmann
 
PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
PDF
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Till Rohrmann
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PDF
Streaming Analytics & CEP - Two sides of the same coin?
Till Rohrmann
 
PPTX
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Fabian Hueske
 
PDF
High availability and fault tolerance of openstack
Deepak Mane
 
PPTX
Apache Flink Hands On
Robert Metzger
 
PDF
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Till Rohrmann
 
PPTX
Eron Wright - Flink Security Enhancements
Flink Forward
 
PPT
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Robert Metzger
 
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Till Rohrmann
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Till Rohrmann
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Streaming Analytics & CEP - Two sides of the same coin?
Till Rohrmann
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Fabian Hueske
 
High availability and fault tolerance of openstack
Deepak Mane
 
Apache Flink Hands On
Robert Metzger
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Till Rohrmann
 
Eron Wright - Flink Security Enhancements
Flink Forward
 
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
Ad

Similar to Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015 (20)

PDF
Fault tolerance
Michał Waleszczuk
 
PPTX
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
confluent
 
PPTX
Mario Fusco - Reactive programming in Java - Codemotion Milan 2017
Codemotion
 
PDF
Exposing and Fixing Common App Performance Problems
Riverbed Technology
 
PDF
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Sean Zhong
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PDF
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
javier ramirez
 
PPTX
Flink 0.10 - Upcoming Features
Aljoscha Krettek
 
PDF
An introduction to_rac_system_test_planning_methods
Ajith Narayanan
 
PPT
When Web Services Go Bad
Steve Loughran
 
PDF
ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus
Ali Kafel
 
PPT
Network and distributed systems
Sri Prasanna
 
PPTX
Solve the colocation conundrum: Performance and density at scale with Kubernetes
Niklas Quarfot Nielsen
 
PPTX
Software architecture for data applications
Ding Li
 
PPTX
Oracle appsloadtestbestpractices
sonusaini69
 
PDF
"Load Testing Distributed Systems with NBomber 4.0", Anton Moldovan
Fwdays
 
PPTX
RR_07 Maint Monitoring and Tshooting.pptx
joomaverick007
 
PPTX
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 
PDF
Intro To RTOS by Silicon labs covering fundamentals
sivakumarrohit2917
 
PPTX
Performance optimization (balancer optimization)
Vitaly Peregudov
 
Fault tolerance
Michał Waleszczuk
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
confluent
 
Mario Fusco - Reactive programming in Java - Codemotion Milan 2017
Codemotion
 
Exposing and Fixing Common App Performance Problems
Riverbed Technology
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Sean Zhong
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
javier ramirez
 
Flink 0.10 - Upcoming Features
Aljoscha Krettek
 
An introduction to_rac_system_test_planning_methods
Ajith Narayanan
 
When Web Services Go Bad
Steve Loughran
 
ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus
Ali Kafel
 
Network and distributed systems
Sri Prasanna
 
Solve the colocation conundrum: Performance and density at scale with Kubernetes
Niklas Quarfot Nielsen
 
Software architecture for data applications
Ding Li
 
Oracle appsloadtestbestpractices
sonusaini69
 
"Load Testing Distributed Systems with NBomber 4.0", Anton Moldovan
Fwdays
 
RR_07 Maint Monitoring and Tshooting.pptx
joomaverick007
 
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 
Intro To RTOS by Silicon labs covering fundamentals
sivakumarrohit2917
 
Performance optimization (balancer optimization)
Vitaly Peregudov
 
Ad

More from Till Rohrmann (10)

PDF
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Till Rohrmann
 
PPTX
Apache flink 1.7 and Beyond
Till Rohrmann
 
PDF
Elastic Streams at Scale @ Flink Forward 2018 Berlin
Till Rohrmann
 
PDF
Scaling stream data pipelines with Pravega and Apache Flink
Till Rohrmann
 
PDF
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Till Rohrmann
 
PDF
Apache Flink Meets Apache Mesos And DC/OS @ Mesos Meetup Berlin
Till Rohrmann
 
PDF
Apache Flink® Meets Apache Mesos® and DC/OS
Till Rohrmann
 
PPTX
From Apache Flink® 1.3 to 1.4
Till Rohrmann
 
PDF
Apache Flink and More @ MesosCon Asia 2017
Till Rohrmann
 
PPTX
Redesigning Apache Flink's Distributed Architecture @ Flink Forward 2017
Till Rohrmann
 
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Till Rohrmann
 
Apache flink 1.7 and Beyond
Till Rohrmann
 
Elastic Streams at Scale @ Flink Forward 2018 Berlin
Till Rohrmann
 
Scaling stream data pipelines with Pravega and Apache Flink
Till Rohrmann
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Till Rohrmann
 
Apache Flink Meets Apache Mesos And DC/OS @ Mesos Meetup Berlin
Till Rohrmann
 
Apache Flink® Meets Apache Mesos® and DC/OS
Till Rohrmann
 
From Apache Flink® 1.3 to 1.4
Till Rohrmann
 
Apache Flink and More @ MesosCon Asia 2017
Till Rohrmann
 
Redesigning Apache Flink's Distributed Architecture @ Flink Forward 2017
Till Rohrmann
 

Recently uploaded (20)

PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 

Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015