Flink Basics
Features
Apache Flink is an excellent choice to develop and run many different types of applications due to its extensive features set.
• Flink’s features include support for stream and batch processing,
• sophisticated state management,
• event-time processing semantics,
• and exactly-once consistency guarantees for state.
• Moreover, Flink can be deployed on various resource providers such as YARN and Kubernetes, but also as stand-alone cluster on bare-
metal hardware.
• Configured for high availability, Flink does not have a single point of failure. Fault-tolerance is achieved by periodically writing
checkpoints to a remote persistent storage.
• Flink has been proven to scale to thousands of cores and terabytes of application state, delivers high throughput and low latency, and
powers some of the world’s most demanding stream processing applications.
• However, Flink’s outstanding feature for event-driven applications is savepoint. A savepoint is a consistent state image that can be used as
a starting point for compatible applications. Given a savepoint, an application can be updated or adapt its scale, or multiple versions of an
application can be started for A/B testing.
• supplier<Stream<Interger>> str =() -> {1,2,3,4};
• str1=Str.map().count();
• Str.
Execution Environment
• . Flink integrates with all common cluster resource managers such as
• Hadoop YARN
• Kubernetes,
• but can also be setup to run as a stand-alone cluster.
• When deploying a Flink application, Flink automatically identifies the
required resources based on the application’s configured parallelism and
requests them from the resource manager. In case of a failure, Flink replaces
the failed container by requesting new resources. All communication to
submit or control an application happens via REST calls. This eases the
integration of Flink in many environments.
Kubernetes– brief
• Kubernetes is a portable, extensible, open source platform for managing
containerized workloads and services, that facilitates both declarative
configuration and automation. It has a large, rapidly growing ecosystem.
Kubernetes services, support, and tools are widely available.
• Containers
• Technology for packaging an application along with its runtime
dependencies.
• Workloads
• Understand Pods, the smallest deployable compute object in Kubernetes,
and the higher-level abstractions that help you to run them.
• Node manager ==Task tracker
Application manager == Job tracker (which takes care of (half
responsibility of Job tracker of MRv1) data execution engine and
scheduling job and taking updates from node manager and asking for
resources from resource manager ..it sits in between )
And resource allocation part of Job tracker of MR1 is now Assigned to
resource manager of MR2.
Flink Application Structure
1. Streams
2. State
3. Time
streams are a fundamental aspect of stream processing. However, streams can have different characteristics that
affect how a stream can and should be processed. Flink is a versatile processing framework that can handle any
kind of stream.
•Bounded and unbounded streams: Streams can be unbounded or bounded, i.e., fixed-sized data sets. Flink
has sophisticated features to process unbounded streams, but also dedicated operators to efficiently process
bounded streams.
•Real-time and recorded streams: All data are generated as streams. There are two ways to process the data.
Processing it in real-time as it is generated or persisting the stream to a storage system, e.g., a file system or
object store, and processed it later. Flink applications can process recorded or real-time streams.
State:-
only applications that apply transformations on individual events do not require state.
Any application that runs basic business logic needs to remember events or intermediate results to access them at a
later point in time, for example when the next event is received or after a specific time duration.
•Features of state:
•Exactly-once state consistency: Flink’s checkpointing and recovery algorithms guarantee the consistency of
application state in case of a failure. Hence, failures are transparently handled and do not affect the correctness of an
application.
•Very Large State: Flink is able to maintain application state of several terabytes in size due to its asynchronous and
incremental checkpoint algorithm.
•Scalable Applications: Flink supports scaling of stateful applications by redistributing the state to more or fewer
workers.
Time
Time is another important ingredient of streaming applications. Most event streams have inherent time semantics
because each event is produced at a specific point in time.
Moreover, many common stream computations are based on time, such as windows aggregations, sessionization,
pattern detection, and time-based joins. An important aspect of stream processing is how an application
measures time, i.e., the difference of event-time and processing-time.
Flink provides a rich set of time-related features.
•Event-time Mode: Applications that process streams with event-time semantics compute results based on timestamps of
the events. Thereby, event-time processing allows for accurate and consistent results regardless whether recorded or real-
time events are processed.
•Watermark Support: Flink employs watermarks to reason about time in event-time applications. Watermarks are also a
flexible mechanism to trade-off the latency and completeness of results.
•Late Data Handling: When processing streams in event-time mode with watermarks, it can happen that a computation
has been completed before all associated events have arrived. Such events are called late events. Flink features multiple
options to handle late events, such as rerouting them via side outputs and updating previously completed results.
•Processing-time Mode: In addition to its event-time mode, Flink also supports processing-time semantics which performs
computations as triggered by the wall-clock time of the processing machine. The processing-time mode can be suitable for
certain applications with strict low-latency requirements that can tolerate approximate results.
Layered APIs
Flink provides three layered APIs. Each API offers a different trade-off
between conciseness and expressiveness and targets different use cases.