0% found this document useful (0 votes)
39 views

Unit 5 Apache Kafka Notes

Uploaded by

madhavmane2021
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Unit 5 Apache Kafka Notes

Uploaded by

madhavmane2021
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Apache Kafka/Spark

Notes
Introduction
• In Big Data, an enormous volume of data is used.
Regarding data, we have two main challenges. The first
challenge is how to collect large volume of data and the
second challenge is to analyze the collected data. To
overcome those challenges, you must need a messaging
system.
• Kafka is designed for distributed high throughput systems.
Kafka tends to work very well as a replacement for a more
traditional message broker. In comparison to other
messaging systems, Kafka has better throughput, built-in
partitioning, replication and inherent fault-tolerance,
which makes it a good fit for large-scale message
processing applications.
What is a Messaging System?

• A Messaging System is responsible for transferring


data from one application to another, so the
applications can focus on data, but not worry
about how to share it. Distributed messaging is
based on the concept of reliable message
queuing. Messages are queued asynchronously
between client applications and messaging
system. Two types of messaging patterns are
available − one is point to point and the other is
publish-subscribe (pub-sub) messaging system.
Most of the messaging patterns follow pub-sub.
Point to Point Messaging System

• In a point-to-point system, messages are persisted in a


queue. One or more consumers can consume the messages
in the queue, but a particular message can be consumed by
a maximum of one consumer only. Once a consumer reads a
message in the queue, it disappears from that queue. The
typical example of this system is an Order Processing
System, where each order will be processed by one Order
Processor, but Multiple Order Processors can work as well at
the same time. The following diagram depicts the structure.
Publish-Subscribe Messaging System

• In the publish-subscribe system, messages are persisted


in a topic. Unlike point-to-point system, consumers can
subscribe to one or more topic and consume all the
messages in that topic. In the Publish-Subscribe system,
message producers are called publishers and message
consumers are called subscribers. A real-life example is
Dish TV, which publishes different channels like sports,
movies, music, etc., and anyone can subscribe to their
own set of channels and get them whenever their
subscribed channels are available.
What is Kafka?

• Apache Kafka is a distributed publish-subscribe


messaging system and a robust queue that can
handle a high volume of data and enables you to
pass messages from one end-point to another.
Kafka is suitable for both offline and online
message consumption. Kafka messages are
persisted on the disk and replicated within the
cluster to prevent data loss. Kafka is built on top of
the ZooKeeper synchronization service. It
integrates very well with Apache Storm and Spark
for real-time streaming data analysis.
Benefits

• Following are a few benefits of Kafka −


• Reliability − Kafka is distributed, partitioned, replicated and
fault tolerance.
• Scalability − Kafka messaging system scales easily without
down time..
• Durability − Kafka uses Distributed commit log which means
messages persists on disk as fast as possible, hence it is
durable..
• Performance − Kafka has high throughput for both publishing
and subscribing messages. It maintains stable performance
even many TB of messages are stored.
• Kafka is very fast and guarantees zero downtime and zero
data loss.
Use Cases
• Kafka can be used in many Use Cases. Some of them are listed
below −

• Metrics − Kafka is often used for operational monitoring data.


This involves aggregating statistics from distributed
applications to produce centralized feeds of operational data.
• Log Aggregation Solution − Kafka can be used across an
organization to collect logs from multiple services and make
them available in a standard format to multiple con-sumers.
• Stream Processing − Popular frameworks such as Storm and
Spark Streaming read data from a topic, processes it, and
write processed data to a new topic where it becomes
available for users and applications. Kafka’s strong durability is
also very useful in the context of stream processing.
Need for Kafka
• Kafka is a unified platform for handling all the real-time
data feeds.
• Kafka supports low latency message delivery and gives
guarantee for fault tolerance in the presence of machine
failures.
• It has the ability to handle a large number of diverse
consumers.
• Kafka is very fast, performs 2 million writes/sec.
• Kafka persists all data to the disk, which essentially means
that all the writes go to the page cache of the OS (RAM).
• This makes it very efficient to transfer data from page
cache to a network socket.
Terminologies
• the main terminologies such as topics, brokers,
producers and consumers. The following diagram
illustrates the main terminologies and the table
describes the diagram components in detail.
Sr. Components and Description
1. A stream of messages belonging to a particular category is called a topic. Data is stored in topics.
Topics Topics are split into partitions. For each topic, Kafka keeps a mini-mum of one partition. Each such partition
contains messages in an immutable ordered sequence. A partition is implemented as a set of segment files
of equal sizes.

2. Partition Topics may have many partitions, so it can handle an arbitrary amount of data.

3. Partition Each partitioned message has a unique sequence id called as offset.


offset
4. Replicas Replicas are nothing but backups of a partition. Replicas are never read or write data. They are used to
of partition prevent data loss.
5. Brokers •Brokers are simple system responsible for maintaining the published data. Each broker may have zero or
more partitions per topic. Assume, if there are N partitions in a topic and N number of brokers, each broker
will have one partition.
•Assume if there are N partitions in a topic and more than N brokers (n + m), the first N broker will have one
partition and the next M broker will not have any partition for that particular topic.
•Assume if there are N partitions in a topic and less than N brokers (n-m), each broker will have one or more
partition sharing among them. This scenario is not recommended due to unequal load distribution among
the broker.

6. Kafka Kafka’s having more than one broker are called as Kafka cluster. A Kafka cluster can be expanded without
Cluster downtime. These clusters are used to manage the persistence and replication of message data.
7. Producers Producers are the publisher of messages to one or more Kafka topics. Producers send data to Kafka brokers.
Every time a producer pub-lishes a message to a broker, the broker simply appends the message to the last
segment file. Actually, the message will be appended to a partition. Producer can also send messages to a
partition of their choice.
8. Consumers read data from brokers. Consumers subscribes to one or more topics and consume published
Consumers messages by pulling data from the brokers.
9. Leader Leader is the node responsible for all reads and writes for the given partition. Every partition has one
server acting as a leader.
Cluster diagram of Kafka.
S.No Components and Description

1. Broker Kafka cluster typically consists of multiple brokers to maintain load balance. Kafka brokers are
stateless, so they use ZooKeeper for maintaining their cluster state. One Kafka broker
instance can handle hundreds of thousands of reads and writes per second and each bro-ker
can handle TB of messages without performance impact. Kafka broker leader election can be
done by ZooKeeper.

2. ZooKeeper ZooKeeper is used for managing and coordinating Kafka broker. ZooKeeper service is mainly
used to notify producer and consumer about the presence of any new broker in the Kafka
system or failure of the broker in the Kafka system. As per the notification received by the
Zookeeper regarding presence or failure of the broker then pro-ducer and consumer takes
decision and starts coordinating their task with some other broker.

3. Producers Producers push data to brokers. When the new broker is started, all the producers search it
and automatically sends a message to that new broker. Kafka producer doesn’t wait for
acknowledgements from the broker and sends messages as fast as the broker can handle.

4. Consumers Since Kafka brokers are stateless, which means that the consumer has to maintain how many
messages have been consumed by using partition offset. If the consumer acknowledges a
particular message offset, it implies that the consumer has consumed all prior messages. The
consumer issues an asynchronous pull request to the broker to have a buffer of bytes ready to
consume. The consumers can rewind or skip to any point in a partition simply by supplying an
offset value. Consumer offset value is notified by ZooKeeper.
Workflow of Pub-Sub Messaging
Following is the step wise workflow of the Pub-Sub Messaging −
• Producers send message to a topic at regular intervals.
• Kafka broker stores all messages in the partitions configured for that particular topic. It
ensures the messages are equally shared between partitions. If the producer sends
two messages and there are two partitions, Kafka will store one message in the first
partition and the second message in the second partition.
• Consumer subscribes to a specific topic.
• Once the consumer subscribes to a topic, Kafka will provide the current offset of the
topic to the consumer and also saves the offset in the Zookeeper ensemble.
• Consumer will request the Kafka in a regular interval (like 100 Ms) for new messages.
• Once Kafka receives the messages from producers, it forwards these messages to the
consumers.
• Consumer will receive the message and process it.
• Once the messages are processed, consumer will send an acknowledgement to the
Kafka broker.
• Once Kafka receives an acknowledgement, it changes the offset to the new value and
updates it in the Zookeeper. Since offsets are maintained in the Zookeeper, the
consumer can read next message correctly even during server outrages.
• This above flow will repeat until the consumer stops the request.
• Consumer has the option to rewind/skip to the desired offset of a topic at any time and
Workflow of Queue Messaging / Consumer Group
In a queue messaging system instead of a single consumer, a group of consumers having the
same Group ID will subscribe to a topic. In simple terms, consumers subscribing to a topic with
same Group ID are considered as a single group and the messages are shared among them. Let
us check the actual workflow of this system.

• Producers send message to a topic in a regular interval.


• Kafka stores all messages in the partitions configured for that particular topic similar to the
earlier scenario.
• A single consumer subscribes to a specific topic, assume Topic-01 with Group ID as Group-1.
• Kafka interacts with the consumer in the same way as Pub-Sub Messaging until new
consumer subscribes the same topic, Topic-01 with the same Group ID as Group-1.
• Once the new consumer arrives, Kafka switches its operation to share mode and shares the
data between the two consumers. This sharing will go on until the number of con-sumers
reach the number of partition configured for that particular topic.
• Once the number of consumer exceeds the number of partitions, the new consumer will not
receive any further message until any one of the existing consumer unsubscribes. This
scenario arises because each consumer in Kafka will be assigned a minimum of one partition
and once all the partitions are assigned to the existing consumers, the new consumers will
have to wait.
• This feature is also called as Consumer Group. In the same way, Kafka will provide the best of
both the systems in a very simple and efficient manner.
Core Concepts of Kafka.
• Kafka is simply a collection of topics split into one or more
partitions.
• A Kafka partition is a linearly ordered sequence of messages,
where each message is identified by their index (called as
offset).
• All the data in a Kafka cluster is the disjointed union of
partitions. Incoming messages are written at the end of a
partition and messages are sequentially read by consumers.
Durability is provided by replicating messages to different
brokers.
• Kafka provides both pub-sub and queue based messaging
system in a fast, reliable, persisted, fault-tolerance and zero
downtime manner. In both cases, producers simply send the
message to a topic and consumer can choose any one type of
messaging system depending on their need.
Role of ZooKeeper
• A critical dependency of Apache Kafka is Apache Zookeeper,
which is a distributed configuration and synchronization service.
• Zookeeper serves as the coordination interface between the
Kafka brokers and consumers.
• The Kafka servers share information via a Zookeeper cluster.
• Kafka stores basic metadata in Zookeeper such as information
about topics, brokers, consumer offsets (queue readers) and so
on.
• Since all the critical information is stored in the Zookeeper and it
normally replicates this data across its ensemble, failure of Kafka
broker / Zookeeper does not affect the state of the Kafka cluster.
• Kafka will restore the state, once the Zookeeper restarts. This
gives zero downtime for Kafka. The leader election between the
Kafka broker is also done by using Zookeeper in the event of
leader failure.
Apache Spark
Introduction
• Apache Spark is an open-source cluster computing
framework. Its primary purpose is to handle the real-time
generated data.
• Spark was built on the top of the Hadoop MapReduce. It
was optimized to run in memory whereas alternative
approaches like Hadoop's MapReduce writes data to and
from computer hard drives. So, Spark process the data
much quicker than other alternatives.
• Apache Spark is a lightning-fast cluster computing
designed for fast computation. It was built on top of
Hadoop MapReduce and it extends the MapReduce
model to efficiently use more types of computations
which includes Interactive Queries and Stream Processing
History of Apache Spark
• Spark is one of Hadoop’s sub project
developed in 2009 in UC Berkeley’s AMPLab
by Matei Zaharia. It was Open Sourced in 2010
under a BSD license.
• It was donated to Apache software foundation
in 2013, and now Apache Spark has become a
top level Apache project from Feb-2014.
Features of Apache Spark

• Speed − Spark helps to run an application in Hadoop


cluster, up to 100 times faster in memory, and 10
times faster when running on disk. This is possible by
reducing number of read/write operations to disk. It
stores the intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in
APIs in Java, Scala, or Python. Therefore, you can
write applications in different languages. Spark comes
up with 80 high-level operators for interactive
querying.
• Advanced Analytics − Spark not only supports ‘Map’
and ‘reduce’. It also supports SQL queries, Streaming
data, Machine learning (ML), and Graph algorithms.
Uses of Spark
• Data integration: The data generated by systems are not consistent
enough to combine for analysis. To fetch consistent data from
systems we can use processes like Extract, transform, and load (ETL).
Spark is used to reduce the cost and time required for this ETL
process.
• Stream processing: It is always difficult to handle the real-time
generated data such as log files. Spark is capable enough to operate
streams of data and refuses potentially fraudulent operations.
• Machine learning: Machine learning approaches become more
feasible and increasingly accurate due to enhancement in the
volume of data. As spark is capable of storing data in memory and
can run repeated queries quickly, it makes it easy to work on
machine learning algorithms.
• Interactive analytics: Spark is able to generate the respond rapidly.
So, instead of running pre-defined queries, we can handle the data
interactively.
Spark Built on Hadoop
• There are three ways of Spark deployment as explained below.
• Standalone − Spark Standalone deployment means Spark occupies the place on top
of HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly.
Here, Spark and MapReduce will run side by side to cover all spark jobs on cluster.
• Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without
any pre-installation or root access required. It helps to integrate Spark into Hadoop
ecosystem or Hadoop stack. It allows other components to run on top of stack.
• Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in
addition to standalone deployment. With SIMR, user can start Spark and uses its
shell without any administrative access.
Spark Components

• The Spark project consists of different types of


tightly integrated components. At its core,
Spark is a computational engine that can
schedule, distribute and monitor multiple
applications.
Spark Core

• The Spark Core is the heart of Spark and


performs the core functionality.
• It holds the components for task scheduling,
fault recovery, interacting with storage
systems and memory management.
Spark SQL
• The Spark Core is the heart of Spark
and performs the core functionality.
• It holds the components for task
scheduling, fault recovery, interacting
with storage systems and memory
management.
Spark Streaming
• Spark Streaming is a Spark component that supports
scalable and fault-tolerant processing of streaming data.
• It uses Spark Core's fast scheduling capability to perform
streaming analytics.
• It accepts data in mini-batches and performs RDD
transformations on that data.
• Its design ensures that the applications written for
streaming data can be reused to analyze batches of
historical data with little modification.
• The log files generated by web servers can be considered
as a real-time example of a data stream.
MLlib
• The MLlib is a Machine Learning library that
contains various machine learning algorithms.
• These include correlations and hypothesis
testing, classification and regression,
clustering, and principal component analysis.
• It is nine times faster than the disk-based
implementation used by Apache Mahout.
GraphX
• The GraphX is a library that is used to
manipulate graphs and perform graph-parallel
computations.
• It facilitates to create a directed graph with
arbitrary properties attached to each vertex
and edge.
• To manipulate graph, it supports various
fundamental operators like subgraph, join
Vertices, and aggregate Messages.
Spark - RDD
• Resilient Distributed Datasets (RDD) is a fundamental
data structure of Spark.
• It is an immutable distributed collection of objects. Each
dataset in RDD is divided into logical partitions, which
may be computed on different nodes of the cluster.
• RDDs can contain any type of Python, Java, or Scala
objects, including user-defined classes.
• Formally, an RDD is a read-only, partitioned collection of
records. RDDs can be created through deterministic
operations on either data on stable storage or other
RDDs. RDD is a fault-tolerant collection of elements that
can be operated on in parallel.
There are two ways to create RDDs
• Parallelizing an existing data in the driver
program
• Referencing a dataset in an external storage
system, such as a shared filesystem, HDFS,
HBase, or any data source offering a Hadoop
InputFormat.
Parallelized Collections
• To create parallelized collection,
call SparkContext's parallelize method on an
existing collection in the driver program. Each
element of collection is copied to form a
distributed dataset that can be operated on in
parallel.
External Datasets
• In Spark, the distributed datasets can be created from any
type of storage sources supported by Hadoop such as HDFS,
Cassandra, HBase and even our local file system. Spark
provides the support for text files, SequenceFiles, and
other types of Hadoop InputFormat.
• SparkContext's textFile method can be used to create
RDD's text file. This method takes a URI for the file (either a
local path on the machine or a hdfs://) and reads the data
of the file.
RDD Operations
The RDD provides the two types of operations:

• Transformation
• Action
Transformation

• In Spark, the role of transformation is to


create a new dataset from an existing one. The
transformations are considered lazy as they
only computed when an action requires a
result to be returned to the driver program.
Action
• In Spark, the role of action is to return a value
to the driver program after running a
computation on the dataset.
RDD Persistence
• Spark provides a convenient way to work on the dataset by
persisting it in memory across operations. While persisting
an RDD, each node stores any partitions of it that it
computes in memory. Now, we can also reuse them in other
tasks on that dataset.
• We can use either persist() or cache() method to mark an
RDD to be persisted. Spark?s cache is fault-tolerant. In any
case, if the partition of an RDD is lost, it will automatically be
recomputed using the transformations that originally created
it.
• There is an availability of different storage levels which are
used to store persisted RDDs. Use these levels by passing
a StorageLevel object (Scala, Java, Python) to persist().
However, the cache() method is used for the default storage
level, which is StorageLevel.MEMORY_ONLY.
RDD Shared Variables
In Spark, when any function passed to a transformation operation,
then it is executed on a remote cluster node. It works on different
copies of all the variables used in the function. These variables are
copied to each machine, and no updates to the variables on the
remote machine are revert to the driver program.

Broadcast variable
• The broadcast variables support a read-only variable cached on
each machine rather than providing a copy of it with tasks. Spark
uses broadcast algorithms to distribute broadcast variables for
reducing communication cost.
Accumulator
• The Accumulator are variables that are used to perform associative
and commutative operations such as counters or sums. The Spark
provides support for accumulators of numeric types.
SPARK: WORKING WITH PAIRED RDDS
• Paired RDD is one of the kinds of RDDs. These
RDDs contain the key/value pairs of data. Pair
RDDs are a useful building block in many
programs, as they expose operations that allow
you to act on each key in parallel or regroup
data across the network. For example, pair
RDDs have a reduceByKey() method that can
aggregate data separately for each key, and
a join() method that can merge two RDDs
together by grouping elements with the same
key.
Transformations on Pair RDDs
• Whatever transformations are available for
standard RDD, it will be there for Pair RDDs
too, the only difference is we need to pass
functions that operate on tuples rather than
on individual elements. Some of the examples
are map(), reduce(), filter(). Now what new
transformations, Pair RDDs provide us with?
Let’s try some of those transformations:
reduceByKey()

It runs several parallel reduce operations, one for each key in the
dataset, where each operation combines values that have the same
key. Data are combined at each partition, only one output for one key
at each partition to send over the network. reduceByKey required
combining all your values into another value with the exact same type.

Note: Since datasets can have


very large numbers of keys,
reduceByKey() is not
implemented as an action that
returns a value to the user
program. Instead, it returns a new
RDD consisting of each key and
the reduced value for that key.
groupByKey()
• On applying groupByKey() on a dataset of (K, V) pairs, the data shuffle according to the key-
value K in another RDD. In this transformation, lots of unnecessary data transfer over the
network.While both reducebykey and groupbykey will produce the same answer, the
reduceByKey example works much better on a large dataset. That’s because Spark knows it
can combine output with a common key on each partition before shuffling the data. On the
other hand, when calling groupByKey — all the key-value pairs are shuffled around. This is a
lot of unnecessary data to being transferred over the network.
So, we should avoid groupByKey and rather use reduceByKey.
foldByKey()
• foldByKey operation is used to aggregate values
based on keys. It is quite similar to fold(); both use a
zero value of the same type of data in our RDD and
combination function. As with fold(), the provided
zero value for foldByKey() should have no impact
when added with your combination function to
another element.
mapValues(func)
• mapValues is similar to map except the former is only
applicable for PairRDDs, meaning RDDs of the form
RDD[(A, B)]. In that case, mapValues operates on the
value only (the second part of the tuple), while map
operates on the entire record (tuple of key and value).
There might be the cases where we’d be interested in
accessing the value(& not key). In those case, we can use
mapValues() instead of map().
AggregateByKey()
• aggregateByKey() will combine the values for a particular key,
and the result of such combination can be any object that you
specify. You have to specify how the values are combined
(“added”) inside one partition (that is executed in the same
node) and how you combine the result from different
partitions (that may be in different nodes). reduceByKey is a
particular case, in the sense that the result of the combination
(e.g. a sum) is of the same type that the values, and that the
operation when combined from different partitions, is also the
same as the operation when combining values inside a
partition. The following example shows first where the same
operation is used for everything, i.e. using reduceByKey. then,
while using aggregateByKey, 0 is initial value, _+_ inside a
partition, _+_ between partitions.
join()
• A query that accesses multiple rows of the same or different
tables at one time is called a join query. Inner joins require
a key to be present in both RDDs whereas
Outer joins do not require a key to be present in both RDDs.
• All keys that will appear in the final result is common to
rdd1 and rdd2. This is similar to the relation database
operation INNER JOIN.
cogroup()
• Given two RDDs sharing the same key type K,
with the types of the respective value as V and W,
the resulting RDD is of type [K, (iterable[V],
Iterable[W])], as one key at least appear in either
of the two RDDs, it will appear in the final result.
sortByKey()
• sortByKey() is part
of OrderedRDDFunctions that works on
Key/Value pairs.
It receives key-value pairs (K, V) as an input,
sorts the elements in ascending or descending
order and generates a dataset in an order.

You might also like