Unit 5 Apache Kafka Notes
Unit 5 Apache Kafka Notes
Notes
Introduction
• In Big Data, an enormous volume of data is used.
Regarding data, we have two main challenges. The first
challenge is how to collect large volume of data and the
second challenge is to analyze the collected data. To
overcome those challenges, you must need a messaging
system.
• Kafka is designed for distributed high throughput systems.
Kafka tends to work very well as a replacement for a more
traditional message broker. In comparison to other
messaging systems, Kafka has better throughput, built-in
partitioning, replication and inherent fault-tolerance,
which makes it a good fit for large-scale message
processing applications.
What is a Messaging System?
2. Partition Topics may have many partitions, so it can handle an arbitrary amount of data.
6. Kafka Kafka’s having more than one broker are called as Kafka cluster. A Kafka cluster can be expanded without
Cluster downtime. These clusters are used to manage the persistence and replication of message data.
7. Producers Producers are the publisher of messages to one or more Kafka topics. Producers send data to Kafka brokers.
Every time a producer pub-lishes a message to a broker, the broker simply appends the message to the last
segment file. Actually, the message will be appended to a partition. Producer can also send messages to a
partition of their choice.
8. Consumers read data from brokers. Consumers subscribes to one or more topics and consume published
Consumers messages by pulling data from the brokers.
9. Leader Leader is the node responsible for all reads and writes for the given partition. Every partition has one
server acting as a leader.
Cluster diagram of Kafka.
S.No Components and Description
1. Broker Kafka cluster typically consists of multiple brokers to maintain load balance. Kafka brokers are
stateless, so they use ZooKeeper for maintaining their cluster state. One Kafka broker
instance can handle hundreds of thousands of reads and writes per second and each bro-ker
can handle TB of messages without performance impact. Kafka broker leader election can be
done by ZooKeeper.
2. ZooKeeper ZooKeeper is used for managing and coordinating Kafka broker. ZooKeeper service is mainly
used to notify producer and consumer about the presence of any new broker in the Kafka
system or failure of the broker in the Kafka system. As per the notification received by the
Zookeeper regarding presence or failure of the broker then pro-ducer and consumer takes
decision and starts coordinating their task with some other broker.
3. Producers Producers push data to brokers. When the new broker is started, all the producers search it
and automatically sends a message to that new broker. Kafka producer doesn’t wait for
acknowledgements from the broker and sends messages as fast as the broker can handle.
4. Consumers Since Kafka brokers are stateless, which means that the consumer has to maintain how many
messages have been consumed by using partition offset. If the consumer acknowledges a
particular message offset, it implies that the consumer has consumed all prior messages. The
consumer issues an asynchronous pull request to the broker to have a buffer of bytes ready to
consume. The consumers can rewind or skip to any point in a partition simply by supplying an
offset value. Consumer offset value is notified by ZooKeeper.
Workflow of Pub-Sub Messaging
Following is the step wise workflow of the Pub-Sub Messaging −
• Producers send message to a topic at regular intervals.
• Kafka broker stores all messages in the partitions configured for that particular topic. It
ensures the messages are equally shared between partitions. If the producer sends
two messages and there are two partitions, Kafka will store one message in the first
partition and the second message in the second partition.
• Consumer subscribes to a specific topic.
• Once the consumer subscribes to a topic, Kafka will provide the current offset of the
topic to the consumer and also saves the offset in the Zookeeper ensemble.
• Consumer will request the Kafka in a regular interval (like 100 Ms) for new messages.
• Once Kafka receives the messages from producers, it forwards these messages to the
consumers.
• Consumer will receive the message and process it.
• Once the messages are processed, consumer will send an acknowledgement to the
Kafka broker.
• Once Kafka receives an acknowledgement, it changes the offset to the new value and
updates it in the Zookeeper. Since offsets are maintained in the Zookeeper, the
consumer can read next message correctly even during server outrages.
• This above flow will repeat until the consumer stops the request.
• Consumer has the option to rewind/skip to the desired offset of a topic at any time and
Workflow of Queue Messaging / Consumer Group
In a queue messaging system instead of a single consumer, a group of consumers having the
same Group ID will subscribe to a topic. In simple terms, consumers subscribing to a topic with
same Group ID are considered as a single group and the messages are shared among them. Let
us check the actual workflow of this system.
• Transformation
• Action
Transformation
Broadcast variable
• The broadcast variables support a read-only variable cached on
each machine rather than providing a copy of it with tasks. Spark
uses broadcast algorithms to distribute broadcast variables for
reducing communication cost.
Accumulator
• The Accumulator are variables that are used to perform associative
and commutative operations such as counters or sums. The Spark
provides support for accumulators of numeric types.
SPARK: WORKING WITH PAIRED RDDS
• Paired RDD is one of the kinds of RDDs. These
RDDs contain the key/value pairs of data. Pair
RDDs are a useful building block in many
programs, as they expose operations that allow
you to act on each key in parallel or regroup
data across the network. For example, pair
RDDs have a reduceByKey() method that can
aggregate data separately for each key, and
a join() method that can merge two RDDs
together by grouping elements with the same
key.
Transformations on Pair RDDs
• Whatever transformations are available for
standard RDD, it will be there for Pair RDDs
too, the only difference is we need to pass
functions that operate on tuples rather than
on individual elements. Some of the examples
are map(), reduce(), filter(). Now what new
transformations, Pair RDDs provide us with?
Let’s try some of those transformations:
reduceByKey()
•
It runs several parallel reduce operations, one for each key in the
dataset, where each operation combines values that have the same
key. Data are combined at each partition, only one output for one key
at each partition to send over the network. reduceByKey required
combining all your values into another value with the exact same type.