Kafka Interview Questions
Kafka Interview Questions
Source : https://github.com/OBenner/data-engineering-interview-
questions/blob/master/content/kafka.md#What-is-Apache-Kafka
Kafka topics without being concerned about how the data will be
processed. Similarly, consumers can read data from topics
without needing to coordinate with producers. This decoupling
simplifies system architecture and enhances flexibility.
In Apache Kafka, ISR stands for In-Sync Replicas. It's a concept related
to Kafka's high availability and fault tolerance mechanisms.
For each partition, Kafka maintains a list of replicas that are considered
"in-sync" with the leader replica. The leader replica is the one that
handles all read and write requests for a specific partition, while the
follower replicas replicate the leader's log. Followers that have fully
caught up with the leader log are considered in-sync. This means they
have replicated all messages up to the last message acknowledged by
the leader.
The ISR ensures data durability and availability. If the leader fails,
Kafka can elect a new leader from the in-sync replicas, minimizing data
loss and downtime.
If a replica stays out of the ISR (In-Sync Replicas) for a long time, it
indicates that the replica is not able to keep up with the leader's log
updates. This can be due to network issues, hardware failure, or high
load on the broker. As a result, the replica might become a bottleneck
for partition availability and durability, since it cannot participate in
acknowledging writes or be elected as a leader if the current leader
fails.
If the preferred replica is not in the In-Sync Replicas (ISR) for a Kafka
topic, the producer will either wait for the preferred replica to become
available (if configured with certain ack settings) or send messages to
another available broker that is part of the ISR. This ensures data
integrity by only using replicas that are fully up-to-date with the leader.
Consumers might experience a delay in data availability if they are set
to consume only from the preferred replica and it is not available.
Apache Kafka, Apache Storm, and Apache Flink are all distributed
systems designed for processing large volumes of data, but they serve
different purposes and operate differently:
In Kafka, the offset is a unique identifier for each record within a Kafka
topic's partition. It denotes the position of a record within the partition.
The offset is used by consumers to track which records have been read
and which haven't, allowing for fault-tolerant and scalable message
consumption. Essentially, it enables consumers to pick up reading from
the exact point they left off, even in the event of a failure or restart,
thereby ensuring that no messages are lost or read multiple times.
Leader: For each partition of a topic, there is one broker that acts
as the leader. The leader is responsible for handling all read and
write requests for that partition. When messages are produced to
a partition, they are sent to the leader broker, which then writes
the messages to its local storage. The leader broker ensures that
messages are stored in the order they are received.
Follower: Followers are other brokers in the cluster that replicate
the data of the leader for fault tolerance. Each follower
continuously pulls messages from the leader to stay up-to-date,
ensuring that it has an exact copy of the leader's data. In case
the leader broker fails, one of the followers can be elected as the
new leader, ensuring high availability.
10