Data
Engineering 101
Kafka
Core Concepts
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Broker
A Kafka broker is a server that
runs the Kafka software and is
responsible for storing and
serving data. Brokers receive
messages from producers, assign
offsets to messages, and store
1
them on disk. Example: In a Kafka
cluster, multiple brokers work
together to ensure data is reliably
stored and served.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Topics
Topics are logical channels to
which messages are sent by
producers and from which
messages are read by
consumers. A topic is divided into
multiple partitions to allow
parallel processing.
2
Example: A "user_activity" topic
might be divided into several
partitions to handle high
message volume.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Partitions
Partitions are subdivisions of
topics. Each partition is an
ordered, immutable sequence of
messages that is continually
appended to. Partitions enable
Kafka to scale horizontally and
maintain message order.
3
Example: Partition 0 of the
"user_activity" topic stores
messages for a specific subset of
users.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Producers
Producers are clients that send
messages to Kafka topics. They
can send messages to specific
partitions based on a partitioning
strategy or distribute them evenly
across all partitions.
4
Example: A web application that
logs user activity sends these logs
to a Kafka topic as messages.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Consumers
Consumers are clients that read
messages from Kafka topics.
Consumers can operate
individually or as part of a
consumer group, which allows for
parallel processing of messages.
5
Example: An analytics service
reads user activity logs from a
Kafka topic to generate reports.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Consumer Groups
Consumer groups allow multiple
consumers to collaborate on
processing messages from a
topic. Each partition in a topic is
assigned to only one consumer
within a group at a time, ensuring
parallel processing and load
balancing.
6
Example: Three consumers in a
group process messages from six
partitions.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Offsets
Offsets are unique identifiers
assigned to each message within
a partition. Consumers use offsets
to track which messages have
been read.
7
Example: A consumer reads
messages up to offset 105 and
resumes from offset 106 after a
restart.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Cluster
A Kafka cluster is composed of
multiple brokers that work
together. Clusters provide fault
tolerance and high availability.
Example: A cluster with three
8
brokers can continue operating if
one broker fails.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Replication
Kafka replicates partitions across
multiple brokers to ensure fault
tolerance. Each partition has a
leader and several followers. The
leader handles all reads and
writes, while followers replicate
the data.
9
Example: Partition 0 has one
leader and two followers across
three brokers.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
ZooKeeper
ZooKeeper is used for distributed
coordination and metadata
management in Kafka. It
manages broker metadata,
leader election, and configuration.
10
Example: ZooKeeper ensures a
new leader is elected if the
current leader broker fails.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Producers and
ACKs
Producers send messages to
brokers and can configure
acknowledgment settings (ACKs)
to ensure reliable message
delivery.
11
Example: A producer configures
ACKs to wait for confirmation
from all replicas before
considering a message sent.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Retention Policy
Kafka topics can have retention
policies that determine how long
messages are stored. Policies can
be time-based or size-based.
Example: A topic is configured to
12
retain messages for 7 days, after
which they are deleted.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Log Compaction
Log compaction ensures that only
the latest message for each key is
retained in a topic, useful for
maintaining the latest state.
Example: A log-compacted topic
13
retains only the latest update for
each user profile.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Connect
Kafka Connect is a framework for
integrating Kafka with other data
systems. It provides connectors to
move data in and out of Kafka.
Example: Using Kafka Connect to
14
sync data between a MySQL
database and a Kafka topic.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Streams
Kafka Streams is a library for
building stream processing
applications on top of Kafka. It
allows processing and
transforming data in real time.
15
Example: An application using
Kafka Streams aggregates
clickstream data to generate
real-time metrics.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
MirrorMaker
MirrorMaker is a tool for
replicating data between Kafka
clusters, often used for cross-
datacenter replication.
Example: Using MirrorMaker to
16
replicate messages from a
primary datacenter to a backup
datacenter.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka API
Kafka provides APIs for producing,
consuming, and managing data,
including Producer API, Consumer
API, and Admin API.
Example: Using the Producer API
17
to send messages from a Java
application to a Kafka topic.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Security
Kafka supports various security
features, including SSL encryption,
SASL authentication, and ACLs for
authorization.
Example: Configuring SSL to
18
encrypt data in transit and SASL
for client authentication.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
AdminClient API
The AdminClient API allows
programmatic management of
Kafka topics, brokers, and
configurations.
Example: Using AdminClient to
19
create a new topic and configure
its retention policy.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Monitoring and
Metrics
Kafka provides metrics for
monitoring cluster health and
performance. Tools like
Prometheus and Grafana can be
used to visualize these metrics.
20
Example: Monitoring consumer
lag and broker health using
Prometheus and Grafana
dashboards.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Message Delivery
Semantics
Kafka supports three types of
message delivery semantics: at
most once, at least once, and
exactly once.
Example: Configuring a producer
21
for exactly-once delivery to
ensure no message is lost or
duplicated.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Stateful
Processing
Kafka Streams supports stateful
processing, allowing applications
to maintain state across
messages using state stores.
Example: A stream processing
22
application that maintains a
running count of events over a
window of time.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Windowed
Operations
Kafka Streams provides support
for windowed operations,
enabling time-based
aggregations and
transformations.
23
Example: Calculating the average
number of user clicks per minute
using windowed operations.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
KSQL
KSQL is a SQL-like interface for
stream processing in Kafka,
simplifying the creation of stream
processing applications.
Example: Using KSQL to filter,
24
aggregate, and transform
streams of data in real time.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Ecosystem
Kafka's ecosystem includes
various tools and frameworks for
comprehensive data processing,
such as Kafka Connect, Kafka
Streams, and KSQL.
25
Example: Integrating Kafka with a
relational database using Kafka
Connect and processing the data
with Kafka Streams.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Publish/Subscribe
Messaging
Pub/Sub systems allow
decoupling of message
producers and consumers. Kafka
acts as a broker facilitating this.
Example: An application publishes
26
user activity logs which can be
consumed by analytics services.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Message and
Batches
Messages are the basic unit of
data in Kafka, stored as byte
arrays. Messages are written in
batches for efficiency.
Example: A batch of log messages
27
sent from an application.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Schemas
Schemas define the structure of
messages, ensuring consistency.
Apache Avro is a common
serialization framework used with
Kafka.
28
Example: Avro schema for user
profile data.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Topics and
Partitions
Topics are categories to which
messages are published. Topics
are divided into partitions for
scalability and redundancy.
Example: A "user_activity" topic
29
with partitions.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Producers and
Consumers
Producers create and send
messages to Kafka topics.
Consumers read messages from
topics.
Example: A microservice
30
producing order data and
another consuming for
processing.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Brokers and
Clusters
A broker is a Kafka server that
stores data and serves clients.
Multiple brokers form a Kafka
cluster, providing fault tolerance
and scalability.
31
Example: A Kafka cluster with
three brokers.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Disk-Based
Retention
Kafka retains messages on disk
for a configured period, allowing
consumers to read at their pace.
Example: Retaining logs for 7 days.
32 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Multiple Producers
and Consumers
Kafka supports multiple
producers and consumers for the
same topic, enabling flexible data
pipelines.
Example: Multiple sensors
33
producing data to a single topic,
multiple analytics services
consuming it.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
High Throughput
Kafka can handle large volumes
of messages efficiently due to its
architecture.
Example: Processing millions of
log entries per second.
34 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Stream Processing
Kafka supports real-time
processing of streams of data
using tools like Kafka Streams.
Example: Real-time analytics on
incoming transaction data.
35 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Connect
Kafka Connect simplifies the
integration of Kafka with other
data systems.
Example: Using Kafka Connect to
sync data between a database
36
and a Kafka topic.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Streams API
Kafka Streams API allows building
stream processing applications
with Kafka.
Example: An application that
aggregates user clickstream data
37
in real-time.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Log Compaction
Kafka can retain only the latest
message per key in a log-
compacted topic, useful for
changelog data.
Example: Keeping only the latest
38
update to user profiles.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Exactly Once
Semantics
Kafka ensures that messages are
processed exactly once, even in
distributed systems.
Example: Financial transactions
processed without duplicates.
39 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Idempotent
Producer
Producers can safely retry
sending messages without
duplicating them.
Example: Sending a payment
confirmation message with
40
guaranteed single delivery.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Transactions
Kafka supports atomic writes
across multiple partitions and
topics using transactions.
Example: Ensuring that a series of
related messages are either all
41
written or none are.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
MirrorMaker
Tool for replicating Kafka topics
across clusters, useful for disaster
recovery and multi-datacenter
setups.
Example: Mirroring production
42
data to a backup datacenter.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Security
Kafka supports authentication,
authorization, and encryption to
secure data.
Example: Using SSL for encrypting
data in transit.
43 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka AdminClient
AdminClient API allows
programmatic management of
Kafka.
Example: Creating topics, altering
configurations programmatically.
44 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Monitoring and
Metrics
Kafka provides metrics and
monitoring tools to track cluster
performance.
Example: Monitoring consumer
lag and broker health.
45 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Serialization and
Deserialization
Kafka requires serialization of
data for transmission, with
support for various formats like
Avro, JSON.
Example: Serializing user data to
46
Avro format before sending to
Kafka.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Message Ordering
Kafka maintains the order of
messages within a partition,
important for consistency.
Example: Ensuring order of
transaction logs.
47 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Consumer Group
Consumers can join groups to
balance load and ensure each
message is processed once.
Example: Multiple consumers
processing a high-volume topic
48
collaboratively.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Offset
Management
Kafka tracks the offset of
messages to manage consumer
progress.
Example: Storing offsets in Kafka
to resume processing after a
49
restart.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Topic Replication
Kafka replicates partitions across
multiple brokers for fault
tolerance.
Example: A partition replicated
across three brokers to handle
50
broker failure.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Message
Compression
Kafka supports compressing
messages to save bandwidth and
storage.
Example: Compressing log
messages before sending to
51
Kafka.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Zookeeper
Kafka uses Zookeeper for
distributed coordination and
metadata management.
Example: Zookeeper managing
broker metadata and leader
52
election.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Broker
Configuration
Brokers can be configured for
performance, retention policies,
and more.
Example: Configuring a broker to
retain messages for 30 days.
53 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Producer
Configuration
Producers have configurable
parameters for message delivery,
retries, and more.
Example: Setting producer retries
to handle transient failures.
54 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Consumer
Configuration
Consumers can be configured for
fetch sizes, timeout settings, and
more.
Example: Configuring consumer
fetch size for optimal
55
performance.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Topic
Management
Topics can be created, deleted,
and managed programmatically
or via CLI.
Example: Creating a new topic for
storing event logs.
56 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Quotas and
Throttling
Kafka supports setting quotas to
control resource usage by clients.
Example: Throttling a high-
volume producer to prevent
overwhelming the cluster.
57 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Rebalance
Protocol
Kafka handles rebalancing of
consumers within a group to
maintain load balance.
Example: Rebalancing partitions
when a new consumer joins the
58
group.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka REST Proxy
Provides a RESTful interface to
interact with Kafka clusters.
Example: Sending messages to
Kafka using HTTP requests.
59 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka API
Kafka provides APIs for producing,
consuming, and managing data.
Example: Using the Kafka
Producer API to send messages
from a Java application.
60 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Schema Registry
Confluent Schema Registry
manages and enforces schemas
for Kafka messages.
Example: Ensuring all messages in
a topic follow a predefined
61
schema.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Streams DSL
A high-level API for stream
processing in Kafka.
Example: Using Kafka Streams DSL
to filter and transform a stream of
events.
62 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Fault Tolerance
Kafka’s design ensures high
availability and fault tolerance.
Example: Automatic failover to
replicas when a broker fails.
63 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Real-Time
Analytics
Kafka supports real-time data
analytics and processing.
Example: Real-time dashboard
updating with live metrics from
Kafka.
64 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
ETL Pipelines
Kafka can be used to build
efficient ETL pipelines for data
integration.
Example: Extracting data from
databases, transforming it, and
65
loading it into a data warehouse
via Kafka.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Upgrades
Kafka supports rolling upgrades
to minimize downtime.
Example: Upgrading Kafka brokers
without disrupting message flow.
66 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Message
Timestamping
Kafka messages can have
timestamps for time-based
processing.
Example: Using timestamps for
event time processing in Kafka
67
Streams.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
State Stores
Kafka Streams allows maintaining
stateful processing with state
stores.
Example: Counting occurrences of
events over a window of time
68
using state stores.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Windowed
Operations
Kafka Streams supports
windowed operations for
aggregations over time windows.
Example: Calculating the sum of
transactions every minute.
69 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
KSQL
KSQL is a SQL-like interface for
stream processing with Kafka.
Example: Using KSQL to perform
real-time filtering and
aggregations on Kafka topics.
70 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Ecosystem
Kafka’s ecosystem includes tools
like Connect, Streams, KSQL, and
more for comprehensive data
processing.
Example: Using Kafka Connect to
71
integrate with databases, Kafka
Streams for processing, and KSQL
for querying streams.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Connectors
Pre-built connectors for
integrating Kafka with various
data sources and sinks.
Example: Using a JDBC connector
to sync data between a database
72
and Kafka.
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Kafka Cluster
Management
Tools and practices for managing
Kafka clusters efficiently.
Example: Using tools like Kafka
Manager for monitoring and
managing cluster health.
73 Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101 - Kafka
Tiered Storage
Kafka’s tiered storage allows
offloading older data to cheaper
storage.
Example: Storing older Kafka topic
data in S3 to reduce on-prem
74
storage costs.
Shwetank Singh
GritSetGrow - GSGLearn.com