0% found this document useful (0 votes)
34 views11 pages

Apache Kafka Notes

The document provides an overview of Apache Kafka, detailing its fundamental concepts such as events, topics, partitions, brokers, producers, and consumers. It explains the architecture, data retention, security measures, and the role of Zookeeper in managing Kafka clusters. Additionally, it covers Kafka's replication mechanism, consumer groups, and integration frameworks like Kafka Connect and Confluent REST proxy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views11 pages

Apache Kafka Notes

The document provides an overview of Apache Kafka, detailing its fundamental concepts such as events, topics, partitions, brokers, producers, and consumers. It explains the architecture, data retention, security measures, and the role of Zookeeper in managing Kafka clusters. Additionally, it covers Kafka's replication mechanism, consumer groups, and integration frameworks like Kafka Connect and Confluent REST proxy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Apache kafka notes

02 November 2023 05:34 PM

Domain 1.0: Kafka fundamentals


Apache kafka
• What is an event?
o Anything that has happened and can be recorded (a change in a state of a thing)
▪ Examples
• Purchased a product
• Payment made successfully
• Click of a link
• Impression you showed on a specific area of the webpage
o It’s a combination of
▪ notifications
▪ state
• it is smaller in size (even less than a mb)
• represented in structured format
o avro
o parquet
o json
o protocol buffers
▪ it is serialized in some format
o event in kafka – sequences of bytes
▪ timestamp
• by default, it’s the current time of the produced message
• we can explicitly tell the producer to use the same timestamp when the event
actually occurred rather than the current time.
▪ key
• enforcing ordering
• collocating the data with the same key property
• key retention
▪ Value
• Payload is typically in the value (actual info)
▪ optional headers
• metadata about the data the event holds
• read by consumers to make meaningful decisions.
o events are known as streams
o Both keys and values are byte arrays
• Apache kafka
o Loosely typed (unstructured or unorganized)
o Distributed messaging system
o Highly scalable
o Native language is java
o High-level architecture
▪ Storage layer
• Kafka cluster
o Distributed system
o Consist of kafka brokers
▪ Primitive APIs
• Consumer AP
o Read and process events from storage layer
• Producer API
Publish event to storage layer

Quick Notes Page 1


o Publish event to storage layer
▪ Higher-level APIs
• Kafka connect API
o Integrate rest of the ecosystem to import data into kafka
• Sink connectors.
o Flow the data out of kafka to the rest of the ecosystem.
• Processing API
o Kafka streams API
o ksqlDB
▪ sql-like syntax to allow application to process the data continuously.
o storage layer is independent of processing layer
▪ can scaled out independently whenever is needed
o data retention in kafka
▪ default is 1 week globally
• can be configured to stay forever for both per topic or globally
o troubleshooting
▪ confluent control center
• commercial offering to monitor kafka cluster
▪ logging to troubleshoot whats happening with cluster
• each node will have its own log
o can be configured to use centralized logging instead
o security
▪ encrypt data in transit
• client to broker
• broker to broker
• broker to zookeeper
▪ authenticate and authorise communication from client to cluster
• present name and password
o if the user is authorised to do the stuff
▪ authorization is managed by the cluster itself to map principals to
capabilities
▪ no encryption at rest
• data stored in partition is not encrypted
o alternative
▪ deploy disk encryption systems
▪ create encryption at application level
• write wrapper for producer and consumer libraries
o does encryptions
o handles key at application level
o control and data plane
▪ control plane
• controls metadata
▪ data plane
• controls the actual data
• Kafka topics
o Fundamental unit of events organization
o Act as a table in a relational database
▪ It has got a name
▪ Contains things which are similar to one another in the same table
o Creation happens to hold different types of events
o Can be created event to hold duplicate or filtered events
▪ One topic holds all the thermostats data
▪ Other topic holds the information about the places where the temperature was hot.
o Sometimes refers to as queues but are not really
▪ Logs/Messages in topics are immutable
▪ They are append only
• Cannot put a message in the middle if you have missed it

Quick Notes Page 2


• Cannot put a message in the middle if you have missed it
• Always goes at the end
▪ Can only be read by an offset
• Not indexed
o When indexed, we can perform search operation to find out whats inside
that message
o Logs/messages in kafka topic are durable
▪ Nothing is temporary
▪ Can be configured to expire by age (retention period), and size.
▪ They are like files on a disk
• Logs are persistent because they are stored in a disk
▪ Reading the messages will not delete them.
• They can be read over and over again if needed.
o Compacted topic
▪ When you want to store only the recent data for a key
• Saves storage
▪ Can be configured on a topic to save storage
• Kafka partitions
o A topic is divided further into topics to act like a distributed system
▪ Making sure that a single topic is not stressed with too much of producing and
consuming messages
o Messages are distributed (part of producer)
▪ If there is no key
• Distributed evenly (round robin strategy)
▪ If there is a key
• Hash function output % number of partitions = partition number
• Same key messages land in the same partition and in order.
o There is also an option to write custom partitioner
▪ Not really needed as kafka partitioner does a great job
o Offsets
▪ Starts with 0
▪ Increases monotonically.
▪ Once an offset is used, it is never used in the same partition.
• Kafka brokers
o Network of machines called brokers and each of these will run Kafka broker process
▪ Can be in the same server
▪ Can be cloud instances
▪ Containers running on pod
▪ Running on virtualized servers
▪ Running on actual processor in a physical data centre somewhere
o Broker handles two types of requests
▪ Produce request from the producer
▪ Fetch request from the consumer
o Each broker hosts some set of partitions
▪ Because every broker has its own storage
▪ Broker’s basic function is to manage the partitions
o There is no logic tied between the number of partitions inside number of brokers
▪ Number of partitions can be equal to number of brokers
▪ Number of partitions can be more than the number of brokers available
▪ Number of brokers are more than the number of partitions
o Handles replication of partition in each other
o Functioning of the broker for a produce request
▪ Once a broker receives a produce request, it first lands to the socket receive buffer
▪ It will be picked up by one of the network threads
• Once a network thread is assigned to the client request, it will stick to it for its
entire existence in the broker
• It then forms a produce request object and puts that into a shared request queue

Quick Notes Page 3


• It then forms a produce request object and puts that into a shared request queue
▪ It will be then picked up by the second pool of kafka which is I/O threads
• They can handle requests from any client unlike network threads
• Once the data is picked up from the shared queue, it will first validate the CRC of
the data associated with the partition
• It will then append the data associated with the partition to a data structure
called commit log
▪ Commit log
• It is organized in a bunch of segments
o Each segment has two parts
▪ Actual data
▪ Index structure
• Provides mapping from the offset to the position of this record
within this log file
▪ Purgatory (map)
• Once the data is received for the produce request, the broker will only
acknowledge the produce request if it is replicated across the brokers by default
for durability
• While waiting for the data to be fully replicated, broker cannot leave the data
inside I/O threads as it is shared, it is sent to purgatory.
• Once the acknowledgement is received from other brokers, it is sent to the
request queue which is separately available for each network thread
▪ Network thread then pickup the generated response
• It will send the response socket send buffer
• Since network is responsible for enforcing the ordering of the request.
o It will only take one request at a time. Once finished with the request, it will
then take the other request only.
o Functionality of broker for fetch request
▪ Received by the socket receive buffer
▪ Picked up by network thread
▪ Put into the shared request queue
▪ I/O thread will use the index structure to find the corresponding file byte range using the
offset index
• Sometimes, topic might have no new data
o Consumer can specify min number of bytes it needs to wait for the response
o Consumer can specify the max time it can wait for the response
o When this is being done, it is sent to purgatory to free up I/O threads
▪ After either of the conditions is matched, the response is generated and sent to the
response queue
▪ Network thread will pick it up
• Zero copy transfer
o Since it is event streaming platform, data will remain in the page cache, it
doesn’t have to go to the disk to make extra efforts
o Sometimes, when older data is needed, it can block network thread and
since network thread for consumer is shared (because of consumer group),
some of the processing might be delayed
▪ It will be send to the socket send buffer which is then received by the consumer client
• Kafka replication
o Partitions are replicated into several different brokers to facilitate fault tolerance
▪ Default replication factor is 1
o All replications are called followers
o Every partition that gets replicated has one leader and N-1 followers
o Data is produced/consumed to/from leaders first
o It is turned on by default
• In-sync replicas (ISR)
o When the followers have all the messages that are available in the leader, it is then said that
they are in-sync

Quick Notes Page 4


they are in-sync
• Leader epoch
o Each leader is associated with the leader epoch
o It’s a unique number which is also monotonically increasing to generate the lifetime of a
particular leader
o It is also a critical component as it does log reconciliation among the replicas
• Committed
o Once a particular record is included in all the in-sync replicas, it is considered as committed
o The offset before which all the records are considered as committed, that offset will be
marked as high watermark.
• Kafka producers
o We program producer to work with our kafka broker
o producer does not know anything about the consumer
▪ It can scale independently
▪ It can fail independently
▪ Its slow speed will not affect consumer
o Producer makes decisions on which partition a messages needs to be sent
o Producers write to the leader and followers just follows the data written to the leader
o Functioning of a producer
▪ When producer sends a message, it goes to serializer
• Configure serializer in producer library
o What the type of key and value are
• It will be then turned into bytes by serializer
▪ Bytes are then passed to the partitioner
• Checks if there is a key
o If key is not null
▪ Hashing happens
▪ Sending one message at a time isn’t efficient
• Messages are buffered for particular partition in memory data structure known as
record batches
o Compression is also performed (if configured) for more efficient transfer
speed
▪ Producer also controls when to send the record batch to form a produce request for the
broker
• Linger.ms
• Batch.size
▪ Producer guarantee
• Acks = 0
o Don’t wait for the confirmation, just keep sending messages
o Might have some loss of messages
• Acks = 1(leader)
o Wait for the confirmation from the leader to acknowledge
o Latency is higher than acks = 0
o Might have loss of messages in case the leader itself is down and messages
were not replicated to the followers
• Acks = -1 (all)
o Wait for the confirmation from all replicas as well as leader
o Latency is higher than acks = 1
o No loss of data at all
▪ Delivery guarantee (sending the ack for receiving the message)
• At most once
o Lost some messages
• At least once
o Some duplicate messages
• Exactly once
o No duplicate messages
Transactional API

Quick Notes Page 5


o Transactional API
▪ Available for both producer and consumer
▪ When messages get delivered more than once (duplicate message in a
log)
• It switches on producer to produce idempotently and when
duplicates are detected, it stops producing and the same
message will not be processed again if it was processed
previously.
▪ Idempotent process of producers
• Reading and processing one time as well as many times is okay.
o Reading and processing the same message more than once is not hurting
anything (no data loss)
• Kafka consumer
o We program consumer to work with our kafka broker
o Consumers will read the data from the leaders
▪ Data can be read from other partitions too and not just leader to make sure that load
balancing is happening.
o Consumer does not know anything about the producer
▪ It can scale independently
▪ It can fail independently
▪ Its slow speed will not affect producer
o Consumer offset
▪ Special topic in kafka cluster to remember the last offset read by a consumer
• This will make consumers stateful
o They will not get the same message again
o Functioning of consumer
▪ It just sends a request by specifying three things
• Topic
• Partition
• offset
• Kafka consumer group
o Every consumer in Kafka is a part of a consumer group
• Zookeeper
o Group of zookeepers is known as ensemble
o It manages brokers to make sure all the brokers are in sync with what's happening in the Kafka
cluster
o ACL is stored in it
o When a broker fails and if it’s a leader of a particular topic, zookeeper elects the new leader
for that topic
▪ Topic’s replications are also managed by zookeeper
• InCharge of each replica in a broker is managed by zookeeper
• Organization of replica in each broker is managed by zookeeper
o Failover is managed by zookeeper
• Kafka connect
o Distributed, fault-tolerant, pluggable, declarative data integration framework
o Has its own server
▪ Every connect cluster will have n number of connect worker
• Every worker can run one or more tasks
o Each task is running a piece of a connector (if it can be partitioned)
▪ A connector is partitioned across multiple workers to parallelize the
reading/writing to multiple partitions of the topics
• If a topic is being read by the consumer and that topic has
multiple partitions,
o Each task will be assigned to the partitions of that topic
for parallelization
▪ If a connector is not partitioned, then a work will run only one task for
the connector

Quick Notes Page 6


the connector
o State of each connector is stored in a kafka topic
o Runs externally from the kafka cluster
o Confluent has its own libraries for connectors
• KIP-500
o Kafka improvement proposal 500
▪ Plans on removing kafka completely from kafka
• Confluent REST proxy
o REST wrapper around producer and consumer
▪ If you want to use HTTP proxy to access kafka
• Confluent schema registry
o Schema related issues are addressed with it
▪ Issue of scale – when there are a lot of producers, consumers and topics are involved,
you might want to update producers or consumers to their upgraded version for better
performance and security. If they upgrade but are compatible with the old data, then
there will be a problem
o Define the schema
o Producers and consumers assumes
▪ Data which is being produced in a format, it will be understood by the consumers
▪ Data which is being received in a format, it will be the same for the rest of the lifecycle
of that topic
• Kafka streams
o Functional Java API
o Uses
▪ Microservices
▪ Continuous queries
▪ Continuous transformations
• Librdkafka
o is a C library that implements the Apache Kafka protocol, providing Producer, Consumer and
Admin clients for interacting with Kafka clusters
o documentation link -
https://docs.confluent.io/platform/current/clients/librdkafka/html/md_INTRODUCTION.html
o features
▪ High performance: It can handle millions of messages per second with low latency and
minimal memory copying.
▪ Compatibility: It supports all Kafka broker versions from 0.8.x to 2.x and provides feature
discovery to automatically adapt to the broker capabilities.
▪ Reliability: It handles message delivery failures, network errors, leader changes and
other scenarios gracefully and transparently.
▪ Configurability: It exposes a rich set of configuration properties to tune the library for
different use cases and environments.
▪ Native C++ interface: It also provides a C++ wrapper that simplifies the usage of the
library for C++ applications.
• Kafka security
o On a cluster level, we can create an API and use it with the services that want to connect to
the cluster.
▪ API key and API secret key.
• Secret key is visible only for once and can be downloaded when the API is being
generated.
• If you have lost the API secret key, then API key itself will become useless.
o Authentication
▪ Types of authentications in Apache Kafka
• Users
o Email ID and password
o SSO (single sign on)
▪ E.g., you login to office 365, you are able to access any other Microsoft
product without the need to login again for authentication.

Quick Notes Page 7


product without the need to login again for authentication.
• Services or applications
o API-keys
▪ Resource specific keys
• Able to manage specific resources such as clusters, schema
registry, ksqlDB and connect.
▪ Cloud keys
• Able to manage entire organization.
o OAuth
▪ It allows you to access your resources and data without having to
share or store user credentials.
▪ It’s built off cryptographically signed access tokens that allow your
application or service to authenticate.
▪ It is an industry standard for providing authentication.
▪ It uses identity pools to map groups of identities to your role-based
access control or access control list (ACL) policies.
• E.g. let’s say you have a group of applications that all need
access to the same cluster or Kafka topic. Rather than giving
each application individual access, or controlling access on a
per-application basis, you can create an identity pool, and use
an identity pool filter to map each application identity to this
identity pool. Then, using RBAC or ACL policies you can control
permissions for the pool.
▪ Advantages
• It is a cloud-native authentication solution.
o Most companies already employ an OAuth solution for
other cloud resources they use for authentication/identity
management.
• It provides centralized identity management.
o This is advantageous because system administrators have
only one place to manage authentication for all their
systems rather than having to create and manage
credentials, or set up a sync system, for each independent
service in their organization.
• It allows you to scale to thousands of identities and credentials,
making sure that your applications only have the access they are
supposed to have.
▪ Points to remember about API keys
• Only use service account API keys in production for max security. A User leaving
the org will not affect the entire system’s functionality.
• Manage and delete unneeded keys and service accounts
• Rotate API keys regularly
o By rotating, you are making sure that a breach can only get access to the
data for a specific amount of time as for the other times, there’s a different
key
o By rotating the keys, you are also making sure that only the people who
should have access to the data are able to access it. People with outdated
information about the API keys can’t.
• API-keys can be created and destroyed without affecting any policy such as ACLs
or RBAC role bindings.
• Use audit logs to track the use of API keys
• If organization provides a large number of service accounts, or there are a large
number of applications accessing Confluent Cloud, it may run out of API keys.
▪ Points to remember about OAuth
• You can use OAuth with Apache 3.2.1, Confluent Platform 7.2.1 and 7.1.3 and
later, and librdkafka 1.9.2 or later.
• You will need to upgrade any older or legacy clients that don’t support OAuth.

Quick Notes Page 8


• You will need to upgrade any older or legacy clients that don’t support OAuth.
o While most applications or services have support built in, you may have
some legacy systems that will need to be upgraded, or which may require
you to use API keys.
• You should group clients into identity pools when possible.
o If you come from using API keys, your first inclination might be to create
identity pools for each service account. While this is possible, take some
time to really look at your clients and see if there are groups that you can
combine and control with one RBAC or ACL policy.
▪ Types of accounts
• User accounts
• Service accounts
▪ kafkaPrincipal
• it is an entity that can be authenticated by the authorizer.
• if you use SSL, the principal type is User and the name is the subject of the client
certificate.
• If you use SASL, the principal type is User and the name is the username.
• Usage
o use Kafka principals to define access control lists (ACLs) that specify the
permissions for each principal on different resources or groups of resources
in Kafka.
▪ types of interaction with authentication
• clients to broker
• broker to clients
• broker to broker
▪ listeners
• It is a way of specifying how a broker can communicate with other brokers and
clients.
• one or more listener is specified while creating a broker
o host/IP
o Port
o Security protocol
▪ PLAINTEXT – not secure
▪ SASL_PLAINTEXT – not secure
▪ SSL – secure
• When it is enabled for kafka listener, all traffic in the channel
will be encrypted using TLS cryptographic protocol
o TLS uses digital certificates to verify identities
• To use SSL for client authentication, use ssl.client.auth=required
in broker configuration.
o Ssl.client.auth=requested in broker configuration (not
recommended in broker configuration)
▪ Clients with the certificate will be identified
▪ Clients without the certificate will be assigned
anonymous user principal
• Default behavior of kafka clients
o They will verify that the hostname in the url is the same in
the broker certificate
▪ Disable this by setting
ssl.endpoint.identification.algorithm=”“
▪ Not recommended for production environment
▪ SASL_SSL – secure
• Simple Authentication and Security Layer
• Choice for Kerberos server like LDAP or active directory
• 4 different SASL authentication mechanisms
o GSSAPI
▪ Best suited for Kerberos servers

Quick Notes Page 9


▪ Best suited for Kerberos servers
o SCRAM-SHA 256 or 512
o OAUTHBEARER
o PLAIN
• types of listeners
o internal listeners
o external listeners
o authorization
▪ what a user can do when it’s authenticated itself.
▪ ACLs
• It describes which users are permitted to perform which operations on either
specific or group of resources
• Example – user:Alice has Allow permission for Write to Topic:customer
o User:Alice is known as principal
o Allow is the permission
o Write is the operation
o Topic:customer is the resource
▪ Resource names are by default used as “Literals”
• Can be changed to “prefixed”
• Wildcard character “*” can also be used.
o Encryption
▪ By default, encryption is disabled for performance
• Enabling it may impact the performance
▪ Encryption in-transit is the only option available with kafka out of the box
• Need to use platform specific tools to encrypt data at rest.
• Kafka quotas
o An Apache Kafka® cluster has the ability to enforce quotas on requests to control the broker
resources used by clients.
o Two types of client quotas can be enforced by Kafka brokers for each group of clients sharing a
quota:
▪ Network bandwidth quotas define byte-rate thresholds (version 0.9 and later)
▪ Request rate quotas define CPU utilization thresholds as a percentage of network and
I/O threads (version 0.11 and later)
o Order of precedence
▪ It is based on the level of specificity of the user and client ID groups.
• The more specific the group, the higher the precedence.
• For example, a specific user and a specific client ID is more specific than a specific
user and any client ID, which is more specific than any user and any client ID.

Confluent CLI
• Confluent environment list
o To list all the environments
o Different than the cluster
• Confluent environment use ID_name
o ID_name can be found in the output generated by above command
• Confluent kafka cluster list
o To see all the available clusters
• Confluent kafka cluster use ID_name
o ID_name can be found in the output generated by above command
• Confluent api-key create –resource ID_name

Quick Notes Page 10


• Confluent api-key create –resource ID_name
o ID_name here refers to cluster ID
o This will generate an API key to access cluster in CLI
• Confluent api-key use api_key_num –resource ID_name
o api_key_num is the api key generated by the above command
o ID_name here is again the cluster’s ID
o This command will help you connect with the cluster in CLI
• Confluent kafka topic list
o This will list all the available topics
• Confluent kafka topic consume –from-beginning topic_name
o Here topic_name refers to the name of the topic available from the available topic’s list
o This will start consuming the messages from the selected topic
• Confluent kafka topic produce topic_name –parse-key
o Here topic_name refers to the name of the selected topic
o This command will help you start producing messages right from the CLI
o Colon delimited messages will be parsed and sent to the selected topic
▪ 7:”I love kafka”
• Confluent kafka topic describe topic_name
o Here topic_name is the topic you want to select
o This command will help you see the topic configuration
• Confluent kafka topic create –partitions num_of_partitions topic_name
o num_of_partitions will be replaced by the digit
o topic_name will be replaced by the name of the topic
o this command will create a new topic with provided number of partitions

Quick Notes Page 11

You might also like