0% found this document useful (0 votes)
293 views72 pages

Kafka & Confluent: A Technical Guide

Kafka is a distributed streaming platform that allows publishing and subscribing to streams of records. It is scalable, fault-tolerant, and very fast. Kafka uses a publish-subscribe messaging model and is designed as a distributed transaction log. It partitions topics into segments spread across nodes in a cluster. Producers write data to topics that are consumed by consumer groups pulling data from the brokers. Zookeeper coordinates the cluster metadata.

Uploaded by

nadir nadjem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
293 views72 pages

Kafka & Confluent: A Technical Guide

Kafka is a distributed streaming platform that allows publishing and subscribing to streams of records. It is scalable, fault-tolerant, and very fast. Kafka uses a publish-subscribe messaging model and is designed as a distributed transaction log. It partitions topics into segments spread across nodes in a cluster. Producers write data to topics that are consumed by consumer groups pulling data from the brokers. Zookeeper coordinates the cluster metadata.

Uploaded by

nadir nadjem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Kafka - Spark streaming - ESGI 2020.

IPPON 2019
2018
What is Kafka Kafka architecture Produce and consume
messages

Motivations Start Kafka


● Kafka

What is Kafka ? ● Confluent


LinkedIn before Kafka

Ippon Technologies 2018


Linkedin after Kafka

Ippon Technologies 2018


What is Kafka ?

● MOM (Message Oriented Middleware)


● Used to publish and subscribe to streams of
records
● It’s Scalable
● It’s Polyglot
● It’s Fast

Ippon Technologies 2018


What about today ?

● Created by Linkedin in 2009


● Open source since 2011
● Part of the Apache foundation
○ Very active community
○ Current version 2.3.0

● Spinoff company Confluent created in 2014


○ Jay Kreps, Neda Narkhede and Jun Rao
○ Created the confluent platform
○ Raised several billion dollars (2,5 Billions 23/01/19)

Ippon Technologies 2018


What is Kafka ?

● Message bus
○ Written in Scala
○ Heavily inspired by transaction logs
● Initially created at LinkedIn in 2010
○ Open sourced in 2011
○ Became an apache top level project in 2012
● Designed to support batch and real time analytics
● Performs very well, especially at very large scale
What is Confluent ?

● Founded in 2014 by the creators of Kafka


● Provides support, training etc. for Kafka
● Provides the confluent platform
○ A lot of products to work with Kafka, to produce messages, transform data etc.
● Traditional systems

The importance of real time


Motivations ●

● The birth of Kafka


Traditional systems

● In a traditional system, data is dispatched over some databases


○ In a database, HDFS etc.
○ Each producer implements its own transformation logic et write into the database
● Over time, the system will grow
○ The codebase grows and becomes hard to maintain
Traditional systems

● At first, it is easy to connect several systems, several data sources to


databases
Traditional systems

● But eventually it becomes hard to maintain


The importance of real time

● Batch processing is traditional and well known


○ We use this approach with Spark
○ Every day, week etc. I run my batch processing
● But it implies a strong restriction
○ I need to wait the batch to be executed to start data analysis
The importance of real time

● Nowadays, it is really common to have real time processing needs


○ Fraud detection
○ Recommander systems
○ Log monitoring
○ real time supplier for HDFS
○ etc.
Kafka

● Kafka has been created to solve 2 issues


○ Simplify the architecture of data flows
○ Handle data streaming

● Kafka separate data production and consumption


○ Both are usually tied into one application
○ “Publish” / “Subscribe” concepts
Kafka

● Kafka is designed to work in a cluster


● A cluster is a set of instances (nodes) that know each other
Kafka

● Once the data in Kafka, It can be read by several / different consumers


○ A consumer which writes in HDFS, another applying an alerting process etc.
● Increasing the number of consumers does not have any significant impact
on performances
● A consumer can be added without touching the producer
● Fundamentals

● Producing messages

Partitioning
The architecture ●

● Consuming messages

● Zookeeper
Fundamentals

● Data sent into Kafka are messages


○ Each message is key / value pair
○ By default, messages do not have any schema
● Each message is written in a topic
○ It is a way to group messages
○ Very close (conceptually) to a message queue
● Topics can be created in advance or dynamically by the producers
Fundamentals

● The 4 key components of Kafka are


○ Producers
○ Brokers
○ Consumers
○ Zookeeper
The producer

● It has the task of sending messages to the Kafka cluster


● One can write a producer in a lot of programming languages
○ Java, C, Python, Scala etc. In our case, it will be Scala
About messages

● It is a key / value pair


● keys and values can be of any type
○ You provide a serializer to tell to the producer how to transform the data into a byte array
● The key is optional
○ It is used for partitioning (see that soon)
○ Without any key provided, the message can be written in any partition
About partitioning

● Topics are splitted into partitions


● Each partition contains a subset of the topic’s messages
● Kafka use the key (hashed) to choose the partition where the message will
be written
● Partitions are dispatched on the whole cluster
The broker

● The broker is the heart of Kafka


● It receives messages and persists them
● Highly performant (can handle several millions of messages per second)
The broker

● A Kafka cluster usually contains several brokers


○ For development / testing purpose, we may work with only one
● Each broker handle one or several partitions
○ Partitions are dispatched over the whole cluster
The consumer

● It reads messages from Kafka


● Several consumers can read the same topic
○ Each consumer will receive all messages from the topic (default behaviour)
● It recevies messages by pulling them from Kafka
○ Other products offer to push them to the consumers
○ The main advantage of pulling is that it does not overload the consumer (backpressure)
○ The consumer reads as its own speed
Zookeeper

● Apache project
● It is a configuration centralisation tool
● It is used by Kafka’s internals
Global architecture
● HDFS & RDBMS

Kafka versus ● CAP Theorem


HDFS & RDBMS

● Kafka is similar to products like RabbitMQ


○ RabbitMQ pushes messages to consumer
● Kafka can be used as a database, by modifying the message retention
duration
○ It is not its main purpose
○ It is hard to manipulate messages individually
● It is a kind of orchestrator, it supplies different services, different
databases
○ Such as HDFS
HDFS

● Distributed file system


● Scala extremely well
○ Even when more than a thousand of nodes composes the cluster
● Not so true for Cassandra or MongoDB
○ Beyond a certain number of nodes, performances decrease
CAP theorem

● Consistency, Availability, Partition tolerance


● A distributed system can only satisfy at most 2 of those properties
○ A RDBMS is CA, because not scalable so not concerned by P

C A P

Kafka X X

MongoDB X X

Cassandra X X

HDFS X X
● Partitions

Advanced ● Commit log

● Consumer group and offset


architecture ● Replicas
Partitions

● Each topic is divided into partitions

Each topic is divided into


one or several partitions

Partitions are distributed


over all the brokers in the
cluster
Partitions

● With partitions, we can scale. Data is no longer centralised mais


distributed
● Inside the same partition, data are read is the same order they are written.
Order is guaranteed on the partition
● On the other hand, from the point view of the topic, there is no order
guarantee between messages (coming from different partitions)
● This is why it is important to choose the right key if the order does matter
Commit log

● Data of each partition are persisted in a commit log


● Commonly implemented with a file in “append only” mode
○ Thus, data is immutable and reads / writes are highly efficient
● Also used by classical RDBMS
○ To trace all the changes that happen on tables
Consumer group

● Several consumers can consume together as a consumer group


○ They will not read the same messages from a given topic
○ They will share messages, a message will be read only once in the group
● Each consumer will read from one or several partitions
● Data from a partition will be read by only one consumer in the group
Consumer group

Consumers in a group Single consumers


share partitions consume all the partitions
Consumer group

● The number of useful consumers is limited by the number of partitions


○ A useful consumer receives data
○ Others do not, they are waiting
Offset

● For each consumer group and each partition, Kafka keeps an offset (an
integer)
● It is the position of the last element read by a given consumer group in a
given partition
Offset

● When a consumer asks for a message, Kafka search for any offset it has
for this consumer group (in any partition of the requested topic) and send
the corresponding message
● When a consumer gets a message, it will commit it
● When a consumer commits, Kafka increments the offset for the given
partition
● We can ask Kafka to read from a specific offset. Thus the consumer can
consume from wherever it wants
Réplicas

● It is possible (and recommended) to replicate partitions


● Replicas are perfect copies of main partitions

Topic-Partition-1 Topic-Partition-2

Topic-Replica-2 Topic-Replica-1

Broker 1 Broker 2
Réplicas

● If a broker is down, the replica becomes the leading partition and thus we
can still consume / produce messages

Topic-Partition-1 Topic-Partition-2

Topic-Replica-2 Topic-Partition-1

Broker 1 Broker 2
● Start Kafka

Produce and ● Dependencies

● Produce
consume ● Consume
Start Kafka

● Download Zookeeper and Kafka


(https://www-us.apache.org/dist/zookeeper/current/zookeeper-3.4.12.tar.
gz &
https://www.apache.org/dyn/closer.cgi?path=/kafka/2.1.0/kafka_2.12-2.1
.0.tgz)
● bin/zookeeper-server-start.sh config/zookeeper.properties
● bin/kafka-server-start.sh ./config/server.properties
Console

● Kafka provides command line tools to manipulate topics, consume


messages etc.
● To create a topic
○ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1
--topic test
Console

● To produce a message
○ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
● To consume a topic
○ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test
--from-beginning
Scala dependencies

● Kafka is a Scala dependency


○ One can use Maven or SBT
○ With Maven :
Scala dependencies

● Kafka is a Scala dependency


○ One can use Maven or SBT
○ With SBT :
Produce

● To start, we need to instantiate a producer


Produce

● Then we need to configure the producer. There are 3 mandatory properties:


○ The address of at least one broker
○ The serializers for the key and the value
○ Other serializers are provided by Kafka and we can define our own serializers
Produce

● Kafka provides a utility class to simplify the configuration


Produce

● There is a lot of possible parameters


● Everything is documented
Produce

● To send a message
Produce

● The call to producer.send() is asynchronous (non blocking)


● It does not bloque the code
● To force a synchronous call (blocking), we need to call
producer.send().get()
Produce

● To get the result, there are two ways


● The call producer.send() returns a Future
○ Unfortunately, it is a Java Future, hard to use in Scala
Produce

● The method producer.send() can also take a function as parameter, a


callback
● When the call will be done, the callback function will be called
Consume

● As for the producer, we need to instantiate the consumer


Consume

● As for the producer, we need to configure the consumer


● Another parameter is mandatory, the group id
Consume

● Several other parameters can be set


● For example, the parameter enable.auto.commit is used to tell the
consumer if it has to commit automatically. Otherwise, it has to be done
manually
○ If the property is set to true, the consumer commit every auto.commit.interval.ms
(5000ms by default)
● By default, enable.auto.commit is set to true
Consume

● Then, we need to subscribe to topics we wish to consume


● Kafka will then dispatch partitions between every consumer (of a given
group)
Consume

● Then we can fetch the results


● The call to poll is synchronous. If not a single message is available, the
consumer waits the duration indicated in parameter before giving control
back to the user
Consume

● If we set the parameter enable.auto.commit to be false, we will have to


manually commit, otherwise we will indefinitely read the same messages
Consume

● We can also asynchronously commit


Confluent ecosystem

● Schema registry
○ Offers possibility to apply schemas to messages
● Kafka Streams
○ High level library (offers a DSL) to transform data between topics
○ Plays the role of T in ETL
● Kafka Connect
○ Offers connectors to supply Kafka with data or transform data from Kafka to other
systems
■ There are connectors for HDFS, file system, cassandra etc.
○ Plays the role of E in ETL if the connector is a source and L if it is a sink
● etc.
Kafka Streams

● High level API to consume and producer messages between topics


○ Is used to transform data
○ Kafka Streams also offers a low level API. We will concentrate on the high level API
● is an alternative to
○ Spark Streaming
○ Apache Storm
○ Akka stream
○ etc.
Kafka Streams

● Kafka streams has 2 concepts


● KStream
○ The topic is seen as a data flow, where every data is independent from other data
● KTable
○ Similar to a changelog. Each data is seen as an update (depending on the key)
● For example, I have a topic with two elements (“euro”, 5) and (“euro”, 1)
○ If I create a KStream on this topic and sum the values in euros, I will get 6
○ If I create a KTable, I will get 1
Kafka Streams

● Kafka streams offers usual high level functions :


○ map
○ filter
○ groupByKey
○ count
○ etc.
Kafka Streams

● Simple example
Kafka Streams

● Simple example
Kafka Streams

● Word count example

You might also like