Kafka - Spark streaming - ESGI 2020.
IPPON 2019
2018
What is Kafka Kafka architecture Produce and consume
messages
Motivations Start Kafka
● Kafka
What is Kafka ? ● Confluent
LinkedIn before Kafka
Ippon Technologies 2018
Linkedin after Kafka
Ippon Technologies 2018
What is Kafka ?
● MOM (Message Oriented Middleware)
● Used to publish and subscribe to streams of
records
● It’s Scalable
● It’s Polyglot
● It’s Fast
Ippon Technologies 2018
What about today ?
● Created by Linkedin in 2009
● Open source since 2011
● Part of the Apache foundation
○ Very active community
○ Current version 2.3.0
● Spinoff company Confluent created in 2014
○ Jay Kreps, Neda Narkhede and Jun Rao
○ Created the confluent platform
○ Raised several billion dollars (2,5 Billions 23/01/19)
Ippon Technologies 2018
What is Kafka ?
● Message bus
○ Written in Scala
○ Heavily inspired by transaction logs
● Initially created at LinkedIn in 2010
○ Open sourced in 2011
○ Became an apache top level project in 2012
● Designed to support batch and real time analytics
● Performs very well, especially at very large scale
What is Confluent ?
● Founded in 2014 by the creators of Kafka
● Provides support, training etc. for Kafka
● Provides the confluent platform
○ A lot of products to work with Kafka, to produce messages, transform data etc.
● Traditional systems
The importance of real time
Motivations ●
● The birth of Kafka
Traditional systems
● In a traditional system, data is dispatched over some databases
○ In a database, HDFS etc.
○ Each producer implements its own transformation logic et write into the database
● Over time, the system will grow
○ The codebase grows and becomes hard to maintain
Traditional systems
● At first, it is easy to connect several systems, several data sources to
databases
Traditional systems
● But eventually it becomes hard to maintain
The importance of real time
● Batch processing is traditional and well known
○ We use this approach with Spark
○ Every day, week etc. I run my batch processing
● But it implies a strong restriction
○ I need to wait the batch to be executed to start data analysis
The importance of real time
● Nowadays, it is really common to have real time processing needs
○ Fraud detection
○ Recommander systems
○ Log monitoring
○ real time supplier for HDFS
○ etc.
Kafka
● Kafka has been created to solve 2 issues
○ Simplify the architecture of data flows
○ Handle data streaming
● Kafka separate data production and consumption
○ Both are usually tied into one application
○ “Publish” / “Subscribe” concepts
Kafka
● Kafka is designed to work in a cluster
● A cluster is a set of instances (nodes) that know each other
Kafka
● Once the data in Kafka, It can be read by several / different consumers
○ A consumer which writes in HDFS, another applying an alerting process etc.
● Increasing the number of consumers does not have any significant impact
on performances
● A consumer can be added without touching the producer
● Fundamentals
● Producing messages
Partitioning
The architecture ●
● Consuming messages
● Zookeeper
Fundamentals
● Data sent into Kafka are messages
○ Each message is key / value pair
○ By default, messages do not have any schema
● Each message is written in a topic
○ It is a way to group messages
○ Very close (conceptually) to a message queue
● Topics can be created in advance or dynamically by the producers
Fundamentals
● The 4 key components of Kafka are
○ Producers
○ Brokers
○ Consumers
○ Zookeeper
The producer
● It has the task of sending messages to the Kafka cluster
● One can write a producer in a lot of programming languages
○ Java, C, Python, Scala etc. In our case, it will be Scala
About messages
● It is a key / value pair
● keys and values can be of any type
○ You provide a serializer to tell to the producer how to transform the data into a byte array
● The key is optional
○ It is used for partitioning (see that soon)
○ Without any key provided, the message can be written in any partition
About partitioning
● Topics are splitted into partitions
● Each partition contains a subset of the topic’s messages
● Kafka use the key (hashed) to choose the partition where the message will
be written
● Partitions are dispatched on the whole cluster
The broker
● The broker is the heart of Kafka
● It receives messages and persists them
● Highly performant (can handle several millions of messages per second)
The broker
● A Kafka cluster usually contains several brokers
○ For development / testing purpose, we may work with only one
● Each broker handle one or several partitions
○ Partitions are dispatched over the whole cluster
The consumer
● It reads messages from Kafka
● Several consumers can read the same topic
○ Each consumer will receive all messages from the topic (default behaviour)
● It recevies messages by pulling them from Kafka
○ Other products offer to push them to the consumers
○ The main advantage of pulling is that it does not overload the consumer (backpressure)
○ The consumer reads as its own speed
Zookeeper
● Apache project
● It is a configuration centralisation tool
● It is used by Kafka’s internals
Global architecture
● HDFS & RDBMS
Kafka versus ● CAP Theorem
HDFS & RDBMS
● Kafka is similar to products like RabbitMQ
○ RabbitMQ pushes messages to consumer
● Kafka can be used as a database, by modifying the message retention
duration
○ It is not its main purpose
○ It is hard to manipulate messages individually
● It is a kind of orchestrator, it supplies different services, different
databases
○ Such as HDFS
HDFS
● Distributed file system
● Scala extremely well
○ Even when more than a thousand of nodes composes the cluster
● Not so true for Cassandra or MongoDB
○ Beyond a certain number of nodes, performances decrease
CAP theorem
● Consistency, Availability, Partition tolerance
● A distributed system can only satisfy at most 2 of those properties
○ A RDBMS is CA, because not scalable so not concerned by P
C A P
Kafka X X
MongoDB X X
Cassandra X X
HDFS X X
● Partitions
Advanced ● Commit log
● Consumer group and offset
architecture ● Replicas
Partitions
● Each topic is divided into partitions
Each topic is divided into
one or several partitions
Partitions are distributed
over all the brokers in the
cluster
Partitions
● With partitions, we can scale. Data is no longer centralised mais
distributed
● Inside the same partition, data are read is the same order they are written.
Order is guaranteed on the partition
● On the other hand, from the point view of the topic, there is no order
guarantee between messages (coming from different partitions)
● This is why it is important to choose the right key if the order does matter
Commit log
● Data of each partition are persisted in a commit log
● Commonly implemented with a file in “append only” mode
○ Thus, data is immutable and reads / writes are highly efficient
● Also used by classical RDBMS
○ To trace all the changes that happen on tables
Consumer group
● Several consumers can consume together as a consumer group
○ They will not read the same messages from a given topic
○ They will share messages, a message will be read only once in the group
● Each consumer will read from one or several partitions
● Data from a partition will be read by only one consumer in the group
Consumer group
Consumers in a group Single consumers
share partitions consume all the partitions
Consumer group
● The number of useful consumers is limited by the number of partitions
○ A useful consumer receives data
○ Others do not, they are waiting
Offset
● For each consumer group and each partition, Kafka keeps an offset (an
integer)
● It is the position of the last element read by a given consumer group in a
given partition
Offset
● When a consumer asks for a message, Kafka search for any offset it has
for this consumer group (in any partition of the requested topic) and send
the corresponding message
● When a consumer gets a message, it will commit it
● When a consumer commits, Kafka increments the offset for the given
partition
● We can ask Kafka to read from a specific offset. Thus the consumer can
consume from wherever it wants
Réplicas
● It is possible (and recommended) to replicate partitions
● Replicas are perfect copies of main partitions
Topic-Partition-1 Topic-Partition-2
Topic-Replica-2 Topic-Replica-1
Broker 1 Broker 2
Réplicas
● If a broker is down, the replica becomes the leading partition and thus we
can still consume / produce messages
Topic-Partition-1 Topic-Partition-2
Topic-Replica-2 Topic-Partition-1
Broker 1 Broker 2
● Start Kafka
Produce and ● Dependencies
● Produce
consume ● Consume
Start Kafka
● Download Zookeeper and Kafka
(https://www-us.apache.org/dist/zookeeper/current/zookeeper-3.4.12.tar.
gz &
https://www.apache.org/dyn/closer.cgi?path=/kafka/2.1.0/kafka_2.12-2.1
.0.tgz)
● bin/zookeeper-server-start.sh config/zookeeper.properties
● bin/kafka-server-start.sh ./config/server.properties
Console
● Kafka provides command line tools to manipulate topics, consume
messages etc.
● To create a topic
○ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1
--topic test
Console
● To produce a message
○ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
● To consume a topic
○ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test
--from-beginning
Scala dependencies
● Kafka is a Scala dependency
○ One can use Maven or SBT
○ With Maven :
Scala dependencies
● Kafka is a Scala dependency
○ One can use Maven or SBT
○ With SBT :
Produce
● To start, we need to instantiate a producer
Produce
● Then we need to configure the producer. There are 3 mandatory properties:
○ The address of at least one broker
○ The serializers for the key and the value
○ Other serializers are provided by Kafka and we can define our own serializers
Produce
● Kafka provides a utility class to simplify the configuration
Produce
● There is a lot of possible parameters
● Everything is documented
Produce
● To send a message
Produce
● The call to producer.send() is asynchronous (non blocking)
● It does not bloque the code
● To force a synchronous call (blocking), we need to call
producer.send().get()
Produce
● To get the result, there are two ways
● The call producer.send() returns a Future
○ Unfortunately, it is a Java Future, hard to use in Scala
Produce
● The method producer.send() can also take a function as parameter, a
callback
● When the call will be done, the callback function will be called
Consume
● As for the producer, we need to instantiate the consumer
Consume
● As for the producer, we need to configure the consumer
● Another parameter is mandatory, the group id
Consume
● Several other parameters can be set
● For example, the parameter enable.auto.commit is used to tell the
consumer if it has to commit automatically. Otherwise, it has to be done
manually
○ If the property is set to true, the consumer commit every auto.commit.interval.ms
(5000ms by default)
● By default, enable.auto.commit is set to true
Consume
● Then, we need to subscribe to topics we wish to consume
● Kafka will then dispatch partitions between every consumer (of a given
group)
Consume
● Then we can fetch the results
● The call to poll is synchronous. If not a single message is available, the
consumer waits the duration indicated in parameter before giving control
back to the user
Consume
● If we set the parameter enable.auto.commit to be false, we will have to
manually commit, otherwise we will indefinitely read the same messages
Consume
● We can also asynchronously commit
Confluent ecosystem
● Schema registry
○ Offers possibility to apply schemas to messages
● Kafka Streams
○ High level library (offers a DSL) to transform data between topics
○ Plays the role of T in ETL
● Kafka Connect
○ Offers connectors to supply Kafka with data or transform data from Kafka to other
systems
■ There are connectors for HDFS, file system, cassandra etc.
○ Plays the role of E in ETL if the connector is a source and L if it is a sink
● etc.
Kafka Streams
● High level API to consume and producer messages between topics
○ Is used to transform data
○ Kafka Streams also offers a low level API. We will concentrate on the high level API
● is an alternative to
○ Spark Streaming
○ Apache Storm
○ Akka stream
○ etc.
Kafka Streams
● Kafka streams has 2 concepts
● KStream
○ The topic is seen as a data flow, where every data is independent from other data
● KTable
○ Similar to a changelog. Each data is seen as an update (depending on the key)
● For example, I have a topic with two elements (“euro”, 5) and (“euro”, 1)
○ If I create a KStream on this topic and sum the values in euros, I will get 6
○ If I create a KTable, I will get 1
Kafka Streams
● Kafka streams offers usual high level functions :
○ map
○ filter
○ groupByKey
○ count
○ etc.
Kafka Streams
● Simple example
Kafka Streams
● Simple example
Kafka Streams
● Word count example