A visual introduction to Apache Kafka

Introducing Apache Kafka®
Paul Brebner
Technical Evangelist February 2019

What is Kafka?
Message flow
Distributed streams
Processing System
Messages sent by
Distributed Producers
to
Distributed Consumers
Consumers
via
Distributed Kafka Cluster
Cluster

Kafka Benefits
 Fast
 Scalable
 Reliable
 Durable
 Open Source
 Managed Service
Kafka benefits

Kafka Benefits
 Fast – high throughput and low latency
 Scalable – horizontally scalable with nodes and partitions
 Reliable – distributed and fault tolerant
 Durable - zero data loss, messages persisted to disk with immutable log
 Open Source – An Apache project
 Available as a Managed Service - on multiple cloud platforms
Kafka benefits

“A” is the Producer – sends a message, to

“B” the Consumer (recipient) of the message

Due to decline in “snail mail” direct deliveries

Instead … “Poste Restante”

“Poste Restante”?
• Not a post office in a
restaurant
• General delivery (in the
US)
• The mail is delivered to
a post office, they hold
it for you until you call

Consumers “poll” for messages by visiting the
Poste Restante counter at the post office

Kafka topics act like a Post Office

Benefits include
• Disconnected delivery –
consumer doesn’t need to
be available to receive
messages
• Less effort for the
messaging service – only
has to deliver to a few
locations not every
consumer
• Can scale better and
handle more complex
delivery semantics!

Scalability? Many consumers for a topic?

A single counter introduces delays

More counters increases concurrency

Kafka topics have >= 1 Partitions (“counters”)
• Partitions increase
consumer concurrency
• Increase throughput
• Reduce latency

Santa
North Pole
What’s a Kafka message?
A Record – like a letter

Santa
North PoleTopic
Topic is the destination

Santa
North PoleTopic
Timestamp, offset, partition
The “Postmark”

Santa
North PoleTopic
The “Postmark”
Time semantics are flexible
time of event creation,
ingestion,
or processing.

Santa
North PoleTopic
Key -> Partition (optional)
Key (optional)

Santa
North PoleTopic
Key (optional)
Refines the destination
Send to Santa not just any Elf

Santa
North PoleTopic
Value is contents (byte array)

Santa
North PoleTopic
Kafka Producers and Consumers
need a serializer and de-serializer
to write & read key and value

• Kafka doesn’t look
at the value
• Consumer can
read value
• And try to make
sense of the
message
• What will Santa
be delivering?!

Next…
Delivery Semantics
Do we care
if the message arrives?

Yes! Guaranteed delivery is desirable

But homing pigeons got lost or eaten

One pigeon may make it
Eaten
Lost

Producer
Broker
1
Broker
2
Broker
3
M1
How does Kafka guarantee delivery?

Producer
Broker
1
Broker
2
Broker
3
M1
A Message (M1) is written to a broker (2)

Producer
Broker
1
Broker
2
Broker
3
M1
M1
The message is always persisted to disk.

Producer
Broker
1
Broker
2
Broker
3
M1
M1
This makes it resilient to loss of power.

Producer
Broker
1
Broker
2
Broker
3
M1 M1
M1 M1M1
The message is also replicated on multiple “brokers”

Producer
Broker
1
Broker
2
Broker
3
M1
M1 M1M1
Which makes it resilient to loss of most servers

Producer
Broker
1
Broker
2
Broker
3
Acknowledgement
M1 M1 M1
Producer gets acknowledgement
once the message is persisted and replicated
(configurable)

Consumer
Broker
1
Broker
2
Broker
3
M1 M1 M1
Multiple Brokers and Partitions 
increased read availability and concurrency

Producer
Consumer
Consumer
Consumer
Consumer
The 2nd aspect of delivery semantics:
Who gets the messages?
How many times are messages delivered?

Producer
Consumer
Consumer
Consumer
Consumer
Delivery Semantics - Kafka is “pub-sub”
- Loosely coupled
- Producers and consumers don’t know about each other

Producer
Topic “Parties”
Consumer
Consumer
Consumer
Consumer
Which consumers get which messages
(filtering), is topic based
Topic “Work”

Producer
Topic “Parties”
Consumer
Consumer
Consumer
Consumer
Topic “Work”
Consumers Subscribe to topic “Parties”
Subscribe

Producer
Topic “Parties”
Consumer
Consumer
Consumer
Consumers
Subscribed to Topic “Parties”
Consumer
Publishers send messages to topics
Send
Topic “Work”Send

Producer
Topic “Parties”
Consumer
Consumer
Consumer
Consumers
Subscribed to Topic “Parties”
Consumer
Consumers only receive messages from
subscribed topics
Consumers Poll
To receive messages
from ”Parties”
Topic “Work” Consumers not subscribed to “Work”
Don’t receive any “Work” messages

Partitions and Consumer Groups
Enable sharing of work across
consumers

Duplicate message delivery
Each message is
delivered to each
subscribed
consumer group

Producer
Partition 1
Partition 2
Partition n
Topic “Parties”
C1
Consumer Group
Consumer Group
Consumer
Consumer
Consumer
Consumer
Consumers subscribed to topic are allocated partitions
They will only get messages from their allocated partitions.

Producer
Partition 1
Partition 2
Partition n
Topic “Parties”
C1
Consumer Group
Consumer
Consumer
Consumer
Consumers share work
within groups
Consumers in the same group share the work around
Each consumer gets only a subset of messages

Producer
Partition 1
Partition 2
Partition n
Topic “Parties”
C1
Consumer Group
Consumer Group
Consumer
Consumer
Consumer
Consumer
Messages are duplicated across
Consumer groups
Multiple groups enable message broadcasting
Messages are duplicated across groups, each consumer
group receives a copy of each message.

Key
Partition based delivery
Which messages are delivered to
which consumers?
If a message has a key, then Kafka
uses Partition based delivery.
Messages with the same key are
always sent to the same partition
and therefore the same
consumer.
And the order is guaranteed.

No Key
Round robin delivery
If the key is null, then
Kafka uses round robin
delivery
Each message is delivered
to the next partition

Time for an Example, with 2 consumer groups.

Consumer Group = Nerds
Multiple consumers

Consumer Group = Nerds
Multiple consumers
Consumer Group = Hairy
Single consumer

Producer
Partition 1
Partition 2
Partition n
Topic “Parties”
C1
Group “Nerds”
Group “Hairy”
Consumer 1 (Bill)
Consumer 2 (Paul)
Consumer n
Consumer 1 (Chewy)
Consumers
Subscribed to “Parties”
No Key
Round Robin
M1
M2
etc
M1
M1
M1
M2
M2
M2
Case 1: No Key
Message (M1, M2, etc) sent to the next partition
All consumers allocated to that partition will receive a message when they poll next.

Here’s what happens (not showing producer or topics, have to imagine them)

1. Both Groups subscribe to Topic “parties” (11 partitions, so 1 consumer per partition).
Subscribe to “Parties” Subscribe to “Parties”

2. Producer sends record “Cool pool party – Invitation”
<key=null, value=“Cool pool party - Invitation”> to “parties” topic (no key)

2. Producer sends record “Cool pool party - Invitation”> to “parties” topic
3. Bill and Chewbacca receive a copy of the invitation and plan to attend

4. Producer sends another record “Cool pool party – Cancelled”
<key=null, value=“Cool pool party - Cancelled”> to “parties” topic

4. Producer sends another record <key=null, value=“Cool pool party - Cancelled”> to “parties” topic
5. Paul and Chewbacca receive the cancellation.
Paul gets the message this time as it’s round robin, ignores it as he didn’t get the invitation. Bill wastes his
time trying to go to cancelled party. The rest of the gang aren’t surprised at not receiving any party invites and
stay at home to do some hacking. Chewy is only consumer in his group so gets all messages, plans something
fun instead…

Producer
Partition 1
Partition 2
Partition n
Topic “Parties”
C1
Group “Nerds”
Group “Hairy”
Consumer 1 (Bill)
Consumer 2 (Paul)
Consumer n
Consumer 1 (Chewy)
Consumers
Subscribed to “Parties”
Key
Hashed to partition
M1, M2
etc
M1, M2
M1, M2
M1, M2
M3
M3
M3
M3
Case 2: If there is a Key
A key is hashed to a partition, and a Message with that key is always sent to that partition.
Assume there are 3 messages, and messages 1 and 2 are hashed to same partition.

Here’s what happens with a key: key is “title” of the message (e.g. “Cool pool party”)
Same set up as before:
1. Both Groups subscribe to Topic “parties” (11 partitions).

2. Producer sends record <key=“Cool pool party”, value=“Invitation”> to “parties” topic

2. Producer sends record <key=“Cool pool party”, value=“Invitation”> to “parties” topic
3. As before Bill and Chewbacca receive a copy of the invitation and plan to attend

4. Producer sends another record <key=“Cool pool party”, value=“Cancelled”> to “parties” topic

4. Producer sends another record <key=“Cool pool party”, value=“Cancelled”> to “parties” topic
5. Bill and Chewbacca receive the cancellation (same consumers this time, as identical key)

6. Producer sends another record <key=“Horrible Halloween party”, value=“Invitation”> to ”parties” topic

6. Producer sends another record <key=“Horrible Halloween party”, value=“Invitation”> to ”parties” topic
7. Paul and Chewy receive the invitation
Paul receives the Halloween invitation as the key is different and the record is sent to the partition that Paul is
allocated to
Chewy is the only consumer in his group so he gets every record no matter what partition it’s sent to

Real-time data pipeline
Read-time data pipeline features:
• Ingestion of multiple heterogeneous sources
• Sending data to multiple heterogeneous sinks
• Acts as a buffer to smooth out load spikes
• Enables use cases which reprocess data (e.g. disaster recovery)

Anomaly Detection Pipeline
Real-time Event processing pipeline:
• Simple event driven applications (If X then Y…)
• May write and read from other data sources (e.g. Cassandra)
• New Events sent back to Kafka or to other systems
• E.g. Anomaly Detection, check out my current blog series if you are interested in this example.

Kafka Streams Processing (Kongo IoT Blog series)
Streams processing features:
• Complex streams processing (multiple events and streams)
• Time, windows, and transformations
• Uses Kafka Streams API, includes state store
• Visualization of the streams topology
• Continuously computes the loads for trucks and checks if they are overloaded.

Linkedin - Before Kafka (BK)
A real example from Linkedin, who developed Kafka.
Before Kafka they had spaghetti integration of monolithic applications.
To accommodate growing membership and increasing site complexity, they migrated from a monolithic
application infrastructure to one based on microservices, which made the integration even more complex!

After Kafka (AK)
Rather than maintaining and scaling each pipeline individually, they invested in the
development of a single, distributed pub-sub platform - Kafka was born.
The main benefit was better Service decoupling and independent scaling.

The End (of the introduction) -
Find out more
Apache Kafka: https://kafka.apache.org/
Instaclustr blogs
• Mix of Cassandra, Spark, Zeppelin and Kafka
https://www.instaclustr.com/paul-brebner/
• Kafka introduction
https://insidebigdata.com/2018/04/12/developing-deeper-understanding-apache-kafka-architecture/
https://insidebigdata.com/2018/04/19/developing-deeper-understanding-apache-kafka-architecture-part-2-
• Kongo – Kafka IoT logistics application blog series
https://www.instaclustr.com/instaclustr-kongo-iot-logistics-streaming-demo-application/
• Anomaly detection with Kafka and Cassandra (and Kubernetes), current blog series
https://www.instaclustr.com/anomalia-machina-1-massively-scalable-anomaly-detection-with-apache-kafka-
Instaclustr’s Managed Kafka (Free trial)
https://www.instaclustr.com/solutions/managed-apache-kafka/

A visual introduction to Apache Kafka

Recommended

More Related Content

What's hot (20)

Similar to A visual introduction to Apache Kafka (14)

More from Paul Brebner (20)

Recently uploaded (20)

A visual introduction to Apache Kafka

Editor's Notes