Thoughts on kafka capacity planning

Thoughts on Kafka
Capacity Planning
Jamie Alquiza
Sr. Software Engineer

– Multi-petabyte footprint
– 10s of GB/s in sustained bandwidth
– Globally distributed infrastructure
– Continuous growth
A Lot of Kafka

Poor utilization at
scale is $$$🔥
1

Poor utilization at
scale is $$$🔥
1 2
Desire for
predictable
performance ⚡

CPU
- Message Rate
- Compression
- Compaction (if
used)
Kafka Resource Consumption
Memory Disk Network

CPU
- Message Rate
- Compression
- Compaction (if
used)
Memory
- Efﬁcient, Steady
Heap Usage
- Page Cache
Disk Network

CPU
- Message Rate
- Compression
- Compaction (if
used)
Memory
Heap Usage
- Page Cache
Disk
- Bandwidth
- Storage
Capacity
Network

CPU
- Message Rate
- Compression
- Compaction (if
used)
Memory
Heap Usage
- Page Cache
Disk
- Bandwidth
- Storage
Capacity
Network
- Consumers
- Replication

Kafka Makes
Capacity Planning
Easy

Through the lens of the Universal
Scalability Law:
- low contention, crosstalk
- no complex queries
Kafka Makes Capacity Planning Easy

Through the lens of the Universal
Scalability Law:
- low contention, crosstalk
- no complex queries
Exposes mostly bandwidth
problems:
- highly sequential, batched ops
- primary workload is streaming
the reads/writes of bytes
Kafka Makes Capacity Planning Easy

Kafka Makes
Capacity Planning
Hard

The default tools weren’t made for
scaling:
- reassign-partitions focused on
simple partition placement
Kafka Makes Capacity Planning Hard

The default tools weren’t made for
scaling:
- reassign-partitions focused on
simple partition placement
No administrative API:
- no endpoint to inspect or
manipulate resources
Kafka Makes Capacity Planning Hard

Created Kafka-Kit (open-source):
- topicmappr for intelligent
partition placement
- registry (WIP): a Kafka
gRPC/HTTP API
A Scaling Model

Created Kafka-Kit (open-source):
- topicmappr for intelligent
partition placement
- registry (WIP): a Kafka
gRPC/HTTP API
Deﬁned a simple workload pattern:
- topics are bound to speciﬁc
broker sets (“pools”)
- multiple pools/cluster
- primary drivers: disk capacity &
network bandwidth
A Scaling Model

A Scaling Model
– topicmappr builds optimal
partition -> broker pool
mappings

A Scaling Model
mappings
– topic/pool sets are scaled
individually

A Scaling Model
mappings
– topic/pool sets are scaled
individually
– topicmappr handles repairs,
storage rebalancing, pool
expansion

A large cluster is
composed of
- dozens of pools
- hundreds of brokers

Sizing Pools
- Storage utilization is targeted at 60-80%
depending on topic growth rate

Sizing Pools
- Network capacity depends on several factors

Sizing Pools
- Network capacity depends on several factors
- consumer demand
+
- MTTR targets (20-40% headroom for replication)

Sizing Pools
Determining broker counts

Sizing Pools
storage: n = fullRetention / (storagePerNode * 0.8)

Sizing Pools
network: n = consumerDemand / (bwPerNode * 0.6)

Sizing Pools
network: n = consumerDemand / (bwPerNode * 0.6)
pool size = max(ceil(storage), ceil(network))

Sizing Pools
(we do a pretty good job at actually hitting this)

Instance Types
- If there’s a huge delta between counts required for
network vs storage: probably the wrong type

Instance Types
- Remember: sequential, bandwidth-bound
workloads

Instance Types
- Remember: sequential, bandwidth-bound
workloads
- AWS: d2, i3, h1 class

Instance Types (AWS)
d2: the spinning rust is actually great
Good for:
- Storage/$
- Retention biased workloads
Problems:
- Disk bw far exceeds network
- Long MTTRs

h1: a modernized d2?
Good for:
- Storage/$
- Balanced, lower retention / high throughput workloads
Problems:
- ENA network exceeds disk throughput
- Recovery times are disk-bound
- Disk bw / node < d2

i3: bandwidth monster
Good for:
- Low MTTRs
- Concurrent i/o outside the page cache
Problems:
- storage/$

- r4, c5, etc + gp2 EBS
It actually works well; EBS perf isn’t a problem

- r4, c5, etc + gp2 EBS
It actually works well; EBS perf isn’t a problem
Problems:
- low EBS channel bw in relation to instance size
- the burden of running a distributed/replicated store, hinging it
on tech that solves 2009 problems
- may want to consider Kinesis / etc?

Data Placement
topicmappr optimizes for:
- maximum leadership distribution

Data Placement
- replica rack.id isolation

Data Placement
- replica rack.id isolation
- maximum replica list entropy

Data Placement
Maximum replica list entropy(?)
“For all partitions that a given broker holds,
ensuring that the partition replicas are distributed
among as many other unique brokers as possible”

Data Placement
Maximum replica list entropy
It’s possible to have maximal partition distribution
but a low number of unique broker-to-broker
relationships

Data Placement
Maximum replica list entropy
It’s possible to have maximal partition distribution
but a low number of unique broker-to-broker
relationships
Example: broker A holds 20 partitions, all 20 replica sets contain
only 3 other brokers

Data Placement
Maximum replica list entropy!
- topicmappr expresses this as node degree
distribution

Data Placement
distribution
- broker-to-broker relationships: it’s a graph

Data Placement
distribution
- broker-to-broker relationships: it’s a graph
- replica sets are partial adjacency lists

Data Placement
A graph of replicas
Given the following partition replica sets:
p0: [1, 2, 3]
p1: [ 2, 3, 4]

Data Placement
A graph of replicas
p0: [1, 2, 3]
p1: [ 2, 3, 4]
Broker 3’s adjacency list -> [1, 2, 4]

Data Placement
A graph of replicas
p0: [1, 2, 3]
p1: [ 2, 3, 4]
Broker 3’s adjacency list -> [1, 2, 4] (degree = 3)

Data Placement
Maximizing replica list entropy (is good)

Data Placement
Maximizing replica list entropy (is good)
In broker failure/replacements:
- probabilistically increases replication sources
- faster, lower impact recoveries

Data Placement
- maximum leadership distribution ✅
- replica rack.id isolation ✅
- maximum replica list entropy ✅

Maintaining Pools
Most common tasks:
- ensuring storage balance
- simple broker replacements

Maintaining Pools
Most common tasks:
- ensuring storage balance
- simple broker replacements
Both of these are (also) done with topicmappr

Maintaining Pools
Broker storage balance

Maintaining Pools
- ﬁnds ofﬂoad candidates: n distance below
harmonic mean storage free

Maintaining Pools
- plans relocations to least-utilized brokers

Maintaining Pools
- fair-share, ﬁrst-ﬁt descending bin-packing

Maintaining Pools
- fair-share, ﬁrst-ﬁt descending bin-packing
- loops until no more relocations can be planned

Maintaining Pools
Broker storage balance (results)

Maintaining Pools
Broker replacements

Maintaining Pools
Broker replacements
- When a single broker fails, how is a replacement
chosen?

Maintaining Pools
Broker replacements
chosen?
- Goal: retain any previously computed storage
balance (via 1:1 replacements)

Maintaining Pools
Broker replacements
chosen?
- Goal: retain any previously computed storage
balance (via 1:1 replacements)
- Problem: dead brokers no longer visible in ZK

Maintaining Pools
Broker replacements
- topicmappr can be provided several hot spares
from varying AZs (rack.id)

Maintaining Pools
Broker replacements
- topicmappr can be provided several hot spares
from varying AZs (rack.id)
- infers a suitable replacement (“substitution afﬁnity”
feature)

Maintaining Pools
Broker replacements - inferring replacements
- traverse all ISRs, build a set of all rack.ids:
G = {1a,1b,1c,1d}

Maintaining Pools
- traverse all ISRs, build a set of all rack.ids:
G = {1a,1b,1c,1d}
- traverse affected ISRs, build a set of live rack.ids:
L = {1a,1c}

Maintaining Pools
- Build a set of suitable rack.ids to choose from:
S = { x ∈ G | x ∉ L }

Maintaining Pools
- Build a set of suitable rack.ids to choose from:
S = { x ∈ G | x ∉ L }
- S = {1b,1d}
- automatically chooses a hot spare from 1b or 1d

Maintaining Pools
Outcome:
- Keeps brokers bound to speciﬁc pools
- Simple repairs that maintain storage balance, high
utilization

Scaling Pools
When: >90% storage utilization in 48h

Scaling Pools
How:
- add brokers to pool
- run a rebalance
- autothrottle takes over

autothrottle is a service
that dynamically manages
replication rates

Scaling Pools
Increasing capacity also improves storage balance

What’s Next
- precursor to fully automated capacity management

What’s Next
- continued growth, dozens more clusters

What’s Next
- new infrastructure

What’s Next
- new infrastructure
- (we’re hiring)

Thank you
Jamie Alquiza
Sr. Software Engineer
twitter.com/jamiealquiza

Thoughts on kafka capacity planning

More Related Content

What's hot

Similar to Thoughts on kafka capacity planning

Recently uploaded

Thoughts on kafka capacity planning