Thoughts on Kafka
Capacity Planning
Jamie Alquiza
Sr. Software Engineer
– Multi-petabyte footprint
– 10s of GB/s in sustained bandwidth
– Globally distributed infrastructure
– Continuous growth
A Lot of Kafka
Motivation
Poor utilization at
scale is $$$🔥
1
Poor utilization at
scale is $$$🔥
1 2
Desire for
predictable
performance ⚡
Kafka Resource
Consumption
CPU
- Message Rate
- Compression
- Compaction (if
used)
Kafka Resource Consumption
Memory Disk Network
CPU
- Message Rate
- Compression
- Compaction (if
used)
Kafka Resource Consumption
Memory
- Efficient, Steady
Heap Usage
- Page Cache
Disk Network
CPU
- Message Rate
- Compression
- Compaction (if
used)
Kafka Resource Consumption
Memory
- Efficient, Steady
Heap Usage
- Page Cache
Disk
- Bandwidth
- Storage
Capacity
Network
CPU
- Message Rate
- Compression
- Compaction (if
used)
Kafka Resource Consumption
Memory
- Efficient, Steady
Heap Usage
- Page Cache
Disk
- Bandwidth
- Storage
Capacity
Network
- Consumers
- Replication
Kafka Makes
Capacity Planning
Easy
Through the lens of the Universal
Scalability Law:
- low contention, crosstalk
- no complex queries
Kafka Makes Capacity Planning Easy
Through the lens of the Universal
Scalability Law:
- low contention, crosstalk
- no complex queries
Exposes mostly bandwidth
problems:
- highly sequential, batched ops
- primary workload is streaming
the reads/writes of bytes
Kafka Makes Capacity Planning Easy
Kafka Makes
Capacity Planning
Hard
The default tools weren’t made for
scaling:
- reassign-partitions focused on
simple partition placement
Kafka Makes Capacity Planning Hard
The default tools weren’t made for
scaling:
- reassign-partitions focused on
simple partition placement
No administrative API:
- no endpoint to inspect or
manipulate resources
Kafka Makes Capacity Planning Hard
A Scaling Model
Created Kafka-Kit (open-source):
- topicmappr for intelligent
partition placement
- registry (WIP): a Kafka
gRPC/HTTP API
A Scaling Model
Created Kafka-Kit (open-source):
- topicmappr for intelligent
partition placement
- registry (WIP): a Kafka
gRPC/HTTP API
Defined a simple workload pattern:
- topics are bound to specific
broker sets (“pools”)
- multiple pools/cluster
- primary drivers: disk capacity &
network bandwidth
A Scaling Model
A Scaling Model
– topicmappr builds optimal
partition -> broker pool
mappings
A Scaling Model
– topicmappr builds optimal
partition -> broker pool
mappings
– topic/pool sets are scaled
individually
A Scaling Model
– topicmappr builds optimal
partition -> broker pool
mappings
– topic/pool sets are scaled
individually
– topicmappr handles repairs,
storage rebalancing, pool
expansion
A large cluster is
composed of
- dozens of pools
- hundreds of brokers
Sizing Pools
Sizing Pools
- Storage utilization is targeted at 60-80%
depending on topic growth rate
Sizing Pools
- Storage utilization is targeted at 60-80%
depending on topic growth rate
- Network capacity depends on several factors
Sizing Pools
- Storage utilization is targeted at 60-80%
depending on topic growth rate
- Network capacity depends on several factors
- consumer demand
+
- MTTR targets (20-40% headroom for replication)
Sizing Pools
Determining broker counts
Sizing Pools
Determining broker counts
storage: n = fullRetention / (storagePerNode * 0.8)
Sizing Pools
Determining broker counts
storage: n = fullRetention / (storagePerNode * 0.8)
network: n = consumerDemand / (bwPerNode * 0.6)
Sizing Pools
Determining broker counts
storage: n = fullRetention / (storagePerNode * 0.8)
network: n = consumerDemand / (bwPerNode * 0.6)
pool size = max(ceil(storage), ceil(network))
Sizing Pools
(we do a pretty good job at actually hitting this)
Instance Types
Instance Types
- If there’s a huge delta between counts required for
network vs storage: probably the wrong type
Instance Types
- If there’s a huge delta between counts required for
network vs storage: probably the wrong type
- Remember: sequential, bandwidth-bound
workloads
Instance Types
- If there’s a huge delta between counts required for
network vs storage: probably the wrong type
- Remember: sequential, bandwidth-bound
workloads
- AWS: d2, i3, h1 class
Instance Types (AWS)
d2: the spinning rust is actually great
Good for:
- Storage/$
- Retention biased workloads
Problems:
- Disk bw far exceeds network
- Long MTTRs
Instance Types (AWS)
h1: a modernized d2?
Good for:
- Storage/$
- Balanced, lower retention / high throughput workloads
Problems:
- ENA network exceeds disk throughput
- Recovery times are disk-bound
- Disk bw / node < d2
Instance Types (AWS)
i3: bandwidth monster
Good for:
- Low MTTRs
- Concurrent i/o outside the page cache
Problems:
- storage/$
Instance Types (AWS)
- r4, c5, etc + gp2 EBS
It actually works well; EBS perf isn’t a problem
Instance Types (AWS)
- r4, c5, etc + gp2 EBS
It actually works well; EBS perf isn’t a problem
Problems:
- low EBS channel bw in relation to instance size
- the burden of running a distributed/replicated store, hinging it
on tech that solves 2009 problems
- may want to consider Kinesis / etc?
Data Placement
Data Placement
topicmappr optimizes for:
- maximum leadership distribution
Data Placement
topicmappr optimizes for:
- maximum leadership distribution
- replica rack.id isolation
Data Placement
topicmappr optimizes for:
- maximum leadership distribution
- replica rack.id isolation
- maximum replica list entropy
Data Placement
Maximum replica list entropy(?)
“For all partitions that a given broker holds,
ensuring that the partition replicas are distributed
among as many other unique brokers as possible”
Data Placement
Maximum replica list entropy
It’s possible to have maximal partition distribution
but a low number of unique broker-to-broker
relationships
Data Placement
Maximum replica list entropy
It’s possible to have maximal partition distribution
but a low number of unique broker-to-broker
relationships
Example: broker A holds 20 partitions, all 20 replica sets contain
only 3 other brokers
Data Placement
Maximum replica list entropy!
- topicmappr expresses this as node degree
distribution
Data Placement
Maximum replica list entropy!
- topicmappr expresses this as node degree
distribution
- broker-to-broker relationships: it’s a graph
Data Placement
Maximum replica list entropy!
- topicmappr expresses this as node degree
distribution
- broker-to-broker relationships: it’s a graph
- replica sets are partial adjacency lists
Data Placement
A graph of replicas
Given the following partition replica sets:
p0: [1, 2, 3]
p1: [ 2, 3, 4]
Data Placement
A graph of replicas
Given the following partition replica sets:
p0: [1, 2, 3]
p1: [ 2, 3, 4]
Broker 3’s adjacency list -> [1, 2, 4]
Data Placement
A graph of replicas
Given the following partition replica sets:
p0: [1, 2, 3]
p1: [ 2, 3, 4]
Broker 3’s adjacency list -> [1, 2, 4] (degree = 3)
Data Placement
Maximizing replica list entropy (is good)
Data Placement
Maximizing replica list entropy (is good)
In broker failure/replacements:
- probabilistically increases replication sources
- faster, lower impact recoveries
Data Placement
topicmappr optimizes for:
- maximum leadership distribution ✅
- replica rack.id isolation ✅
- maximum replica list entropy ✅
Maintaining Pools
Maintaining Pools
Most common tasks:
- ensuring storage balance
- simple broker replacements
Maintaining Pools
Most common tasks:
- ensuring storage balance
- simple broker replacements
Both of these are (also) done with topicmappr
Maintaining Pools
Broker storage balance
Maintaining Pools
Broker storage balance
- finds offload candidates: n distance below
harmonic mean storage free
Maintaining Pools
Broker storage balance
Maintaining Pools
Broker storage balance
Maintaining Pools
Broker storage balance
- finds offload candidates: n distance below
harmonic mean storage free
- plans relocations to least-utilized brokers
Maintaining Pools
Broker storage balance
- finds offload candidates: n distance below
harmonic mean storage free
- plans relocations to least-utilized brokers
- fair-share, first-fit descending bin-packing
Maintaining Pools
Broker storage balance
- finds offload candidates: n distance below
harmonic mean storage free
- plans relocations to least-utilized brokers
- fair-share, first-fit descending bin-packing
- loops until no more relocations can be planned
Maintaining Pools
Broker storage balance (results)
Maintaining Pools
Broker replacements
Maintaining Pools
Broker replacements
- When a single broker fails, how is a replacement
chosen?
Maintaining Pools
Broker replacements
- When a single broker fails, how is a replacement
chosen?
- Goal: retain any previously computed storage
balance (via 1:1 replacements)
Maintaining Pools
Broker replacements
- When a single broker fails, how is a replacement
chosen?
- Goal: retain any previously computed storage
balance (via 1:1 replacements)
- Problem: dead brokers no longer visible in ZK
Maintaining Pools
Broker replacements
- topicmappr can be provided several hot spares
from varying AZs (rack.id)
Maintaining Pools
Broker replacements
- topicmappr can be provided several hot spares
from varying AZs (rack.id)
- infers a suitable replacement (“substitution affinity”
feature)
Maintaining Pools
Broker replacements - inferring replacements
- traverse all ISRs, build a set of all rack.ids:
G = {1a,1b,1c,1d}
Maintaining Pools
Broker replacements - inferring replacements
- traverse all ISRs, build a set of all rack.ids:
G = {1a,1b,1c,1d}
- traverse affected ISRs, build a set of live rack.ids:
L = {1a,1c}
Maintaining Pools
Broker replacements - inferring replacements
- Build a set of suitable rack.ids to choose from:
S = { x ∈ G | x ∉ L }
Maintaining Pools
Broker replacements - inferring replacements
- Build a set of suitable rack.ids to choose from:
S = { x ∈ G | x ∉ L }
- S = {1b,1d}
- automatically chooses a hot spare from 1b or 1d
Maintaining Pools
Broker replacements - inferring replacements
Outcome:
- Keeps brokers bound to specific pools
- Simple repairs that maintain storage balance, high
utilization
Scaling Pools
Scaling Pools
When: >90% storage utilization in 48h
Scaling Pools
How:
- add brokers to pool
- run a rebalance
- autothrottle takes over
autothrottle is a service
that dynamically manages
replication rates
Scaling Pools
Increasing capacity also improves storage balance
What’s Next
What’s Next
- precursor to fully automated capacity management
What’s Next
- precursor to fully automated capacity management
- continued growth, dozens more clusters
What’s Next
- precursor to fully automated capacity management
- continued growth, dozens more clusters
- new infrastructure
What’s Next
- precursor to fully automated capacity management
- continued growth, dozens more clusters
- new infrastructure
- (we’re hiring)
Thank you
Jamie Alquiza
Sr. Software Engineer
twitter.com/jamiealquiza

Thoughts on kafka capacity planning