Beginners guide to balance your data across Apache kafka partitions
Abstract
- number of partitions depends on how data is consumed
- find the key allowing highest cardinality and most critical ordering
- keep in mind data distribution over time
- observe and monitor
- be ready for the change
Partitioning:
- a mechanism to ditribute the topic data across multiple brokers
- a way to parallelize the production and consumption of the messages
- the key to scale the system horizontally
Distributed systems are complex:
- durable messaging
- concurrent distributed transaction
- HA, scaling, security
Measuring performance:
- throughput: nb of messages that goes through the system in a given amount of time
- how many people fit into a bus
- latency: overall time it takes to process each message
- total time for a traveler to complete the journey
- lag: delta between the last produced message and the last consumer’s committed message
- how far is the farthest passenger behind the front of the arriving group
How many partitions do you need?
Having too few partitions:
- low throughput
- increase in lag
- puts limit on the upper nb of consumers - leads to struggling consumers
Too many partitions:
backups, upgrades, other admin operations require more effort
more potential consumer rebalances
producers are slower at batching and distributing data, compression is less effective, etc
Start with the end in mind.
Default partitioners
With the hash based partitioner, if you are changing the number of partitions, order of records may be compromised, because the
hash(key) % nb of partitions
formula may not send the record to the partition prior of the partition nb change.Sticky partitioning may increase ~50% in performance.
linger.ms
(default is 0ms)batch.size
(default is ~16MB)Unbalanced partitions
How to avoid?
- find the highest cardinality, with most critical ordering
Data is not produced evenly over time, so we can tweak the
linger.ms
andbatch.size
settings.Stick partitioner problem, how to solve:
- kafka >= 3.3
partitioner.adaptive.partitionng.enable=true
partitioner.availability.timeout.ms
> 0- kafka < 3.3
- increase
linger.ms
- rebalance partitions on a regular basis
Fetch size is too low on a consumer side.
Measure and monitoring
- consumer lag
- under replicated partition
- fetch-latency-avg and fetch-latency-max
Many hot partitions
- create a new topic and move the data
- useful for
- rebalancing records across partitions
- disaster recovery
- scaling up and down
- changing schema