Beginners guide to balance your data across Apache kafka partitions

Abstract

  • number of partitions depends on how data is consumed
  • find the key allowing highest cardinality and most critical ordering
  • keep in mind data distribution over time
  • observe and monitor
  • be ready for the change

Partitioning:

  • a mechanism to ditribute the topic data across multiple brokers
  • a way to parallelize the production and consumption of the messages
  • the key to scale the system horizontally

Distributed systems are complex:

  • durable messaging
  • concurrent distributed transaction
  • HA, scaling, security

Measuring performance:

  • throughput: nb of messages that goes through the system in a given amount of time
    • how many people fit into a bus
  • latency: overall time it takes to process each message
    • total time for a traveler to complete the journey
  • lag: delta between the last produced message and the last consumer’s committed message
    • how far is the farthest passenger behind the front of the arriving group

How many partitions do you need?

Having too few partitions:

  • low throughput
  • increase in lag
  • puts limit on the upper nb of consumers - leads to struggling consumers

Too many partitions:

  • backups, upgrades, other admin operations require more effort

  • more potential consumer rebalances

  • producers are slower at batching and distributing data, compression is less effective, etc

Start with the end in mind.

Default partitioners

With the hash based partitioner, if you are changing the number of partitions, order of records may be compromised, because the hash(key) % nb of partitions formula may not send the record to the partition prior of the partition nb change.

Sticky partitioning may increase ~50% in performance.

  • linger.ms (default is 0ms)
  • batch.size (default is ~16MB)

Unbalanced partitions

How to avoid?

  • find the highest cardinality, with most critical ordering

Data is not produced evenly over time, so we can tweak the linger.ms and batch.size settings.

Stick partitioner problem, how to solve:

  • kafka >= 3.3
    • partitioner.adaptive.partitionng.enable=true
    • partitioner.availability.timeout.ms > 0
  • kafka < 3.3
    • increase linger.ms
    • rebalance partitions on a regular basis

Fetch size is too low on a consumer side.

Measure and monitoring

  • consumer lag
  • under replicated partition
  • fetch-latency-avg and fetch-latency-max

Many hot partitions

  • create a new topic and move the data
  • useful for
    • rebalancing records across partitions
    • disaster recovery
    • scaling up and down
    • changing schema