kafka cheatsheet
Concepts
Kafka is a distributed, replicated commit log:
- Distributed: Kafka is deployed as a cluster of nodes, for both fault tolerance and scale
- Replicated: messages are usually replicated across multiple nodes
- Commit Log: messages are stored in partitioned, append only logs which are called Topics
Kafka components:
- A Kafka cluster stores categories called Topics
- A cluster consist of one or more servers called Brokers
- Each Topic maintains a partitioned log
- A Topic may have many partitions which all act as the unit of parallelism
- Replicas is the list of Brokers that replicate the log for this partition
Producers and consumers
Producers are client applications that publish events to Kafka.
Consumers are those that subscribe to these events.
In Kafka, producers and consumers are fully decoupled and agnostic of each other, which is a key design element to achieve the high scalability (producers never need to wait for consumers).
Topics
Events are organized and durably stored in Topics.
Topics are split into multiple Partitions. In a cluster, there a is only one leader partition, the rest are replicas.
+---------+ +---------+
| Topic A | | Topic B |
| | | |
|+-------+| |+-------+|
||Part 0 || ||Part 0 ||
|+-------+| |+-------+|
| | | |
|+-------+| |+-------+|
||Part 1 || ||Part 1 ||
|+-------+| |+-------+|
| | | |
|+-------+| +---------+
||Part 2 ||
|+-------+|
| |
+---------+
Each partition is a separate data structure which guarantee message ordering, i.e. message ordering is only guarantee within a single partition.
The producer just appends the message at the end of the partition.
Consumers can consume from the same partition but they can read from different offsets.
+--------+ writes
|producer|-----------+
+--------+ |
partition 1 v
+-+-+-+-+-+-+-+-+
| | | | | | | | |
+-+-+-+-+-+-+-+-+
^ ^
| | +----------+
| +----------|consumer 1|
| reads +----------+
| +----------+
+-----------|consumer 2|
reads +----------+
Consumer groups
Consumer groups may contains multiple consumers where each one in the group will process a subset of all the messages in the topic.
1 producer, 3 partitions and 1 consumer group with 3 consumers:
+--------+
|producer|
+--------+
|
+-----+-----+
| | |
v v v
+----------------+
|+--+ +--+ +--+|
Topic A ||P0| |P1| |P2||
|+--+ +--+ +--+|
+----------------+
| | |
v v v
+----------------+
|+--+ +--+ +--+|
||C0| |C1| |C2||
|+--+ +--+ +--+|
+----------------+
consumer group
1 producer, 5 partitions and 1 consumer group with 3 consumers:
+--------+
|producer|
+--------+
|
+-------+-------+-------+-------+
| | | | |
v v v v v
+--------------------------------------+
|+----+ +----+ +----+ +----+ +----+|
Topic A || P0 | | P1 | | P2 | | P3 | | P4 ||
|+----+ +----+ +----+ +----+ +----+|
+--------------------------------------+
| | | | |
+---+---+ | +---+---+
| | |
v v v
+------------------------------+
| +--+ +--+ +--+ |
| |C0| |C1| |C2| |
| +--+ +--+ +--+ |
+------------------------------+
consumer group
One idle consumer:
+--------+
|producer|
+--------+
|
+-------+-------+
| | |
v v v
+----------------------+
|+----+ +----+ +----+|
Topic A || P0 | | P1 | | P2 ||
|+----+ +----+ +----+|
+----------------------+
| | |
v v v
+---------------------------+
|+--+ +--+ +--+ +--+|
||C0| |C1| |C2| |C3||
|+--+ +--+ +--+ +--+|
+---------------------------+
consumer group
Rebalancing / Repartition
Rebalancing is automatically triggered after:
- a consumer joins a consumer group
- a consumer leaves a consumer group
- new partitions are added
Rebalancing will cause a short period of extra latency while consumers stop reading batches of messages and get assigned to different partitions.
:warning: Upon rebalancing, any memory data will be useless unless the consumer get assigned back to the same partition. Therefore consumers needs to implement a re-partitioning logic to either maintain the data state by persisting externally or to remove it from its memory.
Consumer can implement the interface org.apache.kafka.clients.consumer.ConsumerRebalanceListener
,
which is a listener that will be called when a rebalancing occurs.
Partition assignment strategies
Kafka clients provides 3 build-in strategies: RangeAssignor, RoundRobinAssignor and StickyAssignor.
RangeAssignor
RangeAssignor is the default strategy.
Strategy useful to join records from two topics which have the same number of partitions and the same key-partitioning logic.
Topic A Topic B
+-----------+ +-----------+
| A0 A1 | | B0 B1 |
| +--+ +--+ | | +--+ +--+ |
| | | | | | | | | | | |
| | | | | | | | | | | |
| +--+ +--+ | | +--+ +--+ |
+-----------+ +-----------+
| | | |
+++-----------+ |
|| |+-----------+
|| ||
vv vv
+----------------+
| +--+ +--+ +--+ |
| |C1| |C2| |C3| |
| +--+ +--+ +--+ |
+----------------+
consumer group
Assignments:
C1 = {A0, B0}
C2 = {A1, B1}
C3 = {}
RoundRobinAssignor
Maximize the number of consumers used but does not attempt to reduce partition movements the number of consumers changes.
Topic A Topic B
+-----------+ +-----------+
| A0 A1 | | B0 B1 |
| +--+ +--+ | | +--+ +--+ |
| | | | | | | | | | | |
| | | | | | | | | | | |
| +--+ +--+ | | +--+ +--+ |
+-----------+ +-----------+
| | | |
+-+----------------+
| | |
| | |
| | +--+
| | |
v v v
+----------------+
| +--+ +--+ +--+ |
| |C1| |C2| |C3| |
| +--+ +--+ +--+ |
+----------------+
consumer group
Assignments:
C1 = {A0, B1}
C2 = {A1}
C3 = {B0}
Topic A Topic B
+-----------+ +-----------+
| A0 A1 | | B0 B1 |
| +--+ +--+ | | +--+ +--+ |
| | | | | | | | | | | |
| | | | | | | | | | | |
| +--+ +--+ | | +--+ +--+ |
+-----------+ +-----------+
| | | |
+-+-----------+ |
| | |
| | |
| +----+-------+
| |
v v
+----------------+
| +--+ +--+ +--+ |
| |C1| |XX| |C3| |
| +--+ +--+ +--+ |
+----------------+
consumer group
The C2 dies, new assignments:
C1 = {A0, B0}
C2 = {}
C3 = {A1, B1}
⇒ unnecessary partition movement may have an impact on consumer performance.
StickyAssignor
Same as RoundRobinAssignor, but tries to minimize partition movements between two assignments.
Same scenario as RoundRobinAssignor, and if C2 dies / leaves:
Topic A Topic B
+-----------+ +-----------+
| A0 A1 | | B0 B1 |
| +--+ +--+ | | +--+ +--+ |
| | | | | | | | | | | |
| | | | | | | | | | | |
| +--+ +--+ | | +--+ +--+ |
+-----------+ +-----------+
| | | |
+-+----------------+
| | |
| | |
| +----+--+
| |
v v
+----------------+
| +--+ +--+ +--+ |
| |C1| |XX| |C3| |
| +--+ +--+ +--+ |
+----------------+
consumer group
The C2 dies, new assignments:
C1 = {A0, B1}
C2 = {}
C3 = {A1, B0}
C1 keeps assigned to the same partitions.
Resources
Useful commands
Using Kafka client scripts
Using Docker
Same as using Kafka client scripts, but using the docker image:
Using kafkactl
In Java
Add kafka-clients dependency:
Producer:
Consumer:
docker-compose
Single node
Cluster of 3 nodes
Note: we cannot exploit the docker-compose --scale
feature as we need to configure the Kafka
advertised listeners and Kafka Zookeeper connects.
SASL + SSL
Generate certificates
Execute the script to a folder /path/to/kafka/secrets
.
Create JAAS config files for authentication
Add the JAAS conf files in the same /path/to/kafka/secrets
folder.
zk_jaas.conf
:
kafka_jaas.conf
:
Launch Kafka in SASL_SSL mode
Connect with Kafka scripts
Create client_security.properties
with the following content to connect to Kafka in SASL_SSL:
Then:
Connect using Kafkactl
Add the following context:
Connect using Java
Web UI
Using Kowl as a web UI for exploring messages, consumers, configurations:
Create a config.yml
with the following content:
Add the service in the docker-compose.yml:
Partitions
How to choose the number of partitions for a topic
Simple formula
#Partitions = max(Np, Nc)
Np
is the number of required producers determined by calculatingTt/Tp
Nc
is the number of required consumers determined by calculatingTt/Tc
Tt
is the total expected throughput for the systemTp
is the max throughput of a single producer to a single partitionTc
is the max throughput of a single consumer from a single partition
Considerations
- more partitions lead to higher throughput
- more partitions requires more open file handles
- more partitions may increase unavailability
- more partitions may increase end-to-end latency
- more partitions may require more memory in the client
Sources
- https://docs.cloudera.com/runtime/7.2.0/kafka-performance-tuning/topics/kafka-tune-sizing-partition-number.html
- https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster/
- https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
Add partitions to an existing topic
However, be aware of re-partitioning when using key:
Be aware that one use case for partitions is to semantically partition data, and adding partitions doesn’t change the partitioning of existing data so this may disturb consumers if they rely on that partition. That is if data is partitioned by hash(key) % number_of_partitions then this partitioning will potentially be shuffled by adding partitions but Kafka will not attempt to automatically redistribute data in any way.