What is this post about?
This post has the intention of introducing Kafka and the reference for all you are going to find here is this documentation
Later on I will try to dig deeper and share my understanding with you guys. I hope this post could easily communicate with people from different technical background and will receive feedbacks as well.
How Kafka is working?
Kafka core API:
Using this API we can send messages to Kafka
Using this API we can retrieve messages from Kafka(or retrieve data). Usually they are long running background thread or jobs that are consuming messages.
Streams are something like streams in Java 8, but they are iterating over Topics. Topics is generally a concept/category which sit between Producers and Consumers. Mainly Relation of Producer and Topic is Many-to-One (Many producers … One Topic, Although producers can send meesages to multiple topics asynchronously) and Relation of Topic and Consumer is One-to-Many(One Topic … Many Consumers). Getting back to Stream, as a user it is possible to write a transformation from input stream to output stream. for example consuming data from multiple topics and transform and joined those data into one Topic.
Connectors are basically small piece of code which are running inside the Kafka Clusters or Kafka broker with the intention of fetching data from external services or external system like database or vice versa transfer(export) some data to external services or resources.
Communication between client and server is based on TCP protocol.
Topic and Partitioned Log
In short they are the building blocks of Kafka. Topic is a structure which is responsible for storing receiving data. there should be some other sort of structure which is responsible for keeping the info of those who are interested in these receiving data. those component interested in receiving data are called subscriber/ consumer. A topic can have zero, one, or many consumers. also there should be a function who notify all subscriber.
In Kafka any thing received are saved as a Record. So we generally deal with streams of records. In Kafka for a topic we can have Partition. it can be one or many of them. So data received are saved as a record in a partition inside of the topic. The partition can be only apended(once a record put in partition it cannot be changed) and the records are ordered.The order of receiving recors(messages) are preserved(ordering applied per partition).For each record a unique incremental id is assigned which is called an offset. Offset zero is older than offset 1, so new record is added at the end of the partition.
The Kafka cluster durably persists all published records. Durability of Kafka means that if a message is sent to the topic but some of those subscriber due to some reason went offline, they will receive such messages when they came back online. the persistence means that if failure occurs during the message processing the message is still be in Kafka cluster and it can be processed again. [whether or not the messages are consumed - using the retention period]
reference to Kafka/Intro: e.g., If the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space.
It is the consumer who will decide how to read data(offset) depends on their need they can read the most recent data or specific offset.
For each topic, the Kafka cluster maintains a partitioned log: my_topic has a Log and a Log hast 3 partiotion: Partition0, Partition1, Partition2
A Kafka cluster can have several servers. A partition in a log can be hosted in differnt server. and can be replicated across a configurable number of servers for fault tolerance.
“Leader” and “Follower” are basically a simple concept which is telling the Kafka cluster who is the first option to read from or write in. These concepts are applied to partition. it means that a server who host that partition is acting as a leader and is the first choice when it comes to read data or write in some data by Kafka cluster. if the server who is a leader is failed then one of the follower automatically becomes a leader. each server can act as a leader for some of its partition and act as a follower for some of its partition.
Consumer and Consumer group
- Consumer group as it name shows is way to grouping the Consumers. Consumer group is the one who subscribe itself to a topic. and when it comes to broad casting the messages are broadcasted to one instance within the consumer group. consumer instances can be in separate server or in same machine but different process. we can grouping our consumers based on “logical” functionality. this way we can do the clustering of our consumers instead of having a single process. Kafka dynamically handle the maintainance of instances. so each instance withing the group is responsible for a number of partition. if an instance dies the responsibility of that instance handed over to other instances in the group or if a new instance joined then this new joiner will take some responsibility from the existing instances.
the following picture is from Kafka Documentation and shows how it works:
- If a user wants to maintain the order of records in whole partitions he should use only one partition within the topic as the order is only preserved withing a partition.
Kafka as Messaging System
This type of messaging is also called PTP. This usually works with a one-to-one approach. A component will send a message to a Queue and another component receive that message from this Queue. The messages are stayed within a Queue until they are consumed or expired. It doesnt matter that if there is no Consumer exist yet at the time of sending messages. The “Acknowledge” concept comes to play when a consumer received a message. The consumer send back the “Acknowledgement” to Queue. This way a Queue could understand that if the message is consumed correctly. It has a gaurantee that the message is delivered to and processed by exactly one consumer. We can define multiple consumer and balance the load of processing of messages. as soon as the message is consumed it will be deleted from the queue. The messages are processed out of order as due to some reason messages in the queue will be resend at a later time like network failure.
This type of messaging is for broadcasting behavior. A publisher(producer) or many of them, publish a message to a Topic and one or many clients who previously registered themselves to that topic (Subscribed) will receive the messages that appeared while he was registered(there is also durable subscription concept which make the messaging system save messages published while the subscriber is not listening). In contrast with Queuing each consumer receive every message from the topic(unless using shared subscription or message selectors) and in exact order in which they received. They are usually used in stateful application as the ordering of messages are important in stateful application. In stateful application what happens previously is important because based on previous state they apply some logic so the ordering is important.
Kafka is using a mixed way of processing message. By introducing the concept of Consumer group. In queuing we saw how to balance the load of processing of messages by defining several consumers registered to one queue. so in Kafka topic has both properties of traditional messaging with the help of topic and consumer group. The ordering in Kafka is much more better than the previous traditional way of messaging. previously as we saw if we have multiple consumers the messages are handed out in the order they are stored. but they might be arrived out of order in different consumer. Kafka by the help of partition in the topics do it better. A partition in a topic is assigned to a consumer in the consumer group so each partition is exactly consumed by one consumer in the group. so at the same time parallelism and ordering is guaranteed.
Kafka as Storage System
Data received by Kafka are stored and replicated as I did explain it above.
Kafka as Stream Processing System
For stream processing the Stream API should be used.
Extra general information
Everything in kafka is stored in binary format
Each record consists of a key, a value(can be json or whatever), and a timestamp.
Kafka Connect supports message headers in both sink and source connectors
In general there are two types of topic in Kafka(based on clean up policy). Regular topics has default policy which apply retention to seven days. The other one has cleanup.policy=compact. Compacted topics no longer accept messages without key and an exception is thrown by the producer if this is attempted.
- Log Compaction Basics
- Kafka can delete older records based on time or size of a log. Log compaction means that Kafka keep the lastest version of a record and delete the older versions during a log compaction. Log compaction retains at least the last known value for each record key for a single topic partition. Compacted logs are useful for restoring state after a crash or system failure or reloading caches after application restarts during operational maintenance.
- In the above picture as we can see we have head and tail. the head is the same as traditional Kafka log. New records are apended to the end of the head. Log compaction applies to the tail of the log. or in other word only tail gets compacted. Records in the tail kept their original offset when written after compaction cleanup applied.
- As the following picture shows the compaction when applied, we will keep only the latest version of data based on the Key of an offset and the original offset is preserved.
- records with a given key will always be retained.
- To turn on compaction for a topic use topic config log.cleanup.policy=compact.
- previously the zookeeper stored the information about what were the offset consumed by each groupid on a topic by partition. Now they are stored on an internal topic called __consumer_offsets
- This can be considered as a storage for the consumers to remember until which message they read and in case of failure, to restart from there.
- to recap some info: consumer consumes the partitions of some topics. a consumer belongs to a groupid. Each consumed partitions has its own last offset consumed for each couple(topic,groupid)