All about Kafka

Siddheshwar Kumar
7 min readDec 21, 2020

This post covers the design philosophy, components, and architecture of Kafka, i.e. everything which matters!

Let’s understand what is stream processing before we delve deeper. Jay Kreps, who implemented Kafka along with other members while working at LinkedIn explains about stream processing, here.

Let’s cover programming paradigms -

  1. Request/Response- Send ONE request and wait for ONE response (a typical HTTP or REST call).
  2. Batch- Send ALL inputs, batch job does data crunching/processing and then returns ALL output in one go.
  3. Stream Processing- Program has control in this model, it takes (a bunch of) inputs and produces SOME output. SOME here depends on the program — it can return ALL output or ONE or it can do everything in-between. So, this is basically a generalization of the above two extreme models.

Stream processing is generally async. Stream processing has also been popularised by Lambdas and frameworks like Rx — where stream processing is confined to a process. But, in the case of Kafka- stream processing is distributed and really large!

What is the Event Stream?

Events are actions that generate data!

Databases store the current state of data, which has been reached by sequence or stream of events. Visualizing data as a stream of events might not be very obvious, especially if you have grown seeing data being stored as rows in databases.

Let’s take an example of your bank balance — your current bank balance is the result of all credit and debit events that have occurred in the past. Events are business story which resulted in the current state of data. Similarly, the current price of a stock is due to all the buy and sell events that have happened from the day it got listed on a bourse; in the retail domain, data can be realized as- a stream of orders, sells, and price adjustments. Google earns billions of dollars by capturing click events and impression events.

Now companies want to records all user events on their sites — this helps in better profiling the customers and offering more customized services. This has led to tons of data being generated. So, when we hear about the term big data — it basically means capturing all these events which most of the companies were ignoring earlier.

Kafka’s Core Data Structure

There are quite similarities between the stream of records and application logs. Both order the entries with time. At the core, Kafka uses something similar to record all the streams of events — Commit Log. I have discussed separately commit logs, here.

Kafka provides a commit log of updates. Data Sources ( or Producers) can publish a stream of events or records which get stored in the commit log and then subscribers or consumers (like DB, Cache, HTTP service) can read those events. Consumers are independent of each other and have their own reference points to read records, i.e. events will not get lost (or marked as read) if read by one consumer.

Commit logs can be partitioned or shared across a cluster of nodes and they are also replicated to achieve fault-tolerance.

Read about Zero Copy here: How Kafka storage internally works: https://www.ibm.com/developerworks/library/j-zerocopy/
https://thehoard.blog/how-kafkas-storage-internals-work-3a29b02e026

Key Concepts and Terminologies

Let’s cover the important components and aspects of Kafka:

Message: The message is a record or information which gets persisted in Kafka for processing. It’s a fundamental unit of data in Kafka's world. Kafka stores the message in binary format.

Topics: Messages in Kafka are categorized under topics. A topic is like a database table (so we can have multiple topics). Messages are always published and subscribed for a given topic (name). For each topic, Kafka maintains a structured commit log with one or more partitions.

Partition: Partition is Kafka's way to scale and provide high availability for a topic. A topic can have multiple partitions as shown in the above figure. Kafka appends a new message at the end of a partition. Each message in a topic is assigned a unique identifier known as offset. Write to a partition are sequential (from left to right) but write across different partitions can be done in parallel as each partition may be in a different box/node. Offset uniquely identifies a message in a given partition. The current offset where the message is going to be written in partition 0 is 12 (in the above pic).

Ordering of records is guaranteed only across a partition of the given topic. Partition allows Kafka to go beyond the limitation of a single server. This means a single topic can be scaled horizontally across multiple servers. But at the same time, each individual partition must fit in a host.

Each partition has one server which acts as a leader and zero or more servers that act as followers. The leader handles all reads and writes requests for that partition and follower replicate. If the leader fails, one of the followers will get chosen as a leader. Each server acts as a leader for some partitions and followers for others so that load is balanced.

Producers: As the name suggests, Producers post messages to topics. The producer is responsible for assigning messages to a partition in a given topic. The producer connects to any of the alive nodes and requests metadata about the leaders for the partition of a topic. This allows the producer to put the message directly to the lead broker of the partition.

Consumers: Consumers subscribe to one or more topics and read messages for further processing. It is the consumers' job to keep track of which messages have been read using offset. Consumers can re-read past messages or can jump to future messages as well. This is possible because Kafka retains all messages for a given time (which is pre-configured).

Consumer Group: Kafka scales the consumption by grouping consumers and distributing partitions among them. Each partition of a topic is assigned to only one member in the group. Group coordination protocol is built into Kafka itself (earlier it was managed through zookeeper). For each group, one broker is selected as a group co-ordinator. Its main job is to control partition assignment when there is any change (addition/deletion) in membership of the group.

https://kafka.apache.org/0110/images/consumer-groups.png

How Consumer Group helps:

  • If all consumer instances have the same consumer group, then records will get load-balanced over consumer instances.
  • If all the consumer instances have different consumer groups, then each record will be broadcast to all consumer processes.

Brokers: Kafka is a distributed system, so topics could be spread across different nodes in a cluster. These individual nodes or servers are known as brokers.

The broker's job is to manage persistence and replication of messages. Brokers scale well as they are not responsible for tracking offset or messages for individual consumers. Also, there is no issue due to the rate with which consumer consumes messages. A single broker can easily handle thousands of partitions and millions of messages.

Within a cluster of brokers, one will be elected as cluster controller. This happens automatically among live brokers.

A partition is owned by a single broker of the cluster and that broker acts as partition leader. To achieve replication, a partition will be assigned to multiple brokers. This provides redundancy of the portion and will be used as a leader if the primary one fails.

Leader: Node or Broker responsible for all the reads and writes of the given partition.

Replication Factor: The replication factor controls the number of replica copies. If a topic is un-replicated then the replication factor will be 1.

If you want to design in such a way that f failures are fine then need to have a 2f+1 replica.

Architecture

The below diagram shows all the important components of Kafka and their relationship-

Role of Zookeeper

Zookeeper is an open-source, high-performance coordination service for distributed applications. In distributed systems, Zookeeper helps in configuration management, consensus building, coordination, and locks (Hadoop also uses Zookeeper). It acts as a middle man among all nodes and helps in different co-ordination activities — it’s the source of truth.

In Kafka, it’s mainly used to track the status of cluster nodes and also to keep track of topics, partitions, messages, etc. So, before starting the Kafka server, you should start Zookeeper.

Kafka Vs Messaging Systems

Kafka can be used as traditional messaging systems (or Brokers) like ActiveMQ, RabitMQ, Tibco. Traditional messaging systems have two models — queuing and publish-subscribe. Each of these models has its own strengths and weaknesses -

Queuing- In this model, pool of consumers can read from the server and each record goes to one of them. This allows us to divide up the processing over multiple consumers and helps in scale processing. But, queues are not multi-subscriber- once one subscriber reads data it’s gone.

Publish-Subscribe- This model allows to broadcast data to multiple subscribers. But, this approach doesn’t scale up as every message goes to every consumer/subscriber.

Kafka uses the Consumer Group model; so every topic has both these properties (queue and publish-subscribe). It can scale processing and it’s also a multi-subscriber. Consumer group allows to divide up processing over different consumers of the group (just like queue model). And just like the traditional pub-sub model it allows you to broadcast messages to multiple consumer groups.

Order is not guaranteed in traditional systems, if records need to be delivered to consumers asynchronously they may reach consumers out of order. Kafka does better by creating multiple partitions for the same topics and each partition is consumed by exactly one consumer in the group.

Originally published at http://geekrai.blogspot.com.

--

--

Siddheshwar Kumar

Principal Software Engineer; enjoy building highly scalable systems. Before moving to this platform I used to blog on http://geekrai.blogspot.com/