What is Kafka Used For and When Not to Use It
Thinking about data as streams is a popular approach nowadays. In many cases, it allows for the building of data engineering architecture in a more efficient way than when thinking about data as a state. But to support the streaming data paradigm we need to use additional technologies. One of the most popular tools for working with streaming data is Apache Kafka. In this article, we will discuss the best scenarios for deploying Kafka.
What is Apache Kafka?
Apache Kafka is an open-source streaming platform. It was originally developed at LinkedIn as a messaging queue, but now Kafka is much more than a messaging queue. It is a powerful tool for working with data streams and it can be used in many use cases.
Kafka is distributed, which means that it can be scaled up when needed. All that you need to do is to add new nodes (servers) to the Kafka cluster.
Kafka can handle a lot of data per unit of time. It also has low latency, which allows for the processing of data in a real-time mode.
Apache Kafka is written in Scala and Java, but it is compatible with many other popular programming languages.
Kafka is different from traditional message queues (like RabbitMQ). Kafka retains the message after it was consumed for a period of time (default is 7 days), while RabbitMQ removes messages immediately after the consumer’s confirmation was received. Also, RabbitMQ pushes messages to consumers and keeps track of their load. It decides how many messages should be in processing by each of consumers (there are settings for this behaviour). Kafka supports fetching messages by consumers (pulling). It is designed to be ready to scale horizontally, by adding more nodes. Traditional messaging queues expect to scale vertically, by adding more power to the same machine. These are the most important differences between Kafka and traditional messaging systems.
Apache Kafka Core Concepts
Let’s look at how Kafka works in more detail.
The first thing that everyone who works with streaming applications should understand is the concept of the event. The event is an atomic piece of data. For example, when the user registers with the system, the action creates an event. You can also think about an event like a message with data. This message can be processed and saved somewhere (if it is needed). The registration event is the message where information about the user’s name, email, password, location, etc. can be included. Kafka is the platform that works with the streams of events.
Events are constantly written by producers. They are called producers because they write events (data) to Kafka. There are many kinds of producers. Examples of users include web servers, components of applications, entire applications, IoT devices, monitoring agents, etc. The component of the website that is responsible for user registrations can produce an event “new user is registered”. A weather sensor (IoT device) can produce hourly “weather” events with information about temperature, humidity, wind speed, and so on. So, the producer is anything that creates data.
Consumers are entities that use data (events). In other words, they can receive data written by producers and use this data. There are a lot of examples of data consumers. It is also true that the same entities (components of applications, whole applications, monitoring systems, etc.) can act as both producers and consumers. It all depends on the particular architecture of the system. But in general, entities like databases, data lakes, data analytics applications act as data consumers because it is often needed to store the generated data somewhere.
Kafka is the middleman between applications that generate data and applications that consume data. The Kafka system is called the Kafka cluster because it can consist of multiple elements. These elements are called nodes. Brokers are the software components that run on a node. And this is why Kafka is categorized as a distributed system. Data in the Kafka cluster is distributed amongst several brokers. There are several copies of the same data in the Kafka cluster. They are called replicas. This mechanism makes Kafka more stable, fault-tolerant, and reliable. If something bad happens with one broker, the information will not be lost, and another broker will start to perform the functions of the broken broker.
Producers publish events to Kafka topics. Consumers can subscribe to topics to gain access to the data they require. Kafka topics are an immutable log of events (sequences). Each topic can serve data to many consumers. That’s why producers are sometimes called publishers and consumers are called subscribers. For example, the registration component of the website can publish events (via the Kafka producer) into the “registration” topic. Subscribers like analytics apps, newsfeed apps, monitoring apps, and databases, etc. can consume events from the “registration” topic for their own needs.
Partitions serve to replicate data across brokers. Each Kafka topic is divided into partitions and each partition can be placed on a separate node.
Best Apache Kafka Use Cases
Let’s look at the common use cases of Apache Kafka.
Real-time data processing
Many modern systems require data to be processed as soon as it becomes available. For example, in the finance domain, it is important to block fraudulent transactions the instant they occur. In predictive maintenance, the models should constantly analyse streams of metrics from the working equipment and trigger alarms immediately after deviations are detected. IoT devices are often useless without real-time data processing ability. Kafka can be useful here since it is able to transmit data from producers to data handlers and then to data storages.
Application activity tracking
This is the use case Kafka was originally developed for, to be used in LinkedIn. Each event that occurs in the application can be published to the dedicated Kafka topic. User clicks, registrations, likes, time spent on certain pages by users, orders, etc. – all these events can be sent to Kafka’s topics. Then, other applications (consumers) can subscribe to topics and process the received data for different purposes including monitoring, analysis, reports, newsfeeds, personalization, and so on.
Logging and/or monitoring system
Apache Kafka can be used for logging or monitoring. It is possible to publish logs into Kafka topics. The logs can be stored in a Kafka cluster for some time. There, they can be aggregated or processed. It is possible to build pipelines that consist of several producers/consumers where the logs are transformed in a certain way. In the end, logs can be saved in a traditional log-storage solution.
For the monitoring case, suppose you have a special component of the system that is dedicated to monitoring and alerting. This component (monitoring application) can read data from Kafka topics. This makes Kafka useful for monitoring purposes, especially if it is real-time monitoring.
When Not To Use Kafka
- Kafka is an overkill when you need to process only a small amount of messages per day (up to several thousand). Kafka is designed to cope with the high load. Use traditional message queues like RabbitMQ when you don’t have a lot of data.
- Kafka is a great solution for delivering messages. But despite the fact that Kafka has a Stream API, it is not easy to perform data transformations on-fly. You need to build a complex pipeline of interactions between producers and consumers and then maintain the entire system. This requires a lot of work and efforts. So, avoid using Kafka for ETL jobs, especially where real-time processing is needed.
- When you need to use a simple task queue you should use appropriate instruments. Kafka is not designed to be a task queue. There are other tools that are better for such use cases, for example, RabbitMQ.
- If you need a database, use a database, not Kafka. Kafka is not good for long-term storage. It supports saving data during a specified retention period, but generally, it should not be very long. Kafka also stores redundant copies of data, which can increase storage costs. Databases are optimized for storing fresh data. They have also versatile query languages and support efficient data inserting and retrieving. If relational databases are not what you need for your use case, try to look at non-relational (for example, MongoDB), but don’t use Kafka.
In this article, we described Apache Kafka and the most suitable use cases for deploying this tool. From what you have learned in this article, it is easy to see why it is such a powerful streaming platform. Kafka is a valuable tool in scenarios requiring real-time data processing and application activity tracking, as well as for monitoring purposes. In the same time, Kafka shouldn’t be used for data transformations on-fly, data storing, and when all that you need is just a simple task queue.