Apache Kafka is an open-source streaming platform originally developed by LinkedIn. It was developed as a messaging queue but took on a life of its own, and as a result, was donated to Apache for further development. While Kafka operates like a traditional messaging queue such as RabbitMQ, in that it allows you to publish and subscribe to streams of messages, there are three core differences.
Combined, these factors make Kafka much more than just a messaging queue.
The simple answer is, Apache Kafka is used for event streaming. But what is event streaming? According to Kafka itself, “Event streaming is the digital equivalent of the human body’s central nervous system. It is the technological foundation for the ‘always-on’ world where businesses are increasingly software-defined and automated, and where the user of the software is more software”.
Event streaming captures data in real-time from sources like databases, cloud services, software applications, mobile devices, IoT sensors, and more, in the form of streams of events. These event streams are stored for later retrieval, but also manipulated, and processed in real-time or retroactively, and routed to different technologies as needed.
For instance, Kafka can be used in the financial sector, to gather and process many payments and financial transactions in real-time, and can be used in stock exchanges, banks, and insurance, providing updates to dashboards, pricing, and more. In shipping, it can be used to monitor and update the location of hundreds of cargo vessels in real-time for up-to-date cargo delivery estimates. Or it can be used to capture sensor data from wells or mining equipment, for delivery to update logistics and tracking applications.
While Apache Kafka can perform similar functions to an ETL tool, in that it allows for the extraction of data, transformation/processing of data, and can load it into a data repository or another program. But just because it can perform these functions, it does not mean it is just an ETL tool.
ETL tools are designed to move data out of one system and add it to another, typically by connecting databases and the data warehouse in batches. They were also often designed to handle transactional data only, and in the modern world, there are many other types of data. ETL tools were not designed to handle the large amount of data produced by things like IoT sensors, mobile applications, gaming platforms, and the like, particularly in real-time.
Kafka, on the other hand, is designed to move streaming data into storage, where it can be integrated with off-the-shelf applications and data systems, stream real-time data from unconventional sources, not batches, and empower custom applications with triggers from these data streams. While it offers some basic data transformation capabilities, this is not the core focus of the tool.
In a sense, one could argue that Kafka is a database. It provides ACID (atomicity, consistency, isolation, durability) guarantees and is commonly used for mission-critical deployments. It has different options for querying historical data and has native add-ons for data processing and event-based long-term storage. Stateful applications (programs that save client data from the activities of one session to another) can be built leveraging Kafka clients without needing an external database.
However, just because it can be used as a database, it does not mean it is one. It is a scalable event streaming platform for messaging, storage, processing, and integration of real-time data. The ability to store data long-term, database query functionality, and data processing functionalities are corollaries that enhance Kafka functionality. Other database solutions like MongoDB, MySQL, Elasticsearch, and Hadoop are complementary pieces to Kafka in an enterprise solution, not replaced by Kafka.
In other words, Kafka offers some storage and querying functionality, but is not a full-fledged database.