Apache Kafka and Amazon Kinesis are two of the more widely adopted messaging queue systems. Many organizations dealing with stream processing or similar use-cases debate whether to use open-source Kafka or to use Amazon’s managed Kinesis service as data streaming platforms.
This article compares between Apache Kafka and Amazon Kinesis based on the decision points such as setup, maintenance, costs, performance, and incidence risk management.
Once you have your stream processing in place, you’ll want to make sure you have the right tools to integrate and analyze streaming data. Get a free trial of Upsolver or check out our previous guide to Apache Kafka with or without a Data Lake.
Apache Kafka was started as a general-purpose publish and subscribe messaging system and eventually evolved as a fully developed horizontally scalable, fault-tolerant, and highly performant streaming platform.
Kafka runs on a cluster in a distributed environment, which may span over multiple data centers. The Kafka Cluster is made up of multiple Kafka Brokers (nodes in a cluster). A topic is designed to store data streams in ordered and partitioned immutable sequence of records. Each topic is divided into multiple partitions and each broker stores one or more of those partitions. Applications send data streams to a partition via Producers, which can then be consumed and processed by other applications via Consumers – e.g., to get insights on data through analytics applications. Multiple producers and consumers can publish and retrieve messages at the same time.
Like Apache Kafka, Amazon Kinesis is also a publish and subscribe messaging solution, however, it is offered as a managed service in the AWS cloud, and unlike Kafka cannot be run on-premise.
The Kinesis Producer continuously pushes data to Kinesis Streams. A producer can be any source of data – a web based application, a connected IoT device, or any data producing system. The Consumer – such as a custom application, Apache hadoop, Apache Storm running on Amazon EC2, an Amazon Kinesis Data Firehose delivery stream, or Amazon Simple Storage Service S3 – processes the data in real time. Similar to partitions in Kafka, Kinesis breaks the data streams across Shards. The number of shards is configurable, however most of the maintenance and configurations is hidden from the user.
Decision Points to Choose Apache Kafka vs Amazon Kinesis
Choosing the streaming data solution is not always straightforward. Making a decision on which streaming platform to use is based on the metrics you want to achieve and the business use case. Following are some metrics and decision points to compare whether to choose Apache Kafka or Amazon Kinesis as a data streaming solution:
Setup, Management, and Administration
Apache Kafka takes days to weeks to setup a full-fledge production ready environment, based on the expertise you have in your team. As an open-source distributed system, it requires its own cluster, a high number of nodes (brokers), replications and partitions for fault tolerance and high availability of your system. Setting up a Kafka cluster would require learning (if there is no prior experience in setting up and managing Kafka Cluster) and distributed systems engineering practice and capabilities for cluster management, provisioning, auto-scaling, load-balancing, configuration management, a lot of distributed DevOps etc.
On the other hand, Kinesis is comparatively easier to setup than Apache Kafka and may take a maximum of couple of hours to setup a production ready stream processing solution. Since it is a managed-service, AWS manages the infrastructure, storage, networking, and configurations needed to stream data on your behalf. On top of that, Amazon Kinesis takes care of provisioning, deployment, on-going maintenance of hardware, software or other services of data streams for you. Additionally, Kinesis producer and consumers can also be created and are able to interact with the Kinesis broker from outside AWS by means of Kinesis APIs and Amazon Web Service (AWS) SDKs.
Performance Tuning – Throughput, Latency, Durability, and Availability
Tuning Apache Kafka for optimal throughput and latency require tuning of Kafka producers and Kafka consumers. Producers can be tuned for number of bytes of data to collect before sending it to the broker and consumers can be configured to efficiently consume the data by configuring replication factor and a ratio of number of consumers for a topic to number of partitions.
In addition, server side configurations e.g., replication factor and number of partitions play an important role in achieving top performance by means of parallelism. To guarantee that messages that have been committed should not be lost – i.e., to achieve durability, the data can be configured to persist until you run out of the disk space. The distributed nature of the Kafka framework is designed to be fault-tolerant. For high availability, Kafka needs to be configured to recover from failures as soon as possible.
In contrast, Amazon Kinesis is a managed service and does not give a free hand for system configuration. The high availability of the system is the responsibility of AWS. Kinesis ensures availability and durability of data by synchronously replicating data across three availability zones. However in comparison to Kafka, Kinesis only lets you configure number of days per shards for the retention period, and that too for not more than 7 days. The throughput of a Kinesis stream is configurable to increase by increasing the number of shards with in a datastream.
Human Costs and Machine Costs
Setting-up and maintaining Kafka often requires significant technical resources, which comes with man hours billing for setup and 24/7 ongoing operational burden of managing your own infrastructure. Moreover, there are costs associated to dedicated hardware, however these costs can be controlled or lowered by investing more human time (and costs) for optimizing the machines for their utilization to full capacity.
Amazon’s model for Linesis is pay-as-you-go. It works on the principle that there are no upfront costs for setting-up but amount to be paid depends upon the rendered services. For example, Kinesis pricing is based on two core dimensions: 1) number of shards needed for the required throughput and 2) a Payload Unit i.e., size of data producer is transmitting to the kinesis data streams. Therefore, saving the companies from bearing the time and monetary expenses for infrastructure building and its constant maintenance. Moreover, the Kinesis costs are reduced normally with time automatically based on how much your workload is typical to the Amazon.
Incident Risk Management
As long as a really good monitoring system is in place for Kafka that is capable of on-time alerting of any failures and a 24/7 team of DevOps taking care of potential failures and recovery, there is a less risk of incidence. The main decision point here is whether you can afford outages and loss of data if you do not have a 24/7 monitoring, alerting, and DevOps team to recover from the failure. With Kinesis – as a managed-service, Amazon itself takes care of the high-availability of the system so these are less likely to occur.
Closing: Kinesis or Kafka for your streaming data?
As with most tech decisions, there is no single right answer to which streaming solution to use. While Kinesis might seem like the more cloud-native solution, a Kafka Cluster can also be deployed on Amazon EC2, which provides a reliable and scalable infrastructure platform. However, monitoring, scaling, managing and maintaining servers, software, and security of the clusters would still create IT overhead (There are also fully managed services offered by Confluent as well as Amazon Managed Kafka).
Choosing the data streaming solution may depend on company resources, engineering culture, monetary budget and aforementioned decision points. For example, If you are (or have) a team of distributed systems engineering, have extensive experience with Linux and a considerable workforce for distributed cluster management, monitoring, stream processing and DevOps, then the flexibility and open-source nature of Kafka could be the better choice. Alternatively, If you are looking for a managed solution or you do not have time or expertise and budget at the moment to setup and take care of distributed infrastructure, and you only want to focus on your application, you might lean towards Amazon Kinesis.
Whether you choose Kafka or Kinesis, Upsolver provides a complete solution for ingesting streaming data into your data lake, optimizing data for consumption, and creating ETL pipelines to Amazon Athena, Redshift and more. Check out our technical white paper to see how it’s done.