Apache Kafka and Amazon Kinesis are two of the more widely adopted messaging queue systems. Many organizations dealing with stream processing or similar use cases debate whether to use open-source Kafka or to use Amazon’s managed Kinesis service as data streaming platforms.
Whether you’re just getting into streaming data or are a seasoned architect, you should definitely check out our 40-page Ultimate Guide to Streaming Data Architecture – which covers the topics covered in this article and much more!
Once you have your stream processing in place, you’ll want to make sure you have the right tools to integrate and analyze streaming data. Get a free trial of Upsolver or check out our previous guide to Apache Kafka with or without a Data Lake.
Apache Kafka was started as a general-purpose publish and subscribe messaging system and eventually evolved as a fully-developed, horizontally scalable, fault-tolerant, and highly performant streaming platform.
Kafka runs on a cluster in a distributed environment, which may span multiple data centers. The Kafka Cluster is made up of multiple Kafka Brokers (nodes in a cluster). A topic is designed to store data streams in ordered and partitioned immutable sequence of records. Each topic is divided into multiple partitions and each broker stores one or more of those partitions. Applications send data streams to a partition via Producers, which can then be consumed and processed by other applications via Consumers – e.g., to get insights on data through analytics applications. Multiple producers and consumers can publish and retrieve messages at the same time.
Like Apache Kafka, Amazon Kinesis is also a publish and subscribe messaging solution. However, it is offered as a managed service in the AWS cloud, and unlike Kafka cannot be run on-premises.
The Kinesis Producer continuously pushes data to Kinesis Streams. A producer can be any source of data – a web-based application, a connected IoT device, or any data producing system. The Consumer – such as a custom application, Apache Hadoop, Apache Storm running on Amazon EC2, an Amazon Kinesis Data Firehose delivery stream, or Amazon Simple Storage Service (S3) – processes the data in real time. Similar to partitions in Kafka, Kinesis breaks the data streams across Shards. The number of shards is configurable, however most of the maintenance and configurations are hidden from the user.
Decision Points to Choose Apache Kafka vs Amazon Kinesis
Choosing the streaming data solution is not always straightforward. Making a decision on which streaming platform to use is based on the metrics you want to achieve and the business use case. Following are some metrics and decision points to compare whether to choose Apache Kafka or Amazon Kinesis as a data streaming solution:
Setup, Management, and Administration
Apache Kafka takes days to weeks to set up a full-fledged production-ready environment, based on the expertise you have in your team. As an open-source distributed system, it requires its own cluster, a high number of nodes (brokers), replications and partitions for fault tolerance and high availability of your system. Setting up a Kafka cluster would require learning (if there is no prior experience in setting up and managing Kafka clusters) and distributed systems engineering practice and capabilities for cluster management, provisioning, auto-scaling, load-balancing, configuration management, a lot of distributed DevOps, and more.
On the other hand, Kinesis is comparatively easier to set up than Apache Kafka and may take at a maximum a couple of hours to set up a production-ready stream processing solution. Since it is a managed-service, AWS manages the infrastructure, storage, networking, and configurations needed to stream data on your behalf. On top of that, Amazon Kinesis takes care of provisioning, deployment, on-going maintenance of hardware, and software or other services of data streams for you. Additionally, Kinesis producers and consumers can also be created and are able to interact with the Kinesis broker from outside AWS by means of Kinesis APIs and Amazon Web Service (AWS) SDKs.
Performance Tuning – Throughput, Latency, Durability, and Availability
Tuning Apache Kafka for optimal throughput and latency requires tuning of Kafka producers and Kafka consumers. Producers can be tuned for number of bytes of data to collect before sending it to the broker and consumers can be configured to efficiently consume the data by configuring replication factor and a ratio of number of consumers for a topic to number of partitions.
In addition, server side configurations, e.g., replication factor and number of partitions, play an important role in achieving top performance by means of parallelism. To guarantee that messages that have been committed should not be lost – that is, to achieve durability, the data can be configured to persist until you run out of disk space. The distributed nature of the Kafka framework is designed to be fault-tolerant. For high availability, Kafka must be configured to recover from failures as soon as possible.
In contrast, Amazon Kinesis is a managed service and does not give a free hand for system configuration. The high availability of the system is the responsibility of AWS. Kinesis ensures availability and durability of data by synchronously replicating data across three availability zones. However, in comparison to Kafka, Kinesis only lets you configure number of days per shards for the retention period, and that for not more than 7 days. The throughput of a Kinesis stream is configurable to increase by increasing the number of shards within a data stream.
Human Costs and Machine Costs
Setting up and maintaining Kafka often requires significant technical resources, in the form of billed engineer hours for setup and the 24/7 ongoing operational burden of managing your own infrastructure. Moreover, there are costs associated with dedicated hardware, though these costs can be controlled or lowered by investing more labor (and cost) in optimizing the machines to operate at full capacity.
Amazon’s model for Kinesis is pay-as-you-go. It works on the principle that there are no upfront costs for setting up; the amount to be paid depends on the services rendered. For example, Kinesis pricing is based on two core dimensions, thereby saving companies from bearing the time and monetary expenses of building and constantly maintaining infrastructure:
- the number of shards needed for the required throughput
- a Payload Unit — that is, the size of data the producer is transmitting to the Kinesis data streams.
Moreover, the Kinesis costs are reduced normally over time automatically based on how much your workload is typical to that of AWS.
Incident Risk Management
As long as a really good monitoring system is in place for Kafka that is capable of on-time alerting of any failures and a 24/7 team of DevOps taking care of potential failures and recovery, there is less risk of incidence. The main decision point here is whether you can afford outages and loss of data if you do not have a 24/7 monitoring, alerting, and DevOps team to recover from the failure. With Kinesis, as a managed-service, Amazon itself takes care of the high-availability of the system so these are less likely to occur.
Closing: Kinesis or Kafka for your streaming data?
As with most tech decisions, there is no single right answer to which streaming solution to use. While Kinesis might seem like the more cloud-native solution, a Kafka Cluster can also be deployed on Amazon EC2, which provides a reliable and scalable infrastructure platform. However, monitoring, scaling, managing and maintaining servers, software, and security of the clusters would still create IT overhead. (There are also fully-managed services offered by Confluent as well as Amazon Managed Kafka).
Choosing a data streaming solution may depend on company resources, engineering culture, and monetary budget, as well as the aforementioned decision points. For example, if you are (or have) a team of distributed systems engineering, have extensive experience with Linux, and a considerable workforce for distributed cluster management, monitoring, stream processing and DevOps, then the flexibility and open-source nature of Kafka could be the better choice. Alternatively, if you are looking for a managed solution or you do not have time or expertise or budget at the moment to set up and take care of distributed infrastructure, and you only want to focus on your application, you might lean towards Amazon Kinesis.
Want more info about streaming data analytics and architecture? Get our Ultimate Guide to Streaming Data:
- Get an overview of common options for building an infrastructure
- See how to turn event streams into analytics-ready data
- Cut through some of the noise of all the “shiny new objects”
- Come away with concrete ideas for wringing all you want from your data streams.
Ready for a more hands-on experience? Build end-to-end pipelines from Kafka or Kinesis to Amazon Athena using Upsolver. Schedule a free chat with our solution architects here.