Streaming data can be a complex beast and typically comes with a host of challenges: from ensuring exactly-once processing, to schema discovery and enforcement, cost-effective storage and finally querying terabytes of data in a scalable, performant way. There is often no single best way to solve all of these problems, and most companies will use some combination of tools to handle the various workloads involved.
In this guide we’ll look at some of the more popular tools for working with streaming data on Amazon Web Services. We will try to cover each step of the typical streaming data architecture - stream processing, ingestion, storage, ETL and analytics. This list is not meant to be comprehensive but rather to give you a general notion of the more popular tools available to operationalize your streaming data on AWS.
Upsolver comes with everything you need to transform streaming data into value and helps data-intensive companies transform petabyte-scale into workable datasets, including blazing-fast streaming ETL and industry leading integration with AWS tools such as Amazon Athena. You can learn more here.
The first component in any streaming data architecture is message broker that processes streaming events sent from sources such as IoT devices, websites and mobile apps, and passes them onwards to be written into storage. In AWS, the most popular choices would be:
Apache Kafka is an open-source stream processing platform that has become extremely commonplace in recent years, often replacing traditional message brokers such as RabbitMQ (see our comparison between the two).
You can run your own Kafka deployment on an AWS EC2 instance for a high level of control and flexibility, and to reduce software licensing costs. However, you will need to manage clusters when event throughput changes and to ensure data is continuously available for your production environments.
You can read more about best practices for running Kafka on AWS on the Amazon blog.
Amazon Kinesis Data Streams
Kinesis Data Streams (KDS) is a proprietary event streaming tool offered as a managed service by AWS. It works very similarly to Kafka’s pub-sub model, including elastic scaling, durability and low-latency message transfer (within 70ms of the data being collected according to Amazon’s marketing).
As with most managed services, using KDS means sacrificing some level of control and customization in exchange for ease-of-use and reduced focus on infrastructure. We’ve explored this issue in more depth in our previous blog post comparing Apache Kafka vs Amazon Kinesis.
Amazon Managed Kafka
Amazon Managed Streaming for Apache Kafka (MSK) is meant for organizations that want to use Kafka without dealing with the challenges of configuring, resizing and ongoing management of Kafka clusters. This is a relatively new service that has only been released for general availability in May 2019. You can read a detailed review of the service here.
Using Amazon MSK, you can continue leveraging the open-source nature and highly developed ecosystem of Apache Kafka, even if you don’t have a large team of data engineers to manage the infrastructure side of things. However, this comfort comes at a price, as you will obviously be paying additional markup to AWS compared to running your own EC2 (see MSK pricing plans).
Once you’ve got your message broker in place and are collecting data, you’ll want to ingest that data into your AWS cloud object storage (see below). Popular options include:
The Upsolver platform provides managed, self-service data lake ingestion that is fully automated and configured vid UI and declarative SQL. Upsolver will connect to Kafka, Kinesis Data Streams, or Amazon MSK and write all incoming events to S3 while ensuring exactly-once processing.
Upsolver automatically creates two copies of the data on S3, with raw historical data stored as Avro files, in addition to analytics-ready Parquet files stored in a separate bucket. The platform also handles partitioning, small file compaction and makes schema-on-read instantly available as the data is being ingested.
Amazon Kinesis Data Firehose
If you’re using KDS, you can use’s Kinesis Firehose - Amazon’s managed service for writing stream data to Amazon S3 or Redshift. It includes auto-scaling according to inbound throughput, and also provides encryption, data batching and compression.
While Firehose is easy to use and does not require much ongoing administration, certain data transformations need to be invoked by a Lambda function, which adds complexity and can require development resources to maintain.
Confluent S3 Connect
For Kafka users, Kafka S3 Connect is available both as open-source and as a managed service by Confluent. It ensures exactly-once delivery and writing to Amazon S3, but is limited in terms of data transformation - e.g. files can only be written as Avro or JSON, so conversion to Parquet for analytic workloads would need to be handled separately.
So you’ve set up your streaming ingest pipelines, but where are you ingesting to? The first question you need to ask is whether you’ll be using a database such as Amazon Redshift or implementing a data lake architecture with Amazon S3 as the storage layer. We’ve written about this before, and you can go check out the signs you’ve outgrown your data warehouse,
For most large-scale streaming data architectures, where you’re ingesting and storing very large volumes of semi-structured data at very high velocity, you would probably choose a data lake. If you’re doing so in the Cloud, AWS can provide the storage layer:
Amazon’s cloud object storage is built to store near-infinite amounts of data at low costs. Unlike a database, storage resources are decoupled from compute power (which you would purchase separately via EC2 instances), making it much easier and more affordable to scale data volumes without overly worrying about retention policies and cluster resizing.
For streaming data on AWS, you’re likely to be using S3 as your storage layer. To improve performance and control costs, you should implement various optimizations such as partitioning and compression, many of which we’ve covered on our previous post about S3 data partitioning.
Now that we’ve got our data ingested and stored on S3, it’s time to actually put it to use! Since data streams from Kafka / Kinesis would typically arrive in semi-structured format, you’ll need to transform, structure and enrich them before being able to run any meaningful kind of analytics. Here are your options:
Since this is our own tool and we’ve mentioned it in this article already, we’ll keep it brief: Upsolver is a data lake ETL platform that’s purpose-built to operationalize streaming data on the AWS cloud. Upsolver provides an end-to-end solution for ingestion, storage optimization and ETL, with a host of additional features to help you manage your data operations more effectively - such as automatic schema discovery on-read.
The main advantage of Upsolver over the alternatives is in its end-to-end approach and ease of use - where the entire process of turning streaming data from raw events into analytics-ready datasets is handled within a single, fully-managed service that any SQL-fluent user can operate.
Apache Spark needs no introduction - it is one of the most widely-used open source frameworks for for big data processing. Within AWS, you can choose to run Spark clusters on EMR, which will integrate with the Glue Data Catalog and give you the regular benefits of the Cloud - namely not having to worry about physical infrastructure.
While Spark is definitely a reliable work-horse that can get most ETL jobs done, it does have major drawbacks in terms of complexity and the ability to provide low-latency data. We’ve covered these in detail in our list of alternatives to Spark.
AWS Glue is a fully-managed service for ETL and data discovery, built on Apache Spark. Using Glue you can execute ETL jobs against S3 to transform streaming data, including various transformations and conversion to Apache Parquet.
Glue is still evolving as a service and while it removes the need to manage Spark clusters, it is still confined to the batch nature of Spark, which entails certain latencies and limitations.
Finally, we’ve got our data on S3, we’ve finished the various transformations and enrichments required to prepare the data for analysis, and now we want to get some insights. The major services you would use to analyze streaming data on AWS include:
Athena is a serverless, interactive query service that is used to query very large amounts of data on Amazon S3. It is likely to be part of many types of analytic workflows that involve streaming data on AWS, including ad-hoc analytics, reporting and data science.
We’ve talked extensively about Athena elsewhere, and you can go check out some of these resources, including our articles on how to improve Athena performance and on data preparation tips for Athena, as well as our webinar on ETL for Athena.
Amazon Redshift and Redshift Spectrum
While Redshift is a relational database more suited for storing structured data rather than semi-structured or unstructured event streams, it is often used to power various processes where consistent performance is critical, such as operational dashboards.
In these cases, you would typically ETL only the data you need into Redshift, while keeping the rest on S3 to reduce storage and processing costs. Recently introduced Redshift Spectrum adds serverless query capabilities and enables you to join S3 data with Redshift tables. To learn more, check out our comparison of Redshift and Athena.
Want to simplify your data architecture? Schedule a demo of Upsolver to learn how a single platform can replace half a dozen tools and thousands of lines of code.