The following article is based on a presentation given by Roy Hasson, Senior Business Development Manager at Amazon Web Services, as part of our recent joint webinar - Frictionless Data Lake ETL for Petabyte-scale Streaming Data. You can watch the full presentation and webinar for free right here.
Whether you’re using Athena or Redshift, Upsolver is the ultimate Data Lake ETL tool to transform event streams into usable data and unlock the value of your big data architecture. Get a free demo now or check out our architecture overview to learn more.
With the near-ubiquity of Amazon Web Services in the cloud space, it’s easy to forget how new many of these services are. Redshift has only been commercially available since 2013; Amazon Athena has been around less than three years. The rapid release of new databases and analytics tools - both by AWS and newer players such as Snowflake - can leave software architects baffled as to which stack they should turn to in order to solve a specific business scenario.
“Tool X vs Tool Y” comparisons usually start and end with a cop-out stating that everything depends on your particular circumstances, there is no single correct answer, etc. (we’ve done it too when we compared Kafka to Kinesis.) Like most cliches, this one is also true; technology decisions are rarely black or white and if two platforms are commonly used, it’s often because each can be useful for some users in some scenarios.
Nevertheless, there are some high-level factors you can look at to help you gauge which tech could be the more relevant for your situation. In this article, we’ll try to do exactly that - help you identify the use cases where you’d use Redshift for your analytic workloads, versus those where you’d lean more towards Athena. We’ll try to do that with 2 sentences, 4 questions, or a detailed feature-by-feature breakdown.
In this article you will learn:
- An overview of Athena and Redshift
- Benefits of both Athena and Redshift
- Athena or Redshift? The short answer
- Comparing features of Athena and Redshift
Once you understand the differences between an enterprise data warehouse such as Redshfit and an interactive query service such as Athena, you can identify the scenarios in which one will be more useful than the other.
Let’s give the 30 second overview of each one:
- Serverless query service - AWS manages all infrastructure aspects
- Works on top of Amazon S3 object storage
- Typically only read-only and does not alter data on S3; however query results can be written to S3
- Uses ANSI SQL and Presto to query data
- Utilizes Amazon Glue as a Hive Metastore
- Cloud data warehouse based on PostgreSQL
- Composed of leader nodes and compute nodes. Users can scale by adding compute nodes as needed
- Coupled architecture (storage and compute); serverless functionality available via Redshift Spectrum
There’s a metric ton more that can be said about each of these tools - the AWS website would be a good place to start if you want to further your research (Redshift, Athena). However, for the purposes of this article, these are the important aspects you need to understand to answer the question of...
Let’s start with the simplest, most basic way to approach the question of where to analyze your cloud data.
- Redshift is faster and more robust (as long as you’re willing to add computer power), but also more expensive and complicated to manage
- Athena makes it is easier to get started and is more flexible in the types of data it can query, but performance is not guaranteed without significant data preparation
From here it’s a question of how much data you have, what your use case is and how much you are willing to invest in money and human hours.
Let’s go a little bit more in-depth with...
The caveat from above still applies - your mileage may vary, every organization and every architecture is different. However, here are the key factors to consider when debating between Redshift and Athena:
1. What is your use case?
Redshift is the more natural choice for data warehouse reporting, Athena for ad-hoc queries against S3 storage
Redshift would be the better choice if you have data coming in from diverse sources and you would like to transform that data, enforce consistent schema and formats, and create a single source that feeds a large number of business intelligence and reporting tools across the organization (see the AWS Athena FAQ).
If you’re running the same query every hour or day to power a BI dashboard, and the structure of the underlying data stays consistent, Redshift might be the better choice - especially if you’re willing to invest in infrastructure to ensure consistent performance come hell or high water.
Athena, on the other hand, might be your go-to for an ad-hoc query - most queries will ‘just work’, even on semi-structured or unstructured data, and they will return results without extensive tuning. However, if you are running your analytic workflows at higher scales, you will need to apply a comprehensive set.of best practices in order to improve Athena performance.
2. Do you have the human resources to manage infrastructure?
Redshift requires DBA resources to manage and resize clusters, with Athena there is no infrastructure to manage
While both Athena and Redshift are offered as managed services by AWS, Redshift still requires non-negligible effort to initialize and configure clusters (last year’s release of Elastic Resize is meant to streamline this process). If data volumes and retention policies are in constant flux - which would often be the case if you work with high volumes of streaming data - you will need to invest significant human resources in cluster management. Additional time needs to be spent loading the data into Redshift.
Athena, on the other hand, is by definition serverless - there is no infrastructure to manage except object storage on Amazon S3, which scales virtually infinitely. This removes the reliance on DBA resources, but also means you can’t add compute resources to improve performance when data loads increase.
Redshift Spectrum, released in late 2017, is an attempt to bridge the gap by providing a serverless extension to Redshift that allows users to join data from S3 with Redshift tables, without managing any infrastructure. However, Spectrum is only available if you already have a Redshift cluster, and some consider it to be less mature than Athena (see this related discussion on Stack Overflow).
3. How much are you willing to invest in performance?
Redshift is more expensive as you are paying for both storage and compute, compared to Athena’s decoupled architecture.
A highly optimized Redshift cluster with sufficient compute resources will most likely return faster than the same query in Athena. However, getting to that stage could cost you an arm and a leg.
Since Redshift is a database, you are paying per node, and the size of your cluster will greatly depend on the volume and velocity of the data you’re dealing with. This can get prohibitively expensive at higher scales, forcing you to either devote increasingly large budgets to your cloud infrastructure or playing a game of cat-and-mouse with your data retention policy (a telltale sign that you should move to a big data architecture). This is without even discussing the software and human costs of data transformation, which can be even more signficant.
Athena runs directly on top of S3 and is priced per terabyte of data scanned. This decoupled architecture allows you to leverage inexpensive storage on Amazon S3 which should not make much of a dent in your AWS bill. Querying costs can also be significantly reduced by means of effective data preparation for Athena, which can also have an impact on performance.
4. How structured is your data?
Redshift is a relational database and best suited for tabular data; Athena is better for semi-structured and unstructured data
As an RDBMS, Redshift stores data in tables and enforces schema-on-write. This makes it better suited for structured data that is ingested already in tabular format, which would often be the case with business application sources such as CRM or HR systems. For these sources, the ETL effort is less significant and there are plenty of tools available to help you get the job done.
However, if you are working with unstructured or semi-structured data - such as raw event data captured from clickstream, IoT sensors or online advertising - this ETL becomes a much more acute problem, requiring complex coding in Spark/Hadoop, workflow orchestration and more. In these cases it might make more sense to store schemaless data on S3 first, and solve the ETL problem once you actually have your analytic use case figured out.
This is not to say you cannot store semi-structured data in Redshift - it’s just a question of whether you are willing to invest this significant ETL effort on ingest. If not, Athena might be the better choice as it queries S3 directly, and can handle complex data types, including nested JSON.
Now that you are hopefully able to place your scenario broadly within the scope of one tool or the other, you can use the following table and resources to continue comparing Redshift and Athena on a feature-by-feature basis.
Streaming data is becoming a core component of enterprise data architecture due to the explosive growth of data from non-traditional sources such as IoT sensors, security logs and web applications.
Streaming technologies are not new, but they have considerably matured in recent years. The industry is moving from painstaking integration of open-source Spark/Hadoop frameworks, towards full stack solutions that provide an end-to-end streaming data architecture built on the scalability of cloud data lakes.
Want to see how leading organizations design their big data infrastructure? Check out these 4 real-life examples of streaming architectures.