Explore our expert-made templates & start with the right one for you.
With the near-ubiquity of Amazon Web Services in cloud computing, it’s easy to forget how new many of its services are. Redshift has only been commercially available since 2013; Amazon Athena has only been around since 2016. The rapid release of new databases and analytics tools – both by AWS and newer players such as Snowflake – can leave software architects baffled as to which stack they should adopt in order to solve a specific business scenario.
“Tool X vs Tool Y” comparisons usually start and end with a cop-out stating that everything depends on your particular circumstances, there is no single correct answer, etc. (we’ve done it too when we compared Kafka to Kinesis.) Like most cliches, this one is also true; technology decisions are rarely black and white and if two platforms are commonly used, it’s often because each can be useful for some users in some scenarios.
Nevertheless, there are some high-level factors you can look at to help you gauge which tech could be the more relevant for your situation. In this article, we’ll try to do exactly that – help you identify the use cases where you’d use Redshift for your analytic workloads, versus those where you’d lean more towards Athena. We’ll try to do that with 2 sentences, 4 questions, and a detailed feature-by-feature breakdown.
Redshift and Athena from a Bird’s Eye View
Once you understand the differences between an enterprise data warehouse such as Redshfit and an interactive query service such as Athena, you can identify the scenarios in which one will be more useful than the other.
Let’s give the 30 second overview of each one:
Amazon Athena is a serverless query engine used to retrieve data from Amazon S3 using SQL.
- Serverless query service – AWS manages all infrastructure aspects
- Works on top of Amazon S3 object storage
- Typically only read-only and does not alter data on S3; however query results can be written to S3
- Uses ANSI SQL and Presto to query data
- Utilizes Amazon Glue as a Hive Metastore
Amazon Redshift is a cloud data warehouse optimized for analytics performance.
- Cloud data warehouse based on PostgreSQL
- Composed of leader nodes and compute nodes. Users can scale by adding compute nodes as needed
- Coupled architecture (storage and compute); serverless functionality available via Redshift Spectrum
There’s a metric ton more that can be said about each of these tools – the AWS website would be a good place to start if you want to further your research (Redshift, Athena). However, for the purposes of this article, these are the important aspects you need to understand to answer the question of…
Redshift or Athena? The super-short answer
Let’s start with the simplest, most basic way to approach the question of where to analyze your cloud data.
- Redshift is faster and more robust (as long as you’re willing to add compute power), but also more expensive and complicated to manage
- Athena makes it is easier to get started and is more flexible in the types of data it can query, but performance is not guaranteed without significant data preparation
From here it’s a question of how much data you have, what your use case is, and how much you are willing to invest in money and human hours.
Let’s go a little bit more in-depth with…
4 Questions to decide where to run your analytic workloads
The caveat from above still applies – your mileage may vary, every organization and every architecture is different, and so on. However, here are the key factors to consider when deciding between Redshift and Athena:
1. What is your use case?
Redshift is the more natural choice for data warehouse reporting, Athena for ad-hoc queries against S3 storage.
Redshift would be the better choice if you have data coming in from diverse sources and you would like to transform that data, enforce consistent schema and formats, and create a single source that feeds a large number of business intelligence and reporting tools across the organization (see the AWS Athena FAQ).
If you’re running the same query every hour or day to power a BI dashboard, and the structure of the underlying data stays consistent, Redshift might be the better choice – especially if you’re willing to invest in infrastructure to ensure consistent performance come hell or high water.
Athena, on the other hand, might be your go-to for an ad-hoc query – most queries will just work, even on semi-structured or unstructured data, and they will return results without extensive tuning. However, if you are running your analytic workflows at higher scales, you will need to apply a comprehensive set of best practices to improve Athena performance.
2. Do you have the human resources to manage infrastructure?
Redshift requires DBA resources to manage and resize clusters; with Athena there is no infrastructure to manage.
While both Athena and Redshift are offered as managed services by AWS, Redshift still requires non-negligible effort to initialize and configure clusters (the release of Elastic Resize is meant to streamline this process). If data volumes and retention policies are in constant flux – which would often be the case if you work with high volumes of streaming data – you will need to invest significant human resources in cluster management. Additional time needs to be spent loading the data into Redshift.
Athena, on the other hand, is by definition serverless – there is no infrastructure to manage except object storage on Amazon S3, which scales virtually infinitely. This removes the reliance on DBA resources, but also means you can’t add compute resources to improve performance when data loads increase.
Redshift Spectrum, released in late 2017, is an attempt to bridge the gap by providing a serverless extension to Redshift that allows users to join data from S3 with Redshift tables, without managing any infrastructure. However, Spectrum is only available if you already have a Redshift cluster, and some consider it to be less mature than Athena (see this related discussion on Stack Overflow).
3. How much are you willing to invest in performance?
Redshift is more expensive as you are paying for both storage and compute, compared to Athena’s decoupled architecture.
A highly optimized Redshift cluster with sufficient compute resources will most likely return results faster than the same query in Athena. However, getting to that stage could cost you an arm and a leg.
Since Redshift is a database, you are paying per node, and the size of your cluster greatly depends on the volume and velocity of the data you’re dealing with. This can get prohibitively expensive at higher scales, forcing you to either devote increasingly large budgets to your cloud infrastructure or play a game of cat-and-mouse with your data retention policy (a telltale sign that you should move to a big data architecture). This is without even considering the software and human costs of data transformation, which can be even more significant.
Athena runs directly on top of S3 and is priced per terabyte of data scanned. This decoupled architecture allows you to leverage inexpensive storage on Amazon S3, which should not make much of a dent in your AWS bill. Querying costs can also be significantly reduced by means of effective data preparation for Athena, which can also have an impact on performance.
4. How structured is your data?
Redshift is a relational database and best suited for tabular data; Athena is better for semi-structured and unstructured data.
As an RDBMS, Redshift stores data in tables and enforces schema-on-write. This makes it better suited for structured data that is ingested already in tabular format, which would often be the case with business application sources such as CRM or HR systems. For these sources, the ETL effort is less significant and there are plenty of tools available to help you get the job done.
However, if you are working with unstructured or semi-structured data – such as raw event data captured from clickstreams, IoT sensors, or online advertising – this ETL becomes a much more acute problem, requiring complex coding in Spark/Hadoop, workflow orchestration, and more. In these cases it might make more sense to store schemaless data on S3 first, and solve the ETL problem once you actually have your analytic use case figured out.
This is not to say you cannot store semi-structured data in Redshift – it’s just a question of whether you are willing to invest this significant ETL effort on ingest. If not, Athena might be the better choice as it queries S3 directly, and can handle complex data types, including nested JSON.
Now that you are hopefully able to place your scenario broadly within the scope of one tool or the other, you can use the following table and resources to continue comparing Redshift and Athena on a feature-by-feature basis.
|Setup||Need to initialize and config cluster, and load data||Instant – no need to setup infrastructure|
|Partitioning||Does not support table partitioning; CREATE TABLE is used to define distribution and optimize processing||Data partitions on Hive Metastore correspond to folders on S3, with support for custom partitioning|
|UDFs||Users can create user-defined functions using an SQL SELECT clause or Python||Does not support UDFs or UDAFs|
|Pricing (details below)||– Dense Compute: $0.25 per hour for dc2.large or $4.80 per hour for dc2.8xlarge|
– Dense Storage: $0.85 per hour for ds2.xlarge or $6.80 per hour for ds2.8xlarge
|– Price per query: $5 per terabyte scanned|
– Minimum 10mb scanned per query
– Storage on Amazon S3
|Data formats||AVRO, CSV, JSON, Parquet, ORC, and TXT.||CSV, TSV, JSON or text files; highly recommended to store data in columnar format such as Apache Parquet or Apache ORC|
|Primary Keys||No-build in constraint on primary key; needs to be declared at onset||No indexing, as it works on top of S3; no primary keys|
|Security||Cluster-level security with support for encryption||Relies on Amazon Identity Management (IAM); access to data based on S3 permissions; can read encrypted data from S3|
Understanding Athena vs Redshift Pricing
Athena charges per-query, based on the bytes of data scanned and rounded up to the nearest MB, at a rate of $5 per terabyte (though this can vary by region) and a minimum of 10MB per query. If you’re using federated queries, this charge applies to the aggregate of data scanned across all data sources.
Redshift: While Athena pricing is fairly straightforward, Redshift costs are more complicated due to the many options available. At a high level, there are three main ways you can pay for Redshift usage:
- On-demand Redshift instances are charged by the hour, based on the work performed by the cluster and the amount of resources allocated to it (where you can increase cluster size to provide better performance)
- Reserved instances allow you to reduce costs by paying in advance based on your forecasted usage
For a deep dive into this topic, check out our comprehensive guide to Redshift pricing.
Explore Additional Resources
- Check out our Amazon Athena resource page to learn more.
- Comparing serverless options? See our comparison between Athena and Redshift Spectrum
- Looking in other directions? Check out our breakdown of Databricks vs Snowflake or Snowflake vs AWS
- Read this case study on the AWS blog: How ironSource uses both Athena and Redshift in its data lake architecture
This video from the AWS official Youtube channel can also help you get your bearings:
Learn everything you need to build performant cloud architecture on Amazon S3 with our ultimate Amazon Athena pack, including:
– Ebook: Partitioning data on S3 to improve Athena performance
– Recorded Webinar: Improving Athena + Looker Performance by 380%
– Recorded Webinar: 6 Must-know ETL tips for Amazon Athena
– Athena compared to Google BigQuery + performance benchmarks
And much more. Get the full bundle for FREE right here.