This article is part 1 of 2 from our Comprehensive Guide to Understanding and Reducing Redshift Costs. You can read part 2 on our blog, or get access to the full document with 3 additional videos right here.
Amazon Redshift is a fully managed and scalable data warehouse service in the cloud. Redshift’s speed, flexibility, and scalability, along with a frequent stream of new features and capabilities from AWS, make it one of the most popular data warehouses in use.
But Redshift can be expensive. Its pricing model can come across as byzantine. Examining the nuances, options, and methods of Redshift charges is time-consuming. It can be hard to feel confident that you have optimized your Redshift bill.
In this first of two articles, we’ll help you understand the different pricing methods available, as well as the different performance/cost tradeoffs, so you can identify with confidence how best to spend your Redshift budget appropriately and predictably. (The next article, coming soon, delves into the tools and techniques you can use to limit or even reduce your Redshift investment and keep costs under control.)
Surveying the many pricing options in Amazon Redshift
Here’s a scene we can envision, after a DevOps or data practitioner explores the cost of configuring and deploying Redshift and reports back to management:
BOSS: OK, Dude. What’d we find out about Redshift?
DATA DUDE: Well, we can pay by the hour on demand, unless we want to reserve compute resources in advance and get a discount.
BOSS: Great – let’s go for the discount.
DATA DUDE: OK. But then we have to figure out whether to pay up front partially or fully, or not at all, so we’ll need to closely analyze and monitor current and anticipated usage to make the most cost-effective decision.
BOSS: Great – we’ll do that.
DATA DUDE: OK. But before we do that we have to figure out if we want a dc2-type node or an RA3-type node, though before we do THAT we should determine whether we want a large node or a more expensive extra-large node.
BOSS: Great – we’ll do – wait – what’s the difference?
DATA DUDE: Well, the DC2 large and the RA3 extra large are more cost-effective in some cases, although the DC2 extra-large and the RA3 large are a better choice in other cases. In either case we can save money by having fewer nodes.
BOSS: So let’s have fewer nodes.
DATA DUDE: Sure, but it’s hard to scale down and easier to pay for nodes we’re not fully using.
BOSS: But – why should we pay for resources we’re not using?
DATA DUDE: Exactly. Also don’t forget about where our data resides because Redshift charges higher prices to store data in certain regions but also charges to transfer it to other regions.
BOSS: But compliance and governance – we don’t always have flexibility –
DATA DUDE: We can also save money by using Redshift Spectrum to query data in our data lake instead of in Redshift itself, although that costs money per volume of data queried to use Redshift Spectrum, too, so we want to watch how much data we’re querying, though we can be more efficient and get free credits by choosing concurrency scaling –
BOSS: Concurrency scaling.
DATA DUDE: But we should be careful not to exceed our credits or we’ll be charged by the second.
BOSS: Um – uhhh –
DATA DUDE: And no matter what, don’t forget about the technical overhead involved in managing and maintaining clusters, and no matter what again we can save a lot of money by fine-tuning our queries, and we might also consider Redshift Serverless, which Amazon just came out with, to manage capacity when our usage spikes, and – boss? Boss?
BOSS: (staring off into space)
We’ve seen Apache Airflow DAGs a lot less complex than that.
But we can flatten this out a bit and help clarify just what your options are.
Let’s get started.
How Redshift charges
Here we break down Amazon Redshift pricing accordingly:
- Base pricing
- Types of nodes
- Number of nodes
- Vehicles for reducing pricing
- Maintenance costs
Amazon Redshift base pricing
First, foremost, and generally speaking, Redshift charges by the hour, based on the type and number of nodes in your cluster. (There are ancillary charges that can be significant; we cover those later in this blog.) A Redshift node is a finite set of resources optimized for compute and storage. A node includes an engine and a database.
- Use the native Redshift query engine to query data stored in Redshift; this engine is based on PostgreSQL.
- Use Redshift Spectrum, a serverless query processing engine, to query data stored outside of Redshift (in Amazon S3, for example). Redshift Spectrum costs extra and has its own pricing scale.
Where pricing can get confusing is in the variety of pricing models: by the hour based on your node usage (on demand), by the number of bytes scanned (Redshift Spectrum), by the time spent over your free daily credits (concurrency scaling), or by committing to an annual plan (reserved instance).
Types of Amazon Redshift Nodes
There are 2 types of nodes:
- Dense compute (DC2)
- RA3 (these supersede the earlier dense storage nodes, as per Amazon’s recommendation)
Each type comes in different sizes:
- Extra large
- Extra large plus (for RA3-type nodes)
If you frequent Starbucks, you could think of them as Tall, Grande, and Vente.
Number of Redshift nodes to purchase
How many nodes do you need, and of what type? That’s dictated by the amount of data you’re working with.
- DC2-type nodes are optimized for faster queries and are preferable for smaller data sets.
- RA3-type nodes are more expensive than dense compute but are better optimized for storing large amounts of data. They include a feature called Managed Storage, in which they offload less-frequently-used data to less-expensive Amazon S3 object storage.
- With RA3-type nodes only, you pay for compute and storage separately: per hour for compute, plus per GB per hour for data stored on the nodes.
- Extra large nodes offer more storage, use HDD storage instead of SSD storage, and on average cost roughly between 8x and 20x more than large nodes.
Amazon Web Services recommends DC2-type nodes for datasets <1TB uncompressed. For fast-growing datasets, or datasets >1TB, AWS recommends RA3-type nodes.
Adding nodes, of course, can improve query performance as well as expand storage capacity. Be sure to factor in query performance and disk I/O requests in addition to data volume. Also be aware you cannot mix and match node types.
The base cost of your Redshift cluster generally is determined by node type x n nodes x hours in use. Simple, enough. Head to Amazon’s Redshift site for the latest pricing. But here, too, there are multiple variations on the theme.
Vehicles for reducing Amazon Redshift pricing
- Reserved instance. You get a discount for reserving a certain amount of work in advance. The discount varies widely – from 20% off all the way up to 76% off – based on:
- whether you commit to a 1- or 3-year increment
- whether and how much you pay up front. You can make a full payment, a partial payment, or no payment.
- Concurrency Scaling. Each cluster earns up to one hour of free credits per day. If you exceed that, Amazon charges the per-second on-demand rate. AWS recommends concurrency scaling as a way to maintain high performance even with “…virtually unlimited concurrent users and concurrent queries.”
- Use Redshift Spectrum to query data directly in S3. Spectrum is an additional cost – per byte scanned, rounded up by megabyte, with a 10MB minimum per query. But it’s intended to more than compensate for that cost by minimizing the data you must load into Redshift tables (as opposed to leaving it in dirt cheap S3 storage). In addition, you can prepare the data on S3 such that your queried data is small and thus your Spectrum queries incur a minimal load.
- Cluster location. Redshift’s pricing varies widely across regions. Clusters in Asia cost more than clusters in the U.S. In addition, Amazon adds data transfer charges for inter-region transfer and for every transfer involving data movement from a non-AWS location.
Amazon Redshift maintenance costs
Finally, there are other costs to factor in when deploying and maintaining Redshift:
- If your data volumes are dynamic – often the case with high volumes of streaming data – you may find yourself investing significant engineering time in cluster management.
- Much maintenance in Redshift is time-consuming, executed manually via a command line interface. Take into account the staffing resources you will spend running your commands, updating rows, and monitoring your clusters for better performance.
Given the above, Amazon recently introduced Redshift Serverless. Redshift Serverless is intended to get you up and running quickly by automatically provisioning the necessary compute resources. It automates cluster setup and management and makes it much simpler to manage variable capacity. Redshift Serverless introduces yet more pricing methods:
- per second for compute (measured in this case in Redshift Processing Units, or RPUs)
- per amount of data stored in Redshift-managed storage. Amazon says this is similar to the cost of a provisioned cluster using RA3 instances.
How to hone in on the right Redshift cluster size for your situation
One important thing you can do is identify the scale-cost-performance combination most appropriate for your organization. These 3 factors are usually in tension and can help you make intelligent trade-offs.
- Is latency your primary concern? Adding nodes to a cluster gives more storage space, more memory, and more CPU to allocate to your queries, enhancing performance in linear fashion (so an 8-node cluster, for example, processes a query 2x as fast as a 4-node cluster).
- Is cost your primary concern? You can try removing one or more nodes. You can use AWS Cost Explorer (explained a bit further down) to calculate whether you have enough capacity in other nodes to pick up the slack.
- Is scale your primary concern? Again, add enough nodes to cover any anticipated usage spikes (though you may wind up paying for compute resources that sit idle much of the time).
It’s important to measure your current data usage and estimate future usage as accurately as possible, so you’ll know how much Redshift to buy. Amazon provides a couple of tools to help you analyze your usage and adjust your spend accordingly:
- AWS Cost Explorer, to visualize, understand, and manage your AWS costs and usage over time.
- Amazon Redshift Advisor, to identify undesirable end user behaviors such as large uncompressed columns that aren’t sort key columns, and come up with recommendations to improve performance and reduce cost. Redshift Advisor recommendations are viewable on the AWS Management Console.
Reigning in Amazon Redshift costs
There are multiple pathways for keeping Redshift costs in check. In part 2 of this blog, we go into greater detail on the tools, technologies, and techniques you can use to wring the maximum amount of value from your Redshift investment.