This article is part 2 of 2 from our Comprehensive Guide to Understanding and Reducing Redshift Costs. You can read part 1 on our blog, or get access to the full document with 3 additional videos right here.
In Part I of this blog we described the many different variables associated with charges in Amazon Redshift. We affirmed it can be difficult to keep your costs in check and be confident you’re getting the most from your Redshift investment.
Here in Part II, we’ll review some of the tools and methods you can use to track your Redshift usage and reduce and maximize your Redshift spend.
Note that our intent is not to discourage you from using Redshift which, as we earlier indicated, is powerful, versatile, and highly scalable. Rather, we want to help you envision how you can most cost-effectively leverage the speed and power of Redshift (and, by extension, any data warehouse solution).
Understand your Amazon Redshift usage patterns
Know as best you can how you’re using your data warehouse. And forecast as best you can how you’ll be using it in the foreseeable future. This helps you focus on just what needs changing.
- Minimize compute by “right sizing” your data set to match requirements. Do you store raw data when per-second aggregates would be sufficient? Do you store per-second aggregates when all of your reporting and query responses only require hourly or daily results? The most important factor in cost reduction is minimizing compute time. The easiest way to do this is to right size your data. Aggregate rows to what is needed and filter columns that are rarely or never used.
- Use reserved instances if you have steady workloads. Reserve a set number of nodes for 1 or 3 years – 3 years gets you a bigger discount — and calculate whether a partial or full upfront payment is preferable.
- Turn off or pause unused clusters. For Redshift clusters used for development, testing, staging, or any other non-24×7 purpose, you can turn them off when you don’t need them. For clusters in production, use the pause and resume feature in the Redshift console to suspend billing when nodes are not in use over consecutive hours. You can do this manually, or on a set schedule. The underlying data structures remain intact.
When you pause a Redshift cluster, it may take several minutes for the cluster actually to cease working; likewise, it may take several minutes for the cluster to resume processing after it has been paused. But Amazon stops charging you from the moment the pause begins, and doesn’t resume charging until the cluster is fully back in action. Redshift also completes any outstanding queries before any pause takes effect.
Finally, you can also shut down a cluster. Before shutting it down you can opt to create a final snapshot. To resume running the cluster you can restore that snapshot and resume querying data from it; anecdotally, this can take up to 30 minutes or more. Tools such as AWS CloudFormation can help you automate the restore operation.
- Minimize data transfer charges. While not always practical, try to keep all your deployment and data in as few regions as possible. Where this isn’t possible – for example, to comply with data residency requirements – factor in data transfer costs when you budget. You can also consider aggregating the data – say from minutes to hours or days – before transferring it to reduce the overall volume-based transfer cost.
- Use available tools. Amazon provides tools that give you visibility into your Redshift usage:
- Amazon Web Services Cost Explorer enables you to view and analyze your costs and usage for up to the prior 12 months, and also forecast how much you’re likely to spend for the next 12 months. Then it recommends what Reserved Instances to purchase. Viewing data is free of charge; there’s a fee for accessing your data via the Cost Explorer API.
- Redshift Advisor analyzes performance and usage metrics for your cluster. It recommends changes you can make to improve cluster performance and decrease operating costs.
- The Amazon Redshift Partners site lists more than 100 partner companies that provide complementary or supplementary products and services.
- Consider offloading ETL to the data lake. While you can transform data into smaller tables for analytics inside your data warehouse, a data lake can usually handle the job much more cost-effectively. If you find data lake processing a daunting subject, consider Upsolver, which is a tool that enables you to build continuous pipelines to your data warehouse using only SQL to define the transformations.
What to look for to right-size your Amazon Redshift clusters
In addition to right-sizing your data, you can also rely on certain usage statistics to determine whether to add or remove nodes from your Redshift cluster:
- Current and historic disk utilization. Keep disk utilization below 70-80% to leave room for query execution. If it’s much lower than this you might be able to downsize without impacting users.
- Workload management concurrency and queue wait-times. Determine whether queries are sitting around waiting for cluster resources to become available. Use your query response SLA to guide whether to resize.
- Percentage of disk-based queries. Examine the breakdown of memory usage per queue. Typically, keep the percentage of disk-based queries to less than 10% as data retrieval from disk creates a severe performance impact.
Finally, other AWS tools and services can help you assess your cluster usage. But these tools’ nomenclature alone could add to the confusion. For example, in addition to Amazon Cloudformation (mentioned above), Amazon Cloudwatch automatically collects a range of cluster metrics and makes them available via a Web-based monitoring interface. Finally there are third-party tools as well; companies such as SumoLogic and DataDog also promote comprehensive Redshift monitoring (among other capabilities).
Identify what data to store where
Store as much data as you can in cheap object storage (Amazon S3). Understanding your total data volume can lead you to the best cost/performance combination of node type and cluster size.
- Keep operational data in Redshift. This is the most current and frequently-queried data. Redshift stores data in tables and enforces schema-on-write, so using its native query engine provides good performance.
- Pre-aggregate when you can. Rarely is the granularity of the raw data equivalent to the granularity required for analytics. For example, do you need seconds, or are hours sufficient? Do you need Postal Code, or will Metro Area or State/Province suffice?
- Keep all other data in raw S3 storage. Store historical data, and data you don’t query often, more economically in a data lake. Keep it prepared into smaller tables based on your likely query needs, and use Redshift Spectrum when you do wish to access this data.
- Also store unstructured or semi-structured data – such as raw event data captured from clickstreams, IoT sensors, or online advertising – especially streaming data – in S3. When you need to query it, use a third-party tool such as Upsolver to transform this schemaless data and then load it into Redshift. This saves significant time and money you’d otherwise spend on coding ETL, workflow orchestration, and so on.
- Delete orphaned snapshots. Redshift backups are in the form of snapshots. If you delete clusters, be sure to delete these snapshots.
- Vacuum tables. Be sure to run vacuum periodically after deletes, as Redshift doesn’t do this automatically. This saves space and, by extension, cost.
Optimize your data for faster, cheaper Redshift queries
Fine-tuning your queries can greatly improve cluster performance. In turn this can significantly lower your Redshift compute costs. If you’re using Spectrum to query data in AWS S3, commit to common best practices in data processing to control query costs.
Optimizing queries on cloud object storage entails a range of best practices in data transformation preparation. It can be challenging to do this via manual coding using Scala, Python, and so on, but there are third-party tools you can take advantage of to automate the process and save significant time and effort. Depending on your requirements you can do this in advance. This gives you the added benefit of using a different or supplemental query tool in addition to Redshift Spectrum, such as Athena, depending on the use case.
These best practices include:
- Optimizing the storage layer, including partitioning, compacting, and converting into columnar Parquet.
- Converting data into generic columnar Parquet files.
- Compressing the data to save space and improve performance, which could reduce the number of nodes you require.
- Compacting data such that file sizes are as big as possible while still fitting into memory.
- Partitioning wisely so as to best answer the queries being run.
- Flattening data. Redshift does not support nested data types (such as JSON files), meaning it’s up to you to get nested data ready for querying.
- Vacuuming on S3 during the data prep process (reducing the data in S3 before it goes to Redshift).
Specific examples of data preparation in Redshift could include:
- Casting data types to support Redshift’s supported data types.
- Aligning field names in the data to Redshift column naming conventions.
- Removing unused fields to reduce storage requirements on Redshift.
- Aggregating data to match query requirements while reducing Redshift storage and compute.
- Joining two streaming sources.
- Performing high cardinality joins before moving the data to Redshift, so you can use Redshift’s resources for querying rather than for infrastructure development.
Amazon Redshift and Upsolver – a complementary pair
Upsolver is a continuous pipeline platform that lets you build data preparation using only SQL, run the processing on affordable EC2 Spot instances, and output live tables to Amazon Redshift. Upsolver provides click-to-connect data sources and outputs, declarative pipeline construction with just SQL and a visual IDE, and automated orchestration, table management, and optimization. These make it easy to implement continuous production pipelines at scale – for Redshift, standalone query engines, data stores, stream processors, and so on. And all of the best practices mentioned above – which are very complex to implement – are handled automatically, invisible to the user.
Peruse the Upsolver Web site for much more information about the Upsolver platform, including how it automates a full range of data best practices, real-world stories of successful implementations, and more.
To speak with an expert, please schedule a demo: https://www.upsolver.com/schedule-demo
Or just see for yourself how easy it is: download and use the free version: https://app.upsolver.com/signup