How to Reduce Your Data Platform Costs with Value-Based Pricing

Upsolver offers a batch and streaming data pipeline platform that runs on cloud resources.  Until recently we had charged for service based on compute “units.”  We recently decided to make a major change to our pricing model to tie it better to the value the customer receives.  This post describes our journey and thinking around pricing.

When it comes to building a data platform, there are a wide range of tools and services you can employ to deliver value to your users. Some can be self-hosted and others are fully-hosted and offered as a SaaS product or as a serverless engine. In either case, the way these tools and services charge – Upsolver included until recently – is typically based on the amount of compute resources consumed over a period of time. 

For example, Databricks charges per “DBU,” Snowflake per Credit, AWS Glue per “DPU,” and Confluent Cloud per “CKU,” partitions, and a base price. These currencies are controlled by the vendor, similar to airline miles. Furthermore, pricing is dependent on the edition, cluster configuration, and types of workloads. Some services bundle the cost of EC2 compute; others don’t.

Compute-based pricing models are difficult to understand and even harder to predict as the volume and variety of data and complexity of jobs increases. How do you understand the cost of your data platform? How can you budget reliably as you add more data and use cases?

Too many variables make it difficult to understand what your data platform actually costs

Not to pick on Databricks, but let’s use them as an example. 

First, you are asked to choose between Standard, Premium, and Enterprise editions. This is usually done at the organization level, so let’s assume it’s not your worry. 

Second, you need to understand which workloads you will run and their performance SLA, and the security, and isolation requirements – only then can you choose a cluster type to launch. 

Third, you must select the required functionality from 5 options: Jobs Compute, All-Purpose Compute, Jobs Lite Compute, SQL Compute, and Serverless SQL Compute. 

Now we get to the hard part!

Fourth, you must size your cluster. You have to figure out the CPU, memory, and disk space required by your job so you can select an EC2 instance type and the number of nodes you need to process the data. This tells you the cost per DBU. However, that’s only a rough estimation. 

To get a true feel for your cost you need to run a proof of concept (POC) with real data and a real ETL job. You need to run the same workload with different cluster types and compute configurations to arrive at the optimal cost. A POC also helps you estimate how long your job will run given the cluster type and instance type you selected. POCs take significant time, resources, and experience to get right, and the results are only relevant for a short time.

Finally after you complete your analysis and a POC, you are able to come up with a good idea of your costs. Are you exhausted yet? Because I am.

Wait! We didn’t even consider EC2 Spot instances and how to estimate your costs when the cluster dynamically scales up and down based on the workload.  Allowing a badly configured cluster to autoscale with the needs of workloads can very quickly lead to massive costs – I’ve seen this over and over again.

Data platform pricing variables are becoming more abstract

The Databricks example shows the growing complexity of compute-based cost models. However, they aren’t alone. Snowflake’s model is based on choosing a virtual data warehouse size and using credits to pay for clock time that the VDW is active, whether or not it is fully utilized while running. . Before you can figure out how many credits your workload will use, you need to decide which instance type to use for your data warehouse. But even this is abstracted – under the guise of “simplicity” – into T-shirt “warehouse” sizes that still don’t explain how much actual computing capacity you will get. A good post about How to calculate your Snowflake monthly costs by Joris Van den Borre from Tropos delves into more detail. We wrote a similar article to help customers understand the pricing dimensions of Amazon Redshift and how to manage your costs.

Abstracting the pricing variables simplifies your bill, but makes it extremely difficult to understand what exactly is driving your costs and how to optimize them.  This is especially important during an economic downturn when you must optimize as much as possible.  Kris Peeters from Data Minded discusses this in more detail in his post Why rising cloud costs are the silent killers of data platforms, which includes examples of how this cost multiplies as you build out your modern data stack.

If there are too many pricing variables, and if the majority of these variables are abstracted, how can you predict costs as your business evolves and you scale to meet increased data volumes and computing demands? 

At Upsolver we’ve thought long and hard about this problem. We discussed it with many current and potential customers and came to the conclusion that pricing models need to closely align to the value they reflect. When they do, costs become easy to understand and predict.

Listening to our customers’ concerns around pricing

Upsolver’s previous pricing model was based on “Upsolver Units.” A unit roughly equates to 8 CPU cores of compute capacity per hour.  But for customers it was difficult to know how many units they needed.  They didn’t want to buy too much or too little, but couldn’t easily predict what they would need.

Here are a few examples of the feedback we received:

  • “I switched to larger instances with more memory but now I’m paying Upsolver double for the same workload and value.”
  • “A data scientist ran a pipeline that scanned our historical data (replay) which consumed a large portion of our remaining Upsolver compute units. I wasn’t expecting to have to renew so soon.”
  • “I don’t have the cycles to run a POC on production data just to understand Upsolver’s cost.”
  • “We can’t give our analysts access to create jobs on Upsolver since we can’t control how much they spend.”
  • “I accidentally left my cluster up and burned through a lot of units without gaining any value.”

In all the cases above, customers vocalized their concerns about our pricing, because more compute didn’t translate to more value. We dove deep with them and created our value-based pricing tenets:

  1. Pricing must be based on dimensions that align directly to business value.
  2. Pricing must be easy to understand and easy to calculate. 
  3. Pricing must be predictable.
  4. Pricing must incentivize best practices.  
  5. Pricing must encourage adoption.

These 5 pricing tenets guided us as we evaluated different approaches and finally settled on our current model.

Upsolver’s value-based pricing model

To provide simplicity and predictability, Upsolver’s price is based on only 2 parameters:

  • Volume of ingested data measured as uncompressed JSON.  When organizations generate more data, it’s usually because the business is growing and therefore the value derived from the data pipeline platform increases.
  • Number of concurrently running transformation jobs, which maps to use cases indicating value is being derived from the ingested data.

We offer our customers the option of on-demand or an annual commitment. Customers can run their data pipelines on the Upsolver Cloud or deploy Upsolver into their own AWS account. With on-demand pricing, customers pay per GB ingested and can run up to a maximum number of concurrent jobs. For customers with higher volumes of data or more concurrent jobs, we offer volume discounts for an upfront commitment. Very simple to understand and predict.

Let’s test this model against the pricing tenets:

  1. Aligned directly to business value? Yes.
    Growth in ingested data is almost always an indicator of customer’s growth – number of users, ad spend, connected devices, payments, and so on. Growth in the number of concurrently running jobs means the customer is using the ingested data to build more tables to be used for analytics, ML, and business insights, delivering more business value.
  2. Easy to understand and calculate? Yes.
    Customers immediately understand the meaning of ingested data and concurrent jobs. We no longer need to educate our salespeople and customers to understand what is an Upsolver unit and how to estimate costs based on expected usage.
  3. Is pricing predictable? Yes.
    Customers we interviewed were able to quickly and accurately estimate the amount of data they would be ingesting and how many jobs they would need to run to deliver on planned use cases. When usage increased in data ingested, jobs, or both, customers were able to easily predict the expected increase in Upsolver costs.
  4. Incentivize best practices? Yes.
    Customers should be allowed and empowered to optimize and reduce their costs. However, it should not be done to the detriment of the business and its users. Since cost is based on data scanned, the volume of data can vary widely based on the file format and compression used. This makes it difficult to calculate the amount of GBs to be ingested. By representing the data as uncompressed JSON, we normalize the data volume ingested. 
  5. Encourage adoption? Yes.Adoption is usually measured by active users of the data platform, but in the case of Upsolver it’s measured by the amount of data ingested and the number of running jobs. The more concurrent jobs, the greater the adoption.

During the process of developing our new pricing model, we were able to answer additional key questions that help refine our approach:

  • If customers exceed their number of concurrent jobs mid-contract, should Upsolver charge overage? We decided not to charge overage since it breaks predictability. 
  • Should we have 3-4 product editions with a different price based on features?
    We decided every user should get all features since it makes understanding costs easier, encourages adoption, and incentivizes best practices deployment.
  • Should we price based on processed data or ingested data? Ingested data size and volume are known to customers and therefore predictable. Processed data size and volume can only be discovered after a POC and would make it difficult to understand, calculate, and predict. Additionally, charging on ingested data means you pay only once for your data. If you produce 5 output tables from the same source data, you just pay once for the data ingested.  This makes it easier to predict, as well as encourages new use cases that utilize existing data.

How can Upsolver help with your data platform ROI?

At Upsolver, we’re taking steps towards making our pricing simple, clear, and predictable.  We believe our customers should pay for the value they receive, not how many resources or seats they consume. Upsolver’s core value is allowing users to quickly ingest and transform data using SQL, which is only part of our customers’ journey to extract insights. We want this part of their journey to be simple, reliable, and cost effective so they can focus on innovating and solving business problems.

Upsolver pricing offers the following benefits:

  1. Pay only when ingesting data.
  2. Deploy new use cases on existing data with no risk of overspending.
  3. Automatically optimize jobs and data to reduce overall operating costs.
  4. Pay on-demand or commit up front and receive volume discounts.

With Upsolver, users build production ready data pipelines and ad-hoc data jobs using SQL. Jobs are always on and automatically process new data as it arrives; you never need external orchestration such as Apache Airflow. Upsolver automatically stores and manages data in the data lake using optimal partitioning, file formats, and small file compaction, before loading transformed data into your data warehouse, like Amazon Redshift and Snowflake.

Upsolver enables you to reduce the complexity of your data stack, deliver self-service analytics, and future proof your data with a predictable and easy to understand pricing model. To learn more, visit our website and SQL pipeline template gallery.

Try SQLake for Free (Early Access)

SQLake is Upsolver’s newest offering. It lets you build and run reliable data pipelines on streaming and batch data via an all-SQL experience. Try it for free. No credit card required. 

Published in: Blog , Building Data Pipelines
Roy Hasson
Roy Hasson

Roy Hasson is the head of product @ Upsolver. Previously, Roy was a product manager for AWS Glue and AWS Lake Formation.

Keep up with the latest cloud best practices and industry trends

Get weekly insights from the technical experts at Upsolver.

Subscribe

Templates

All Templates

Explore our expert-made templates & start with the right one for you.