Explore our expert-made templates & start with the right one for you.
Snowflake vs. Databricks: A Practical Comparison
Evaluating cloud database performance? Read our recent benchmark report to understand how to best ingest and process streaming data in Snowflake and why this can dramatically impact performance. Get the full benchmark right here, FREE.
Things are heating up. As the big data space edges closer to maturity, so does the rivalry between Databricks and Snowflake, two of its biggest players. Regardless of the bad blood and PR battle between the two companies, their products are indeed two of the most common choices for cloud analytics workloads. This guide is going to focus on the practical aspects of choosing between the two tools, and outline the considerations you might take into account before deciding on one or the other.
How objective are we? Here at Upsolver we have no rooting interest; our product SQLake is an all-SQL data pipeline platform that lets you just “write a query and get a pipeline” for batch and streaming data . It automates everything else, including orchestration, file system optimization and infrastructure management. Query the live tables we create in your data lake via Databricks SQL or other query engines, or in your data warehouse such as Snowflake or Redshift. To check it out, you can execute sample pipeline templates, or start building your own, in Upsolver SQLake for free.
as a SQL pipeline platform for cloud data lakes, many of our customers are also Snowflake or Databricks customers. In both cases they can reduce their cloud bill and leverage an open architecture by using Upsolver to ingest and continually process data in the lake before running their analytics queries. From this vantage point, we’ll hopefully be able to isolate the signal from the noise. Throughout the article, we’ll highlight the places where we believe Upsolver can be relevant, but you’re welcome to skip these sections if you just want the comparison.
Find out how to reduce data warehouse costs by downloading our benchmark report.
Key Concepts to Avoid Confusion: Data Lake, Data Warehouse, and Data Lakehouse
Below we illuminate the playing field, clarify what’s behind this data dogfight, and provide some practical ways to compare the two so you might better decide where to run your analytical workloads.
Relevant terms (a more detailed comparison is here):
- A data warehouse stores data in an organized hierarchical structure – typically in the form of database tables – for immediate analysis and business intelligence. Storage and compute are tightly coupled. Data typically is structured.
- A data lake is an architectural design pattern centered around a virtually limitless repository that stores raw data for subsequent analysis. Storage is decoupled from compute. Structured, semi-structured, or unstructured data can be stored equally well.
- A data lakehouse architecture is a hybrid. It combines the cheaper mass storage of a lake with other tools and services to achieve the high performance of a data warehouse.
Snowflake began as a cloud-native data warehouse centered around SQL. It now decouples storage from compute, which might lead some to mistakenly refer to Snowflake as a data lake. Snowflake includes capabilities such as support for Python and unstructured data and the ability to query external tables (stored in a data lake, for example). Snowflake promotes itself as a complete cloud data platform. Yet at its core it is still a data warehouse, relying on a proprietary data format.
Databricks began as a processing engine – essentially, managed Apache Spark. It now includes a data lake (Delta Lake), a SQL engine (Databricks SQL Analytics), and other services. It promotes itself as a data lakehouse.
Both companies now compete to be a “one-stop shop” for all of your data and analytics needs, from ingestion to storage to BI, ML and AI, and so on.
Architecture and Vendor Lock-In: Which Platform is More Open?
The differences between the two vendors are diminishing, as neither supports a fully-open data architecture.
Snowflake follows the traditional enterprise data warehouse; you can’t query data in Snowflake except by using the Snowflake query engine. Databricks promotes the data lakehouse paradigm but is also pointed in the same direction as Snowflake; you must use Delta Lake, and while Delta Lake is open core its utility is limited without Databricks’ proprietary enhancements.
- Based on open core distribution model.
- Reserves key features for its commercial products. The resulting “walled garden” creates vendor lock-in no different than a top-to-bottom proprietary stack.
- Requires a specialized skill set to manage and maintain infrastructure. For example, demands experience fine-tuning Spark, which in turn calls for expertise in Scala, Python, and distributed systems.
- Relies on proprietary Delta Lake file format on top of a cloud object storage layer and SQL query engine to expand beyond the original data science use cases for which it had been known.
- Cloud-based from its inception.
- Upon ingest, organizes data into the company’s proprietary columnar format, creating vendor lock-in.
- Built on top of cloud infrastructure in AWS, GCP, and Azure. Recently added support for unstructured data to expand beyond the original BI and reporting use cases for which it had been known.
|Based on open core distribution model.||Cloud-based from its inception.|
|Reserves key features for its commercial products. The resulting “walled garden” creates vendor lock-in no different than a top-to-bottom proprietary stack.||Upon ingest, organizes data into the company’s proprietary columnar format, creating vendor lock-in|
|Requires a specialized skill set to manage and maintain infrastructure. For example, demands experience fine-tuning Spark, which in turn calls for expertise in Scala, Python, and distributed systems.||SQL-centric.|
|Relies on proprietary Delta Lake file format on top of a cloud object storage layer and SQL query engine to expand beyond the original data science use cases for which it had been known.||Built on top of cloud infrastructure in AWS, GCP, and Azure. Recently added support for unstructured data to expand beyond the original BI and reporting use cases for which it had been known.|
How Upsolver fits into a data lake architecture
Upsolver continuously delivers up-to-the-minute analytics-ready tables to query engines, cloud data warehouses, and streaming platforms – including nested, semi-structured, and array data.
- Single platform to unify ETL for streaming and batch data.
- Eliminates orchestration (No DAGs) and automates file system optimization
- Can be queried by Databricks Photon engine or output to Snowflake tables.
- Open architecture, fully cloud-based (can read on-prem sources too).
To learn more, download the Architecture Overview
Typical Use Cases: What are Snowflakes and Databricks Used For?
Reporting and SQL-based analytics: advantage Snowflake. ML and AI: advantage Databricks. Streaming and CDC: neither is optimal.
- BI use cases such as SQL-based data analysis; reporting; and dashboarding.
- Business problems that call for repeatable queries using high-quality cleaned data, usually batch, with a low- to moderate query load.
- Not typical are data science use cases such as ML and AI; these require additional tools, such as the Snowpark API, to build applications on top of Snowflake.
- Data science/ML, such as: predictive analytics; recommendation engines; and intrusion detection.
- Used for big data batch ETL jobs.
- Not typical are use cases requiring frequent addition or modification of data pipelines.
Neither Snowflake nor Databricks is ideal when:
- You rely heavily on streaming analytics at scale (use cases such as online gaming and advertising, financial transactions, real-time IoT analysis, and so on).
- You’re not sure what you want to do with all your data.
- You want to keep your architecture as open as possible.
- You rely on change data capture.
|Data science/ML, such as: predictive analytics; recommendation engines; and intrusion detection.||BI use cases such as SQL-based data analysis; reporting; and dashboarding.|
|Used for big data batch ETL jobs.||Business problems that call for repeatable queries using high-quality cleaned data, usually batch, with a low- to moderate query load.|
|Not typical are use cases requiring frequent addition or modification of data pipelines.||Not typical are data science use cases such as ML and AI; these require additional tools, such as the Snowpark API, to build applications on top of Snowflake.|
Typical use cases for Upsolver in the ETL layer
Upsovler improves performance, reduces cost, and eliminates data preparation bottlenecks for a range of data-intensive use cases, including:
- Streaming analytics at scale (digital ads, online gaming, log analytics, real-time IoT analysis, and so on).
- Database CDC and replication.
- Data warehouse staging area (reduce cost by offloading real-time data prep to data lake)
Learn more about our data lake solutions.
How Databricks and Snowflake Price Usage
Both are consumption-based. Streaming data can spike costs with Snowflake. Manual engineering work can spike costs for Databricks.
- Consumption-based: DBU compute time per second; rate based on node type, number, and cluster type.
- Storage (S3) and compute (EC2) charged to customer VPC.
- Engineering expertise is separate and can cause costs to spike, as tuning and customization is code-based.
- Consumption-based: rate based on a combination of clock time (running time of virtual data warehouse, regardless of load); pre-configured sizes per VDW; and service tier.
- Storage charges are separate.
- Complex datasets at scale can cause costs to spike, as can streaming data (which creates a continuous load by keeping the Snowflake VDW meter running 24/7).
More pricing resources:
|Consumption-based: DBU compute time per second; rate based on node type, number, and cluster type.||Consumption-based: rate based on a combination of clock time (running time of virtual data warehouse, regardless of load); pre-configured sizes per VDW; and service tier|
|Storage (S3) and compute (EC2) charged to customer VPC.||Storage charges are separate.|
|Engineering expertise is separate and can cause costs to spike, as tuning and customization is code-based.||Complex datasets at scale can cause costs to spike, as can streaming data (which creates a continuous load by keeping the Snowflake VDW meter running 24/7).|
With Upsolver you pay $99/TB ingested, transformations are free.
We don’t believe in baffling users with “credit-hour” pricing schemes. Upsolver’s pricing is based solely on your data volume ingested. Since data is retained in the data lake, you can run an unlimited number of transformations on your ingested data for no additional cost.
- $99/TB (> 10 cents per GB) of data ingested. Retained in data lake for repeat use.
- Unlimited transformations.
Performance: A Raging Debate
Both perform best on data in their own proprietary formats. Snowflake has an advantage on ad hoc queries, batch transformation workloads, BI reporting. Databricks uses Spark to query semi-structured and schema-less data, and add-on tools to run SQL.
- Uses Spark to run analytics queries against semi-structured, schema-less data.
- Must query Delta Lake tables in Delta file format (when using the Databricks SQL Analytics query engine). Otherwise…
- Offers Delta Engine, a federated query engine optimized to run SparkSQL and that includes a caching layer on top of Delta Lake-versioned Parquet.
- Added SQL query engine (DB SQL Analytics) to support BI reporting and ad hoc query use cases.
- Automatically enforces a schema, enhancing query performance.
- Automatic query tuning.
- Optimizes storage for data access by its native SQL query engine during ingestion; uses columnar format.
- Closely ties storage and querying; solves well for interactive queries.
More performance resources:
- Our summary of Databricks and other data lake query engines
- Comparing Snowflake and Redshift on AWS
Recent Performance Benchmarks
Benchmarking query performance has become a point of strong contention between Snowflake and Databricks. The two came to verbal blows over a benchmark test of processing speed called TPC-DS. Databricks claimed significantly faster performance. Snowflake claimed Databricks’ announcement was misleading and lacked integrity. Databricks implied Snowflake pre-processed the data it used in the test to obtain better results.
Regardless, storage and compute functions are now and will remain decoupled. When it comes to batch workload speed either company will likely satisfy most organizations’ needs. For streaming data, we’ve included a link to our benchmark report at the bottom of this post.
|Uses Spark to run analytics queries against semi-structured, schema-less data.||Automatically enforces a schema, enhancing query performance.|
|Must query Delta Lake tables in Delta file format (when using the Databricks SQL Analytics query engine). Otherwise…||Automatic query tuning.|
|Offers Delta Engine, a federated query engine optimized to run SparkSQL and that includes a caching layer on top of Delta Lake-versioned Parquet.||Optimizes storage for data access by its native SQL query engine during ingestion; uses columnar format.|
|Added SQL query engine (DB SQL Analytics) to support BI reporting and ad hoc query use cases.||Closely ties storage and querying; solves well for interactive queries.|
More benchmarking resources:
- Databricks and Snowflake collide – a summary of the 2021 ‘blog warfare’ between the two companies
Scalability: Will Snowflake and Databricks Support Larger Workloads?
Both offer auto-scaling and can handle most enterprise-level loads. Scaling up or down is generally simpler in Snowflake.
- Auto-scales based on load.
- Optimized auto-scaling goes to maximum in 2 steps; scales down under a variety of circumstances (percentage of current nodes, time of underutilization).
- Standard auto-scaling can take multiple steps to reach maximum; follows stricter guidelines for scaling down (must be 100% idle for longer).
- Removes idle workers on under-utilized clusters. Adds additional workers during more computationally-demanding phases of a job. Workloads run faster as clusters are properly provisioned.
- Auto-scaling for ETL or processing throughput is challenging.
- Scales without affecting queries in process, enabling just in-time scaling based on current demand.
- Enables near-infinite scalability by isolating simultaneous workloads on dedicated resources.
- 1-click cluster resize, but no choice of node size.
- Can scale out to use multi-cluster warehouses to accommodate a large number of concurrent users and/or queries.
- Processing and storage layers can scale independently.
|Auto-scales based on load.||Scales without affecting queries in process, enabling just in-time scaling based on current demand.|
|Optimized auto-scaling goes to maximum in 2 steps; scales down under a variety of circumstances (percentage of current nodes, time of underutilization).||Enables near-infinite scalability by isolating simultaneous workloads on dedicated resources.|
|Standard auto-scaling can take multiple steps to reach maximum; follows stricter guidelines for scaling down (must be 100% idle for longer).||1-click cluster resize, but no choice of node size.|
|Removes idle workers on under-utilized clusters. Adds additional workers during more computationally-demanding phases of a job. Workloads run faster as clusters are properly provisioned.||Can scale out to use multi-cluster warehouses to accommodate a large number of concurrent users and/or queries.|
|Auto-scaling for ETL or processing throughput is challenging.||Processing and storage layers can scale independently.|
How Upsolver handles scale
Upsolver provides high performance at petabyte-scale on complex data. You choose a scaling strategy to keep consistent low latency.
- Storage (S3) and compute (EC2 spot instances) are decoupled.
- Scales out the compute cluster to handle increases in message volume.
- Scaling is linear; local disks are not used.
- Visual IDE, SQL enable rapid construction of declarative pipelines at PB scale.
Though implementation details vary, security features in both Databricks and Snowflake are likely sufficient for most enterprises.
Databricks Security Features
- Encryption at rest.
- Provides separate customer keys.
- Role-based access control for workspace objects, clusters, jobs, pools, and table-level.
- Row- and column-level security in Delta Lake provided through Databricks partners.
Snowflake Security Features
- Always-on encryption, depending on product edition.
- Provides separate customer keys.
- Role-based access control.
- Row-level and column-level security (aka data masking).
|Encryption at rest.||Always-on encryption, depending on product edition.|
|Provides separate customer keys.||Provides separate customer keys.|
|Role-based access control for workspace objects, clusters, jobs, pools, and table-level.||Role-based access control.|
|Row- and column-level security in Delta Lake provided through Databricks partners.||Includes row-level security and column-level security (aka data masking)|
More security resources:
- Databricks Platform Security – webpage the Databricks website detailing platform security features
- Databricks Security Guide – Databricks documentation covering topics such as how to set up access control and secure cluster connectivity
- Snowflake security features – official documentation from Snowflake
- Snowflake security: access controls explained – detailed guide by Satori Cyber
How Upsolver handles security
Upsolver never stores any of your data; you choose where Upsolver’s processing, serving, and streaming clusters reside.
- Data access and server management are decoupled. Upsolver does not store data.
- Clusters can reside in customer’s VPC (managed remotely, from within customer’s AWS account) or on an Upsolver VPC (processing data on EC2 clusters in Upsolver AWS account but still storing data on customer’s AWS account).
- Single sign-on and role-based access to Upsolver cluster.
Questions to Ask When Examining Snowflake and Databricks
When comparing features and functions, it’s critical also to understand as clearly as possible:
- What you wish to do, and plan on doing, with your data.
- What skills you have in-house (the biggest difference between Snowflake and Databricks is the skills required to use them).
Answers to the above questions will guide your answers to the below:
- How much do you plan to analyze (vs. how much you plan to store)?
- Are you primarily processing batch jobs or data streams in real- or near-real-time?
- How complex are your transformations? If you have in-house expertise in Scala, Python, and so on, highly-complex transformations may be easier done manually in Spark rather than with SQL.
- Are you deriving insights from data streams, such as social media, e-commerce, or sensors and the internet of things (IoT)?
- What is your 3 – 5 year forecast for your business in terms of growth and expansion?
If You’re Evaluating or Using Snowflake, You Need This Benchmark Report
Our Snowflake benchmark report reviews four approaches to processing data for near real-time analytics, using Snowflake Data Platform as the destination for analytics. Our goal was to understand the differences and trade-offs across these requirements:
- Latency to obtain query-ready data
- Cloud compute cost to process the data
- Time required to develop and maintain the data pipeline
- Cost and complexity of infrastructure the data team needs to maintain as scale, velocity and consumption requirements change
Get the report now (it’s free and instantly available), or watch the accompanying webinar here.
Learn More About How Upsolver SQLake Powers a Truly Open Architecture
SQLake is Upsolver’s newest offering. It lets you build and run reliable data pipelines on streaming and batch data via an all-SQL experience.
With SQLake users can easily develop, test, and deploy pipelines that extract, transform, and load data in the data lake and data warehouse in minutes instead of weeks. SQLake simplifies pipeline operations by automating tasks like job orchestration and scheduling, file system optimization, data retention, and the scaling of compute resources.
SQLake neither stores data nor analyzes data. But it is an integral part of an end-to-end streaming data architecture. SQLake outputs data as optimized, open-source Apache Parquet – and as such it is more likely to complement than compete with a cloud data warehouse or data lake.
Try SQLake for free for 30 days. No credit card required. Browse our growing library of declarative pipeline templates, and get started right away either with our sample data or with data from your own S3 bucket.
Visit the SQLake Builders Hub, where you can browse our pipeline templates and consult an assortment of how-to guides, technical blogs, and product documentation.
If you have any questions, or wish to discuss this integration or explore other use cases, start the conversation in our Upsolver Community Slack channel.
To speak with an expert, please schedule a demo: https://www.upsolver.com/schedule-demo.