Snowflake vs. Databricks: A Practical Comparison

As the “big data” space heats up, so does the rivalry between Databricks and Snowflake, two of its biggest players.  Here at Upsolver we have no rooting interest; many of our customers are also Snowflake and/or Databricks customers.  So we’ll try to isolate the signal from the noise. 

Below we illuminate the playing field, clarify what’s behind this data dogfight, and provide some practical ways to compare the two so you might better decide where to run your analytical workloads.

Relevant terms (a more detailed comparison is here):

  • A data warehouse stores data in an organized hierarchical structure – typically in the form of database tables – for immediate analysis and business intelligence.  Storage and compute are tightly coupled.  Data typically is structured. 
  • A data lake is an architectural design pattern centered around a virtually limitless repository that stores raw data for subsequent analysis.  Storage is decoupled from compute.  Structured, semi-structured, or unstructured data can be stored equally well.
  • A data lakehouse architecture is a hybrid.  It combines the cheaper mass storage of a lake with other tools and services to achieve the high performance of a data warehouse.

Snowflake began as a cloud-native data warehouse centered around SQL.  It now decouples storage from compute.  Snowflake includes capabilities such as support for Python and unstructured data and the ability to query external tables (stored in a data lake, for example).  Snowflake promotes itself as a complete cloud data platform.  Yet at its core it is still a data warehouse, relying on a proprietary data format.

Databricks began as a processing engine – essentially, managed Apache Spark.  It now includes a data lake (Delta Lake), a SQL engine (Databricks SQL Analytics), and other services.  It promotes itself as a data lakehouse.   

Both companies now compete to be a “one-stop shop” for all of your data and analytics needs, from ingestion to storage to BI, ML and AI, and so on.

Snowflake and Databricks: architecture and vendor lock-in

The differences between the two vendors are diminishing, as neither supports a fully-open data architecture.

 

Snowflake follows the traditional enterprise data warehouse; you can’t query data in Snowflake except by using the Snowflake query engine.  Databricks promotes the data lakehouse paradigm but is also pointed in the same direction as Snowflake; you must use Delta Lake, and while Delta Lake is open core its utility is limited without Databricks’ proprietary enhancements.

Databricks

  • Based on open core distribution model.
  • Reserves key features for its commercial products.  The resulting “walled garden” creates vendor lock-in no different than a top-to-bottom proprietary stack.
  • Requires a specialized skill set to manage and maintain infrastructure.  For example, demands experience fine-tuning Spark, which in turn calls for expertise in Scala, Python, and distributed systems.
  • Relies on proprietary Delta Lake file format on top of a cloud object storage layer and SQL query engine to expand beyond the original data science use cases for which it had been known.

 Snowflake

  • Cloud-based from its inception.
  • Upon ingest, organizes data into the company’s proprietary columnar format, creating vendor lock-in.
  • SQL-centric.
  • Built on top of cloud infrastructure in AWS, GCP, and Azure.  Recently added support for unstructured data to expand beyond the original BI and reporting use cases for which it had been known.
Databricks Snowflake
Based on open core distribution model. Cloud-based from its inception.
Reserves key features for its commercial products.  The resulting “walled garden” creates vendor lock-in no different than a top-to-bottom proprietary stack. Upon ingest, organizes data into the company’s proprietary columnar format, creating vendor lock-in
Requires a specialized skill set to manage and maintain infrastructure.  For example, demands experience fine-tuning Spark, which in turn calls for expertise in Scala, Python, and distributed systems. SQL-centric.
Relies on proprietary Delta Lake file format on top of a cloud object storage layer and SQL query engine to expand beyond the original data science use cases for which it had been known. Built on top of cloud infrastructure in AWS, GCP, and Azure.  Recently added support for unstructured data to expand beyond the original BI and reporting use cases for which it had been known.

Upsolver architecture

Continuously delivers up-to-the-minute analytics-ready tables to query engines, cloud data warehouses, and streaming platforms, even including nested, semi-structured, and array data.

  • Centralized data platform that automates labor-intensive aspects of working with streaming data: message ingestion, batch and streaming ETL, storage management, and data preparation for analytics.
  • Open architecture, fully cloud-based.
  • No vendor lock-in; works with wide range of both proprietary and open components.

 

Snowflake and Databricks: Typical use cases

Reporting and SQL-based analytics: advantage Snowflake.  ML and AI: advantage Databricks.  Streaming and CDC: neither is optimal.

 

Snowflake

  • BI use cases such as SQL-based data analysis; reporting; and dashboarding.
  • Business problems that call for repeatable queries using high-quality cleaned data, usually batch, with a low- to moderate query load.
  • Not typical are data science use cases such as ML and AI; these require additional tools, such as the Snowpark API, to build applications on top of Snowflake.

Databricks

  • Data science/ML, such as: predictive analytics; recommendation engines; and intrusion detection.
  • Used for big data batch ETL jobs.
  • Not typical are use cases requiring frequent addition or modification of data pipelines.

Neither Snowflake nor Databricks is ideal when:

  • You rely heavily on streaming analytics at scale (use cases such as online gaming and advertising, financial transactions, real-time IoT analysis, and so on).
  • You’re not sure what you want to do with all your data.
  • You want to keep your architecture as open as possible.
  • You rely on change data capture.
Databricks Snowflake
Data science/ML, such as: predictive analytics; recommendation engines; and intrusion detection. BI use cases such as SQL-based data analysis; reporting; and dashboarding.
Used for big data batch ETL jobs. Business problems that call for repeatable queries using high-quality cleaned data, usually batch, with a low- to moderate query load.
Not typical are use cases requiring frequent addition or modification of data pipelines. Not typical are data science use cases such as ML and AI; these require additional tools, such as the Snowpark API, to build applications on top of Snowflake.

Upsolver typical use cases

Improves performance, reduces cost, and eliminates data preparation bottlenecks for a range of data-intensive use cases, including:

  • Streaming analytics at scale (online gaming and advertising, financial transactions, real-time IoT analysis, and so on).
  • CDC and database replication.
  • Real-time analytics.

 

Databricks and Snowflake: Pricing

Both are consumption-based.  Streaming data can spike costs with Snowflake.  Manual engineering work can spike costs for Databricks.

 

Databricks

  • Consumption-based: DBU compute time per second; rate based on node type, number, and cluster type.
  • Storage (S3) and compute (EC2) charged to customer VPC.
  • Engineering expertise is separate and can cause costs to spike, as tuning and customization is code-based. 

Snowflake

  • Consumption-based: rate based on a combination of clock time (running time of virtual data warehouse, regardless of load); pre-configured sizes per VDW; and service tier.
  • Storage charges are separate.
  • Complex datasets at scale can cause costs to spike, as can streaming data (which creates a continuous load by keeping the Snowflake VDW meter running 24/7). 

More pricing resources:

Databricks’ pricing page

Pricing breakdown, Databricks and Upsolver

Snowflake’s pricing page

Databricks Snowflake
Consumption-based: DBU compute time per second; rate based on node type, number, and cluster type. Consumption-based: rate based on a combination of clock time (running time of virtual data warehouse, regardless of load); pre-configured sizes per VDW; and service tier
Storage (S3) and compute (EC2) charged to customer VPC. Storage charges are separate.
Engineering expertise is separate and can cause costs to spike, as tuning and customization is code-based. Complex datasets at scale can cause costs to spike, as can streaming data (which creates a continuous load by keeping the Snowflake VDW meter running 24/7).

Upsolver pricing

Upsolver pricing is based on Upsolver’s server usage.

  • Consumption-based, on-demand
  • Compute-based model where pricing is based on Upsolver’s usage of EC2 servers.
  • Minimum 4-CPUs instances
  • Free edition includes fixed-capacity compute, unlimited storage on customer cloud account (20GB on Upsolver cloud account).

 

Snowflake and Databricks: Performance

Both perform best on data in their own proprietary formats.  Snowflake has an advantage on ad hoc queries, batch transformation workloads, BI reporting.  Databricks uses Spark to query semi-structured and schema-less data, and add-on tools to run SQL.

 

Databricks

  • Uses Spark to run analytics queries against semi-structured, schema-less data.
  • Must query Delta Lake tables in Delta file format (when using the Databricks SQL Analytics query engine). Otherwise…
  • Offers Delta Engine, a federated query engine optimized to run SparkSQL and that includes a caching layer on top of Delta Lake-versioned Parquet.
  • Added SQL query engine (DB SQL Analytics) to support BI reporting and ad hoc query use cases.

Snowflake

  • Automatically enforces a schema, enhancing query performance.
  • Automatic query tuning.
  • Optimizes storage for data access by its native SQL query engine during ingestion; uses columnar format.
  • Closely ties storage and querying; solves well for interactive queries.

More performance resources:

Databricks and other data lake query engines

Snowflake and Redshift on AWS

Snowflake and Databricks: Benchmarking performance

Benchmarking query performance has become a point of strong contention between Snowflake and Databricks.  The two came to verbal blows over a benchmark test of processing speed called TPC-DS.  Databricks claimed significantly faster performance.  Snowflake claimed Databricks’ announcement was misleading and lacked integrity.  Databricks implied Snowflake pre-processed the data it used in the test to obtain better results. 

Regardless, storage and compute functions are now and will remain decoupled.  When it comes to batch workload speed either company will likely satisfy most organizations’ needs.

Databricks Snowflake
Uses Spark to run analytics queries against semi-structured, schema-less data. Automatically enforces a schema, enhancing query performance.
Must query Delta Lake tables in Delta file format (when using the Databricks SQL Analytics query engine). Otherwise… Automatic query tuning.
Offers Delta Engine, a federated query engine optimized to run SparkSQL and that includes a caching layer on top of Delta Lake-versioned Parquet. Optimizes storage for data access by its native SQL query engine during ingestion; uses columnar format.
Added SQL query engine (DB SQL Analytics) to support BI reporting and ad hoc query use cases. Closely ties storage and querying; solves well for interactive queries.

More benchmarking resources:

Databricks and Snowflake collide

The TPC-DS  decision support benchmark

Test of Snowflake’s processing speed both with Upsolver and without

Upsolver performance

Automatically implements numerous best practices for designing and operationalizing a high-performance data lake.

  • Automated best practices, such as partitioning, file conversion, compaction, and compression, substantially improves performance of CDWs and query engines.
  • Continuously optimizes S3 storage for faster analysis.
  • Declarative pipelines perform 10-100x faster than Spark.

 

Snowflake and Databricks: Scalability

Both offer auto-scaling and can handle most enterprise-level loads.  Scaling up or down is generally simpler in Snowflake.

 

Databricks

  • Auto-scales based on load.
  • Optimized auto-scaling goes to maximum in 2 steps; scales down under a variety of circumstances (percentage of current nodes, time of underutilization).
  • Standard auto-scaling can take multiple steps to reach maximum; follows stricter guidelines for scaling down (must be 100% idle for longer).
  • Removes idle workers on under-utilized clusters.  Adds additional workers during more computationally-demanding phases of a job.  Workloads run faster as clusters are properly provisioned.
  • Auto-scaling for ETL or processing throughput is challenging.

Snowflake

  • Scales without affecting queries in process, enabling just in-time scaling based on current demand.
  • Enables near-infinite scalability by isolating simultaneous workloads on dedicated resources.
  • 1-click cluster resize, but no choice of node size.
  • Can scale out to use multi-cluster warehouses to accommodate a large number of concurrent users and/or queries.
  • Processing and storage layers can scale independently.
Databricks Snowflake
Auto-scales based on load. Scales without affecting queries in process, enabling just in-time scaling based on current demand.
Optimized auto-scaling goes to maximum in 2 steps; scales down under a variety of circumstances (percentage of current nodes, time of underutilization). Enables near-infinite scalability by isolating simultaneous workloads on dedicated resources.
Standard auto-scaling can take multiple steps to reach maximum; follows stricter guidelines for scaling down (must be 100% idle for longer). 1-click cluster resize, but no choice of node size.
Removes idle workers on under-utilized clusters.  Adds additional workers during more computationally-demanding phases of a job.  Workloads run faster as clusters are properly provisioned. Can scale out to use multi-cluster warehouses to accommodate a large number of concurrent users and/or queries.
Auto-scaling for ETL or processing throughput is challenging. Processing and storage layers can scale independently.

Upsolver scalability

Provides high performance at scale on complex data.  You choose a scaling strategy to keep consistent low latency.

  • Storage (S3) and compute (EC2 spot instances) are decoupled.
  • Scales out the compute cluster to handle increases in message volume.
  • Scaling is linear; local disks are not used.
  • Visual IDE, SQL enable rapid construction of declarative pipelines at PB scale.

 

Snowflake and Databricks: Security

Though implementation details vary, security features in both Databricks and Snowflake are likely sufficient for most enterprises.

 

Databricks

  • Encryption at rest.
  • Provides separate customer keys.
  • Role-based access control for workspace objects, clusters, jobs, pools, and table-level.
  • Row- and column-level security in Delta Lake provided through Databricks partners.

Snowflake

  • Always-on encryption, depending on product edition.
  • Provides separate customer keys.
  • Role-based access control.
  • Row-level and column-level security (aka data masking).
Databricks Snowflake
Encryption at rest. Always-on encryption, depending on product edition.
Provides separate customer keys. Provides separate customer keys.
Role-based access control for workspace objects, clusters, jobs, pools, and table-level. Role-based access control.
Row- and column-level security in Delta Lake provided through Databricks partners. Includes row-level security and column-level security (aka data masking)

More security resources:

Databricks Platform Security

Databricks Security Guide

Snowflake security features

https://satoricyber.com/snowflake-security/snowflake-security/

Upsolver security

Upsolver never stores any of your data; you choose where Upsolver’s processing, serving, and streaming clusters reside.

  • Data access and server management are decoupled.  Upsolver does not store data.
  • Clusters can reside in customer’s VPC (managed remotely, from within customer’s AWS account) or on an Upsolver VPC (processing data on EC2 clusters in Upsolver AWS account but still storing data on customer’s AWS account).
  • Single sign-on and role-based access to Upsolver cluster.

 

Questions to ask when examining Snowflake and Databricks

When comparing features and functions, it’s critical also to understand as clearly as possible:

  • What you wish to do, and plan on doing, with your data.
  • What skills you have in-house (the biggest difference between Snowflake and Databricks is the skills required to use them).

Answers to the above questions will guide your answers to the below:

  • How much do you plan to analyze (vs. how much you plan to store)?
  • Are you primarily processing batch jobs or data streams in real- or near-real-time?
  • How complex are your transformations?  If you have in-house expertise in Scala, Python, and so on, highly-complex transformations may be easier done manually in Spark rather than with SQL.
  • Are you deriving insights from data streams, such as social media, e-commerce, or sensors and the internet of things (IoT)?
  • What is your 3 – 5 year forecast for your business in terms of growth and expansion?

About Upsolver

Upsolver provides low-code data pipelines for cloud data lakes.  Our goal is to eliminate complex data engineering so organizations can easily build and affordably run advanced cloud analytics workloads, at scale and in real-time.

Upsolver neither stores data nor analyzes data.  But it is an integral part of an end-to-end streaming data architecture, and as such it is more likely to complement than compete with a cloud data warehouse or data lake.

To speak with an expert, please schedule a demo: https://www.upsolver.com/schedule-demo

Or just see for yourself how easy it is: download and use the free version: https://app.upsolver.com/signup

Published in: Blog , Cloud Architecture
Jerry Franklin
Jerry Franklin

Jerry is a senior content manager at Upsolver. He has over 20 years of experience developing technical content for SaaS companies, and has worked as a technical writer at Box, SugarSync, and Navis. Connect with Jerry on LinkedIn

Keep up with the latest cloud best practices and industry trends

Get weekly insights from the technical experts at Upsolver.

Subscribe