Migrating a Self-Managed Data Lake to Snowflake

Santona Tuli, PhD
Data Lakes
November 28, 2023

Shoutout to Rachel Horder for the alternate perspectives and lively debates!

A brief history

Data lakes allow companies to cost effectively consolidate and store vast amounts of structured and unstructured data in open file formats to be leveraged in analytics and ML use cases. Before data lakes, businesses stored mainly structured data in databases and data warehouses, which were limited in their ability to cost effectively scale to meet both storage and query needs. To address the scale challenges, companies that could afford it began building data lakes on top of Apache Hadoop’s Distributed File System (HDFS). Lowering the cost barrier to scale, these lakes allowed engineers to deliver more use cases with their data. However, Hadoop based data lakes required a dedicated team of data engineers to build and maintain, keeping it outside the reach of the majority of companies.

With the modernization of technology stacks on the public cloud, companies of any size could adopt object stores to cheaply store any and all data. Hadoop’s, and later Apache Spark’s, ability to separate compute and storage greatly simplified data processing in the cloud. This was the lightbulb moment and the creation of the cloud-native data lake.

Companies flocked to this new concept. However, building a data lake on cloud native components still required a great deal of engineering effort to get right. Data modeling, ontological architecture, metadata management, retention and removal, and access and governance are just some of the pieces to get right when building and maintaining a data lake. Without proper architecture, data lakes can fail to deliver their promised value of expedited data activation and monetization. Many smaller companies realized the complexity of self-managing even a cloud-native data lake soon after embarking on its implementation, and either abandoned the project or persisted to the detriment of business progress and innovation.

An unlikely hero came to the rescue: the data warehouse—albeit a far more modern version—in particular, Snowflake.

Today’s challenge

The challenge today is migrating data from a self-managed cloud-native data lake, or the remnants of one, to Snowflake or another modern data warehouse. We created this post as an actionable guideline to follow for such a migration.

In this post you will learn:

The drivers behind the decision to migrate from a self-managed data lake to Snowflake.
What components you need to migrate and their equivalents in Snowflake.

In the next post, you will learn:

Best practices for migrating your data and the pitfalls to avoid.
How Upsolver can help you efficiently migrate your data lake to Snowflake.

When is it time to migrate?

First, let’s acknowledge the pros of a data lake. Data lakes offer a single platform for storing and analyzing data, making it easy for data engineers, data scientists, and analysts to work with raw data without needing to switch between different systems. They can handle large amounts of data cost-effectively. The architecture is versatile because it supports diverse data formats, from structured to semi-structured and unstructured data, consolidated in a single, accessible location. Storing data in its native format facilitates querying and analysis by different user personas within your organization, who need to consume the data in unique ways.

In fact, data lakes enable diverse use cases, beyond human-readable queries, such as sharing and monetizing data across business functions, machine learning, and other automated software-level applications. Additionally, by leveraging open file formats, data lakes make data portable. This gives you the ability to implement new tools and use cases quickly, avoiding the dreaded lock-in.

This all sounds great, so why would you want to migrate away from your data lake?

Data engineers are under pressure from two business requirements that are in tension: faster data delivery and ever-increasing data sources (and volume) to ingest.

The business wants to leverage its data for ever faster decision-making and to constantly reveal new revenue streams that data can help unlock. Data SLAs shorten, as faster data activation provides competitive advantage.

On the flip side, coveted growth in product usage means more user sessions to track, more transactions in production databases, and more events transferred through message buses. In today’s distributed world connected by the internet of things, growth can also mean more edge devices deployed, each with more sensors embedded. Data continues to grow exponentially.

As engineers, we are tasked with overseeing access, security, and compliance of all business data, while constantly incorporating new data sources, scaling to meet the volume of data generation, and simultaneously facing demands from the business to expedite data availability.

Self-managed data lakes run counter to all of these requirements. Data lakes are prone to becoming data swamps, which are characterized by unorganized data whose quality cannot be consistently calibrated—otherwise known as untrustworthy data. Manually identifying and optimizing file formats, compacting, partitioning, and caching data for fast access are labor-intensive processes in data lake management that are aggravated by increased data volume. Dedicated engineering effort is also needed for governance and quality control of data to prevent data breaches and compliance issues.

Additionally, without experienced data engineers owning continuous data lake management, its full potential—hyper-optimized storage efficiency and computation performance combined with maximal business-strategic organization—could remain untapped. Companies successful in building a data lake but unable to maintain it, fail to realize any real business value from this powerful technology.

Meanwhile, Snowflake’s automated data management and optimization capabilities make it an attractive alternative, even for companies with resources to expend.

Let’s dig into two specific engineering pain points of data lake management and how Snowflake simplifies them:

Query Performance: Returning query results within the shortest possible timescale is essential for analysts to deliver timely business insights. With a data lake’s ability to store millions of files and tables, it’s not uncommon for bottlenecks to occur. A high number of small files can slow the performance due to I/O limitations, and files need to be consolidated and compacted to overcome this. Furthermore, without caching, repeatedly reading the same data can degrade performance. When data lakes expand to petabytes in size, the volume of metadata required to maintain the relationships between all that data becomes a factor as well. Lastly, deleted files in cloud data lakes may continue to persist without the knowledge of those responsible for managing the lake, further impacting performance and opening companies to security and privacy risks. Snowflake’s built in data retention and clean up functionality alleviate these concerns with minimal effort.

Data Governance: Historically, data lakes have been difficult to secure. With GDPR and CCPA laws firmly in place to protect customer data, companies must respond to requests for updates and deletes to personal information. Handling updates and deletes to Parquet files held in the lake is compute-intensive and hard to achieve as all relevant files must be identified and updated in order to amend the data. Furthermore, to avoid security risks, governance experts must implement explicit measures and access controls to protect sensitive data from unauthorized access. Enabling fine-grained access controls in the data lake requires a well orchestrated, i.e. labor-intensive, integration between different, many times incompatible, policy decision and policy enforcement systems. By contrast, Snowflake offers tag and resource based access controls that are integrated with its own policy management capability as well as a wide range of 3rd party solutions, simplifying the enforcement and audit of user access.

In addition to abstracting away performance tuning and data optimization, Snowflake also eliminates manual governance and security tasks, greatly reducing companies’ risk exposure. Lastly, through its UI-based data exploration and summarization features, Snowflake brings a level of ease of use for both technical and non-technical users.

The components to migration

Building and managing a data lake requires you to implement several core functional components which must be integrated correctly to allow you to efficiently store, process, catalog and analyze your data. Furthermore, security, performance and scale requirements introduce added layers of complexity as these components, individually and together, evolve to meet business needs.

Each component of the data lake architecture may be served by one of more tools. For example, metadata management requires a technical catalog (AWS Glue Data Catalog, Hive Metastore etc.) that is integrated with the processing layer (Spark, Trino etc.) in order to serve schemas and table statistics. Technical catalogs often lack business level features and therefore companies often need to integrate an additional business catalog to enable search, collaboration and business ontologies.

Although not an exhaustive list, the following components are most commonly used when you build a self-managed data lake on AWS. The components are the same, but tools are different in other public clouds.

Data storage

Amazon S3 is a cloud-native object store that allows you to store data as files in any format. Commonly, raw data in S3 is stored in CSV or JSON format that is then converted and optimized into columnar format like Apache Parquet. Data organization in S3 must be constantly optimized to speed up access to the data from query engines. These optimizations include partitioning or grouping the data on common columns and compaction or merging lots of small files into fewer larger ones.

When migrating to Snowflake, you need to consider, at a minimum, the following:

Storage design delineates how to lay out your data across storage layers. For example should you retain all raw data in its original form in S3 and only load the data you need into Snowflake, or should you load all of your data into Snowflake?

Data retention considers how long data should be kept and needs to apply to all data across one or more storage layers. Both S3 and Snowflake provide features that allow you to delete or archive old or unused data.

Disaster recovery is often ignored until it’s too late. Majority of cloud native services, like S3 and Snowflake are designed and built to be resilient, highly available and provide means to recover from failures. However, you can’t ignore that disasters come in many shapes and sizes. Planning for disaster could mean archiving historical data in S3 with cross-region replication enabled and utilizing Snowflake’s database and/or object replication to another region or cloud provider.

Privacy compliance is extremely important when handling consumer data such as PII and PHI. GDPR and CCPA require companies to ensure private information is protected and removed according to the law. Enabling storage-wide encryption is table stakes, however masking and deleting customer records is more difficult on S3 and requires additional tools to identify, locate and remove records across potentially millions of files. Snowflake makes it easier to implement these policies through deletes and updates to data.

Cost is always a consideration. If you expect to collect, store and analyze large amounts of data, consider archiving raw, historical data in S3 because it offers a range of cost-effective storage classes. Store actively used data in Snowflake with the proper retention policies.

Data processing and ETL

Apache Spark is the most popular open source data processing framework for data lakes. Whether you’re using a self-managed Spark deployment or a fully managed offering like Amazon EMR, Databricks or AWS Glue, you’ve developed Spark applications that need to be migrated to Snowflake.

To migrate these workloads to Snowflake, you have two options: porting them to Snowpark or to SQL.

Porting Spark code to Snowpark

The Snowpark Client Library provides DataFrame (Python, Java and Scala) and SQL interfaces with syntax similar to Spark, making the code migration straightforward. You can also leverage UDFs to port custom functions.

This approach is recommended if you want to integrate and continue using your existing development pipeline which may already include version control, code reviews, testing and CICD. Developers will need to learn the new platform and APIs, but you can expect a quick return to productivity.

Porting Spark code to SQL

The second option is to port the Spark code directly to SQL and execute it on Snowflake as you would any other SQL. If your original Spark code is based on DataFrames, not SparkSQL, then this approach requires more effort, but ultimately yields a more portable code artifact that’s easier to migrate to other platforms in the future. If your Spark code includes lots of custom functions and/or depends on 3rd party libraries, porting to SQL requires a great deal of refactoring and redeveloping functionality in native UDFs or Stored Procedures. Expect this migration to take time and testing.

A best practice in migrating transformation code between systems is to pick a few low hanging fruit jobs, those that aren’t business critical, and migrate them first. You will gain experience with the API, identify any nuances with how features are implemented and be able to gauge the overall complexity of migrating larger Spark applications. Once you’re confident and have a migration and testing plan in place, you can start migrating the larger Spark codebases.

This approach is recommended if you’re starting with SparkSQL and don’t have many external dependencies or custom code.

Data catalogs

There are two types of catalogs that you’ll need to consider during migration:

Technical data catalog

A technical data catalog stores table metadata like schemas and partition information. It exposes an API that engines use to access the metadata when planning to execute a query. More specifically, when you execute SELECT id, region FROM table_customers, the query planner asks the technical catalog for the full schema, table statistics, partitions and file location for “table_customers”. Using this information it determines how to best select and return all of the values for the requested columns. There are few products available for technical catalogs; most commonly it will be an Apache Hive Metastore compatible catalog like AWS Glue Data Catalog or Databricks Unity Catalog.

For Snowflake, the technical catalog is materialized in two ways. First, is an internal only catalog that Snowflake uses to register the schema and lots of other metadata about datasets it maintains in local storage. This is the native behavior when loading external files into Snowflake using COPY INTO command, Snowpipe or Upsolver. The second, is an external table that gives Snowflake access to datasets stored in the data lake. Metadata can be automatically refreshed in Snowflake once data is available in the external storage. However, there is no direct integration with an external technical catalog, so all of the external metadata needs to be reproduced and updated in Snowflake.

Business data catalog

A business data catalog stores metadata, tags, annotations, taxonomies and relationships about datasets. It’s used by teams to search, discover and understand the many datasets available throughout the company’s data systems. It’s a business tool that is used by end users and not by systems, like query engines. Acryl Data, Atlan and Alation are a few popular business catalogs that integrate with data lakes, Snowflake and many other data systems.

During migration, the business data catalog needs to be updated to point to tables that have been migrated from the lake to Snowflake. This process should be fairly straightforward since business catalog solutions come with crawlers that can scrape metadata from different systems and update the catalog automatically.

Analysis

One of the most difficult and time consuming parts of any migration is to transition users and their workloads to the new system. Users want to connect to their data in different ways, such as BI dashboards, notebooks, and query engines. Each tool must be integrated with Snowflake so users can access the data. Thanks to the market adoption of Snowflake and Snowflake’s commitment to remaining a fully-featured data warehouse, the large majority of tools commonly used by data analysts, scientists and engineers support executing queries on Snowflake.

Query execution engines on data lakes, like Apache Trino and Presto, Amazon Athena, Dremio, Starburst and Amazon Redshift, are mostly ANSI SQL compliant, which is straightforward to migrate to Snowflake. What introduces complexity are custom extensions to the standard SQL dialect, built-in functions, like data/time, string and JSON manipulation, and any UDFs or Stored Procedures that were created using the unique interfaces of the particular engine. Similar to data processing and ETL migration, these analysis frameworks require additional effort, time and testing to fully migrate to Snowflake.

Access controls

We can write an entire blog series on data governance, security and access controls since it’s a complex topic with many touch points throughout your entire data stack. It helps to start with migrating user access roles from your existing systems to Snowflake since it has the most direct impact on users.

Data lake user access control

In the data lake, user access is typically managed with either open source tools like Apache Ranger or commercial ones like AWS Lake Formation or Azure Purview. These tools allow you to define fine grained access policies to control the tables, columns and even rows that users are allowed to retrieve. Policies are defined in these tools, and enforcement is done at the query engine. This separation can lead to incompatibility between the type of policies you can define and what can actually be enforced. The problem is further amplified when companies implement governance tools, like Collibra and Alation, that provide lots of security and access control capabilities but are not integrated with the enforcement layer of the query engine.

Since access control in the data lake is still in its infancy, the scope of work required to translate policies and migrate them to Snowflake should remain fairly minimal. To simplify future migrations, tools like Privacera provide a middle layer that allows you to define your policies once, or ingest them from another system like Collibra, and then push them down to different engines, like Snowflake or AWS Lake Formation, for enforcement. This unified translation layer could be a life saver if you need to manage lots of users, frequently changing policies and multiple points of data access.

Snowflake user access control

Snowflake offers a robust set of security and access control capabilities that make it easy to implement a wide range of access management strategies, including centralized and distributed.

Although Snowflake provides the primitives, defining and maintaining them for a large, dynamic organization can be challenging. Consider employing a strategic approach for automating, tracking and auditing access control. Two such approaches are as follows.

Leverage a commercial tool like Privacera or Collibra to define, manage, version and audit policies using simple GUI. This is often needed in enterprise size companies with many users, technical and non-technical, that span across analytics, security, legal and compliance teams. This single pane of glass reduces risk and improves transparency.
Adopt a DevOps methodology to deploying, tracking and auditing policies. Using APIs together with Snowflake’s policy DDL, you can implement a process for defining any new policy or updating an existing one to include checking into a Git repository where it can be versioned, reviewed, tested and approved for deployment. Then using a CICD orchestrator like Github Actions or Jenkins, deploy the policy to Snowflake. Similar to the first approach, users have access to a log of all the changes for transparency and auditing. Since this methodology is more technical, it may require custom frontend development to expose the access control policies to non-technical users like compliance officers and auditors.

The final piece

Having strategies in place for the above components, what’s left for migrating a data lake to Snowflake is the moving of the physical data. The recommended approach is to copy business critical data originally stored in S3 (or another object store) into Snowflake and let Snowflake manage the data going forward. Any updates to the data, or new data inserted to these tables, should happen directly in Snowflake. Less critical or less frequently accessed tables can remain in S3 and be exposed to Snowflake as external tables. In this paradigm, Snowflake becomes the new single point of access to all the data, but less critical data remains in slightly colder, and cheaper, storage. As demand for these tables increase, consider migrating them into Snowflake as well to improve query performance against them.

In the next blog post, we outline the path to a secure migration that eliminates any chance of data loss. Along with these steps, we also walk through exactly how to perform the migration using Upsolver. In the meantime, you can chat with an Upsolver Solutions Architect to learn how you can begin migrating from your data lake to Snowflake today.

Published in: Blog , Data Lakes

Santona Tuli, PhD

Santona is a physicist turned machine learning engineer turned data workflow architect and product strategist. She likes making data pipeline authoring friendly to different types of developers.

Migrating a Self-Managed Data Lake to Snowflake

A brief history

Today’s challenge

When is it time to migrate?