Moving data from a lake to Snowflake

Rachel Horder
Data Lakes
December 8, 2023

This blog post is the second in a series on migrating from a self-managed data lake to Snowflake.

In the previous post we discussed:

The drivers behind the decision to migrate from a self-managed data lake to Snowflake.
The data lake components that must be migrated and their counterparts in Snowflake.

In this post, you will learn:

Best practices for migrating your data and the pitfalls to avoid.
How Upsolver can help you efficiently migrate your data lake to Snowflake.

Not using Snowflake as your data warehousing solution? Not to worry—much of the guidance here applies to migrating to any modern data warehouse.

Best practices for migrating data from a lake to a warehouse

Today, Snowflake supports a powerful version of the data warehouse, with lake and lakehouse capabilities sprinkled, and enables applications ranging from analytics to machine learning and AI. It’s now easier than ever to migrate your self-managed data lake to leverage the performance, data governance, and concurrency benefits of the modern data warehouse no matter what format your data comes in.

The tricky part is to ensure you follow, during migration, best practices for data itself in order to ensure warehousing success:

Keep Data Raw

The staging area for all data ingested into analytical systems should be used as the landing zone for your raw data. In order to preserve the state and provenance of data for lineage purposes, transformations—with the exception of PII handling—and aggregations should not be applied to data without securing a copy of the raw data first.

Your self-managed data lake should have a staging area, as should your new data warehouse. Staging for a Snowflake data warehouse can be external to Snowflake, but with appropriate pointers and connections so the data can be ingested quickly. Migrating the historically staged data should be straightforward, and could be as simple as using the same cloud storage blob or bucket with a new access role. But pay attention to make sure new data to be ingested into Snowflake also goes to the correct staging environment and in the correct format.

As an aside—with the advent of sophisticated open table formats like Apache Iceberg, it is even possible to have staged data and modeled data in the same place with correct metadata organization. Even if the Iceberg-format data lives externally to Snowflake, Snowflake can manage the data. This is the lakehouse architecture that modern warehouses have begun to integrate.

Uphold Privacy

In order to comply with GDPR and other privacy regulations, ensure you mask sensitive or personally identifiable information (PII) prior to loading it into the data warehouse. Although cheap storage is plentiful in the cloud, PII must be protected on all historical and current data to prevent any dangerous breaches.

The best way to ensure such compliance is to have processes in place that redact, anonymize, or scrub PII data automatically and at the first point of arrival or within a preconfigured compliant time interval. These policies should already exist in your self-managed data lake, and you can port them over to the Snowflake infrastructure easily. Again, PII handling should happen very early in the data pipeline, so don’t wait until the warehouse to implement such processes.

Stay Secure

Beyond PII handling, data privacy and privilege enforcement extends to provisioning data access appropriately for each data consumer and their use cases for maximal security. Whether or not data security was a priority in your self-managed data lake, make sure to secure your data after the migration using the sophisticated and fine-grained role and view-based controls afforded in every modern data warehouse. To remind yourself about access controls in Snowflake, refer back to the last section of the previous blog post in this series.

Have a Curation Policy

Lastly, as you migrate, don’t forget to leverage the original strength of data warehouses: data modeling. The exact modeling framework is not as important as having a framework in place and maintaining self-consistency. But if you are looking for guidance, give this blog post from dbt a read to learn the basics of four types of data models popular today.

Make sure to catalog your data and document your model in order to offer downstream users such as data analysts a reliable description of data semantics and ontology to make data discovery meaningful and data activation rapid. An ongoing curation policy will ensure your catalog remains current and useful, and returns quality results every time.

Options for migrating your data to Snowflake

With these best practices in mind, let’s look at our options for migrating data from a self-managed data lake to Snowflake. Remember that this is only one of many components of your data lake migration, albeit a crucial one. Here are your three main options for moving over the data itself, along with their pros and cons.

Use the COPY command

You can manually migrate your historical data from your data lake using Snowflake’s COPY command. First, you will need to create your database and destination tables in Snowflake. Then you can specify the cloud storage location, for example your Amazon S3 bucket, where your data lake data is stored as the externalLocation to COPY FROM. Alternatively, you can create a named externalStage pointing to the externalLocation first. Either way, you need to provide your cloud account credentials when specifying your externalLocation so Snowflake has access. You can then use the COPY command to write SQL statements to copy the data from your lake’s data store to Snowflake. Note that the COPY command does not automatically ingest data when new files arrive in the staging area unless explicitly orchestrated.

Write an ETL script using Snowpipe

Snowpipe, the native Snowflake data ingestion service, is a good, if not particularly feature-rich, option to ingest your data. Snowpipe is simply a wrapper around the COPY command discussed above with a mechanism for executing the command whenever new data is available in the source. You will once again need to define and configure access to the external location where your data lake data is stored. Additionally, you will need to create an event notification on your storage service to publish messages to an event queue, such as Amazon SQS, or to a REST endpoint when new files arrive in order to inform Snowpipe. Moreover, if using Snowpipe, you must ensure that the source schema matches the target table schema for proper data loading.

Build a pipeline using Upsolver’s powerful no-code feature

Upsolver is ideal for ingesting your high-scale workloads continuously and offers more intelligent ingestion features than Snowpipe. Using the guided UI with no-code functionality, connect to your source—the data lake—and target—Snowflake—, and then apply any essential transformations, data masking, and quality expectations to the data in flight. Or write SQL to do the same in a worksheet or even in a dbt model. Upsolver takes care of the rest, including upserts, creating new tables in Snowflake, and managing schema evolution. The pipeline continues to run until you stop it, and you can use the built-in observability dashboard to monitor your job and data quality.

Efficiently Migrate your Data Lake Workloads to Snowflake with Upsolver

The remainder of this article details exactly how to perform a complete data lake migration to Snowflake using Upsolver in three steps through a specific user example.

Upsolver is a cloud-native data movement solution that’s designed to make it easy for you to ingest high-volume, relationally-complex data to your target, without delay or quality blindspots. By decoupling compute instances from data and metadata storage—just like Snowflake—Upsolver guarantees highly reliable ingestion workflows that are fully auto-healing.

With an emphasis on ease of use, Upsolver’s no-code solution removes the need to write complex ETL/ELT code, while the low code option supports the most advanced ingestion jobs, without losing simplicity in the process.

Step 1 – Create Pipelines to Migrate from Your Data Lake to Snowflake

To ingest data from your data lake to Snowflake, you first create a connection to AWS in your Upsolver account. This connection is shared across your organization and is persistent, meaning it is always on and you won’t need to re-connect each time. The following example creates a connection to Amazon S3:

CREATE S3 CONNECTION my_s3_storage 
AWS_ROLE = 'my_role_arn'
AWS_ACCESS_KEY_ID = 'my_access_key' 
AWS_SECRET_ACCESS_KEY = 'my_secret_key';

The quickest way to create your first pipeline is to use the no-code option in Upsolver to generate a script (like the code snippet shown below) for you. In this script, we create a new job to ingest the orders data in the security-sensor bucket to the SENSOR_ALERTS table in Snowflake. Similar SQL scripts can ingest data from the other buckets holding the rest of the data in the lake.

CREATE SYNC JOB ingest_s3_sensor_alerts_to_snowflake
CREATE_TABLE_IF_MISSING = true
CONTENT_TYPE = AUTO
COMMIT_INTERVAL = 1 MINUTES
EVENT_TIME_COLUMN = UPSOLVER_EVENT_TIME
ADD_MISSING_COLUMNS = true
AS COPY FROM my_s3_storage 
LOCATION = 's3://security-sensor/alerts'
INTO my_snowflake_connection.DEMO.SENSOR_ALERTS;

Unlike the process for ingesting data using a COPY command in Snowflake or writing an ETL script using Snowpipe, Upsolver automatically creates the target table in Snowflake if we include the CREATE_TABLE_IF_MISSING job option. This functionality goes beyond initial job creation—if configured, Upsolver can create and update tables as needed as the schema of the ingested data evolves, without any developer intervention. Keeping in mind the inevitability of source data schema evolution due to fast-paced application development and changes to API payloads, this is the best way to guarantee you have a robust and reliable ingestion pipeline that does not fall over with every deviation from the data contracts, but rather draws attention to the change while continuing to maintain delivery SLAs.

Step 2 – Apply Data Quality Measures

An essential step in migrating your data is to ensure you meet the data quality standards set by your business. “Garbage in, garbage out”, often abbreviated as GIGO, is a fundamental principle in the field of data science and computing that emphasizes that the quality of the output of a system is directly related to the quality of the input data. In other words, if you feed a system or algorithm inaccurate, incomplete, or low-quality data, it will produce unreliable, inaccurate, or low-quality results. GIGO underscores the importance of data quality and integrity forethought in any data-driven process.

With Upsolver, you can guarantee that clean, accurate, and reliable data is ingested into Snowflake, enhancing the effectiveness and credibility of your data-driven endeavors. The following enhancements can be added to your ingestion pipelines:

Transformations Use our library of SQL functions and operators to apply in-flight transformations to your pipelines.
Data Privacy Safeguard PII using masking functions to protect data destined for the warehouse.
Expectations Add expectations to your jobs to root out or tag rows that have incorrect or missing data. Leverage any of our functions to create an expectation across one or more columns that drops rows that don’t meet your quality requirements or generates a warning.

Step 3 – Monitor Your Pipeline

After your ingesting jobs are up and running, you’ll want to ensure they persist in delivering quality data to Snowflake. Data observability is an essential practice that involves monitoring and ensuring the health, reliability, and performance of data ingestion pipelines and the data they contain.

Upsolver’s built-in observability dashboard provides a clear and real-time understanding of the state of your data as it flows through various stages of the pipeline, delivering transparency and insights into data quality, latency, and any potential issues such as data loss or corruption. Without data observability, you risk operating in the dark, leading to blindspots and potential data-related problems further down the pipeline that can have far-reaching consequences.

By implementing data observability in data ingestion pipelines, you can proactively identify and address issues, maintain data accuracy and consistency, and ultimately improve the overall trustworthiness and reliability of your data, which is paramount for informed decision-making and effective data-driven operations. Read about the 5 pillars of data observability and the Upsolver features that fortify them.

Technical Support

Upsolver has been designed to make it as easy as possible to build your pipelines and, with the tutorials and guides in our documentation and a built-in observability dashboard to discover and fix problems, the process is straightforward. However, if you do need help, our technical support team is eager to assist during and beyond your data migration.

Conclusion

Data lakes offer flexibility, portability and control. However, data lakes come with a lot of complexity that, if self-managing, must be solved using bespoke tools and custom code. This makes building, maintaining and scaling data lakes difficult without a team of experienced data engineers.

Migrations aren’t simple, they involve changing many components, processes and routines. In the first part of this blog series, we covered key components for migration: data storage, processing, cataloging, analyzing and controlling access, and the challenges therein.

In this post, we dove deeper into approaches, processes and best practices for migrating your data and metadata from a self-managed data lake on AWS to Snowflake. We saw how Upsolver can help simplify and accelerate this migration, empowering you to focus on building out your analytics and ML applications on Snowflake.

To find out more about Upsolver, why not start your free trial, or schedule a no-obligation demo with one of our in-house solutions architects.

Published in: Blog , Data Lakes

Rachel Horder

Rachel spent the first half of her career building websites using .NET and later shifted her focus to SQL Server, honing her expertise in data migration and integration. Driven by her enthusiasm for data and ETL processes, she co-authored the book "What's New in SQL Server 2012" and contributed to writing the official training material for Microsoft SQL Server 2016 and Power BI. In her current role as Senior Technical Writer at Upsolver, Rachel continues to combine her passions for data and writing.

Moving data from a lake to Snowflake