How to Implement Data Retention in Your Data Lake with SQLake

Three simple steps to data retention in a lake

Data retention is an important aspect of any data governance program. Implementing data retention reduces storage costs and enables companies to comply with data privacy regulations.

Data retention should be configured at every phase of the data lifecycle, from raw to mart. There are several criteria to consider when defining a retention policy, such as:

  • Auditability: How far back do you need to retain data for the purpose of audit? In financial services this can be up to 10 years.
  • Recovery: If data becomes corrupt or quality is degraded, how far back must you go to replay and fix? In many cases this is 3 months, but other use cases require going back multiple years.
  • Privacy: If a dataset contains private or sensitive information, how long should it be retained? In many cases this is 3 months, but with proper fine-grained access controls it may be retained for longer.
  • Cost: If portions of a dataset become cold, or are accessed less frequently, it may make sense to delete these portions or at least transition them to a less expensive storage tier (when available). This is most common with mart tables representing aggregations and metrics.

There are other criteria to consider, including those defined at the government, industry, and company levels. , but the above are the primary ones.

Upsolver SQLake enables you to define the retention policy when you create a table. Retention is defined by the number of days you must keep data around. To keep track of the number of days elapsed, SQLake requires tables to be partitioned by a column of type date. Best practice suggests to always partition tables by a date column. If one is not available in the data, SQLake automatically injects one named $event_date. Note that you must have at least one partition column of type date for retention to work.

There are three properties you must be aware of when configuring retention:

  • RETENTION_DATE_PARTITION.  If there is only one partition column of type date, you can leave this property blank. If there are multiple partition columns, set this property to the column name to use, of type date.
  • TABLE_DATA_RETENTION.  Set the value from 1 to 9999 DAYS to define how long to retain data. SQLake deletes data within partitions that are greater than the value set in this property – for example, 30 DAYS. The keyword “DAYS” is required.
  • COLUMN_DATA_RETENTION.  This is a list of column names and retention period pairs. Set the retention value from 1 to 9999 DAYS to define how long to retain data of a particular column. Once the threshold is reached, SQLake overwrites the column values with NULL. Currently column retention does not support nested fields such as customer.email. Here is an example of column level retention configuration:

                      (
                         (COLUMN=credit_card, DURATION=1 DAYS),        
                         (COLUMN=email_address, DURATION=3 DAYS)
                       )

Once the retention is defined on your target table, SQLake automatically monitors and deletes data as per the retention policy. There is nothing else for you to do or manage.

 There are 3 steps to run this pipeline:

  1. Create the connection to the source data
  2. Create a staging table with retention policies
  3. Create an ingestion job to load data into the staging table

Step 1: Create the connection to the source data

Before you can ingest data into the staging table you must create a connection to the source system. This tutorial shows how to connect to the Upsolver Samples S3 bucket as follows:

CREATE S3 CONNECTION upsolver_s3_samples
   AWS_ROLE = 'arn:aws:iam::949275490180:role/upsolver_samples_role'
   EXTERNAL_ID = 'SAMPLES'
   READ_ONLY = TRUE;

Once you execute this code in SQLake, SQLake stores the connection information, which we’ll use in the next step.

Step 2: Create a staging table with retention policies

In this step you create the target table to store the ingested raw data.  

Because this is raw data we want to keep it around for 30 days.  We also know there are some sensitive columns we do not want to keep around for very long.  After we have transformed and aggregated these columns, we no longer need them.  We’ll set those columns to retain data for only 1 day.

CREATE TABLE default_glue_catalog.database_2777eb.orders_raw_data () 
  PARTITIONED BY $event_date
  RETENTION_DATE_PARTITION = $event_date
  TABLE_DATA_RETENTION = 30 DAYS
  COLUMN_DATA_RETENTION = ( (COLUMN=customer, DURATION=1 DAYS) );

Step 3: Create an ingestion job to load data into the staging table

You’ve now created the table with the appropriate retention policies defined. The next step is to create an ingestion job to copy data from the source and load it into your target table.  Use the following code to create this job:

CREATE SYNC JOB ingest_order_data
  CONTENT_TYPE = AUTO
AS COPY FROM S3 upsolver_s3_samples 

   BUCKET = 'upsolver-samples' 

   PREFIX = orders/'
INTO default_glue_catalog.database_2777eb.orders_raw_data;

Execute this SQL code to create the job. Once created, the job remains active until you delete it

It may take a couple of minutes for data to show up in the table. SQLake automatically deletes data within partitions and columns when the retention threshold is reached.

Summary – Achieving Regulatory Compliance and Cost Reduction

Configuring table and column retention can help you meet data privacy regulations and reduce the cost of your data lake storage. We recommend that you set retention on all of the tables – raw, intermediate, and marts – that represent business metrics. SQLake automatically manages the deletion process for you so you don’t need to manage additional processes or systems.

Get started today for free with sample data or bring your own.

To learn more, visit SQLake Builders Hub, where you can browse our pipeline templates and consult an assortment of how-to guides, technical blogs, and product documentation.

ctaForm

Start for free - No credit card required

Batch and streaming pipelines.
No DAGs.

Accelerate data lake queries

Real-time ETL for cloud data warehouse

Build real-time data products

Get Started Now

Templates

All Templates

Explore our expert-made templates & start with the right one for you.