Explore our expert-made templates & start with the right one for you.
Data retention is an important aspect of any data governance program. Implementing data retention reduces storage costs and enables companies to comply with data privacy regulations.
Data retention should be configured at every phase of the data lifecycle, from raw to mart. There are several criteria to consider when defining a retention policy, such as:
There are other criteria to consider, including those defined at the government, industry, and company levels. , but the above are the primary ones.
Upsolver SQLake enables you to define the retention policy when you create a table. Retention is defined by the number of days you must keep data around. To keep track of the number of days elapsed, SQLake requires tables to be partitioned by a column of type date. Best practice suggests to always partition tables by a date column. If one is not available in the data, SQLake automatically injects one named $event_date. Note that you must have at least one partition column of type date for retention to work.
There are three properties you must be aware of when configuring retention:
(
(COLUMN=credit_card, DURATION=1 DAYS),
(COLUMN=email_address, DURATION=3 DAYS)
)
Once the retention is defined on your target table, SQLake automatically monitors and deletes data as per the retention policy. There is nothing else for you to do or manage.
There are 3 steps to run this pipeline:
Before you can ingest data into the staging table you must create a connection to the source system. This tutorial shows how to connect to the Upsolver Samples S3 bucket as follows:
CREATE S3 CONNECTION upsolver_s3_samples AWS_ROLE = 'arn:aws:iam::949275490180:role/upsolver_samples_role' EXTERNAL_ID = 'SAMPLES' READ_ONLY = TRUE;
Once you execute this code in SQLake, SQLake stores the connection information, which we’ll use in the next step.
In this step you create the target table to store the ingested raw data.
Because this is raw data we want to keep it around for 30 days. We also know there are some sensitive columns we do not want to keep around for very long. After we have transformed and aggregated these columns, we no longer need them. We’ll set those columns to retain data for only 1 day.
CREATE TABLE default_glue_catalog.database_2777eb.orders_raw_data () PARTITIONED BY $event_date RETENTION_DATE_PARTITION = $event_date TABLE_DATA_RETENTION = 30 DAYS COLUMN_DATA_RETENTION = ( (COLUMN=customer, DURATION=1 DAYS) );
You’ve now created the table with the appropriate retention policies defined. The next step is to create an ingestion job to copy data from the source and load it into your target table. Use the following code to create this job:
CREATE SYNC JOB ingest_order_data CONTENT_TYPE = AUTO AS COPY FROM S3 upsolver_s3_samples BUCKET = 'upsolver-samples' PREFIX = orders/' INTO default_glue_catalog.database_2777eb.orders_raw_data;
Execute this SQL code to create the job. Once created, the job remains active until you delete it.
It may take a couple of minutes for data to show up in the table. SQLake automatically deletes data within partitions and columns when the retention threshold is reached.
Configuring table and column retention can help you meet data privacy regulations and reduce the cost of your data lake storage. We recommend that you set retention on all of the tables – raw, intermediate, and marts – that represent business metrics. SQLake automatically manages the deletion process for you so you don’t need to manage additional processes or systems.
Get started today for free with sample data or bring your own.
To learn more, visit SQLake Builders Hub, where you can browse our pipeline templates and consult an assortment of how-to guides, technical blogs, and product documentation.
Browsi replaced Spark, Lambda, and EMR with Upsolver’s self-service data integration.
Read case studyironSource operationalizes petabyte-scale streaming data.
Read case studyPeer39 chose Upsolver over Databricks to migrate from Netezza to the Cloud.
Read case studyBigabid chose Upsolver Lookup Tables over Redis and DynamoDB for low-latency data serving.
Read case studyAccelerate data lake queries
Real-time ETL for cloud data warehouse
Build real-time data products
Explore our expert-made templates & start with the right one for you.