Best Practices for Data Retention: Dealing with Logs, Data Warehouses, and Data Lakes

Data retention decisions have a high impact on business and data processing. But most data products in the industry aren’t designed for long-term data retention beyond object storage or JBOD (“Just a Bunch Of Discs” – a collection of disks in a computer system or array that are combined as one logical volume).  Retaining a large amount of historical data is expensive and when you need to analyze it, it can take a very long time to process.  

This article examines data retention best practices and strategies in 4 types of data platforms:

  1. Cloud data warehouses
  2. Log data processing engines
  3. Event-processing platforms
  4. Data lakes

Retaining Data in Data Warehouses

When loading new data into a data warehouse it’s a common practice to truncate data in the staging area.  The SQL TRUNCATE function clears all data out of the table while keeping the columns. Every time you load fresh data, historical data is being truncated (that is, discarded) to keep storage and compute costs low. Here’s an example – a 4-step data retention procedure from Snowflake:

Step 1: Create a table in the staging area named customer.

USE STAGING;
CREATE OR REPLACE TABLE stg_customer ( 
cid int,
customer_name string,
mobile bigint,
city string,
ordername string
) ;

Step 2: Insert sample data into the customer table and use a SELECT statement to ensure the data is loaded properly.

INSERT INTO stg_customer ( cid, customer_name, mobile, city, ordername ) values 
(2, 'Thomas Shelby',8797899999, 'chicago','flowers'),
(3, 'Arthur Shelby',7657575657,'austin','pandas'),
(4, 'John Shelby',77868768768,'chicago','table'),
(5, 'Polly Gray',9878979888,'chicago','chair'),
(6, 'Michael Gray',8098080808,'austin','notebooks'),
(7, 'Alfie Solomon',8080809800,'austin','pedigree');

SELECT * FROM stg_customer;

Step 3: Create a target table and perform transformations.

In this case, you create a table containing customers who bought flowers.

USE TARGET;
CREATE OR REPLACE TABLE tgt_customer_flowers ( 
cid int,
customer_name string,
mobile bigint,
city string,
ordername string
) ;

INSERT INTO tgt_customer_flowers
SELECT 
cid int,
customer_name string,
mobile bigint,
city string,
ordername string
   FROM stg_customer
WHERE ordername = 'flowers';

Step 4: Truncate staging table before the next load.

TRUNCATE TABLE IF EXISTS stg_customer;

This approach helps organizations keep costs down because less data is being processed. But it can also be very risky and problematic because the architecture is still complex and error-prone. For example, data may become unmodifiable if its origin changes due to human error or a change in business structure. After you use the TRUNCATE function, you can no longer restore anything from the staging area, either.

Retaining Data in Log Analytics Engines

Splunk and Elasticsearch are great for log analytics. But they are not designed to store and process historical data.  As with data warehouses, we recommend you create indexes periodically – daily is most common – and then delete or archive the index when the logs reach a certain age. Index lifecycle management (ILM) policies typically manage indices according to performance, resilience, and retention requirements. 

For example, in Elasticsearch, you could use ILM to:

  • Spin up a new index when the current  index reaches a certain size or maximum number of documents
  • Create a new index each day, week, or month and archive previous ones
  • Delete stale indices to enforce data retention standards

This approach has data access limitations similar to those of data warehouses, mentioned above. Also, there are cost implications, as vendors usually charge by the amount of data being indexed. When data sources are high-volume – for example, machine-generated data such as VPC flow logs – it becomes unrealistic to index everything into the log analytics platform.

Retaining Data in Event Processing Platforms

Apache Kafka and Amazon Kinesis are pub-sub, queue-based messaging systems. Producers send messages to a topic and consumers can subscribe to any topic. These systems are not designed for long-term data storage. The default retention period for Kafka is 7 days; for Kinesis, it’s 24 hours. 

Holding data for the long term in the messaging system typically results in order-of-magnitude higher costs than cloud storage because data grows exponentially over time. That’s because by default Kafka stores 3 replicas of each message, and the messages are stored on expensive hard drives connected to servers. These factors contribute to at least 10 times the cost of storing the data on a data lake such as Amazon S3 (which manages redundancy internally) and is why a data lake is the most common storage strategy. 

When data retention is limited, it becomes difficult to replay data when needed.  Schema evolution can cause metadata to fall out of sync easily. And there is no way to guarantee data integrity beyond the retention window.

So let’s see where we are.  Data warehouses require you to incur a high cost or a risk of lost data.  Both log systems and event processing platforms are very expensive for storage as well. We’ll now turn our attention to data lakes, both unmanaged and managed.

Retaining Data in a Data Lake Architecture

A data lake enables you to store data in its raw form in object storage – the most inexpensive storage around – and build your data products on top of it. This approach likely meets all your requirements — technical, business, and regulatory. 

The issue with the data lake is a lack of built-in data management on top of the cloud object store. How do you guarantee exactly once processing, total ordering guarantee, or optimizations for performant queries?  How do you manage ingestion, or inserts and deletes with immutable objects?  

Answering any of these questions can quickly add complexity and workload to the data engineering team. You can architect to address issues of cost and performance optimization, but it takes a long time to implement and manage. A DIY approach to data lake management is also error-prone; everything must be written and managed by code that is in turn written and managed by the engineers. 

Challenges to building your own data lake management solution include:

  • Small bugs can significantly impact data quality, system performance, and analytics efficiency. 
  • Onboarding new data and building new data products takes too long because they require the data engineering staff to modify the data pipeline code, then regression test the changes, then schedule and execute the deployment. 
  • Businesses must wait for data to become available, and any small errors require engineers to change various parts of the pipeline, leading to downtime for analytics users.
  • Tech debt accumulates and creates fragility, delays, and cost. The DIY code must be maintained across the life of the pipeline – years, usually – often without the help of the original author of the code.

So to leverage the low cost of object storage for data retention you must add a data lake management layer to simplify and speed up data product delivery.  This enables you to retain raw data forever in object storage (unless there’s a business need to delete it) plus transform the raw data into data products for widespread consumption. Because your raw data is always available, this architecture enables you to “time travel” to a previous state and replay / reprocess raw data as needed, without having to worry about adjusting and redeveloping existing pipelines. 

Upsolver SQLake is a data pipeline platform that integrates such a data lake management layer with a processing engine and SQL interface (accessible via CLI or API).  It uses an event sourcing approach where the cloud object store is an immutable store for raw data; materialized views project always-live tables (the data products) from that raw store for analytics use.

Here’s an example of data retention using SQLake:

/* STEP 1: Create S3 connection */
CREATE S3 CONNECTION s3_conn
   AWS_ROLE = 'arn:aws:iam::949275490180:role/upsolver_samples_role'
   EXTERNAL_ID = 'SAMPLES'
   READ_ONLY = TRUE;
 
/* STEP 2: Create orders staging table. By omitting the TABLE_DATA_RETENTION option in the CREATE TABLE statement, the data is being retained forever. This allows users to replay data whenever desired. 
 */
CREATE TABLE default_glue_catalog.upsolver_samples.orders_raw
   PARTITIONED BY $event_date;
 
/* STEP 3: Create a job that uses the connection to load data from s3 to the staging table */
CREATE JOB load_orders_raw
   START_FROM = NOW
   CONTENT_TYPE = JSON
   AS COPY FROM S3 s3_conn BUCKET = 'upsolver-samples' PREFIX = 'orders/'
   INTO default_glue_catalog.upsolver_samples.orders_raw;
 
/* STEP 4: Make sure your raw data is staged properly.*/
SELECT * FROM default_glue_catalog.upsolver_samples.orders_raw LIMIT 10;

You can keep the raw data in the staging area forever if the data is used downstream for analytics or compliance. The data is purged after the retention period if you define the TABLE_DATA_RETENTION option.

Try SQLake for Free

SQLake is Upsolver’s newest offering. It lets you build and run reliable data pipelines on streaming and batch data via an all-SQL experience. Try it for free. No credit card required. And continue the discussion and ask questions in our Slack Community.

Published in: Blog , Building Data Pipelines
Mei Long
Mei Long

Mei Long is a Product Manager at Upsolver. She is on a mission to make data accessible, usable, and manageable in the cloud. Previously, Mei played an instrumental role working with the teams that contributed to the Apache Hadoop, Spark, Zeppelin, Kafka, and Kubernetes projects.

Keep up with the latest cloud best practices and industry trends

Get weekly insights from the technical experts at Upsolver.

Subscribe

Templates

All Templates

Explore our expert-made templates & start with the right one for you.