Explore our expert-made templates & start with the right one for you.
How to Structure a Data Lake: Draining the Data Swamp
What is a data lake meant to achieve?
The idea of a data lake serving as a company-wide data repository has existed for many years. But only recently has it become commonplace for companies to invest in building a central place to cost effectively collect, store, and catalog their rapidly increasing amount of data. Today’s data lake architectures separate storage from metadata and compute, persisting files in Amazon S3 and cataloging technical metadata in AWS Glue Data Catalog and business metadata in Atlan (among others). Ingesting data into the data lake requires implementing different tools such as Apache Kafka for streaming, AWS Database Migration Service (DMS) for historical data from traditional databases and Fivetran to replicate changed data from SaaS products.
Each system collects, models, and persists data in the data lake in different ways. Not all tools support Hive-style partitions, or store data in optimized file formats such as Apache Parquet, or use consistent folder and file naming conventions. Some enable users to define transformations using a GUI, domain specific language (DSL), or SQL. Data retention, file size optimization (compaction), and ACID transactions are not always integrated and require special handling using additional tools.
Making data lakes work – easier said than done?
The nuances of all these tooling, configuration, and optimization choices make it very difficult for companies to enable self-service analytics, while also maintaining a data lake based on best practices that scales with demand. This results in slower than anticipated adoption of data lakes and an increase in frustration from companies who want to be more data-driven.
In my experience working with many of these companies, their frustration has led to a shift towards “data clouds” where ingestion, storage, transformation, and query are all provided by a single vendor, such as Snowflake, Google BigQuery, or Databricks (read more about Snowflake and Databricks). The data cloud is a walled garden. It is easier to start with, but it creates vendor lock-in which leads to a slower pace of 3rd-party integration and increasing costs. And a data cloud does not prevent the occurrence of a data swamp.
In this post, I will cover an opinionated technique for structuring your data in the data lake so that your datasets can be easily discovered, managed, and secured, helping users be independent and reducing the possibility of creating a data swamp.
Organizing your data lake: staging, refined, and mart zones
dbt Labs provides a best practices guide to structuring your dbt project that proposes creating staging, intermediate, and mart zones. In the dbt Labs guide, they describe this design from the perspective of organizing and storing your SQL models – queries executed against data in the data warehouse. I propose extending this model to also reflect how data is stored in the data lake to allow users to access it using their choice of tools, not just the data warehouse.
This approach isn’t new. For years data lakes were designed this way. However, what’s novel is the ability of our tools to automatically implement these patterns and best practices without requiring data engineers to do extra work or data analysts to know the ideal file size or how to partition their data.
Let’s review the three zones:
Staging zone (also known as raw or bronze): stores and catalogs the raw data as in source systems. It maintains history for as long as you need. The data is as close to the original form as possible. It’s not yet modeled or formatted (partitioned, using columnar file format), which makes it difficult to query or load into the DWH efficiently. Data commonly must be replayed to fix corrupt or bad data, train and test ML models, audit, or simply to archive it. In many cases, raw data is not easily discoverable, resulting in users creating duplicates for their own purposes.
Refined zone (also known as intermediate or silver): stores the results of raw data being cleaned, formatted, and modeled. Data should contain fine-grained data, without aggregations. Column names and data type inconsistencies have been corrected, date and time fields have been normalized, and data quality has been validated and fixed to the extent possible.
Additionally, changed events from operational databases using Change Data Capture (CDC) are merged into the refined table by inserting, updating, and deleting records. The refined zone is used by analysts, data scientists, and those users who need access to full datasets for exploration, model training, deep dives, and analytics.
The staging zone is used to store the raw data before any transformations, merging, or modeling. The refined zone is used to store the same data after it was cleaned, formatted, merged, and modeled. You’re probably wondering why you need both zones? Aren’t we duplicating data?
Each zone has a different purpose. The staging zone is used to retain the original shape, structure, and meaning of the data. It’s the original data, held outside of the source system, reducing cost and complexity. It allows you to replay the raw data from any point in time, a super power every data engineer needs.
For example, your CFO rushes into your office screaming that for the last month the company has underreported the taxes paid for all transactions in the US. You realize that the customer purchase events you’ve loaded into the data warehouse contained the wrong tax calculation. You fix the calculation, but need to go back 1 month and replay all of the events. Without a staging zone, this will require keeping historical data in the source system, which is costly and difficult to scale. The refined zone holds a clean and modeled representation of the staged data. It is easy for business users to understand and use, without having to model the same raw data every time. So you will have duplicate data, but it more than pays for itself by avoiding the cost of reprocessing raw data every time you need to use it
Mart zone (also known as curated or gold): stores datasets that represent specific business views, like metrics. The mart zone is used by people such as business analysts, data scientists, product managers, and application developers to join and aggregate data from the refined zone into views that represent specific business needs. In some cases, curated data may not even be stored in the data mart zone of the data lake but published directly from the refined zone into a target datastore more suitable for the user’s needs.
For example, you may curate data from the refined zone, to join, aggregate, and produce curated data containing business metrics. That curated dataset is loaded into your data warehouse for BI reporting and dashboarding. Another common example is to consolidate and deduplicate logs, to reduce the total data volume, and publish it to Elasticsearch for the operations teams to visualize and alert.
There is a fundamental difference between how marts are created and managed in a data warehouse compared to a data lake. In the data lake, mart views are accessible to any query engine and can be published to various target systems to serve specific use cases. In a data warehouse, the marts are only accessible from that system. This can be limiting. Furthermore, this approach enables you to publish analytics data back into your operational data stores, serving the purpose of reverse ETL without the need to implement new tools.
Implementing best practices for data lake design
The staging zone
The first step is to define the staging zone in your Amazon S3 data lake. You define the folder structure to map to the source systems you are ingesting. This allows you to create a single source of truth that’s not too specific or too broad in the way it describes the data. I recommend against grouping your sources based on business functions like marketing or customer support at this stage. If you have a microservices architecture, I recommend grouping data sources based on the microservices that produce and own the data. For example:
data_lake_bucket/staging |__store_front_mysql | |__production_db | | |__customer_tbl | | |__product_sku_details_tbl | | |__product_sku_quantities_tbl |__store_front_kafka | |__shopping_cart_events | |__buy_events
If you are following a data mesh or a multi-account design pattern, you can manage each of the staging locations under a different AWS cloud account. For example:
Account: 11111111111 store_front_lake_bucket/staging |__store_front_mysql | |__production_db | | |__customer_tbl | | |__product_sku_details_tbl | | |__product_sku_quantities_tbl |__store_front_kafka | |__shopping_cart_events | |__buy_events Account: 222222222222 ops_lake_bucket/staging |__k8s_infra_kafka | |__application_log_events | |__container_log_events | |__cluster_log_events |__aws_infra_logs | |__vpcflowlogs | | |__account=111111111111 | | | |__region=us-east-1 | | |__account=333333333333 | | | |__region=us-east-1 | | | |__region=eu-west-2 | |__cloudtrail
With this approach, each team or line of business owns its own data.
Using AWS VPC Flow Logs, let’s demonstrate how to create your staging zone with Upsolver SQLake in 3 simple steps.
1. Connecting to your data
First, create an S3 connection. SQLake connections hold the credentials required to access the source system, in this case the IAM role to access S3. Once created, these connections are persistent. They can be created once by an admin or data engineer and reused by other users. This DRY (Don’t Repeat Yourself) approach makes sure that source system credentials are stored securely and are not shared with other users.
Here is an example of how to create a connection to S3 in SQLake:
CREATE S3 CONNECTION logs_s3_connection AWS_ROLE = ‘arn:aws:iam::111111111111:role/s3_log_access EXTERNAL_ID = ‘XXXXX’ ;
2. Creating the staging table in the data lake
SQLake uses Amazon S3 to stage all of your physical data and AWS Glue Data Catalog to maintain the table and partition metadata, making your staged data easily discoverable and accessible.
Before writing data, create the staging table in the catalog. Partition the table by the aws-account-id, aws-region and event_date columns to efficiently filter the table at query time. Configure the storage location to match the original data lake design. Lastly, configure a table retention of 30 days to avoid excess storage charges.
Here is an example of creating a table in the AWS Glue Data Catalog with partitioning. This command is executed directly in SQLake. You can use the Glue Data Catalog APIs to create tables, but some table options such as retention will not be available.
CREATE TABLE default_glue_catalog.staging.aws_infr_logs_vpcflowlogs( aws-account-id string, aws-region string, event_date date, ) PARTITIONED BY aws-account-id, aws-region, event_date STORAGE_LOCATION = ‘s3://ops_lake_bucket/staging/aws_infra_logs/vpcflowlogs’ TABLE_DATA_RETENTION = 30 days ;
3. Creating a staging job
Next, create a job to copy data from your source bucket to the target table. This job simply reads from the source, optimizes the underlying data, and writes to the staging table in the data lake.
CREATE JOB aws_infra_logs_vpcflowlogs_stage_job START_FROM = NOW AS COPY FROM S3 logs_s3_connection BUCKET = ‘aws_vpc_flow_logs’ PREFIX = ‘AWSLogs/’ INTO default_glue_catalog.staging.aws_infr_logs_vpcflowlogs ;
Since we configured START_FROM = NOW, SQLake will consume only new log files and will not try to read any past data. To log past data, set this parameter to BEGINNING or a timestamp. As data arrives, it is processed, optimized, and stored in the location you defined with the CREATE TABLE statement.
When you created the staging table, you only declared the columns you wanted to use to partition your table. The remaining columns will be automatically discovered and created for you by the ingestion job. As the schema evolves, the job continues to update the table accordingly. Automatic schema detection and evolution are powerful capabilities in SQLake that make self-service easier.
To keep things DRY, you may choose to develop staging jobs for all of your core datasets. This eliminates the need for users to understand anything about the source systems when building staging jobs. This decoupled design enables you to provide self-service analytics in a way that best suits the organization. For example, if you don’t have data engineers, your data consumers can easily create connections and staging jobs themselves, since it’s only a few lines of SQL. If you have data engineers, they can handle ingestion, curate the company datasets, and allow data consumers to work directly with pre-staged data.
The refined zone
The next step is to create the refined or intermediate zone. Think about how your business users will want to combine the datasets into meaningful views that will later be stored in the data mart zone. A best practice is to structure the refined zone based on the business grouping or meaning of the datasets. Let’s look at a simple example:
data_lake_bucket/refined |__storefront_saas | |__customers | | |__customer_detail_tbl | | |__purchases_tbl | | |__returns_tbl | |__products | | |__product_sku_details_tbl | | |__product_sku_quantities_tbl |__security_and_ops | |__infrastructure | | |__s3_access_logs_tbl | |__payment_processing_service | |__recommendation_service |__marketing | |__email_campaigns |__finance | |__orders | |__revenue
In this example, data is grouped into meaningful categories that make it easy for the systems and tools to find and access information efficiently. It also enables admins to easily audit and archive the physical data. This data may be presented to data consumers as raw tables or as views that materialize results based on the business logic you define. As with a data warehouse, your data lake processing engine is responsible for transforming and materializing the data, whether that’s executed via SQLake, Databricks, or self-managed Apache Spark.
Continuing with the previous example, let’s transform and model VPC Flow Logs data.
4. Modeling and refining the staged data
To make the log data more useful, create a basic model that makes the dataset easier to understand. For instance, transform the data to adjust timestamps and add computed fields that may benefit your users when they perform analysis or build reports.
Let’s define the refined table in the AWS Glue Data Catalog.
CREATE TABLE default_glue_catalog.security_and_ops.infra_aws_vpcflowlogs( aws-account-id string, aws-region string, event_date date, ) PARTITIONED BY aws-account-id, aws-region, event_date STORAGE_LOCATION = ‘s3://ops_lake_bucket/refined/security_and_ops/infrastructure/vpcflowlogs’ TABLE_DATA_RETENTION = 30 days ;
Next, you’ll create the job to model and refine the VPC Flow Logs into something your security team can understand and analyze.
CREATE JOB model_and_refine_vpcflowlogs START_FROM = BEGINNING ADD_MISSING_COLUMNS = TRUE RUN_INTERVAL = 1 MINUTE AS INSERT INTO default_glue_catalog.security_and_ops.infra_aws_vpcflowlogs MAP_COLUMNS_BY_NAME SELECT account-id as source_account_id, interface-id as capture_network_interface_id, srcaddr as source_ip_address, srcport as source_port, dstaddr as destination_ip_address, dstport as destination_port, protocol as ip_protocol, packets as num_packets_sent, bytes as num_bytes_sent, time_of_first_packet, time_of_last_packet, instance-id as instance_id, vpc-id as vpc_id, flow-direction as flow_direction, traffic-path as egress_traffic_path -- there are more possible fields, will ignore for brevity FROM default_glue_catalog.staging.aws_infr_logs_vpcflowlogs -- optional calculated fields LET time_of_first_packet = to_date(“start”::string), time_of_last_packet = to_date(“end”::string) WHERE $commit_time BETWEEN run_start_time() AND run_end_time() ;
This refined table will serve as the base for one or more marts. Having the refined tables makes it easy for users to understand the data and eliminates the need for users to clean and pre-process data over and over.
The mart zone
Finally, we need to create the data mart or curated zone. This is where everything comes together to represent business concepts such as customer, order, and revenue. To create these business concepts, you join and aggregate datasets from the refined zone. To store these marts you can use the following structure:
data_lake_bucket/marts |__storefront_saas | |__customers | | |__revenue_by_customer | | |__orders_by_customer | | |__returns_by_customer | |__products | | |__product_by_category | | |__product_availability_by_country |__security_and_ops | |__payment_processing_metrics | | |__successful_payment_events | | |__failed_payment_events |__marketing | |__email_campaigns | | |__open_rate_by_customer | | |__open_rate_by_campaign
Notice how the datasets are organized similarly to the refined zone but with more focus on the business concepts they represent. Tables are partitioned on columns that are commonly used to filter query results relevant to the use case. For example, the open_rate_by_campaign dataset can be partitioned by campaign_date, which enables users to execute fast queries that filter results based on when the campaign started and ended.
5. Creating a business data mart
To make the log data more useful for the security team, you may want to create a table with specific KPIs they can build a dashboard around for near real-time monitoring. In this example we count the number of packets that our security group and network ACLs rejected.
Define the mart table in the AWS Glue Data Catalog.
CREATE TABLE default_glue_catalog.security_and_ops.daily_not_permitted_traffic( dt date ) PARTITIONED BT dt STORAGE_LOCATION = ‘s3://ops_lake_bucket/marts/security_and_ops/infrastructure/traffic_metrics’ TABLE_DATA_RETENTION = 5 days ;
Next you define a job to count the packets that were rejected.
CREATE JOB load_daily_not_permitted_traffic START_FROM = BEGINNING ADD_MISSING_COLUMNS = TRUE AS INSERT INTO default_glue_catalog.security_and_ops.daily_not_permitted_traffic MAP_COLUMNS_BY_NAME SELECT srcaddr, dstaddr, “action", to_date(“start”::string) as dt, count(*) as "total_count" FROM default_glue_catalog.security_and_ops.infra_aws_vpcflowlogs WHERE $commit_time BETWEEN run_start_time() - INTERVAL 1 DAYS AND run_end_time() GROUP BY 1,2,3,4 ;
Since jobs in SQLake run automatically, you don’t need to do any additional work or job orchestration. Data flow is automatically coordinated between jobs so you always get reliable results. Now, this behavior is not exclusive to the data lake; SQLake can write staging and mart tables to a wide range of targets, such as data warehouses. However, the staging data is always written to S3, which enables lower cost for storing historical data and the ability to quickly scale to meet demand for large volumes of data.
In summary, to avoid creating a data swamp, whether in your data lake or data warehouse, you need a layered design. Datasets can be staged by a central team or individual LOBs that own the data and make it easily discoverable. Raw data can be refined and modeled to make it easier for data consumers to analyze and build business-centric marts that represent specific views and metrics. In line with the best practice dbt Labs recommends, it’s important to structure your data storage in the data lake and metadata storage into technical and business catalogs. This enables faster discovery of data, simpler onboarding of new use cases, and offloading the tedious, mundane tasks data engineers need to perform to make data performant and reliable. Upsolver SQLake automatically implements this design pattern and related data lake best practices so you don’t need it.
Try SQLake for Free for 30 Days
SQLake is Upsolver’s newest offering. It lets you build and run reliable data pipelines on streaming and batch data via an all-SQL experience. Try it for free for 30 days. No credit card required.