How to publish data to another AWS account’s catalog

Upsolver SQLake provides an easy and flexible way to enable data producers to create and share datasets across AWS accounts. Data engineers often create a dataset in a landing zone inside their development account. After testing and validation, they want to point their pipeline to write the dataset to another AWS account, possibly the company data lake or a line of business specific landing zone. This distributed approach, sometimes called Data Mesh, makes it possible to share and publish data quickly.

In this tutorial we will cover a simple multi-tenant data lake solution. The producer of the data is in one AWS account and the consumer, the tenant, is in a different AWS account.

Run it!
1. Stage the raw data in Account A’s data lake
2. Model, transform and write the source data into an intermediate table in Account A
3. Create a connection to Account B’s data lake (S3 and AWS Glue Catalog)
4. Continuously load data from Account A to Account B’s data lake

Stage the raw data in Account A’s data lake

SQLake uses staging tables to store a raw and immutable copy of the data in your Amazon S3-based data lake. When creating the staging table, you can omit defining the schema because SQLake will dynamically detect it and update the AWS Glue Data Catalog as data is ingested. 

Furthermore, a best practice is to partition staging tables by $event_date, an automatically generated system column which is the timestamp of the incoming event.

First, create the table definition in Account A’s catalog – default_glue_catalog.

CREATE TABLE default_glue_catalog.database_2777eb.orders_raw_data()
PARTITIONED BY $event_date;

Next, create an ingestion job to copy the raw events and store them in the staging table. SQLake copies the data continuously, minute by minute, with all table management (partitioning, compaction, and cataloging) handled automatically by SQLake. Data will be written into Account A’s data lake.

CREATE SYNC JOB load_orders_raw_data_from_s3
    CONTENT_TYPE = JSON
AS COPY FROM S3 upsolver_s3_samples 
    BUCKET = 'upsolver-samples' 
    PREFIX = 'orders/'
INTO default_glue_catalog.database_2777eb.orders_raw_data;

You can learn more about the different configuration options for the COPY FROM command when reading from S3 in the SQLake documentation.

Model, transform and write the source data into an intermediate table in Account A

At this stage, SQLake is running your ingestion job, processing new data files as they arrive and loading them into your staging table. You can query the staging table directly in SQLake by executing a standard SELECT query or by using a data lake query engine like Amazon Athena.

Since the data is raw, you may want to model and prepare it so it’s easily understood and consumable by your end user – Account B. You may also choose to transform, enhance, aggregate and redact the dataset before sharing with Account B. 

First, create an intermediate table in Account A’s data lake to hold the modeled and transformed data that you will share with Account B.

CREATE TABLE default_glue_catalog.database_2777eb.orders_transformed_data (
  partition_date date)
PARTITIONED BY partition_date;

Next, create a job to read from the staging table, transform the data and write the results to the output table you created in the previous step. For this example, you will combine the name fields into a full name field, use the MD5 hash function to mask customer emails and reduce the dataset to only SHIPPING orders.

CREATE SYNC JOB transform_orders_and_insert_into_athena
   START_FROM = BEGINNING
   ADD_MISSING_COLUMNS = TRUE 
   RUN_INTERVAL = 1 MINUTE
   AS INSERT INTO default_glue_catalog.database_2777eb.orders_transformed_data 
MAP_COLUMNS_BY_NAME
   SELECT
     orderid AS order_id,
     MD5(customer.email) AS customer_id, 
     customer_name, 
     nettotal AS total,
     $event_time AS partition_date
   FROM default_glue_catalog.database_2777eb.orders_raw_data
   LET customer_name = customer.firstname || ' ' || customer.lastname 
   WHERE ordertype = 'SHIPPING'
     AND $event_time BETWEEN run_start_time() AND run_end_time();

Executing the above query will begin to process data. It will take 2-3 minutes for data to be ready to query in Account A’s table.

Query Account A’s intermediate table

To inspect the results of your transformation job, you can execute the following query.  Since this table is available in Account A’s data lake, any data lake compatible query engine such as Amazon Athena, Amazon Redshift, and Databricks can query it with ease.

SELECT order_id,
      customer_id,
      customer_name,
      total
FROM default_glue_catalog.database_2777eb.orders_transformed_data 
LIMIT 10;

Here is a sample of the results:

Create a connection to Account B’s data lake (S3 and AWS Glue Catalog)

Connecting to another data lake involves two simple steps:

  1. Create a connection to the target S3 bucket where table data is stored
  2. Create a connection to the target data catalog where metadata is stored

Once these connections are created, building pipeline jobs is exactly as before, only now you point to a different data lake. Let’s see how to do this in SQLake.

Create an S3 connection:

CREATE S3 CONNECTION catalog_storage_conn
   AWS_ROLE = 'arn:aws:iam::<account_B>:role/datalake_role'
   EXTERNAL_ID = 'some_external_id';

Create an AWS Glue Data Catalog connection:

CREATE GLUE_CATALOG CONNECTION consumer_catalog
   AWS_ROLE = 'arn:aws:iam::<account_B>:role/datalake_name'
   EXTERNAL_ID = 'some_external_id'
   DEFAULT_STORAGE_CONNECTION = catalog_storage_conn
   DEFAULT_STORAGE_LOCATION = 's3://account_b_datalake/'
   REGION = 'us-east-1';

Note: the IAM roles used to create the S3 and AWS Glue Catalog connections must have cross account permissions. These are commonly created with help from DevOps or CloudOps engineers.

Continuously load data from Account A to Account B’s data lake

Similar to steps 1 and 2, you’ll create a target table in Account B’s data lake where the dataset will be stored.  Note that you don’t need to login to Account B’s AWS account to do this, which makes life simpler.

Create a table in the consumer_catalog within Account B’s data lake.

CREATE TABLE consumer_catalog.db.orders_transformed_data (
  partition_date date)
PARTITIONED BY partition_date;

Next, create a job that reads from Account A’s intermediate table and write the results to Account B’s table.

CREATE SYNC JOB publish_to_consumer
START_FROM = BEGINNING
ADD_MISSING_COLUMNS = TRUE 
RUN_INTERVAL = 1 MINUTE
AS INSERT INTO consumer_catalog.db.orders_transformed_data 
MAP_COLUMNS_BY_NAME
SELECT *
FROM default_glue_catalog.database_2777eb.orders_transformed_data
WHERE $event_time BETWEEN run_start_time() AND run_end_time();

When you executing the above query, SQLake will begin to process data. It will take 2-3 minutes for data to be ready to query in the target table. 

The job creates another copy of the data in Account B. It continuously manages this data and the related metadata by compacting files and updating schema and partition information. Furthermore, the same dataset is available to both Account A and B’s consumers within their respective data lakes. 

If you choose to only make the dataset available to Account B, you can skip step 4 and update step 2 to use Account B’s catalog. This will model and transform the data, writing the results directly into Account B’s data lake without making a copy in Account A’s catalog.

SQLake allows you to query the data in Account B, again without the need to log into AWS. You can validate the results are correct and let your consumers know they can query this table from their favorite data lake compatible query engine.

Query Account B’s target table

SELECT order_id,
      customer_id,
      customer_name,
      total
FROM consumer_catalog.db.orders_transformed_data 
LIMIT 10;

Here is a sample of the results – no different from what Account A sees.

Summary

SQLake makes it easy for data producers to publish datasets to consumers. Producers are always in full control of what consumers see. They can model, enhance, aggregate and redact the data before publishing it to consumers. SQLake takes care of all the data and metadata management, eliminating the need to create and manage additional processes inside the consumer accounts, like running AWS Glue Crawlers for schema management.

Get started today for free with sample data or bring your own.

Published in: Blog , Data Lakes
Chandra Peddireddy
Chandra Peddireddy

Chandra Peddireddy is a Senior Solutions Architect at Upsolver. He possesses 16 years of IT experience in data architecture, data engineering, data modeling, business intelligence, and analytics. His long list of expertise includes designing and building data pipelines of varied scales involving multiple data technologies.

Keep up with the latest cloud best practices and industry trends

Get weekly insights from the technical experts at Upsolver.

Subscribe

Templates

All Templates

Explore our expert-made templates & start with the right one for you.