AWS Lake Formation – How to Setup a Secure Data Lake

Everything You Need to Know About AWS Lake Formation

A data lake is a secure data repository (a single source) for all your enterprise data. It includes raw and transformed data like source system data, sensor data, and social data. This source data is further sub-categorized into structured, semi-structured, unstructured, and BLOB data. 

  • Structured data is extracted from relational databases.
  • Semi-structured data includes CSV files, audit logs, as well as XML and JSON data.
  • Unstructured data comprises PDFs, emails, and documents.
  • BLOB data is made up of images and audio and video files.

 

The data in the data lake’s raison d’etre is the requisite function and purpose it plays in tasks such as “reporting, visualization, advanced analytics, and machine learning.” In other words, when added to the data lake, existing data silos are broken down and combined, allowing the use of different types of analytics with the outcomes used to gain insights and to guide improved business decisions. 

 At this juncture, it is vital to note that a data swamp is a deteriorated and unmanaged data lake with security and governance issues. Therefore, it is an imperative to ensure your company data lake does not degenerate into a Data Swamp.

 How?

By answering this question, let’s consider how to set up a secure data lake with AWS Lake Formation, set Lake Formation permissions for governance and simplified security management. 

What is the AWS Lake Formation?

AWS Data Lake Formation is a service that simplifies and speeds up the process of setting up a secure data lake.

Here’s an intro to AWS data lakes.

Under normal circumstances, setting up a lake is a time-consuming, manual process.

Why?

The intersection between Big Data and the data lake definition highlighted above is that both these entities have commonalities in their respective data types and volumes. Consequently, it is reasonable to conclude that these data volumes will take a long time to extract, transform, and load (ETL) into the lake. 

Enter the AWS Lake Formation service.

How Does the AWS Lake Formation Work? 

Lake Formation can be used to set the data access and security policies (more on AWS data lake best practices). Once this information has been entered into the Lake Formation service, the Lake Formation provides its own permissions model that augments the AWS Identity and Access Management (IAM) permission model.  

The salient point here is that most or all of this data is classifiable as sensitive data. Consequently, only qualified users must be allowed access to this data, and the information returned from the data manipulation and analysis procedures. 

Thus, the question that begs is how are the permissions set using the AWS Lake Formation service?

Let’s answer this question by considering the following scenario:

Note: The data set used is public domain or open data, more specifically the New York City Taxi and Limousine Commission (TLC) Trip Record Data, available on the AWS website.

Hence, it makes sense that you are a taxi company owner in New York City and would like to analyze the 2019 and 2020 data to determine the time of day (or night) when taxis are most in demand and least in demand. This includes external weather conditions, day of the week, and trip type. You aim to optimize your company’s taxi services to streamline operations and reduce costs.

The last assumption that we must make is that all of this data is sensitive. It is your company’s intellectual property; so, it is not a good idea that your competition gets hold of the data. It will give them a competitive edge over you and take many of your routes away from your company.

Let’s pick up the story after the data has been loaded into the data lake using the Lake Formation Service.

The next important issue to deal with is user permissions and persona types. How are they implemented using the Lake Formation service?

As described above, one of the aims of this Formation service is “simplified security management.” This translates into a three-tier permissions process that defines permissions right down to the database table and column level.

The beginning of the permissions process is to define individual personas. Let’s define the data engineer, data analyst, and business manager.

AWS Lake Formation Two Types of Resources

  • Metadata is stored in a data dictionary known as the AWS Glue Catalog. Metadata is also known as data about data. In other words, it is information about the databases, tables, and columns that the data is housed in.
  • The physical data that is stored in the lake or the AWS S3 locations

In this scenario, the data engineer needs access to both the metadata and the data. Therefore, these permissions are administrator-type permissions. With these permissions, the data engineer also has the right to grant and revoke permissions to other users.

Secondly, the data analyst needs access to the data, not the AWS Glue Catalog. This is because the analyst’s role is to use statistical analysis algorithms and formulae to turn the data into useful information.

Thirdly, the business manager does not need access at any data level, especially not to the metadata. The manager only needs access to the front end or the information visualization level. This information is required to make the requisite management decisions to streamline the company’s business processes.

These permissions are created in the AWS console. Navigate to the AWS Console > IAM > Policies > Create Policy

Here is an example of the data engineer’s policy creation

  1. DataEngineerLakeFormationPolicy
{
    "Version": "2020-10-01",
    "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                        "lakeformation:GetDataAccess",
                        "lakeformation:GrantPermissions",
                        "lakeformation:RevokePermissions",
                        "lakeformation:BatchGrantPermissions",
                        "lakeformation:BatchRevokePermissions",
                        "lakeformation:ListPermissions" ],
                "Resource": "*"
            }
    ]
}

The data engineer’s pass role is created as follows: 

2. DataEngineerPassRole

{
    "Version": "2020-10-01",
    "Statement": [
        {
            "Sid": "PassRolePermissions",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::account-id:role/workflow_role"
            ]
        }
    ]
}
  1. The creation of a S3NycBucketRead is required because the New York CIty Taxi and Limousine Commission dataset (nyc-tlc) is not registered as a data location and explicit IAM permissions are required.
{
    "Version": "2020-10-01",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::nyc-tlc",
                "arn:aws:s3:::nyc-tlc/*"
            ]
        }
    ]
}

Once the policy has been created, the next step is to create the user that gets linked to these policies. In our scenario, the data engineer user must be created. Navigate to the AWS Console > IAM > Create User to achieve this goal. 

Finally, let’s assume that the data engineer’s username is dataeng01. When creating this user ID, these policies must be linked to this user ID: DataEngineerLakeFormationPolicy, DataEngineerPassRole, and S3NycBucketRead.

 Final Thoughts

One of the Fourth Industrial Revolution’s technological advancements is the role that Big Data is playing and will continue to play in the management decision-making processes across all sectors of society, from industry, retail, finance, and across the spectrum to government and non-government organizations.

Apart from the creation of the data lake and the ETL processes required to load the data into the lake, the most critical role that the AWS Lake Formation service plays is the creation and implementation of user roles and permissions to ensure that all the data stored in the lake is securely stored and governed according to the data owners’ policies.   

Published in: Blog , Data Lakes
Upsolver Team
Upsolver Team

Upsolver enables any data engineer to build continuous SQL data pipelines for cloud data lake. Our team of expert solution architects is always available to chat about your next data project. Get in touch

Keep up with the latest cloud best practices and industry trends

Get weekly insights from the technical experts at Upsolver.

Subscribe

Templates

All Templates

Explore our expert-made templates & start with the right one for you.