Defining the Intersection and Union of Data Lake Governance and Security Best Practices

Upsolver Team
Data Lakes
November 10, 2020

The rapid adoption of Big Data constructs and the increasing demand for versatility and agility in the analytical use cases or information requirements will more than likely result in the organization reaching its existing data infrastructure’s threshold.

The results of a 2017 research study reported by globenewswire.com show that the “telecommunications and financial services [were] the leading industry adopters, with 87 percent and 76 percent respectively reporting current usage.”

The same study also reported that, over three years, the researchers found a significant improvement in the numbers of companies adopting the Big Data analytics methodology and model. In other words, “Big Data is becoming less an experimental endeavor and more of a practical pursuit within organizations.”

Data Lake Explained

While this topic is superfluous to this discussion, for the sake of completeness, let’s consider a brief definition of a data lake, more specifically the AWS data lake.

Consequently, the best place to go and look for a useful data lake definition is Amazon themselves.

“A data lake is a centralized repository [in the cloud] that allows you to store all your structured and unstructured data at any scale.”

Note: For many companies, a data lake is a logical progression from the data warehouse. It is necessary to note that the data warehouse and data lake are not the same. In fact, the only commonality between the two products is that they both store data.

2017 statistics reported by an Aberdeen survey demonstrated that organizations that implemented a data lake, including governance and security architectures, increased their organic growth by 9% when compared to their competitors who did not utilize the data lake construct to store, manage, and report the information gained from analyzing the company’s Big Data.

What is Data Lake Governance?

Data lake governance is a data management methodology that ensures that only the highest quality data is uploaded to the data lake to guarantee that the information derived from the data is accurate, insightful, and of the highest quality.

Why?

Succinctly stated, business management decisions are based on this information. Therefore, the data must be supervised; otherwise, the company runs the risk of making the wrong decisions; incorrect decision-making results in disastrous outcomes that are often irrecoverable.

Two of the most common data governance architectures are Lambda and Delta. Let’s consider a definition as well as the merits of each.

Lambda Architecture

This data governance standard was developed by Nathan Marz of Twitter. In order to understand the Lambda architecture, let’s consider the following statement:

“Lambda architecture provides a clear set of architecture principles that allows both batch and real-time or stream data processing to work together while building immutability and recomputation into the system.”

One of the most critical points of the Lambda architecture is that its architecture includes a built-in set of data governance rules to ensure that the data is processed as accurately as possible.

Secondly, Lambda utilizes real-time data processing and batch transaction processing to process high data volumes with minimal intervention. The most significant difference between batch and real-time processing is that batch processing utilizes separate data input, processing, and resultant data output programs, while real-time data processing is a continuous stream of data input, processing, and output.

The Lambda architecture consists of three layers.

The batch layer precomputes results using a distributed processing system that can handle large data volumes. Its aims are 100% accuracy when processing all available data as it generates the data views.
The speed layer processes data in real-time without worrying about what the data looks like as it streams into the lake, allowing low latency access to this data.
The serving layer stores precomputed data views or builds data views in response to ad-hoc queries.

Some of the challenges with data governed by the Lambda architecture include unsafe writes to the data lake, orphan data, no data schema or schema evolution, no audit history, and the separation of batch and streamed data.

Delta Architecture

The Delta architecture was first introduced a few years ago. And for all sorts of reasons, it is an improvement on the Lambda architecture.

The Delta architecture documentation describes it as “an open-source storage layer that brings reliability to data lakes.” Its fundamental design is to provide “ACID (atomicity, consistency, isolation, durability) transactions, metadata handling, schema enforcement, audit history, time travel, full DML support, and unified batch and real-time data stream processing.”

One of the fundamental differences between these two architectures is that the Lambda architecture processes and analyzes the raw data batches and streams in the lake, while the Delta architecture is described as a “transactional storage layer” that runs on top of the data lake’s raw data object storage layer.

Now that we have considered two of the most common architectures, Lambda and Delta, let’s take a brief look at the two other important aspects of the data lake governance construct: Data Discovery and Cataloging, and Data Lineage.

Data Discovery and Cataloging

The earliest challenges of building and maintaining a data lake involved the ability to keep track of all the data or assets loaded into the lake. However, tracking raw data is not the only challenge. The changes to the data, including the assets and versions created by the data transformation, processing, and analytics, must also be tracked.

Enter the data catalog.

Once again, let’s turn to the Delta architecture to provide us with a succinct definition of the data catalog.

It is a list of all the assets stored in the data lake. And it is “designed to provide a single source of truth about the contents of the data lake.”

The data discovery element is achieved by utilizing the data catalog’s ”queryable interface of all assets stored in the data lake’s S3 buckets.” In other words, you can discover what assets are housed in the data lake by querying the data catalog.

Data Lineage

The best way to describe Data Lineage is to cite the following quotation from dataversity.net.

“Data Lineage describes data origins, movements, characteristics, and quality.” In other words, Data Lineage typically describes “where the data begins and how it is changed to the final outcome.” It’s the journey that the data takes from its origins through all its transformations over time.

Why is it necessary to ensure that changes to the raw data are tracked and monitored as part of the Data Lake governance protocol?

In summary, Data Lineage is all about data quality and data compliance. For data to answer the questions that company management asks to facilitate responsible decision-making, the data uploaded to the data lake must meet minimum quality standards. Secondly, this uploaded data must meet compliance standards designed to keep sensitive data secure and organized according to organizational and government rules and regulations.

What is Data Lake Security?

Data lake security is defined as a set of implementable policies that protect the data from unauthorized access to prevent intentional or unintentional destruction, modification, or disclosure.

The modern organization collects massive volumes of data, classified under the Big Data ambit. There are also data security compliance issues that the organization must consider. Ergo, any data collected and stored in a data lake must be protected from hackers. It includes elements like proprietary data specific to the organization and personal data from customers and suppliers. Illegal data breaches will damage the organization’s professional reputation, and it can open the company to litigation from entities who have had their data hacked and stolen.

Thus, the implementation of a data security protocol (composed of multiple policies) is imperative to protecting the organization’s data.

These policies are divided up into four categories: authentication, authorization, encryption, and auditing.

Authentication

Succinctly stated, user authentication is a process whereby the data lake’s authentication processes verify whether the user attempting to access the data lake is allowed to the data lake. Authentication methods include a single layer like entering a username and password into the security layer, or a double layer like entering a username and password and acknowledging a prompt on another device like a smartphone.

Authorization

Authentication policies function at the level below the user authentication policy, and they define user-access rules. The authorization policy describes which parts of the lake the user is allowed access to. For instance, super-user or admin access is only granted to the data engineer.

This is because admin access often allows the user to update elements of the data lake. Most users only need read-only access or the ability to view the data in the data lake. Providing too many people with the option to update data and change data lake settings will result in the disastrous loss of data, even if it is only by accident.

Encryption

Data encryption policies must be implemented at the data lake level, at the user interface level, and in the data pipelines that transfer the data from its source to the lake and from the lake to the user interface. Otherwise, there is a risk, albeit slight, that hackers and unauthorized users will gain access to the data to steal it. Unfortunately, the risk of data theft has increased exponentially in 2020. Therefore, implementing data encryption policies at all levels is imperative to data lake security.

Auditing

Keeping audit logs is a critical element of implementing data security protocols and policies. Succinctly stated, a record is written to the audit log for every user transaction, irrespective of their access permissions. These logs’ primary function is to track and monitor the data lake’s access permissions to look for anomalies such as unauthorized data lake access.

Final Thoughts

Most of this discussion has focused on the individual elements that make up both the data lake governance and security best practice methods and models.

Finally, let’s consider the intersection and union of these methods to describe how they function together with the sole aim of ensuring that the data is both governed and protected.

The intersection of these best practices is the point at which they intersect or meet. In this scenario, the point at which the data lake governance and the data lake security meet is at their common goal or function: The data. In other words, the raison d’etre for both is the data stored in the data lake. The data lake governance ensures data quality and data consistency. The data lake security makes sure that the data is protected from unauthorized access, however unintentional.

The union of the data lake constructs, the data governance and the data security, is the combination of all the elements found in both constructs to ensure the quality and the protection of the data.

Ultimately, both the union and the intersection are about the data in the data lake. In other words, the data is the alpha and the omega of both the data governance and the data security best practice methods.

Published in: Blog , Data Lakes

Upsolver Team

Upsolver enables any data engineer to build continuous SQL data pipelines for cloud data lake. Our team of expert solution architects is always available to chat about your next data project. Get in touch

Defining the Intersection and Union of Data Lake Governance and Security Best Practices

Data Lake Explained