Data Lake Management: Reevaluating the Traditional Data Lake

“To succeed, every modern company will need to be not just a software company but also a data company” – Matt Turk

We are currently in the midst of the Fourth Industrial Revolution (4IR). According to brookings.edu, this revolution is “characterized by the fusion of the digital, biological, and physical worlds, as well as the growing utilization of new technologies such as artificial intelligence, cloud computing, robotics, 3D printing, the Internet of Things, and advanced wireless technologies.” 

Modern data science and data storage technologies, including the AWS data lake and its associated data management and data processing services, have redefined and are continuing to redefine the digital age. As the quotation by Matt Turk highlighted above notes, “every modern company must be a data company.” 

In other words, every organization must implement the cloud-based enterprise architecture that will store and process massive volumes of data in multiple formats.

The Data Lake: Tradition vs. Modernization

As described by Ori Rafael of Upsolver.com, data storage is relatively cheap. The challenging part is increasing the data’s value by turning it into useful information.

Consequently, let’s briefly consider the differences between the traditional and the modern data lake, especially with reference to the way data is stored in each. Even though they sound similar, the only commonality between both entities is that they store data. Their distinctions are vast and deserve consideration.

The Traditional Data Lake

In summary, a data lake is a raw data storage system that holds data in its native format until it’s required. This data lake is not cloud-based; instead, it is hosted on many different computer servers owned or rented by the enterprise whose data is stored in the data lake. It is highly scalable, and Apache Hadoop is the framework that facilitates the “distributed processing of large data sets across clusters of computers.”

The Hadoop Distributed File System (HDFS) is the file system that stores the information on the physical servers. Its architecture is similar to other file systems used in the distributed server architecture model. The HDFS architecture guide notes that the merits of using this information storage model are that HDFS is highly fault-tolerant, is deployable on low-cost architecture, and has high throughput access to the information.

Note: The traditional data lake can quickly turn into a data swamp without the implementation of the best-practice data quality, data governance, and data management methodologies.

The Modern Data Lake

In every sense, the modern data lake construct is a technological adaptation or an improvement to the traditional data lake. It is an immense cloud-based pool of raw data that requires an equally vast amount of physical storage. This data is malleable, is stored in objects, can be quickly analyzed, and is suitable for machine learning algorithms. 

Object Storage vs. Block Storage

As described above, data is stored in different formats in the traditional data lake and the modern, cloud-based data lake. Succinctly stated, the traditional data lake uses block storage and the modern data lake uses object storage. 

At the outset of this discussion, it is worth noting that object storage is a vast improvement on block storage. Succinctly stated, object storage is designed to data storage scalable, cost-effective, and reliable. Juxtapositionally, block storage based on a flat, linear structure with superfluous, non-essential elements. 

Therefore, the question that begs is how can the superfluous, non-essential functions or elements be removed to make storage scalable, cost-effective, and reliable?

Succinctly stated, the answer is simple: use objects to store data. In other words, manage data by storing data entities in objects. An object is a hierarchical structure that contains all the elements of an entity. 

In summary, the use of objects in the data science and the software development industry was fundamentally driven by the need to simplify entities and to store all the entity’s elements, including data, metadata, and a unique identifier in one place. Because all the data stored in an object is linked via a unique identifier, it does not matter when the data is physically stored in the cloud. It can be housed on computer servers in multi-regional data centers. Consequently, object storage is highly scalable and easy to access via the unique identifier. 

Flat file or block storage splits the data up into evenly-sized data blocks, next to each other, and each block with its own address. However, it does not include metadata or any identifying information. Finally, block storage is not easily scalable. In other words, when the server’s hard disc drive (HDD) runs out of space, the IT engineer either has to add additional HDD storage space to the server or install an HDD with greater storage capacity and copy all the data across to the new HDD. This is both costly and time-consuming. 

For instance, let’s assume you are the developer of an advanced data science reporting tool designed to provide statistical analysis and insights for a biotechnology research lab that collects human blood donations, processes them into plasma, serum, and whole blood.  The research lab aims to sell these samples to institutions that conduct research into treatments for diseases like COVID-19.

Therefore, instead of having individual silos of data such as customer, supplier, donor, blood-type, and end-product, the better method is to create an object of blood donation and include all the data relevant to a single blood donation like a donor, blood type, end product, and customer.

Facing the Challenges and Shortcomings of the Traditional and Modern Data Lake

As described above, while the modern data lake design is a vast improvement on the traditional data lake, there is still a substantial challenge to accessing the modern lake’s data from outside of the lake. 

Why?

The brief answer to this question is the monolith. 

In other words, proprietary software must be used to access the data in both data lakes. And, if you do not use this software, you no longer have access to the data.

A typical example is the Oracle Big Data Service, Spark, or Delta Query API in the data lake. Whether you are working either data lake type, you face the same challenges, albeit with different software.

How do you solve this challenge? 

Enter the combination of the data lake and the data warehouse. 

The Combination of Lake and Warehouse: Improving Current Data Lake Design and Method

In summary, the biggest challenge in both the lake and warehouse models is the inability to access data using cross-platform or platform-independent tools. Consequently, it ends up as a no-win scenario.

Therefore, the question that begs is, how do you improve on both the data warehouse and data lake models?

Suppose you were to combine the data warehouse and the data lake models into a single solution, ergo, one lake, and many warehouses. In other words, you would simplify the integration between the data in a lake and a warehouse by using the cloud object storage solution.

What about the increased cost? Doesn’t this solution require additional storage for the multiple iterations of the same data contained in each warehouse?

The straightforward answer is: No, not really. These multiple iterations are subsets of the same data. Thus, the data is not stored multiple times in the data lake. This model’s fundamental aim is to minimize the data storage cost while increasing the data processing capacity. 

Secondly, the cost is further reduced by the centralized storage, metadata, and access control. The AWS services required to achieve this aim are S3, Glue Catalog, and LakeFormation, respectively.

Finally, AWS has Athena, an interactive data query service, making it easy to analyze the data in the S3 storage using SQL. This service allows you to run different queries on the same data simultaneously, scanning large volumes of data very quickly.

Final Thoughts

The one-lake-multiple-warehouse solution is elegant, timesaving, reliable, and cost-effective. Its underlying premise is that data stored in a centralized location in the cloud has the potential to provide the information that answers multiple questions and provides senior management of an organization, from the smallest to the largest multinational corporation, the opportunity to make information-based decisions that drive the organizational vision.

Finally, the data engineer’s role in the implementation, monitoring, and management of this one-model is to action a continuous discover and learn lifecycle to understand how to improve the quality of data captured, the relationships the data elements have with each other and to identify the usage patterns to reduce risk and increase user trust.

 

Published in: Blog , Data Lakes
Upsolver Team
Upsolver Team

Upsolver enables any data engineer to build continuous SQL data pipelines for cloud data lake. Our team of expert solution architects is always available to chat about your next data project. Get in touch

Keep up with the latest cloud best practices and industry trends

Get weekly insights from the technical experts at Upsolver.

Subscribe

Templates

All Templates

Explore our expert-made templates & start with the right one for you.