AWS Data Lake: Architecture, Best Practices and Tutorials

Why AWS Data Lakes?

A decade ago, data lakes were considered a niche interest in enterprise data management, and most companies were still relying on the enterprise data warehouse as the foundation of their data architecture. However, as data has grown increasingly complex – semi-structured, continuously generated, voluminous, and lacking a predefined schema – data lakes grew more prominent as a solution for dealing with data that the EDW just couldn’t handle.

Today, it’s difficult to find a large-scale data infrastructure that doesn’t incorporate data lake design patterns: unstructured storage, open-source file formats, and leveraging best-of-breed analytics tools for different use cases rather than relying on monolithic enterprise-wide deployments.

Alongside the growing volume and complexity of data, there has been a movement away from on-premises deployments and towards cloud-based infrastructure-as-a-service. Amazon Web Services is the leading provider in this space, and offers a variety of on-demand services for storage, compute, analytics and more – either through its own tools, or as part of its large partner network.

This page collects the essential resources we’ve published over the years around building, maintaining and managing your AWS data lake. We hope it helps guide you on your data lake journey!

Make your data lake analytics-ready with Upsolver. the only platform that lets you build continuous SQL pipelines directly on your data lake. Start for free!

Cloud Data Lakes: The Basics

Data Lake as a Service: Is There a GUI-based Data Lake? Why are data lakes still so difficult in the age of everything-as-a-service? Why are they still dependent on a select group of specialists skilled in arcane programming languages and frameworks? In other words – what’s stopping the data lake from becoming ‘productized’? We try to answer these questions. Read the article

Cloud Data Lake vs On-Premises Data Lake: What You Need to Know: Is it time to move your data lake to the cloud? As with any infrastructural choice, there are advantages and trade-offs to deploying in the cloud vs on-premises. Read the article

Understanding Data Lakes and Data Lake Platforms: We’ve prepared this quick guide to help get you ready for that day, and keep you from asking embarrassing questions such as “what kind of data lake do you think we should buy?” Read the article

A Data Lake Approach to Event Stream Analytics: In this in-depth technical whitepaper, we examine the common architectural challenges of operationalizing log data at scale, thereafter suggesting a solution. Read the whitepaper

Data Lake, Data Warehouse, or Data Lakehouse: Organizations of all sizes can now capture more data from more sources – more quickly than ever before. But what good is all that real-time data if it takes six months until you can use it? This conundrum is at the core of the data warehouse vs data lake debate. We cover the differences and how to choose between alternatives. Read the article

What Is An Open Data Lakehouse? A World Without Monoliths: Whether you work on-premise or in the cloud, coding and expertise in the complex Hadoop/Spark stack often turn data lakes into data swamps. Learn how open cloud lakehouses can be the next evolutionary step in data infrastructure. Read the article