Table of contents
Why AWS Data Lakes?
A decade ago, data lakes were considered a niche interest in enterprise data management, and most companies were still relying on the enterprise data warehouse as the foundation of their data architecture. However, as data has grown increasingly complex – semi-structured, continuously generated, voluminous, and lacking a predefined schema – data lakes grew more prominent as a solution for dealing with data that the EDW just couldn’t handle.
Today, it’s difficult to find a large-scale data infrastructure that doesn’t incorporate data lake design patterns: unstructured storage, open-source file formats, and leveraging best-of-breed analytics tools for different use cases rather than relying on monolithic enterprise-wide deployments.
Alongside the growing volume and complexity of data, there has been a movement away from on-premises deployments and towards cloud-based infrastructure-as-a-service. Amazon Web Services is the leading provider in this space, and offers a variety of on-demand services for storage, compute, analytics and more – either through its own tools, or as part of its large partner network.
This page collects the essential resources we’ve published over the years around building, maintaining and managing your AWS data lake. We hope it helps guide you on your data lake journey!
Cloud Data Lakes: The Basics
- Data Lake as a Service: Is There a GUI-based Data Lake? Why are data lakes still so difficult in the age of everything-as-a-service? Why are they still dependent on a select group of specialists skilled in arcane programming languages and frameworks? In other words – what’s stopping the data lake from becoming ‘productized’? We try to answer these questions. Read the article
- Cloud Data Lake vs On-Premises Data Lake: What You Need to Know: Is it time to move your data lake to the cloud? As with any infrastructural choice, there are advantages and trade-offs to deploying in the cloud vs on-premises. Read the article
- Understanding Data Lakes and Data Lake Platforms: We’ve prepared this quick guide to help get you ready for that day, and keep you from asking embarrassing questions such as “what kind of data lake do you think we should buy?” Read the article
- A Data Lake Approach to Event Stream Analytics: In this in-depth technical whitepaper, we examine the common architectural challenges of operationalizing log data at scale, thereafter suggesting a solution. Read the whitepaper
- Data Lake, Data Warehouse, or Data Lakehouse: Organizations of all sizes can now capture more data from more sources – more quickly than ever before. But what good is all that real-time data if it takes six months until you can use it? This conundrum is at the core of the data warehouse vs data lake debate. We cover the differences and how to choose between alternatives. Read the article
- What Is An Open Data Lakehouse? A World Without Monoliths: Whether you work on-premise or in the cloud, coding and expertise in the complex Hadoop/Spark stack often turn data lakes into data swamps. Learn how open cloud lakehouses can be the next evolutionary step in data infrastructure. Read the article
AWS Data Lake Architecture
- Intro to AWS Data Lakes: Components & Architecture: This article is based on a presentation given by Roy Hasson, Senior BDM at Amazon Web Services, during a recent webinar. Roy covers the basics of what is a data lake and why to build it, and the different components of a cloud data lake. Read the article
- 4 Guiding Principles for Modern Data Lake Architecture: Getting data lakes right can be tricky. How do you design a data lake that will serve the business, rather than weigh down your IT department with technical debt and constant data pipeline rejiggering? In this article we cover the four essential principles for effectively architecting your data lake Read the article
- Streaming Data and Data Lake Architecture: The Ultimate Guide: What’s the best way to design and build a data platform that’s aligned with your use cases? How do you decide which stack to choose for a given business scenario, given the proliferation of new databases and analytics tools? We cover these and other questions in our comprehensive guide. Read the ebook
- Eliminating the Ugly Plumbing of Data Lake Pipelines: Data lake complexities leave too many engineers treading water, manually coding, configuring and optimizing pipelines instead of working on projects that drive business value. We review why this is the case, and then discuss how things can change. Read the article
- Apache Airflow – When to Use it, When to Avoid it when Building a Data Lake: Apache Airflow is a powerful and widely-used open-source workflow management system (WMS) designed to programmatically author, schedule, orchestrate, and monitor data pipelines and workflows. In this deep dive, we review scenarios in which Airflow is a good solution for your data lake, and ones where it isn’t. Read the article
AWS Data Lake Tutorials
- Approaches to Updates and Deletes (Upserts) in Data Lakes: Updating or deleting data is surprisingly difficult to do in data lake storage. Finding the records to update or delete requires a full data scan. But scanning an entire data store for each upsert is expensive and time-consuming. We look at a few ways to solve this challenge. Read the article
- How to Leverage Snowflake in an Optimized Data Lake: Use Snowflake WITH a data lake? Use Snowflake AS a data lake? Use Snowflake INSTEAD OF a data lake? It’s easy to feel a bit adrift as to whether and how to plug Snowflake into your cloud data stack. Let’s look, simply and clearly, at where Snowflake best fits. Read the article
- MySQL CDC and Database Replication for the Data Lake Age: In this post we discuss the different CDC approaches and patterns, and cover key Upsolver CDC capabilities that make it an indispensable tool in high-performance data pipelines. Read the article
- Apache Kafka with and without a Data Lake: How should you design your data architecture to build a scalable, cost effective solution for working with Kafka data? We look at two approaches – reading directly from Kafka vs creating a data lake – and understand when and how you should use each. Read the article
- 7 Guidelines for Ingesting Big Data to Data Lakes: In this article we look at 7 best practices for big data ingestion – from strategic principles down to the more tactical (and technical) issues that you should be aware of when building your ingest pipelines. Read the article
- 7 Best Practices for High-performance Data Lakes: In this article, we’ll present seven of the key best practices you should adhere to when designing, implementing and operationalizing your data lake. Read the article
- Partitioning Data on S3 to Improve Performance in Athena/Presto: In an AWS S3 data lake architecture, partitioning plays a crucial role when querying data in Amazon Athena or Redshift Spectrum. This article will cover the S3 data partitioning best practices you need to know. Read the article
AWS Data Lake Ecosystem and Tools
- Getting Data Lake ETL Right: 6 Guidelines for Evaluating Tools: We look at the top 6 factors you should consider when evaluating a data lake ETL platform – whether open-source, proprietary, or custom-developed. Read the article
- Databricks Delta Lake vs Data Lake ETL: Overview and Comparison: In this article we take a closer look at Delta Lake and compare it to a data lake ETL approach, in which data transformations are performed in the lake rather than by a separate storage layer. Read the article
AWS Data Lake Architecture Diagrams and Use Cases
- Examples of Data Lake Architecture on Amazon S3: This post is all about real-life examples of companies that have built their data lakes on Amazon S3. Use it for inspiration, reference, or as your gateway to learn more about the different components you’ll need to become familiar with for your own initiative. Read the article
- Frictionless Data Lake ETL for Petabyte-scale Streaming Data: ironSource, an app monetization and video advertising platform, managed to transform 500K events per second, using only a visual interface and SQL, saving thousands of engineering hours, reducing latency, and increasing system scale by a factor of ten. Discover how Upsolver helped ironSource reach these results. Watch the webinar
- Data Lake ETL for IoT Data: From Streams to Analytics: n this article we’ll look at some of the data integration challenges and opportunities around IoT data, and suggest a reference architecture for simplifying data lake ETL for IoT streams using Upsolver on AWS. Read the article
- Data Architecture for AWS Athena: 6 Examples to Learn From: In this article we’ll look at a few examples of how you can incorporate Athena in different data architectures and to support various use cases – streaming analytics, ad-hoc querying and Redshift cost reduction. For each use case, we’ve included a conceptual AWS-native example, and a real-life example provided by Upsolver customers. Read the article
Making AWS Data Lakes Analytics-ready with Upsolver
- Upsolver Technical Whitepaper: In this in-depth technical paper we will present the infrastructural challenges of working with big data streams, and how to tackle these challenges using a data pipeline platform that provides data management, processing and delivery as services – all within a data lake architecture. Read the ebook