So you’ve decided it’s time to overhaul your data architecture. What’s next? How do you go about building a data lake that delivers the results you’re expecting.
Well, we’re strong believers in the notion that an example is worth a thousand convoluted explanations. That's why this post is all about real-life examples of companies that have built their data lakes on Amazon S3. Use it for inspiration, reference or as your gateway to learn more about the different components you'll need to become familiar with for your own initiative.
1. Sisense Builds a Versatile Data Lake with Minimal Engineering Overhead
As a leading global provider of business intelligence software, data-driven decision making is embedded in Sisense's DNA. One of the richest sources of data the company has to work with is product usage logs, which capture all manners of users interacting with the Sisense server, browser and cloud-based applications.
Over time, and with the rapid growth in Sisense's customer base, this data had accumulated to over 70bn records. In order to effectively manage and analyze this data, the company quickly realized it would have to use a data lake architecture, and decided to build one using the AWS ecosystem. We've written a more detailed case study about this architecture, which you can read here.
The Data Lake
In order to quickly generate value for the business and avoid the complexities of a Spark/Hadoop based project, Sisense's CTO Guy Boyangu opted for a solution based on Upsolver, S3 and Amazon Athena.
Product logs are streamed via Amazon Kinesis and processed using Upsolver, which then writes columnar CSV and Parquet files to S3. These are used for visualization and business intelligence using Sisense's own software. Additionally, structured tables are sent to Athena to support ad-hoc analysis and data science use cases.
To learn more about Sisense's data lake architecture, check out the case study.
Depop Goes From Data Swamp to Data Lake
Depop is a peer-to-peer social shopping app based in London, serving thousands of users. These users take various actions in the app - following, messaging, purchasing and selling products, etc. - creating a constant stream of events.
The Depop team documented their journey in two excellent blog posts. After an initial attempt to create replicas of the data on Redshift, they quickly realized that performance tuning and schema maintenance on Redshift would prove highly cumbersome and resource intensive. This lead Depop to adopt a data lake approach using Amazon S3.
The Data Lake
Click to enlarge. Image source: Depop Engineering Blog.
The data lake at Depop consists of three different pipelines:
- Ingest: Messages are written via RabbitMQ, and dispatched via a fanout lambda function.
- Fanout: the lambda function sets up the relevant AWS infrastructure based on event type and creates an AWS Kinesis stream.
- Transform: the final step is creating columnar Parquet files from the raw JSON data, and is handled using the AWS Glue ETL and Crawler. From there data is outputted to Athena for analysis.
For more information about Depop's data lake, check out their blog on Medium.
SimilarWeb Crunches Hundreds of Terabytes of Data
SimilarWeb is a leading market intelligence company that provides insights into the digital world. To provide this service at scale, the company collects massive amounts of data from various sources, which it uses to better understand the way users interact with websites.
In a recent blog post published on the AWS Big Data Blog, Yossi Wasserman from Similar Web details the architecture that the company uses to generate insights from the hundreds of terabytes of anonymous data it collects from mobile sources.
The Data Lake
Image source: AWS blog
The SimilarWeb solution utilizes S3 as its events storage layer, Amazon Athena for SQL querying, and Upsolver for data preparation and ETL. In his article, Wasserman details the way data is sent from Kafka to S3, reduced to include only the relevant fields needed for analysis, and then sent as structured tables to Athena for querying and analysis.
Read more about Similar Web's data lake solution on the AWS blog.
An Event-driven, Serverless Architecture at Natural Intelligence
Natural Intelligence runs comparison sites across many different verticals. As Denise Schlesinger details on her blog, the company was looking to accomplish two different goals with this data:
- Query raw, unstructured data for real-time analytics, alerts and machine learning
- Store structured and optimized data for BI and analytics
To effectively work with unstructured data, Natural Intelligence decided to adopt a data lake architecture based on AWS Kinesis Firehose, AWS Lambda, and a distributed SQL engine.
The Data Lake
Image source: Denise Schlesinger on Medium
S3 is used as the data lake storage layer into which raw data is streamed via Kinesis. AWS Lambda functions are written in Python to process the data, which is then queried via a distributed engine and finally visualized using Tableau.
For more details about this architecture, check out Denise's blog on Medium.
Ready to build your own data lake?
Upsolver is the fastest and easiest way to get your S3 data lake from 0 to 1. Schedule a demo to learn how you can go from streams to analytics in a matter of minutes.