This article presents reference architectures and examples of data lake implementations, from around the web as well as from Upsolver customers. For a more detailed, hands-on example of building a data lake to store, process and analyze petabytes of data, check our data lake webinar with ironSource and Amazon Web Services.
So you’ve decided it’s time to overhaul your data architecture. What’s next? How do you go about building a data lake that delivers the results you’re expecting?
Well, we’re strong believers in the notion that an example is worth a thousand convoluted explanations. That’s why this post is all about real-life examples of companies that have built their data lakes on Amazon S3. Use it for inspiration, reference, or as your gateway to learn more about the different components you’ll need to become familiar with for your own initiative.
Sisense Builds a Versatile Data Lake with Minimal Engineering Overhead
As a leading global provider of business intelligence software, Sisense has data-driven decision making embedded in its DNA. One of the richest sources of data the company has to work with is product usage logs, which capture all manner of users interacting with the Sisense server, the browser, and cloud-based applications.
Over time, and with the rapid growth in Sisense’s customer base, this data had accumulated to more than 70bn records. In order to effectively manage and analyze this data, the company quickly realized it would have to use a data lake architecture, and decided to build one using the AWS ecosystem. We’ve written a more detailed case study about this architecture, which you can read here.
The Data Lake
To quickly generate value for the business and avoid the complexities of a Spark/Hadoop-based project, Sisense’s CTO Guy Boyangu opted for a solution based on Upsolver, S3, and Amazon Athena.
Product logs are streamed via Amazon Kinesis and processed using Upsolver, which then writes columnar CSV and Parquet files to S3. These are used for visualization and business intelligence using Sisense’s own software. Additionally, structured tables are sent to Athena to support ad-hoc analysis and data science use cases.
To learn more about Sisense’s data lake architecture, check out the case study.
Depop Goes From Data Swamp to Data Lake
Depop is a peer-to-peer social shopping app based in London, serving thousands of users. These users take various actions in the app – following, messaging, purchasing and selling products, and so on – creating a constant stream of events.
The Depop team documented their journey in two excellent blog posts. After an initial attempt to create replicas of the data on Redshift, they quickly realized that performance tuning and schema maintenance on Redshift would prove highly cumbersome and resource intensive. This led Depop to adopt a data lake approach using Amazon S3.
The Data Lake
Click to enlarge. Image source: Depop Engineering Blog.
The data lake at Depop consists of three different pipelines:
- Ingest: Messages are written via RabbitMQ, and dispatched via a fanout lambda function.
- Fanout: The lambda function sets up the relevant AWS infrastructure based on event type and creates an AWS Kinesis stream.
- Transform: The final step is creating columnar Parquet files from the raw JSON data, and is handled using the AWS Glue ETL and Crawler. From there data is outputted to Athena for analysis.
For more information about Depop’s data lake, check out their blog on Medium.
ironSource Streams its Way to Hypergrowth
ironSource is the leading in-app monetization and video advertising platform. ironSource includes the industry’s largest in-app video network. The company collects, stores, and prepares vast amounts of streaming data from millions of end devices.
When ironSource’s rapid data growth effectively obsoleted its previous system, it chose Upsolver to help it adopt a streaming data lake architecture, including storing raw event data on object storage.
The Data Lake
- Using Upsolver’s no-code self-service UI, ironSource ingests Kafka streams of up to 500K events per second, and stores the data in S3.
- Upsolver automatically prepares data for consumption in Athena, including compaction, compression, partitioning, and creating and managing tables in the AWS Glue Data Catalog.
- ironSource’s BI teams use Upsolver to enrich and filter data and write it to Redshift to build reporting dashboards in Tableau and send tables to Athena for ad-hoc analytic queries.
Bigabid Builds a High-Performance Real-time Architecture with Minimal Data Engineering
Bigabid brings performance-based advertisements to app developers, so clients only pay when new users come to their application through the ad. Bigabid takes advantage of machine learning (ML) for predictive decision-making and goes beyond social media and search engine data sourcing to create an in-depth customer user profile.
Bigabid had to replace its daily batch processing with real-time stream processing, so the company could update user profiles based on users’ most recent actions and continue to develop and scale ad campaigns. Using Upsolver’s visual no-code UI, Bigabid built its streaming architecture so quickly, it saved the equivalent of 6-12 months of engineering work from four dedicated engineers.
The Data Lake
Bigabid uses Kinesis Firehose to ingest multiple data streams into its Amazon S3 data lake, then uses Upsolver for data ETL, combining, cleaning, and enriching data from multiple streams to build complete user profiles in real-time. The company also uses Upsolver and Athena for business intelligence (BI) reporting that is used by its data science team to improve machine learning models.
Upsolver also automatically prepares data for Athena, optimizing Athena’s storage layer (Parquet format, compaction of small files) and sharply accelerating queries.
SimilarWeb Crunches Hundreds of Terabytes of Data
SimilarWeb is a leading market intelligence company that provides insights into the digital world. To provide this service at scale, the company collects massive amounts of data from various sources, which it uses to better understand the way users interact with websites.
In a recent blog post published on the AWS Big Data Blog, Yossi Wasserman from Similar Web details the architecture that the company uses to generate insights from the hundreds of terabytes of anonymous data it collects from mobile sources.
The Data Lake
Image source: AWS blog
The SimilarWeb solution utilizes S3 as its events storage layer, Amazon Athena for SQL querying, and Upsolver for data preparation and ETL. In his article, Wasserman details the way data is sent from Kafka to S3, reduced to include only the relevant fields needed for analysis, and then sent as structured tables to Athena for querying and analysis.
Read more about Similar Web’s data lake solution on the AWS blog.
An Event-driven, Serverless Architecture at Natural Intelligence
Natural Intelligence runs comparison sites across many different verticals. As Denise Schlesinger details on her blog, the company was looking to accomplish two different goals with this data:
- Query raw, unstructured data for real-time analytics, alerts and machine learning
- Store structured and optimized data for BI and analytics
To effectively work with unstructured data, Natural Intelligence decided to adopt a data lake architecture based on AWS Kinesis Firehose, AWS Lambda, and a distributed SQL engine.
The Data Lake
Image source: Denise Schlesinger on Medium
S3 is used as the data lake storage layer into which raw data is streamed via Kinesis. AWS Lambda functions are written in Python to process the data, which is then queried via a distributed engine and finally visualized using Tableau.
For more details about this architecture, check out Denise’s blog on Medium.
Peer39 Contextualizes Billions of Pages for Targeting and Analytics
Peer39 is an innovative leader in the ad and digital marketing industry. Each day, Peer39 analyzes more than 450 million unique web pages holistically to contextualize the true meaning of the page text and topics. It helps advertisers to optimize their spend and place ads in the right place, at the right time.
The Data Lake
Upsolver automates and orchestrates Peer39’s data workflow, incorporating built-in data lake best practices including:
- guaranteed message delivery
- exactly-once processing
- optimization of Athena’s storage layer (Parquet format, compaction of small files) so queries run much faster.
Peer39’s updated streaming data architecture unified teams across data science, analytics, data engineering, and traditional database administration, enabling the company to speed go-to-market with existing staff.
Browsi – Managing ETL Pipelines for 4 Billion Events with a Single Engineer
Browsi provides an AI-powered adtech solution that helps publishers monetize content by offering advertising inventory-as-a-service. Browsi automatically optimizes ad placements and layout to ensure relevant ad content.
Browsi needed to move from batch processing to stream processing, and move off of manually-coded data solutions.
The Data Lake
Browsi implemented Upsolver to replace both the Lambda architecture used for ingest and the Spark/EMR implementation used to process data.
Events generated by scripts on publisher websites are streamed via Amazon Kinesis Streams. Upsolver ingests the data from Kinesis and writes it to S3 while enforcing partitioning, exactly-once processing, and other data lake best practices.
From there, Browsi outputs ETL flows to Amazon Athena, which it uses for data science as well as BI reporting via Domo. End-to-end latency from Kinesis to Athena is now mere minutes. Meanwhile, a homegrown solution creates internal reports from Upsolver’s daily aggregations of data.
And it’s all managed by a single data engineer.
Ready to build your own data lake?
Upsolver is the fastest and easiest way to get your S3 data lake from 0 to 1. Schedule a demo to learn how you can go from streams to analytics in a matter of minutes. For additional data infrastructure best practices, check out some of these data pipeline architecture diagrams.