Browsi: How a Single Data Engineer Manages ETL Pipelines for 4bn Events

CUSTOMER STORY

INDUSTRY: Advertising Technology
USE CASE: Replacing Spark/Hadoop based ingestion and ETL with automated ETL pipeline
DATA: Online advertising logs - 4bn events per day from Kinesis, stored on S3 and queried in Athena
4bn
events processed per day
40%
reduction in costs compared to Spark on EMR
Minutes
end-to-end latency

Browsi provides an AI-powered adtech solution that helps publishers monetize content by offering ad inventory-as-a-service. The company’s platform automatically optimizes ad placements and layout to ensure users are being served relevant ad content without hurting the overall user experience. The company serves high-traffic publishers such as Hearst, Minute Media, Graham Media Group and Ynet.

The Goal

As an AI company, data is at the heart of all of Browsi’s operations. The company processes over 4bn events per day, generated from the Javascript-powered engine it hosts on publisher websites. The company uses this data for analytics – internal and customer-facing dashboards – as well as data science – refining the company’s predictive algorithms.

Matan Ghuy, Big Data Engineer at Browsi, was tasked with maintaining its data infrastructure. The company had built its data lake infrastructure on ingesting data from Amazon Kinesis via a Lambda function that ensured exactly-once processing, while ETL was handled via a batch process, coded in Spark/Hadoop and running on an Amazon EMR cluster once a day. .

Amazon Athena was used to query the data, and due to the batch latencies, the data in Athena was either up to 24 hours old, or expensive and slow to query as it had not yet been compacted. Additionally, the overall solution was cumbersome and difficult to maintain, and each new ETL pipeline required additional effort from Matan, which prevented him from focusing on other back-end development tasks. When the company began evaluating Upsolver, Matan immediately saw its value as a self-service platform that would replace the manual infrastructure work that was taking up dozens of hours each week.

The Solution

After a short proof of concept, Upsolver was seamlessly integrated into Browsi’s AWS account and quickly became the company’s main data lake ETL platform.

The company implemented Upsolver to replace both the Lambda architecture used for ingest and the Spark/EMR implementation used to process data, transitioning from batch to stream processing and enabling end-to-end latency (Kinesis -> Athena) of mere minutes.

While the previous implementation was based on manual coding, Upsolver enables Matan to manage all ETL flows from its visual interface and without writing any code.

Events are generated by scripts on publisher websites, which are streamed via Amazon Kinesis Streams. Upsolver ingests the data from Kinesis and writes it to S3 while ensuring partitioning, exactly-once processing, and other data lake best practices are enforced.

From there, the company built its output ETL flows to Amazon Athena, which is used for data science as well as BI reporting via Domo. For internal reporting, Upsolver creates daily aggregations of the data which are processed by a homegrown reporting solution.

4bn
events processed per day
40%
reduction in costs compared to Spark on EMR
Minutes
end-to-end latency
quote icon I used to spend dozens of hours on infrastructure - today I spend virtually none. Upsolver has made my life way better because now I can actually work on developing new features for Browsi’s back-end, rather than coding and maintaining ETL pipelines.

Templates

All Templates

Explore our expert-made templates & start with the right one for you.