<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=315693165909440&amp;ev=PageView&amp;noscript=1">

Upsolver + Browsi

How a Single Data Engineer Manages ETL Pipelines for 4bn Events

CASE STUDY

INDUSTRY

Advertising Technology

USE CASE

Replacing Spark/Hadoop based ingestion and ETL with automated ETL pipeline

DATA INTEGRATIONS

Online advertising logs - 4bn events per day from Kinesis, stored on S3 and queried in Athena



The Backstory

Browsi provides an AI-powered adtech solution that helps publishers monetize content by offering ad inventory-as-a-service. The company’s platform automatically optimizes ad placements and layout to ensure users are being served relevant ad content without hurting the overall user experience. The company serves high-traffic publishers such as Hearst, Minute Media, Graham Media Group and Ynet.

Browsi platform (image from gobrowsi.com)




The Challenge

Reduce engineering overhead and shift focus from infrastructure to features



As an AI company, data is at the heart of all of Browsi’s operations. The company processes over 4bn events per day, generated from the Javascript-powered engine it hosts on publisher websites. The company uses this data for analytics - internal and customer-facing dashboards - as well as data science - refining the company’s predictive algorithms.

Matan Ghuy, Big Data Engineer at Browsi, was tasked with maintaining its data infrastructure. The company had built its data lake infrastructure on ingesting data from Amazon Kinesis via a Lambda function that ensured exactly-once processing, while ETL was handled via a batch process, coded in Spark/Hadoop and running on an Amazon EMR cluster once a day. .

Amazon Athena was used to query the data, and due to the batch latencies, the data in Athena was either up to 24 hours old, or expensive and slow to query as it had not yet been compacted. Additionally, the overall solution was cumbersome and difficult to maintain, and each new ETL pipeline required additional effort from Matan, which prevented him from focusing on other back-end development tasks. When the company began evaluating Upsolver, Matan immediately saw its value as a self-service platform that would replace the manual infrastructure work that was taking up dozens of hours each week.

The Requirements

  • Handle data at massive scale - over 4bn events processed in Kinesis, daily
  • Reduce friction and complexity compared to previous solution based on Lambda + Spark
  • Easy to maintain without additional engineering overhead


The Solution

From manual coding to self-service streaming ETL


4bn events processed per day

40% reduction in costs compared to Spark on EMR

Minutes end-to-end latency

Browsi data lake ETL architecture


After a short proof of concept, Upsolver was seamlessly integrated into Browsi’s AWS account and quickly became the company’s main data lake ETL platform.

The company implemented Upsolver to replace both the Lambda architecture used for ingest and the Spark/EMR implementation used to process data, transitioning from batch to stream processing and enabling end-to-end latency (Kinesis -> Athena) of mere minutes.

While the previous implementation was based on manual coding, Upsolver enables Matan to manage all ETL flows from its visual interface and without writing any code.

Events are generated by scripts on publisher websites, which are streamed via Amazon Kinesis Streams. Upsolver ingests the data from Kinesis and writes it to S3 while ensuring partitioning, exactly-once processing, and other data lake best practices are enforced.

From there, the company built its output ETL flows to Amazon Athena, which is used for data science as well as BI reporting via Domo. For internal reporting, Upsolver creates daily aggregations of the data which are processed by a homegrown reporting solution.

“I used to spend dozens of hours on infrastructure - today I spend virtually none. Upsolver has made my life way better because now I can actually work on developing new features for Browsi’s back-end, rather than coding and maintaining ETL pipelines.”

Matan Ghuy, Big Data Engineer, Browsi



The Results

Minutes

End-to-end latency

40%

Reduction in infrastructure costs

1 hour

a month devoted to ETL maintenance (down from 40-50)