Replacing Spark/Hadoop based ingestion and ETL with automated ETL pipeline
Online advertising logs - 4bn events per day from Kinesis, stored on S3 and queried in Athena
Browsi provides an AI-powered adtech solution that helps publishers monetize content by offering ad inventory-as-a-service. The company’s platform automatically optimizes ad placements and layout to ensure users are being served relevant ad content without hurting the overall user experience. The company serves high-traffic publishers such as Hearst, Minute Media, Graham Media Group and Ynet.
Reduce engineering overhead and shift focus from infrastructure to features
Matan Ghuy, Big Data Engineer at Browsi, was tasked with maintaining its data infrastructure. The company had built its data lake infrastructure on ingesting data from Amazon Kinesis via a Lambda function that ensured exactly-once processing, while ETL was handled via a batch process, coded in Spark/Hadoop and running on an Amazon EMR cluster once a day. .
Amazon Athena was used to query the data, and due to the batch latencies, the data in Athena was either up to 24 hours old, or expensive and slow to query as it had not yet been compacted. Additionally, the overall solution was cumbersome and difficult to maintain, and each new ETL pipeline required additional effort from Matan, which prevented him from focusing on other back-end development tasks. When the company began evaluating Upsolver, Matan immediately saw its value as a self-service platform that would replace the manual infrastructure work that was taking up dozens of hours each week.
- Handle data at massive scale - over 4bn events processed in Kinesis, daily
- Reduce friction and complexity compared to previous solution based on Lambda + Spark
- Easy to maintain without additional engineering overhead
From manual coding to self-service streaming ETL
4bn events processed per day
40% reduction in costs compared to Spark on EMR
Minutes end-to-end latency
After a short proof of concept, Upsolver was seamlessly integrated into Browsi’s AWS account and quickly became the company’s main data lake ETL platform.
The company implemented Upsolver to replace both the Lambda architecture used for ingest and the Spark/EMR implementation used to process data, transitioning from batch to stream processing and enabling end-to-end latency (Kinesis -> Athena) of mere minutes.
While the previous implementation was based on manual coding, Upsolver enables Matan to manage all ETL flows from its visual interface and without writing any code.
Events are generated by scripts on publisher websites, which are streamed via Amazon Kinesis Streams. Upsolver ingests the data from Kinesis and writes it to S3 while ensuring partitioning, exactly-once processing, and other data lake best practices are enforced.
From there, the company built its output ETL flows to Amazon Athena, which is used for data science as well as BI reporting via Domo. For internal reporting, Upsolver creates daily aggregations of the data which are processed by a homegrown reporting solution.
“I used to spend dozens of hours on infrastructure - today I spend virtually none. Upsolver has made my life way better because now I can actually work on developing new features for Browsi’s back-end, rather than coding and maintaining ETL pipelines.”
Matan Ghuy, Big Data Engineer, Browsi
Reduction in infrastructure costs
a month devoted to ETL maintenance (down from 40-50)