Marketing company, anomaly detection for mobile data collection
100TB a month, JSON files, >1000 fields with hierarchies & arrays
Implemented in an hour using self-serve, detected data collection anomalies
SimilarWeb is the pioneer of market intelligence and the standard for understanding the digital world. SimilarWeb’s solutions provide customers with insights to help them understand, track and grow their digital market share.
SimilarWeb has thousands of customers and works with some of the largest global brands including Google, Walmart, AirBnb, HSBC, adidas and eBay. Our team is spread across 8 global offices and includes over 350 employees and counting.
SimilarWeb’s mobile data collection relies on a mobile SDK which is installed on more than 800 million devices, generating over 100TB of data every month.
SimilarWeb’s data collection process is business critical as it can’t sell their customers insights based on bad or missing data. SimilarWeb’s data collection team needs analytics to track new types of data, partner integrations and overall performance. The earlier they can identify and address anomalies, the better.
The Technical Challenges
- Complex streaming data - over 100TB per month, Nested JSON files, hundreds of fields.
- Agility is important - every team member should be able to create new reports, investigate discrepancies and add automation tests. Coding, scaling and maintaining ETLs and DBs takes a while and might become a burden on our team.
- Fresh data - our Hadoop cluster is based on daily jobs so we might only see anomalies after 24 hours. It’s too late. We wanted to get this number to under 1 hour.
- The count distinct problem - we track the number of unique visitors for billions of possible segments (device, OS, country etc…). Count distinct is a non-additive aggregation so calculating an accurate number of unique visitors usually requires many memory intensive compute nodes.
"The Upsolver ETL pipeline helped improve our efficiency and reduce the time from ingestion to insight from 24 hours to minutes."
- Yossi Wasserman, Data Collection & Innovation Team Leader, SimilarWeb
The solution includes Amazon Athena for SQL analytics, Amazon S3 for events storage and Upsolver for data preparation.
Step 1 - get the raw data from Kafka to S3
SimilarWeb's SDK data resides on-premise in Kafka. They use Upsolver’s built-in connector to read events from Kafka and store them on S3 (exactly once).
Step 2 - create a reduced stream on S3
Storing 100TB on S3 was redundant for the analytics SimilarWeb needs. The full stream includes about 400 fields while they only need 20-30 of them. SimilarWeb use Upsolver to create a smaller stream on S3.
The reduced stream contains some of the fields from the original stream plus some new calculated fields they added with Upsolver. There were also events in the same Kafka topic that weren’t relevant to their use case, so they filtered those out.
While SimilarWeb only keeps the raw data for 1 day, they store the reduced stream for 1 year. The reduced stream enables them to stay dynamic, using a 1-year event source but at a much lower cost than storing the full raw data.
Step 3 - create and manage tables in Athena
Amazon Glue Data Catalog and Amazon S3 are key parts for using Athena. Tables’ schema definitions are managed in Glue Data Catalog and the data itself is managed in S3.
SimilerWeb used an Upsolver Output to map the nested JSON files into Athena tables. When they run the Output, Upsolver creates:
- Flat Parquet files on S3 - SimilarWeb mostly used the "aggregation" option since they track the number of hourly distinct SDK users for various segments
- Tables using the Glue Data Catalog API - The tables change over time as a result of schema changes and partition management. Since SimilarWeb used daily partitions, Upsolver added the following 4 columns to their tables
Step 4 - SQL analytics in Athena
SimilarWeb developers use Athena's SQL console to look for anomalies in new data that SimilarWeb collects or in partner integrations. In some cases, they also run queries as part of CI jobs.
Why we chose Upsolver
- It’s easy to configure at scale - Upsolver is configured with a graphical user interface. It took us about an hour to build a pipeline from Kafka to Athena including the number of distinct visitors for every segment we track. Upsolver’s Stream Discovery tool was very helpful when we created our tables in Athena. We could see the exact distribution of values for every field in our JSON, which helped us find the fields we actually wanted, and make corrections where necessary.
- We can now identify anomalies after 1 hour instead of 24 hours - Upsolver allows us to output hourly statistics to Athena & S3.
- Easy to maintain - Upsolver is a serverless platform with no IT overhead.
- 4. Our data remains private in our own S3 bucket. Although our data is anonymous, we didn’t want it leaving our control.
Why We Chose Athena
- Fast queries using SQL- our team wants to use SQL to query the data but traditional SQL databases are hard to scale to 100s of TBs. Athena is a good fit since it runs distributed SQL queries on S3 using Apache Presto.
- Easy to maintain –Athena is a serverless platform with no IT overhead.
- Low cost – storing so much data in a data warehouse would be very expensive as we would need enough database nodes to store all the data. Since we only query a small portion of the data, 5$ per TB scanned is appealing.