SimilarWeb Analyzes Hundreds of Terabytes of Data
with Amazon Athena and Upsolver
Market intelligence company.
Dealing with large amounts of data, and having to deal with different databases deflects the team’s focus on operations.
Hundreds of TB of data.
The data-collection process is critical for SimilarWeb, because they can’t provide customers’ insights based on a flawed or incomplete data. The data collection team needs analytics to track new types of data, partner integrations, overall performance and more with great effectiveness as quickly as possible. It’s imperative for their team to identify and address anomalies as early as possible. Any tool that supports this process gives a significant advantage.
Bridge together the data lake and the analytics users who aren’t big data engineers.
Hundreds of TB of data is streamed into SimilarWeb every month from different sources. The data is complex. It contains hundreds of fields, many of which are deeply nested, in addition to some with null values. This complexity creates a technical challenge because the data must be cleaned, normalized and prepared for querying.
The first option was to use existing on-premises Hadoop cluster, which processes all of SimilarWeb’s data in a daily batch process that takes a few hours to run. For their business-critical monitoring, a 24-hour delay is not acceptable.
SimilarWeb considered developing a new process using Hadoop. But that would require their team to focus away from daily operations to code, scale, and maintain extract, transform and load (ETL) jobs. Also, having to deal with different databases deflects their team’s focus on operations. Therefore, they wanted an agile solution where every team member could create new reports, investigate discrepancies, and add automated tests.
Why SimilarWeb chose Amazon Athena and Upsolver.
Fast queries using SQL
SimilarWeb chose Upsolver. Upsolver bridges together the data lake and the analytics users who aren’t big data engineers. Its cloud data lake platform helps organizations efficiently manage a data lake. Upsolver enables a single user to control big streaming data from ingestion to management and preparation for analytics platforms like Athena, Amazon Redshift and Elasticsearch Service (Amazon ES).
Upsolver's ETL pipeline helped improve our efficiency and reduce the time from ingestion to insight from 24 hours to minutes.
Yossi Wasserman, Data Collection & Innovation Team Leader, SimilarWeb
Upsolver is the shortest path from streaming to usable data.
By using this new pipeline, it helped improve SimilarWeb’s efficiency and reduce the time from ingestion to insight from 24 hours to minutes.