Upsolver + SimilarWeb

SimilarWeb Analyzes Hundreds of Terabytes of Data
with Amazon Athena and Upsolver

 

CASE STUDY

Industry

Market intelligence company.

Use case

Dealing with large amounts of data, and having to deal with different databases deflects the team’s focus on operations.

Data

Hundreds of TB of data.

Feature Highlights

The Backstory

The  data-collection process is critical for SimilarWeb, because they can’t provide customers’ insights based on a flawed or incomplete data. The data collection team needs analytics to track new types of data, partner integrations, overall performance and more with great effectiveness as quickly as possible. It’s imperative for their team to identify and address anomalies as early as possible. Any tool that supports this process gives a significant advantage.

ETL for Amazon Athena: 6 Things You Must Know

The Goal

Bridge together the data lake and the analytics users who aren’t big data engineers.

Hundreds of TB of data is streamed into SimilarWeb every month from different sources. The data is complex.  It contains hundreds of fields, many of which are deeply nested, in addition to some with null values. This complexity creates a technical challenge because the data must be cleaned, normalized and prepared for querying.

The first option was to use existing on-premises Hadoop cluster, which processes all of SimilarWeb’s data in a daily batch process that takes a few hours to run. For their business-critical monitoring, a 24-hour delay is not acceptable.

SimilarWeb considered developing a new process using Hadoop. But that would require their team to focus away from daily operations to code, scale, and maintain extract, transform and load (ETL) jobs. Also, having to deal with different databases deflects their team’s focus on operations. Therefore, they wanted an agile solution where every team member could create new reports, investigate discrepancies, and add automated tests.

The Requirements

  • An agile solution where every team member could create new reports, investigate discrepancies, and add automated tests
  • The team wants to use SQL to query the data, but traditional SQL databases are hard to scale to hundreds of terabytes.

The Solution

Why SimilarWeb chose Amazon Athena and Upsolver.

Fast queries using SQL

Low maintenance

Low cost

SimilarWeb chose Upsolver. Upsolver bridges together the data lake and the analytics users who aren’t big data engineers. Its cloud data lake platform helps organizations efficiently manage a data lake. Upsolver enables a single user to control big streaming data from ingestion to management and preparation for analytics platforms like Athena, Amazon Redshift and Elasticsearch Service (Amazon ES).

 

Upsolver's ETL pipeline helped improve our efficiency and reduce the time from ingestion to insight from 24 hours to minutes.

Yossi Wasserman, Data Collection & Innovation Team Leader, SimilarWeb

Upsolver is the shortest path from streaming to usable data.

The Results

By using this new pipeline, it helped improve SimilarWeb’s efficiency and reduce the time from ingestion to insight from 24 hours to minutes.

Business Benefits

  • Takes an hour to build a pipeline from Kafka to Athena.
  • Data remains private in our own S3 bucket.
  • Upsolver is a serverless platform with little IT overhead.

Engineering Benefits

  • Identify anomalies after 1 hour instead of 24 hours.
  • Easy to configure at scale by using its graphical user interface (GUI).
  • Stream Discovery tool helped them create tables in Athena.