Joining Impression and Click Streams in Minutes Using SQL

Last week we officially launched SQL – our proprietary SQL-based interface for building, maintaining and orchestrating data lake ETL flows (press release is here in case you missed it). Today we’d like to dive a bit deeper and show you how you can use SQL to quickly and easily create a dataset for predicting whether a user will click on an ad.

We’ve looked more deeply into this example in our previous webinar on Online Inference with User Personalization at Scale, which you should definitely go ahead and watch – but this post will focus specifically on building the dataset using Upsolver’s SQL functionality.

The Challenge 

To answer this question, we need to join a stream of ad impressions with a stream of user clicks. To improve the accuracy of our predictions we’ll want to look at a relatively large time window of historical user data – 90 days:

Essentially we want to create a unified dataset with user-level information that we can then analyze using our query engine of choice. While this might sound simple enough, there are several problems challenges that need to be addressed:

  1. Handling delayed events: Data is typically being generated by end devices (such as mobile phones) that might lose internet connectivity, resulting in events arriving at different times – including clicks that arrive before impressions. To avoid inaccuracies, additional effort needs to be invested in timing and orchestration of ETL jobs.
  • Managing state: Since the data we are joining is either coming from live streams or schemaless data lake storage, we are working without indices. Solutions based on Apache Spark would require us to spin up an additional NoSQL database (e.g. Apache Cassandra) to act as a key-value store in order to join the historical event data.

These challenges can be solved using highly optimized architectures built on Spark/Hadoop along with a seperate NoSQL database such as Cassandra or DynamoDB. However, this type of project can take 3-6 months and require a team of data engineers. Instead, let’s look at an easier way to do this with just a few lines of SQL using Upsolver.

How Joins work in Upsolver SQL

Upsolver’s SQL is a familiar SQL syntax which includes time-based joins and window functions, in addition to regular ANSI SQL statements. SQL leverages Upsolver’s built-in data lake indexing, which allows you to work with schemaless data as if you were using a relational database, using the SQL you already know and love.

Using Upsolver’s SQL, you can create joins between streams, or between a stream and historical big data; the enriched stream will run continuously query your data sources and return accurate data in near real-time, without needing  to update the query or to manage orchestration.

In our not-so-objective opinion this is kind of a big deal because:

  • Batch and streaming ETL is implemented using familiar SQL and nothing else 
  • ETL is processing each event separately so data is available in real-time and processing is fast and efficient. 
  • ETL is fully stateful without an additional key-value store
  • ETL runs on raw data without spending any time on structuring into a table model

Creating the Dataset

Okay, enough preamble – let’s show you how it’s done. Here’s the SQL we’re running in Upsolver to create our labeled dataset:

WITH click_data AS
SELECT AS imp_id, LAST(data.click_timestamp) AS click_time
FROM Clicks
user_data AS
SELECT data.exch_user AS exch_user, COUNT(*) AS num_clicks_90d, LAST(data.click_timestamp) AS last_click_time,
     LAST(data.user_id) AS device_id
FROM Clicks
GROUP BY  data.exch_user
SELECT data.win_timestamp AS win_time, // Impression data AS imp_id,
   data.campaign_id AS cmp_id,
   data.exch_user AS exch_user_id,
   IF_ELSE(click_data.click_time IS NULL, 0, 1) AS is_click,
   user_data.num_clicks_90d AS num_clicks_90d,
   user_data.last_click_time AS last_click_time,
   user_data.device_id AS device_id
FROM Impressions
LEFT OUTER JOIN click_data WAIT 10 MINUTES ON click_data.imp_id =
LEFT OUTER JOIN user_data ON user_data.exch_user = data.exch_user

As you can see, we’ve done everything with regular ANSI SQL except two instances where we relied on proprietary SQL syntax:

  • WINDOW is used to signify the time window for historical event data we want to query in our Amazon S3 data lake
  • WAIT is used to delay the join operation, addressing the challenge of events arriving at different times

Using nothing but SQL that any developer and analyst can immediately understand, we’ve created a dataset that is updated in near real-time with both the latest events, and 90 days worth of historical data. No Spark jobs, no convoluted architectures – write SQL and you’re good to go.

This data is immediately ready to be queried in tools such as Amazon Athena and queries will run super-fast since Upsolver optimizes the underlying storage on S3 (we’ve covered this in our previous article on data preparation for Athena)

Got any questions or feedback? We’d love to hear it. Get in touch with us today if you’d like to learn more.  Or try SQLake for free (early access). SQLake is Upsolver’s newest offering. It lets you build and run reliable data pipelines on streaming and batch data via an all-SQL experience. Try it for free. No credit card required. 

You can learn more about how Upsolver works by checking out our technical whitepaper

Published in: Blog , Use Cases
Eran Levy
Eran Levy

As an SEO expert and content writer at Upsolver, Eran brings a wealth of knowledge from his ten-year career in the data industry. Throughout his professional journey, he has held pivotal positions at Sisense, Adaptavist, and Eran's written work has been showcased on well-respected platforms, including Dzone, Smart Data Collective, and Amazon Web Services' big data blog. Connect with Eran on LinkedIn

Keep up with the latest cloud best practices and industry trends

Get weekly insights from the technical experts at Upsolver.



All Templates

Explore our expert-made templates & start with the right one for you.