Last week we officially launched SQL – our proprietary SQL-based interface for building, maintaining and orchestrating data lake ETL flows (press release is here in case you missed it). Today we’d like to dive a bit deeper and show you how you can use SQL to quickly and easily create a dataset for predicting whether a user will click on an ad.
We’ve looked more deeply into this example in our previous webinar on Online Inference with User Personalization at Scale, which you should definitely go ahead and watch – but this post will focus specifically on building the dataset using Upsolver’s SQL functionality.
To answer this question, we need to join a stream of ad impressions with a stream of user clicks. To improve the accuracy of our predictions we’ll want to look at a relatively large time window of historical user data – 90 days:
Essentially we want to create a unified dataset with user-level information that we can then analyze using our query engine of choice. While this might sound simple enough, there are several problems challenges that need to be addressed:
- Handling delayed events: Data is typically being generated by end devices (such as mobile phones) that might lose internet connectivity, resulting in events arriving at different times – including clicks that arrive before impressions. To avoid inaccuracies, additional effort needs to be invested in timing and orchestration of ETL jobs.
- Managing state: Since the data we are joining is either coming from live streams or schemaless data lake storage, we are working without indices. Solutions based on Apache Spark would require us to spin up an additional NoSQL database (e.g. Apache Cassandra) to act as a key-value store in order to join the historical event data.
- Deploying live models: If we use batch ETL for model training, we will need a separate streaming architecture to actually deploy the model, requiring us to maintain two separate and complex pipelines.
These challenges can be solved using highly optimized architectures built on Spark/Hadoop along with a seperate NoSQL database such as Cassandra or DynamoDB. However, this type of project can take 3-6 months and require a team of data engineers. Instead, let’s look at an easier way to do this with just a few lines of SQL using Upsolver.
How Joins work in Upsolver SQL
Upsolver’s SQL is a familiar SQL syntax which includes time-based joins and window functions, in addition to regular ANSI SQL statements. SQL leverages Upsolver’s built-in data lake indexing, which allows you to work with schemaless data as if you were using a relational database, using the SQL you already know and love.
Using Upsolver’s SQL, you can create joins between streams, or between a stream and historical big data; the enriched stream will run continuously query your data sources and return accurate data in near real-time, without needing to update the query or to manage orchestration.
In our not-so-objective opinion this is kind of a big deal because:
- Batch and streaming ETL is implemented using familiar SQL and nothing else
- ETL is processing each event separately so data is available in real-time and processing is fast and efficient.
- ETL is fully stateful without an additional key-value store
- ETL runs on raw data without spending any time on structuring into a table model
Creating the Dataset
Okay, enough preamble – let’s show you how it’s done. Here’s the SQL we’re running in Upsolver to create our labeled dataset:
WITH click_data AS ( SELECT data.id AS imp_id, LAST(data.click_timestamp) AS click_time FROM Clicks GROUP BY data.id ), user_data AS ( SELECT data.exch_user AS exch_user, COUNT(*) AS num_clicks_90d, LAST(data.click_timestamp) AS last_click_time, LAST(data.user_id) AS device_id FROM Clicks GROUP BY data.exch_user WINDOW 90 DAYS ) SELECT data.win_timestamp AS win_time, // Impression data data.id AS imp_id, data.campaign_id AS cmp_id, data.exch_user AS exch_user_id, IF_ELSE(click_data.click_time IS NULL, 0, 1) AS is_click, user_data.num_clicks_90d AS num_clicks_90d, user_data.last_click_time AS last_click_time, user_data.device_id AS device_id FROM Impressions LEFT OUTER JOIN click_data WAIT 10 MINUTES ON click_data.imp_id = data.id LEFT OUTER JOIN user_data ON user_data.exch_user = data.exch_user
As you can see, we’ve done everything with regular ANSI SQL except two instances where we relied on proprietary SQL syntax:
- WINDOW is used to signify the time window for historical event data we want to query in our Amazon S3 data lake
- WAIT is used to delay the join operation, addressing the challenge of events arriving at different times
Using nothing but SQL that any developer and analyst can immediately understand, we’ve created a dataset that is updated in near real-time with both the latest events, and 90 days worth of historical data. No Spark jobs, no convoluted architectures – write SQL and you’re good to go.
This data is immediately ready to be queried in tools such as Amazon Athena and queries will run super-fast since Upsolver optimizes the underlying storage on S3 (we’ve covered this in our previous article on data preparation for Athena)
Got any questions or feedback? We’d love to hear it. Get in touch with us today if you’d like to learn more, or try Upsolver for yourself in our 14 day free trial; or you can learn more about how Upsolver works by checking out our technical whitepaper.