Performing a join on multiple data streams is easy with Upsolver. This guide provides the instructions on joining a primary data source called Impressions and a secondary data stream called Clicks. Impressions have the ad campaign information, while Clicks keeps track of the number of clicks on an ad. All clicks will be associated with an ad impression, but not all ad impressions result in clicks.
1. Click on OUTPUTS on the left and then NEW on the right upper corner.
2. SELECT Amazon Athena as the data output.
3. Click on ADD to add as many data sources as you need. Click on NEXT to continue.
2. The sample SQL performs a LEFT OUTER JOIN between impressions and clicks data streams. The LEFT OUTER JOIN is creating a lookup table behind the scenes. It enable users to index data by a set of keys and then retrieve the results in milliseconds. Read more about Upsolver lookup tables here.
SELECT data.id AS impression_id,data.win_timestamp AS impression_time,data.campaign_id AS campaign_id,data.exch_user AS user_id,IF_ELSE ( click_data.click_time IS NULL, 0, 1) AS is_click //if click_time exists returns 1, else returns 0FROM ImpressionsLEFT OUTER JOIN (SELECT data.id AS imp_id, LAST(data.click_timestamp) AS click_time //last click_timestamp is the click_timeFROM ClicksGROUP BY data.id) AS click_dataWAIT 10 MINUTES ON click_data.imp_id = data.id //wait for 10 minutes before performing the join since clicks usually arrive after impressions.
2. Define the compute cluster that you would like to use and the time range of the data you would like to output. Keep in mind that setting ENDING AT to Never means it’s a continuous stream. Click on DEPLOY.
1, Check to make sure the output data is up to date by clicking on the PROGRESS tab.
2. Run a query in Athena to make sure you get the correct results.
SUM(is_click)/COUNT(*) AS CTR //divide the sum of clicks for a campaign to the total number of impressions to get the click through rate
GROUP BY campaign_id
Use Upsolver to index less data into Splunk.