Combining data from multiple data sources

Before combining your data sources, you must have already deployed Upsolver and created two data sources.

Most of our users have data from various sources including real time streaming data and historical data. It’s very important to combine data from these sources together for analytics. This guide provides an overview on how to combine multiple data sources together.

Let’s get started.

 

Create an AWS Athena data output

 

1. Click on OUTPUTS on the left and then NEW on the right upper corner.

 

 

2. SELECT Amazon Athena as the data output.

 

 

3. Click on Add to add as many data sources as you need. Click on Next to continue.

 

 

Transform your aggregated data

 

1. Use the + sign next to data to map your fields to the output.

 

 

2. Deselect the columns that you don’t want to output. If you want to add each field individually, you can click on the plus sign next to each field. The plus sign next to data brings in all fields unless you deselect the ones you don’t want to output. For this example, we’re going to output all fields. Leave everything checked. Click on ADD FIELDS.

 

 

3. Perform the same thing for the second data source. Click on the + sign next to data and select ADD FIELDS to add all the fields from the second data source as well.

 

5. Click on the SQL tab on the upper right hand corner.

6. By default, when two data sources have the same column name, Upsolver will add _<number> to the duplicated column. For example, if both data sets have a column named “id”, the output will have “id” from one data source and “id_1” from another data source. This behavior of not merging two columns together automatically is because not all columns with the same name mean the same thing. If the columns do mean the same and they should be merged, then COALESCE statement will combine the columns from two data sources together.

 

 

7. Click on PREVIEW to review your data and RUN.

 

 

Configure your output

 

  1. Input your run parameters for your output and click on NEXT.

 

2. Select the COMPUT CLUSTER that you want to use. Choose time range that you want to load your data from. Keep in mind that for live streaming data, leave ENDING AT as Never. Click on DEPLOY.

 

 

Verify your output

 

  1. Keep track of the progress of your data output from the UI. Check your output data to make sure that everything outputted correctly.

 

 

What’s next?

Joining multiple data streams for real-time analytics.