Data Processing and ETL Software for Cloud Data Lakes

How SQLake Makes it Easy to Parse User Agent Strings

Many applications generate and track events of website and application traffic. Included in these events is a user agent string that contains information about the user’s web browser, operating system, and other bits of information. This string often needs to be parsed so that specific information can be extracted for tracking and reporting purposes.

Upsolver SQLake provides a useful function called USER_AGENT_PARSER that enables you to easily parse user agent strings and extract various attributes, such as browser, device type, and operating system.

The USER_AGENT_PARSER function takes two parameters:

an attribute (or array of attributes) we wish to extract
a user agent string to be parsed.

USER_AGENT_PARSER(attributes1[, attributes2, …, attributesN], “User Agent String”)

In SQLake you use this function with a transformation job as shown below:

CREATE SYNC JOB parse_user_agent
    RUN_INTERVAL = 1 MINUTE
    ADD_MISSING_COLUMNS - TRUE
AS INSERT INTO catalog.prod_db.user_agent_parsed MAP_COLUMNS_BY_NAME
SELECT
  user_agent as user_agent,
  USER_AGENT_PARSER('Device Type', user_agent) as device_type
FROM catalog.stage_db.raw_data
WHERE $event_time BETWEEN RUN_START_TIME() AND RUN_END_TIME();

The output produced by this job would look something like the following::

user_agent	device_type
“Mozilla/5.0 (iPad; CPU OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13F69 Safari/601.1 9 Very common”	“Desktop”
“Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; LG-L38C Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 MMS/LG-Android-MMS-V1.0/1.2”	“Mobile”
“Dalvik/1.6.0 (Linux; U; Android 4.4.4; WT19M-FI Build/KTU84Q) Common”	“Tablet”

Using SQLake to Parse User Agent Strings

To run it:

Create a connection to Amazon S3 with sample data
Create a staging table and job to ingest raw data
Create a target table to hold enriched data
Create transformation job that uses USER_AGENT_PARSER to enrich the data

Let’s walk through these steps

1. Create a connection to Amazon S3 with sample data

To help demonstrate how to use USER_AGENT_PARSER to extract multiple attributes from a user agent string, such as browser and device type, we will use sample data hosted by Upsolver and made available through S3.

Create a connection to S3. The IAM role in the following snippet is provided for you to access the data in S3.

CREATE S3 CONNECTION upsolver_s3_samples
   AWS_ROLE = 'arn:aws:iam::949275490180:role/upsolver_samples_role'
   EXTERNAL_ID = 'SAMPLES'
   READ_ONLY = TRUE;

2. Create a staging table and job to ingest raw data

Create a staging table in the data lake to hold the raw data. This step is simple and you don’t need to know the raw schema; SQLake automatically detects it and updates the AWS Glue Data Catalog.

Create the staging table in the AWS Glue Data Catalog. Note that you must update the <db_name> to use your database name.

CREATE TABLE default_glue_catalog.<db_name>.sales_info_raw_data()
PARTITIONED BY $event_date;

We’re using the $event_date system column to partition the staging table. This column contains the original date the event was generated (if available in the event), or one that SQLake auto-generates when the event is ingested.

Once the table is created, create an ingestion job that copies the raw events from the source S3 bucket into the target table. SQLake converts the data to columnar format, partitions it, and manages it over time to ensure best query performance.

Create the ingestion job to copy raw data into the target table.

CREATE SYNC JOB load_sales_info_raw_data_from_s3
  CONTENT_TYPE = JSON
  AS COPY FROM S3 upsolver_s3_samples
     BUCKET = 'upsolver-samples'
     PREFIX = 'sales_info/'
  INTO default_glue_catalog.<db_name>.sales_info_raw_data;

The following screenshot shows the execution of the ingestion job. In the console, SQLake reports the progress and notifies you when the data is available to query.

You can easily query the staging table directly from the SQLake worksheet. For example:

SELECT saleinfo.web.useragent
FROM default_glue_catalog.<db_name>.sales_info_raw_data
WHERE saleinfo.source = 'Web' LIMIT 10;

The following screenshot shows the results of the query. You can see the useragent string as part of the salesinfo.web nested field. We will use this in the next step.

3. Create a target table to hold the enriched data

Now that the raw data is staged, create the target table in the AWS Glue Data Catalog to hold the enriched data.

Create the target table:

CREATE TABLE default_glue_catalog.<db_name>.sales_user_agent_enrichment (
  partition_date date
)
PARTITIONED BY partition_date;

Notice we’re defining partition_date column as the partition column for this table. This column will get its value from the $event_time column.

4. Create a transformation job that uses USER_AGENT_PARSER to enrich the data

This is the step where the interesting work is performed. We create a transformation job that uses the USER_AGENT_PARSER function to extract multiple attributes from the salesinfo.web.useragent field.

Create the transformation job:

CREATE SYNC JOB sales_user_agent_enrichment
   START_FROM = BEGINNING
   RUN_INTERVAL = 1 MINUTE
   ADD_MISSING_COLUMNS = TRUE
AS INSERT INTO default_glue_catalog.<db_name>.sales_user_agent_enrichment MAP_COLUMNS_BY_NAME
SELECT
   orderid as orderid,
   saleinfo.web.useragent as useragent,
   USER_AGENT_PARSER('Browser',saleinfo.web.useragent) as browser,
   USER_AGENT_PARSER(['Device Brand Name'
                       , 'Device Code Name'
                       , 'Device Pointing Method'
                       , 'Device Type'
                       , 'Device Maker'
                       , 'Device Name'],saleinfo.web.useragent) as device,
   USER_AGENT_PARSER('Crawler',saleinfo.web.useragent) as is_crawler,
   USER_AGENT_PARSER(['Platform Maker','Platform','Platform Version'],saleinfo.web.useragent) as platform,
   $event_date as partition_date
FROM default_glue_catalog.database_e809da.sales_info_raw_data
WHERE saleinfo.source = 'Web' AND
   $event_time BETWEEN RUN_START_TIME() AND RUN_END_TIME();

By executing the previous code in SQLake, the job is created. After a couple minutes the data is fully transformed and ready to query. The following screenshot shows the execution of the job.

Query the target table to see the results of the enrichment:

SELECT *
FROM default_glue_catalog.<db_name>.sales_user_agent_enrichment
LIMIT 50;

The output of the query looks similar to the following screenshot:

Summary

That’s it!!

In just a few minutes and with several lines of SQL you were able to ingest data from S3 into the data lake, extract device, platform, and browser information using the USER_AGENT_PARSER function, and produce an enriched dataset you can use for analytics, reporting, auditing, and alerting.

SQLake makes it easy to work and extract meaning from data quickly. You can find more useful functions in the documentation page.

Get started today for free with sample data or bring your own.

Parse Information from Web and App Traffic Events

How SQLake Makes it Easy to Parse User Agent Strings

Using SQLake to Parse User Agent Strings

1. Create a connection to Amazon S3 with sample data

2. Create a staging table and job to ingest raw data

3. Create a target table to hold the enriched data

4. Create a transformation job that uses USER_AGENT_PARSER to enrich the data

Summary

Discover how companies like yours use Upsolver

Data preparation for Amazon Athena.

Building a multi-purpose data lake.

Database replication and CDC.

Real-time machine learning.

Start for free - No credit card required

Batch and streaming pipelines.

Get Started Now

Templates

All Templates