How to Monitor Network Device Uptime so You’re not “Driving Blind”

Reading and processing remote network device data typically incurs a high I/O cost when it comes to the ETL process.  That’s because the data tends to consist of an extremely high volume of small events, sometimes from a large number of devices and device types – network devices (routers, switches, access points, and so on), IoT endpoints (sensors on factory equipment (IIoT), energy infrastructure, or even smartphones and fitness devices.

The Perils of Real-Time Streaming ETL

 In a typical ETL process, there are discrete steps for:

  1. data ingestion
  2. cleaning
  3. transformation
  4. loading into a target system. 

If it’s real time streaming data there’s even more complexity; when a job fails, you must recreate the stream, process it offline, and sync the system.  

Meanwhile, device failure can result in downtime, loss of productivity, or even security risks. Hardware issues and errors can bring your entire infrastructure to a halt and cause heavy losses to your business. If you don’t track the state of your remote devices, you’re basically driving blind – headlights off on a dark road.

You can implement a complex solution using multiple technologies and systems.  For example, to ingest real-time events, you can connect together Apache Kafka and Apache Spark for ingestion, store the data in Amazon S3, run a separate set of jobs to transform and maintain the data, and employ Apache Airflow to orchestrate all of these jobs.  For streaming data, build in Apache Flink or Spark Streaming, too.  But most of these tools in turn require their own specialized skillset, adding yet more complexity and costs, and making the project yet more difficult to accomplish for your existing team and resources.

And just managing that complexity, plus the time required to build and test all the various failure scenarios for a streaming system, may not be worth the tremendous effort required.  As a result companies often decide to put off their data project, meaning they don’t get business value out of their network or IoT data.  They’ll store it, but the data just sits unused in cheap lake storage because it’s too complicated to exploit.

And it’s that complexity that Upsolver SQLake aims to address.

Data Points for a Network Device Monitoring Solution

Let’s look at how Upsolver SQLake approaches this issue.  

SQLake accommodates all of the complexities of sessionizing network device data in a custom function called dynamic_sessions.  The dynamic_sessions function helps you reduce the amount of code and number of reads on the source data.  Here’s how you can employ the Dynamic_sessions function to provide a sample solution using just SQLake – no additional tools or technologies:

Assume a hypothetical network with n devices, each of which generates an event every 5 minutes.

You must capture every event while continuously monitoring the data streams.  If you don’t receive an event after 3 consecutive time intervals (15 minutes total), the system must flag the device as “down,” issue an alert, and generate a corresponding signal.  Despite the intuitive description, it’s a complex problem, especially as data volume scales.  At scale, it becomes increasingly difficult, time-intensive, and risky to monitor the network manually and react quickly when systems are flagged as offline.

Using SQLake to solve for network device monitoring at scale

Data architecture for network device monitoring
Proposed architecture for network device monitoring

In the above architecture, device event data can come in as a set of files stored on S3 by a network device manager such as AWS IoT SiteWise or as a stream of events captured by Amazon Kinesis Data Streams or Apache Kafka.  Data is then processed by SQLake in a streaming manner and the results are stored in your data lake or data warehouse.

For this use case, we assume that events are stored in S3, and that we want to write results into the data lake so it can be queried with Amazon Athena. 

Using SQLake you can automatically monitor your network devices in 4 simple steps:

  1. Create a source table that receives events every 5 minutes
  2. Create a materialized view to sessionize each device
  3. Determine which devices were offline during a predetermined period
  4. Calculate total uptime per device and output to a data lake table for reporting

How to monitor your network devices

Before you begin, please be sure to sign up for SQLake for free.  No credit card is required

Step 1: Create a source table that receives alerts every 5 minutes

Sample source structure:

  • Device name
  • Timestamp
  • Attribute 1
  • Attribute 2
  • And so on

The following code first creates the staging table in the data. Then it creates a job that reads from the source, converts the file format, and partitions the output before loading it into the target table.code:

CREATE TABLE default_glue_catalog.<DB_NAME>.event_raw_data()
    PARTITIONED BY $event_date;
 
CREATE JOB event_staging_job
    CONTENT_TYPE = CSV
    AS COPY FROM S3 Upsolver_SA_S3 BUCKET = 'upsolver-samples' PREFIX = 'demos/iot/' 
    INTO default_glue_catalog.<DB_NAME>.event_raw_data; 
 
/* Inspect the raw data in the staging table */
SELECT device,att1,att2,dt_updated FROM default_glue_catalog.<DB_NAME>.event_raw_data;

Query result:

Step 2: Create a lookup table with device session data


The materialized view uses a built-in function, DYNAMIC_SESSIONS, to process all events within a given window – in this case the last 15 minutes. It then aggregates the results by the device ID (device column) and returns a single row per device. 

Code example to create the materialized view:

CREATE MATERIALIZED VIEW default_glue_catalog.<DB_NAME>.event_rollup_sessions_lookup
AS
SELECT 
    device as device,
    min(timestamp_cast) AS first_seen_date,
    max(timestamp_cast) AS last_seen_date,
DYNAMIC_SESSIONS(unix_timestamp/1000, (15 * 60)) AS sessions_15m
FROM default_glue_catalog.<DB_NAME>.event_raw_data
LET timestamp_cast = PARSE_DATETIME(dt_updated,'yyyy-MM-dd HH:mm'),
unix_timestamp = TO_UNIX_EPOCH_MILLIS(timestamp_cast)
GROUP BY device;

Step 3: Determine devices that did not respond in a given period of time

In this step you create a job that joins the raw event data with the materialized view containing each device’s session start and end times. The job then calculates the duration of the session representing the device uptime, the first and last time it was detected, and the most recent attributes the device published.

Note   If a given device in the table displays more than 1 entry, it means the device was offline for more than 15 minutes – that is, it failed to send an event for 3 consecutive 5-minute periods.

First create an output table to store the results. Then create a job to calculate session duration and aggregate results:

CREATE TABLE default_glue_catalog.<DB_NAME>.event_logs_flatten_sessions();
 
CREATE JOB event_logs_flatten_sessions
    START_FROM = BEGINNING
    ADD_MISSING_COLUMNS = TRUE  
    RUN_INTERVAL = 1 MINUTE
    AS INSERT INTO default_glue_catalog.<DB_NAME>.event_logs_flatten_sessions MAP_COLUMNS_BY_NAME
SELECT
       r.device AS device,
       FORMAT_DATETIME(from_unixtime(s.sessions_15m[].startTime), 'yyyy-MM-dd HH:mm:ss') AS s_sessions_15m_starttime,
       FORMAT_DATETIME(from_unixtime(s.sessions_15m[].endTime), 'yyyy-MM-dd HH:mm:ss')  AS s_sessions_15m_endtime,
       (s.sessions_15m[].endTime - s.sessions_15m[].startTime) / 60 AS session_minutes,
       LAST(att1) AS att1,
       LAST(att2) AS att2
FROM default_glue_catalog.<DB_NAME>.event_raw_data r
LEFT JOIN default_glue_catalog.<DB_NAME>.event_rollup_sessions_lookup s ON s.device = r.device 
where $commit_time between run_start_time() - PARSE_DURATION('1m')  and run_end_time()
GROUP BY r.device,
          FORMAT_DATETIME(from_unixtime(s.sessions_15m[].startTime), 'yyyy-MM-dd HH:mm:ss'),
          FORMAT_DATETIME(from_unixtime(s.sessions_15m[].endTime), 'yyyy-MM-dd HH:mm:ss'),
          (s.sessions_15m[].endTime - s.sessions_15m[].startTime)  / 60;
 
/* Query to see output */
select * from default_glue_catalog.<DB_NAME>.event_logs_flatten_sessions;

Output result:

Step 4: Determine the total uptime per device

We create another job to calculate the uptime for each device.  The results return the total uptime as a percentage. If the value is 1 (which equates to 100%) it means that device is functioning and has been sending events continuously during each 15-minute window.   Any other value is less than 1, indicating the device has been offline.

Create the target table, including a primary key of “device” ID. Then create the job that calculates the device uptime and updates the target table. If the row with the given device ID already exist, SQLake automatically updates the values with the recent ones. Otherwise it inserts a new row.

CREATE TABLE default_glue_catalog.<DB_NAME>.event_logs_device_uptime(
    device string)
PARTITIONED BY device
PRIMARY KEY device;
 
CREATE JOB event_logs_device_uptime
    START_FROM = BEGINNING
    ADD_MISSING_COLUMNS = TRUE
    RUN_INTERVAL = 1 MINUTE
AS 
    MERGE INTO default_glue_catalog.<DB_NAME>.event_logs_device_uptime As target
    USING ( 
    SELECT
       r.device AS device,
       r.att1 AS att1,
       r.att2 AS att2,
       s.first_seen_date AS first_seen_date,
       s.last_seen_date AS last_seen_date,
       ((ARRAY_SUM(s.sessions_15m[].endTime - s.sessions_15m[].startTime)/60.0) / DATE_DIFF('minute', s.first_seen_date, s.last_seen_date)) AS uptime
FROM default_glue_catalog.<DB_NAME>.event_raw_data r
LEFT JOIN default_glue_catalog.<DB_NAME>.event_rollup_sessions_lookup s ON s.device = r.device 
where $commit_time between run_start_time() and run_end_time()
     
    ) AS source
    ON (source.device = target.device)
    WHEN MATCHED THEN REPLACE
    WHEN NOT MATCHED THEN INSERT MAP_COLUMNS_BY_NAME;
 
/*     Query the output table */        
select * from default_glue_catalog.<DB_NAME>.event_logs_device_uptime;

Output result:

You can see from the results above that two of the devices have gone offline, prompting further investigation.

Summary: Using SQL to Track Your Network Devices

In this post, you learned how to use SQLake to ingest raw network or IoT device event data and sessionize them using a built-in DYNAMIC_SESSIONS function that enables you to calculate the uptime and availability of your network devices. With these KPIs available in your data lake, you can easily trigger alerts or include them in reports.

SQLake provides a scalable, simple-to-manage, and cost-effective streaming and batch data processing platform that enables you to develop ingestion, transformation, and enrichment jobs and pipelines using SQL. All of the complex orchestration, data maintenance, optimizations, and application of best practices are automatically performed by SQLake.

Try SQLake for free

Try SQLake today for free for 30 days. No credit card required. Continue the discussion and ask questions in our Slack Community.

Published in: Blog , Building Data Pipelines
Chandra Peddireddy
Chandra Peddireddy

Chandra Peddireddy is a Senior Solutions Architect at Upsolver. He possesses 16 years of IT experience in data architecture, data engineering, data modeling, business intelligence, and analytics. His long list of expertise includes designing and building data pipelines of varied scales involving multiple data technologies.

Keep up with the latest cloud best practices and industry trends

Get weekly insights from the technical experts at Upsolver.

Subscribe

Templates

All Templates

Explore our expert-made templates & start with the right one for you.