Data is a valuable currency in the modern digital economy. However, it is still a challenge to keep pace with the rapid changes in enterprise data and the growing demand for information; thereby, providing the raison d’etre for the movement of data from legacy mainframe server-type infrastructures to the cloud to scale data-driven decision-making.
Experts agree that cloud-based technologies are integral to ensuring data security, management, process compliance, and privacy across the entire enterprise data spectrum. However, there is one ongoing conundrum or debate, and that is how to process data faster with particular reference to streaming data versus batch data.
Consequently, let’s open this discussion with a brief definition of, and comparison between, streaming data and batch data.
Differences Between Streaming Data and Batch Data
1. Batch data
Succinctly stated, batch data processing is the processing of large data volumes in batches or groups. This data typically consists of millions of daily records, accumulated and stored in a central location before being processed.
In other words, as explained by the Upsolver.com team, batch processing is the mechanism used by users to “collect and store data in batches during a batch window. This can save time and improves the efficiency of processing the data and helps organizations and companies in managing large amounts of data and processing it quickly.”
Ergo, it is a highly effective method of processing sequential records collected over time in a non-stop, sequential order.
2. Streaming data
The counterbalance to batch data is streaming data. It is the mechanism by which data is analyzed almost instantaneously or continuously. This analysis occurs as the data flows through the system with no time limitations on the output. It is highly effective and beneficial if the events that produce this data occur frequently and occur at short intervals.
The Upsolver.com article quoted above adds value to the “what is streaming data” discussion by stating the following.
“Streaming allows you to stream events from any source, and it helps you make changes to the data while you’re on the run.”
The SQL WINDOW function and its advantages over traditional SQL aggregation functions
A SQL WINDOW function is an aggregate function that performs a calculation across a set of table rows related to the current row. This is similar to the traditional aggregate function where according to the Microsoft Transact SQL documentation,
“An aggregate function performs a calculation on a set of values and returns a single value. Aggregate functions are often used with the GROUP BY clause of the SELECT statement.”
However, the SQL WINDOW function takes the basic aggregate function further in that it does not group rows into a single output row. The rows retain their separate identities. It is also worth noting that the WINDOW function and the SQL aggregation function cannot simultaneously be used in the same SQL query.
By way of expanding on this definition, let’s look at the difference between the aggregate and WINDOW function, using the following FOREX data on currency exchange rates.
This table contains the closing prices for the currency pairs, GBP/USD, and EUR/USD. Let’s look at the SQL statement used to calculate the average closing price for each currency pair.
SELECT CurrencyPair, AVG(Close) as average_price FROM CURRENCYTRADE GROUP BY CurrencyPair;
Two rows or a single row output for each currency pair are returned consisting of the currency pair name and the average closing price for the currency pair.
|Currency Pair||Average Price|
The WINDOW function’s output is different. Let’s use the same base data as described above and look at the SQL statement and the outcome.
SELECT *, AVG(Close) OVER (PARTITION BY CurrencyPair) AS average_closing FROM CURRENCYTRADE;
|Currency Pair||DateTime||Close||Average Closing|
There are considerable advantages to using the WINDOW function over the traditional aggregation function. The SQL code is simpler and easier to maintain, and it is much simpler to assign a value to a current row from a previous or successive row. This is a faster option than creating cursors and subqueries.
However, its most important benefit is found in its fundamental difference. Ergo, the primary difference between the aggregation function and the WINDOW function is that the rows in the WINDOW function’s result set are not collapsed. This construct is beneficial as information because it displays both the aggregated information and the single rows simultaneously.
Breaking the traditional mold: The SQL WINDOW function in stream analytics
The SQL WINDOW function has been adopted by data streaming experts to provide similar information. It is very useful for streaming data analytics.
Nonetheless, the principle is more complicated. Let’s look at a step-by-step guide to creating and implementing this function on streamed data.
Note: Data stream-driven architecture produces individual records known as events. Users can subscribe to the events in a particular service and will receive the analytical information as soon as it is available. And, because every service has near-real-time access to the data streamed events, they do not have to wait for ETL processes to finish.
Because streams are by definition unbounded, creating aggregated views is challenging without introducing boundaries.
Enter windowing: a mechanism to group the streamed events based on common event attributes. As an aside, the most common characteristic used to group events into windows is time. Also, a single event might be assigned to multiple windows. For instance, the output of event counts based on overlapping windows will differ from the output of event counts based on distinct windows.
There are four different, commonly used window functions.
1. The Tumbling Window
Each tumbling window has a fixed length and is placed consequently in a row on a time axis. Tumbling windows do not overlay each other. In other words, each event is only assigned to one window, and each window fits on the time axis one after the other.
Tumbling windows are designed using one attribute: time. Consequently, each window only contains a single event, providing a robust use case for simple reporting. For instance, tumbling windows produce useful results when asked to aggregate all the number of website clicks, based on the visitor’s country of origin, and grouped in a 20-second tumbling window.
The SQL code is as follows.
SELECT Country, Count (*) AS VisitorCount FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, TumblingWindow(second, 20);
2. The Hopping Window
The hopping window also has a fixed length. However, it has two configuration parameters (the time s and the hop size h) instead of just one (s) as with the tumbling window. Thus, instead of moving the window forward by s, it is driven forward by h.
Actually, tumbling windows are a form of hopping windows where s = h. And the challenge is that if s > h (overlapping windows) and if s < h, some events might not be assigned to a window. This is a point to take note of when implementing the hopping window construct. The most common use case for hopping windows is to calculate moving averages.
Let’s reuse the example cited in the tumbling window; except, we want to see the moving average of the number of website clicks, based on the visitor’s country of origin, grouped in a 20-second window hopping 2 seconds.
SELECT Country, AVG (*) AS Average_Clicks FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, HoppingWindow(second, 20, 2);
3. The Sliding Window
Sliding windows are hopping windows, where h 0. In other words, this window does not hop by a time interval; it slides along. It moves along the time axis as the tumbling window does, but it groups together events within the window length s. In other words, the input stream is discrete, but the subsequent, aggregated stream is not. Because the data points are discrete, it is possible to implement a sliding window based on actual events rather than continuous time. The next window is constructed whenever an event enters or exits in the forward-moving sliding window.
As with hopping windows, sliding windows are used to compute moving averages. Their uniqueness is that they provide a resolution based on the event time pattern in the data stream. If the event density increases, the moving average will increase. Juxtapositionally, if no new events are entering the sliding window, the moving average stays the same.
This SQL statement is similar to the hopping window’s SQL statement.
SELECT Country, AVG (*) AS Average_Clicks FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, SlidingWindow(second, 20);
4. The Session Window
The session window function is the only window that is not a fixed length; it is an ever-changing length. Therefore, when using this function, you must enumerate the maximum time threshold between successive events. This window can be implemented by keeping current events in a buffer and adding new events as long as they are within the session interval. The challenge here is that, because streams are unbounded, these sessions can grow indefinitely. Consequently, some sessions take a second parameter, the maximum session time, or the maximum number of events per session.
A typical use case for this window is to group together related events when they happen close together. For example, session windows are used to group website clicks within a single user session. This SQL statement counts the total number of website clicks at a countrywide level grouped in a five-second interval session window lasting at most 20 seconds.
SELECT Country, Count (*) AS VisitorCount FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, SessionWindow(second, 5, 20);
Why use the WINDOW function in data streaming
The Amazon AWS web documentation provides us with the justification for using the WINDOW function in data streaming.
“By using WINDOW functions, you can enable your users to create analytic business queries more efficiently.”
The WINDOW function is a simple and elegant way to analyze streamed data in near-real-time.
As described above, the events found in a data stream are discretized into groups of events, known as windowing. And the WINDOW function returns results for each window. In other words, a window can be understood as a database row in the FOREX currency pairs data example cited above. And the WINDOW function provides the same benefits in data streaming that it does when used to analyze batch data. Ergo, the result set comprises both aggregated event data and data about individual events.
As described at the start of this article, data is an extremely valuable currency in the digital economy. Companies of all sizes and industries can, and do, generate massive amounts of data that must be analyzed near-real-time to provide useful information that is used to make the correct business decisions that will drive the organization forward. Therefore, the almost-instant analysis of these data streams is vital to organizational success.