Real-Time Data Streaming for AWS Data Lake

Data Lakes and real-time (or near real-time) data streaming have become core components of the enterprise data architecture. Data is generated from various sources, including human-generated data like banking transactions and clickstreams (Internet browsing data) and machine-generated data such as security logs, application system logs, and IoT sensors. This data must be stored, monitored, and analyzed to ensure the related business units and information technology system’s continued and optimum functioning.

Therefore, the topic that must be discussed is how companies handle the voluminous log files that are generated. 

By way of expanding on this topic, let’s look at the best practices for real-time data streaming into the AWS data lake and data processing by studying the following scenario.

Real-Time Data Streaming: A Case Study

Let’s assume that you are a dairy farmer with over 300 milking cows. You use an automated milking system to milk your cows twice a day with regular intervals between each milking session. It manages the milking process from the time the cows walk into the milking parlor to the time that they walk out, having been milked. Therefore, this system is an agricultural robot that interfaces with computer hardware and software, IoT devices, and mobile devices such as smartphones and tablets. The IoT devices include automated gate openers, the milking machine, and the measurement and release of food for the cows to eat while they are being milked. A large number of machine logs are generated and must be analyzed in real-time to ensure that the milking process runs smoothly. 

And these logs need to be analyzed in near real-time to ensure that the software applications and hardware are running optimally at all times. And any anomalies can be immediately addressed to ensure the software continues to function effectively and efficiently. Secondly, and equally importantly, the security log monitoring process also ensures that the data remains secure at all times.

In summary, the most effective way to store, monitor, and transform these logs is using a data streaming architecture housed on top of an AWS data lake.

Data Streaming and the AWS Data Lake

Amazon AWS defines data streaming as “data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes).”  

As highlighted in this definition, the salient points or elements of data streaming are as follows.

  • Large data streams: This is data or records like application log files recording events that are continuously generated in high volumes and high velocity. 
  • Real-time processing: There is a need to process these data streams in near real-time to ensure that the data owner can almost instantaneously react to these streamed events. 

These log files record events that have taken place and are either semi-structured or unstructured, usually in JSON or XML key-value pairs. The massive amounts of these events streamed into the data lake make it impossible to analyze the data using SQL-based query and analytical tools. Consequently, the data must be parsed and structured before it can be analyzed.

The Data Streaming Architecture

As highlighted above, data streaming includes machine logs from IoT devices, real-time advertising or clickstream data, security logs, and server logs. This data is typically challenging to work with because of its voluminous nature and the lack of structure and schema. In layman’s terms, there are massive amounts of unstructured data pouring into a data lake. And it needs to be analyzed in real-time. However, before it can be analyzed, it must be processed, transformed, and loaded into a structured query environment.  

Enter the data streaming architecture. 

The data streaming architecture can be defined as “a framework of software components built to ingest and process large volumes of streaming data from multiple sources.”

This discussion’s salient point is to note that the difference between a traditional data solution and a data streaming architecture is that the conventional data solution reads and writes data in batches. On the other hand, the data streaming architecture “consumes data immediately as it is generated, persists it to storage, and may include various additional components per use case – such as tools for real-time processing, data manipulation and analytics.”

Consequently, the data streaming architecture must handle the data streams’ unique characteristics, including its massive data volumes (from terabytes to petabytes) and its lack of structure. As a result, this architecture must have the capacity to process the data and perform intense ETL operations before it is in a useful format. 

Therefore, the ability to process these voluminous data streams requires several building blocks built on top of each other. 

  • The stream processor is the tool that collects the data from its source, translates it into a standard message format, and ensures that this data stream is continuous. 
  • This streamed data is stored in cost-effective cloud object storage
  • Batch and data stream aggregation tools are used to enrich, structure, and transform the data from the streams so that SQL-based analytics tools are used to analyze the data.
  • The final part of this architecture is the data analytics or serverless query engine. Tools like Amazon Athena are used to analyze this data to provide meaningful information. 

Why Real-Time Data Streaming?

Note: The phrase “real-time data streaming” is a misnomer. The correct term should instead be “near real-time data streaming.”

The most important reason for implementing a near real-time data streaming solution is to ensure that critical enterprise data is processed as it is produced. Companies can no longer wait for the results of a traditional data batch processing solution. 

Suppose we return to our case study for a moment. The automated milking process is crucial to the dairy farm’s success. All 300 cows must be milked at least twice a day. As an aside, the milk yield per cow is dependent on a regular milking schedule. If the agricultural robot breaks down, and the cows are only milked once per day instead of twice a day, not only does the farm’s overall milking yield go down, but each cow’s milking yield per session also drops.  In other words, if something goes wrong with the agricultural robotic system, the dairy manager must be informed as soon as there is a breakdown in the system so it can be fixed without causing significant interruptions to the milking schedule. 

Final Thoughts 

This scenario describing the real-time processing and analysis of machine log files is a useful example of the imperative of analyzing all machine log files in real-time or near real-time to take immediate action in the event of the presentation of anomalies or deviations from the standard patterns. 

On another level, it is also a description of how to analyze log files containing security events to control and prevent security breaches. This scenario has always been topical. However, in the current COVID-19 pandemic, it has become especially relevant for, as described by Interpol

Cybercriminals are developing and boosting their attacks at an alarming pace, exploiting the fear and uncertainty caused by the unstable social and economic situation created by COVID-19.

Consequently, it is vital to remain proactive and monitor all logs to ensure that your SaaS client data remains secure irrespective of the global geopolitical and socioeconomic conditions. 


Share with your friends

Don't Stop Here - More to Explore

Explore all Blog Categories

Explore all Blog Categories:

data lake ETL Demo

Let’s get personal:
See Upsolver on your data in a live demo.

Schedule a free, no-strings-attached demo to discover how Upsolver can radically simplify data lake ETL in your organization.