Companies and individuals create a vast amount of data every day, with most of this data being unstructured. Businesses are also becoming increasingly aware of the need to analyze this data, transforming it into meaningful information used to drive strategic business decision-making.
However, partnered with this voluminous data comes the day-to-day challenge of managing and analyzing these massive amounts of unstructured data. The four fundamental modern data challenges include lack of scalability, inefficient data warehousing, dissimilar data sources, and overwhelming security and regulatory compliance needs.
There is a simple yet highly effective solution to these practical data challenges using Amazon Athena and Upsolver as the primary driver for data storage and analysis mechanisms.
Note: Upsolver.com is Amazon Athena’s official partner. The Amazon Athena web page describes Upsolver as a “data lake ETL service. It provides a visual, SQL-based interface for creating real-time tables in Athena with little engineering overhead and according to performance best practices. Upsolver’s ETL also enables updates/deletes to tables in Athena for common CDC and compliance use cases.”
As a result, by way of expanding on this point, let’s look at four parts to the solution to these data challenges.
1. Security and regulatory challenges: Handling GDPR and CCPA requests
The way global companies handle data has come under increased scrutiny. The two most prominent regulatory acts are the European Union General Data Protection Regulation (GDPR) and The California Consumer Privacy Act have introduced restrictions on the way companies collect, store, and use Personally Identifiable Information (PII). Additionally, these acts call for harsh penalties and sanctions on breaches of these regulations.
One of the most challenging aspects to deal with is the erasure requests. In other words, customers have the right to request that a company deletes all of their data, and the company must comply. While this is quite simple to achieve in a traditional relational database like Oracle or Redshift, it is difficult in the S3 data lake architecture where the data is queried by Athena.
Data is ingested and stored sequentially and in chronological order in the data lake. It is hard to modify tables or records without the use of an upsert API. The Upsolver/Athena integration has solved this issue by changing how data preparation is handled, simplifying the identification and deletion of all records that belong to a specific user ID.
Secondly, it is essential to delete all of the events that contain this user ID from the S3 buckets where the data is ingested. It is also important to note that this data cleansing process runs in parallel with all other methods, so it does not slow down the day-to-day data ingestion, management, and analysis processes.
Lastly, is it necessary to delete every record linked to the specific user ID in the S3 data lake, but it is also critical to set up a continuous process that deletes all records related to this User ID as the data is written from Kinesis into the S3 bucket.
2. Lack of scalability: Providing users with information based on their unique records
A significant part of any software application or data architecture is the built-in reporting function. This feature is often described as embedded analytics, or the customer-facing dashboards and analytic capabilities built into the software, providing customers with the ability to view analytical reporting based on the data ingested by the application.
The challenge that software providers face is that it is incredibly challenging to run efficient queries on this voluminous data where there is no data mingling. The data must be available in near real-time, and the software application remains performant.
To describe the solution to this problem, let’s consider the following scenario. Let’s assume that our client is a digital marketing company that places PPC adverts on different publisher sites for its clients (advertisers). Apache Kafka is the message broker that collects the clickstream data in real-time and writes it to the S3 data lake.
The caveat here is that the client would like each advertiser to view a dashboard of the overall advertisement performance per publisher. Athena is used to query this data and send it to each advertiser’s dashboard.
The two possible challenges of concern right at the outset of this solution are the potential reduction in processing speed or a reduction in the optimal or performant processing speed. The second challenge is the potential for data contamination between advertisers. These challenges are because Athena is querying the same voluminous data that is only increasing exponentially because of the additional data flowing into the data lake every day.
Succinctly stated, the Upsolver/ Athena integration solves this problem. Upsolver partitions customer data into its own data partition in the Amazon data lake. This partitioning is logical rather than physical. The physical files or data remain in the S3 data lake, and the AWS Glue Data Catalog is to create logical partitions, pointing the data to its mapped partition.
Finally, the logical way to partition this data is by time. However, there is a second requirement, partitioning the data by customer ID. The Upsolver self-service data lake ETL creates custom partitions for each report, logically separating the data by timestamp and customer ID.
3. Detect and prevent security and intrusion: Analyze VPC flow logs
Amazon VPC logs are log files that capture information about the IP traffic that moves between the resources contained within a Virtual Private Cloud (VPC). Flow logs are the main source of information about the network traffic within and to and from the VPC. These flow logs are published to the Amazon S3 data lake or Amazon CloudWatch Logs. After the flow logs have been created, it is possible to retrieve and view the data by analyzing the logs.
When analyzed, these logs are designed to help with security by detecting network anomalies. You can create alarms indicating unauthorized IP addresses and unauthorized destination port redirections. Additionally, flow logs are useful when troubleshooting connection issues as well as detecting threats by monitoring port scanning.
Before beginning the analysis process, you must access the VPC flow logs. The best way to access them is by publishing the flow logs to Amazon S3. You may use Upsolver to enrich and process raw logs on S3. Athena can be utilized to query the transformed data on S3. In addition, a subset of “hot” data can be sent to the log analytics tool of your choice.
4. Analyzing CloudWatch logs with Amazon Athena
Amazon CloudWatch is a “monitoring service for AWS cloud resources and the applications you run on AWS.”
CloudWatch provides basic analysis capabilities as well as the ability to view the log files. However, it is often necessary to perform additional data operations that are not supported by CloudWatch, such as aggregations, cleansing, SQL querying. Although Amazon AWS has tools and databases, including Redshift, Elasticsearch, and CloudWatch, that can perform these data operations; Amazon Athena is best suited for the job.
There are three top reasons for reaching this decision.
- Optimum SQL access: Athena provides analysts with the ability to query the data using ANSI or standard SQL.
- Joins and enrichments: Exporting the log files from CloudWatch creates the ability to enrich and transform the data.
- Reduce costs: Athena’s serverless architecture translates into inexpensive S3 data lake storage rather than costly database storage.
Before querying the log files in Athena, a robust, automated pipeline must be constructed to continuously deliver data from CloudWatch to Athena while optimizing the data for performance. Consequently, the best way, or the recommended solution to build the data pipeline, is to use Upsolver to optimize the S3 data using the self-service and the “SQL-based data transformation interface, and to create structured tables that our analyst can immediately query in Athena.”
In summary, the recommended data architecture is as follows.
- Use CloudWatch to monitor multiple EC2 machines.
- Export and store these log files in the S3 data lake.
- Define an S3 data source in Upsolver.
- Configure an aggregated Athena output in Upsolver, partitioned by time and aggregated by billed-duration.
- To keep the cloud costs low, use Upsolver to configure retention on your data stored in S3 only to be stored for as long as you need it.
- Lastly, run the queries in Athena.
Directly and indirectly, the Upsolver/Amazon Athena integration provides the answers to all four practical data challenges highlighted at the outset of this article.
The S3 data lake and its integration with Amazon, using Upsolver, provides a scalable solution that can store and analyze real-time streaming data such as clickstream data and log files like VPC flow logs and CloudWatch monitoring logs.
Secondly, the inefficient data warehousing challenge and data hosted in different data sources is solved by ingesting data from the various data sources into the S3 data lake. Complex ETLs are not necessary to analyze the data. Amazon Athena interfaces with the Glue Data Catalog and uses ANSI SQL to write queries that analyze structured and unstructured data.
Lastly, the Upsolver/Athena combination resolves the security and regulatory compliance requirements, including the complex requirement to erase all PII data per user ID.