Amazon Athena is a serverless, “interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.” It is also known as the Amazon cloud infrastructure’s fast-growing service, widely used for ad-hoc querying of structured and semi-structured data.
What is possibly not as widely known is that, on 13 November 2020, Amazon posted a press release stating that the original Athena service had been upgraded to Athena 2.
Additionally, Upsolver, a cloud-native data engineering service, with their unique Athena integration is an official Amazon Athena partner. It is essential to note that Upsolver is the only Amazon Athena Partner.
Thus, the question that begs is, what does this partnership, plus the Athena engine 2 upgrade bring to the table for existing and new Amazon data lake and Athena clients?
By way of answering this question, let’s consider the following points.
Athena Engine 2 Upgrade: What, Why, and How?
As described at the outset of this discussion, Amazon Athena is a serverless interactive query service providing users with the ability to analyze data using standard SQL. There are a few additional benefits to using Athena.
- Because it is serverless, there is no server infrastructure to manage, saving time and money. It decouples compute from storage.
- You only pay for the queries you run. There is no monthly subscription fee.
- It is straightforward to use.
- It is not necessary to perform complex ETL operations to prepare the data in the S3 data lake.
- Point the Athena engine to the S3 data lake, define the schema, and create and run standard SQL scripts to query this structured or semi-structured data.
- And it makes it easy for analysts or engineers with SQL skills to efficiently and quickly query large datasets.
- Finally, Athena is scalable.
The Athena engine 2 upgrade includes several performance enhancements in the JOIN, ORDER BY, and AGGREGATE operation. Secondly, this upgrade includes the following new features and other features not highlighted in this article, like the support for reading nested schema to reduce costs.
1. The Federated Query
The federated query is a feature that data analysts, engineers, and data scientists can use to execute queries across a wide variety of data sources, including relational, non-relational, object, and custom data sources.
With this feature, users can submit one query and analyze data from multiple sources, both hosted in the cloud or running on on-premises servers. The Athena engine 2 uses Data Connectors that run on AWS Lambda, with open-source Data Connectors for Amazon S3, Apache HBase, Amazon DocumentDB and Redshift, AWS CloudWatch Metrics, and JDBC-compliant relational databases like MySQL and PostgreSQL.
Should users require data connectors to other data sources, Amazon has provided a Query Federation SDK to build connectors to any proprietary data source.
2. Schema Evolution Support
One of the fundamental aspects of data management is schema evolution. Real-world data is fluid. It can change based on the changing environment producing the data. Consequently, data schemas require the ability to evolve or change over time to accurately represent the data in the S3 data lake or alternative data source.
Athena Engine 2, coupled with the Amazon Glue Data Catalog, allows for the discovery and evolution of schemas to populate the Glue Data Catalog.
3. Geospatial Functions
Aws.amazon.com defines geospatial queries as specialized SQL queries that express a relationship between geometry data types, including distances, crosses, touches, overlaps, and disjoints. These geospatial queries also use technical geometry data points, including point, line, multi-polygon, polygon, and multi-line.
As a result, you can run queries that find the distance between two points, check whether one area touches another area, and check whether a line or polygon crosses another line or polygon.
This aws.amazon.com page cites the following example.
In order to “obtain a point geometry data type from the values double for the geographic coordinates of Mount Rainier in Athena, use the ST_Point (longitude, latitude) geospatial function.”
The Amazon engine 2 upgrade includes many additional geospatial functions, divided into categories such as constructor, operation, and accessor functions.
For detailed information on these geospatial functions, go to the Amazon Athena geospatial function list.
The Upsolver and Athena Partnership
As highlighted at the beginning of this discussion, Upsolver is the only technology partner to Amazon Athena.
This question is best answered by looking at the Upsolver descriptions found on the AWS Marketplace web page.
This page notes that Upsolver is “an industry-leading Data Lake Platform that empowers any developer to manage, integrate and structure streaming data for analysis at unprecedented ease.”
It also describes Upsolver as a “visual, SQL-based ETL service that makes it easy to combine streaming and big historical data for analytics and ML. Upsolver cuts 95% of the data lake ETL effort and improves the performance of services like AWS Athena by 100X.”
Let’s expand on these descriptions by drilling down into the details of the Amazon Athena and Upsolver Data Lake ETL integration.
Upsolver’s unique value proposition is its easy-to-use, atheistically pleasing user-interface undergirded by powerful Data Lake ETL functionality. The graphical user interface is intuitive, guiding even the novice user to set up the S3 data lake integration, simplifying the data collection and analytics process. Upsolver also integrates with services like Amazon Athena. And the application’s true power is seen within the Upsolver Data Lake ETL and Athena integration.
The true Upsolver power lies in their groundbreaking ETL technology and deep integration with S3, Glue, and Athena. Athena is powerful on its own, without Upsolver. But joined together, the query performance and scalability are unparalleled.
The data is ingested into the S3 data lake from Kafka or Kinesis. And it is stored in the data lake using an optimized file system based on Apache Parquet.
Using the Upsolver’s GUI, users can decide whether to partition data by a custom data field or the event time. Athena’s base functionality relies on batch processing to process the data ingested into the S3 lake. Upsolver improves this functionality by allowing users to query data streams real-time (or near real-time).
The same GUI allows users to create and edit tables directly in Athena without negatively affecting the analysis of the data. Historical tables can be created directly on S3 that will provide an instant snapshot of any point in time.
And lastly, the Upsolver GUI includes a SQL editor allowing users to create custom tables from disparate event streams on the fly.
The true value of the Athena and Upsolver Data Lake ETL integration is best described by considering the SimilarWeb case study.
As with most companies in business today, the data collection and analysis process is critical. Decisions based on a robust statistical analysis of company data is key to organizational success. The counterweight to this statement is that incorrect statistics will have disastrous consequences, something that no organization can afford. SimilarWeb is no different.
Upsolver provided the solution to their data collection and analysis process. In summary, their “ETL pipeline helped improve our efficiency and reduce the time from ingestion to insight from 24 hours to minutes.”