The traditional data warehousing architecture, for all intents and purposes, has become a monolithic construct. There was a time and place for data warehouses, with large enterprises using this architecture for a long time because its data storage capacities are useful for making data valuable. However, data warehouses can no longer keep up with data processing requirements like data streaming, machine learning, and exponential volumetric growth.
Why the AWS data lake
Consequently, the modern enterprise business started looking for an agile and flexible data storage and analytics solution.
Enter the data lake, especially the AWS data lake.
Developed by Amazon to support one of the largest logistics networks on the planet, the Amazon Web Services data lake is a “centralized repository that allows you to store all your structured and unstructured data at any scale.” It also provides secure access to this data enabling you to store, govern, discover, and share all of your structured and unstructured data at any scale.
The salient point of the data lake is that you can store your raw data in the lake without having to structure the data or perform ELT actions on the data before running different types of analytics on this data to produce valuable information, designed to answer the “why” and “how” questions, and to drive business growth and development forwards.
AWS data lake query engines
As described above, once the raw data has been loaded into the data lake, it must be processed and transformed into meaningful information. Therefore, by way of expanding on this statement, let’s consider the different query engine options available to analyze this data.
According to the AWS website, Athena is an “interactive query service that makes it easy to analyze data in Amazon S3 [data lake] using standard SQL.” It is also important to note that
“Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.”
Upsolver, the only recommended Amazon Athena partner, has developed a solution that integrates with a “visual and SQL-based interface to create tables easily.” After you have created these tables, the next step is to develop and run queries on your S3 data. Upsolver describes the advantages of using their platform with the following statement.
Once the queries have been created, run them, and “watch your queries run faster than you ever thought possible thanks to [our] groundbreaking ETL technology and deep integration with Amazon S3, Athena, and Glue.”
Apart from the pros of using the Athena/Upsolver integration, the other benefits of using Athena to query data in the S3 data lake include the following.
- Athena is serverless; no ETL is involved in uploading the data to the data lake.
- You pay per query. Ergo, you only pay for the queries you run.
- Athena is built on Presto, which uses standard SQL resulting in an easy-to-use construct.
- Athena is extremely fast, even when running queries on large datasets.
Finally, there are no real disadvantages of using Athena, especially when integrated with Upsolver. Therefore, the only criteria that need to be taken into account when deciding to use Athena or not are whether this query engine is fit for purpose.
2. Redshift Spectrum
Redshift Spectrum is a feature within the AWS Redshift data warehousing service. It aims to allow data analysts to conduct fast, complex analysis of objects stored within the AWS cloud. In other words, the analyst can perform SQL queries on data stored in the S3 buckets within the data lake. It is not necessary to move the data from the S3 buckets into a database. Redshift Spectrum directly queries unstructured data within the S3 bucket.
The most crucial benefit of using Redshift Spectrum is that it expands the scope of a query by extending beyond the existing Redshift data warehouse nodes and large volumes of unstructured data within the S3 data lake.
Redshift Spectrum is similar to Athena. However, these services address different requirements. The primary disadvantage of using Redshift Spectrum is that it resides on dedicated Redshift servers independent of the data lake. And it needs a Redshift cluster and a SQL client to execute SQL queries. Finally, Redshift Spectrum does not support the SQL insert statement. It treats the data in the S3 bucket as read-only.
3. Presto/Spark on EMR
Amazon EMR is a managed cluster that simplifies running big data frameworks, such as Apache Spark on AWS. In other words, EMR is a data store. And, the Presto/EMR cluster and Spark/EMR are different from each other. Therefore, let’s consider each one individually.
As highlighted above, EMR is a data store, And Presto is a distributed SQL query engine. Thus, users can query data in EMR and combine it with data from many other sources. Presto has data connectors to RDBMSs, NoSQL database files, flat files, object stores, and Elasticsearch.
On the other hand, Spark is a general cluster-computing framework that can process data sitting in Amazon EMR. However, the Spark core does not support SQL. Therefore, should you wish to run SQL queries using Spark, you must first install the Spark SQL module. Succinctly stated, Spark is not designed for interactive or ad-hoc queries. And it is not intended to combine data from different sources as with Presto.
There are several disadvantages to using EMR.
- It does not offer native support for Presto. Users must create a Presto metastore, configure connectors, and install and configure the needed tools.
- EMR is complex. It takes much longer to configure and deploy the EMR/ Spark/S3 integration.
- EMR is more expensive than using the EC2 compute engine. Therefore, persisting your EMR data in S3 will add to the costs of running an AWS cloud-based infrastructure.
4. Presto/SparkSQL/Hive on EMR
As described above, Presto and SparkSQL are used to query data stored in the Amazon EMR. The pros and cons of the Presto/SparkSQL integration remain valid in this scenario. However, Hive is added into the mix.
Succinctly stated, Hive can be used for data warehousing that sits on top of the Hadoop Distributed File System (HDFS), simplifying data encapsulation, ad-hoc queries, and the analysis of massive datasets. Lastly, you can query data contained within the EMR clusters using a SQL-like language.
The benefits of Hive on EMR are that EMR offers native support for Hive, making it easy to manage Hive clusters from the AWS management console. Secondly, it is highly available, manages to scale and descale effectively, and performs well under load.
The disadvantages of using EMR, as described above, remain valid in this scenario.
Dremio is a proprietary data lake engine that “operationalizes your data lake storage and speeds [up] your analytics processes with a high-performance and high-efficiency query engine”
In other words, Dremio is a query engine that is easy to launch and deploy in your AWS account. It provides a streamlined and highly automated engine that is based on Apache Arrow. The developers explain how they developed their highly automated, high-speed query engine in the following quotation:
“Dremio combines an Apache Arrow-based query engine with proprietary acceleration technologies to speed queries directly on your existing S3 data by up to 100x.”
As far as the benefits of Dremio go, it stands to reason that the automation, speed, and ease of use are seen as significant wins for clients who deploy Dremio in their AWS account. Other pros include cost savings with AWS EC2 integration and the fact that your data stays in S3. It is not locked into a proprietary format.
Juxtapositionally, the only real negative attached to Dremio is that it is a proprietary product. Although it is based on an open-source engine, uses are still reliant on a third-party for updates and maintenance. While it is improbable, the developers will go out of business; it is still a consideration when deciding whether to use Dremio.
Starburst is a distributed SQL engine. According to the developers’ website, it has the “power to query anything with SQL.” It is “fast, scalable, SQL-on-anything.” Starburst offers a SQL engine that’s decoupled from the storage.
The benefits of this product include the facilitation of the provisioning of SQL clusters. All users need to do is to deploy a Starburst environment to leverage Starburst’s flexible compute capabilities against any data source of any size.
Starburst also allows users to connect to any data source and analyze the data without owning it and moving it into a proprietary format. Starburst is built to deliver more efficient query processing time and improve cost efficiencies.
7. Databricks SQL Analytics
Databricks SQL Analytics is an improved SQL analytics service that provides Databricks users with the ability to perform BI and SQL workloads on Delta Lake.
According to the Databricks website, this SQL analytics service “provides a dedicated SQL-native workspace, built-in connectors to let analysts query data lakes with the BI tools they already use, query performance innovations that deliver fast results on larger and fresher data sets than analysts traditionally have access to, and new governance and administration capabilities.”
The benefits of using this platform is that it’s very tightly integrated with the existing Databricks environment. It provides SQL capabilities on Spark and it performs better than the older version because it was rewritten in C++ to provide cpu vectorization and address underlying JVM limitations.
Some of the negatives include lock-in and performance. Databricks SQL Analytics is designed to run with Databricks’ platform. It does not provide flexibility and makes it difficult for users to choose and integrate with solutions outside of Databricks. Also, even with the performance improvements, the latency is still relatively high since Apache Spark is an improvement from the Hadoop ecosystem, which was designed for batch processing.
8. Qubole (Idera)
Idera.Inc has just acquired Qubole. And it will join Idera’s database tools’ business unit, “significantly expanding the breadth and depth of the company’s cross-platform data-centric solutions.”
This news should not affect Qubole. Therefore, let’s look at what the product is as well as its pros and cons.
Qubole is the leading Data Lake Platform that is cross-platform and is used to analyze unstructured data. It is not cloud-specific and has an integrated environment for all cloud providers, including AWS, Google, and Azure, and data processing engines.
Its fundamental purpose is to provide end-to-end services that reduce the time and effort required to run data pipelines, streaming analytics, and machine learning workloads on any cloud platform. It also includes optimized versions of Hive, Airflow, Spark, and Presto. Finally, intelligent automation strategy and technology is built into this platform to scale up and down to meet service-level requirements, reducing costs.
The most significant advantage of deploying Qubole is that it is not vendor-specific. The same products, code, and configuration settings are usable across all the major cloud providers. In other words, the tools Qubole contains are not tied to a specific cloud vendor.
On the other hand, the biggest challenge with Qubole is that it is not easy to use. The cluster-admin tasks are not easy to complete, especially when creating an individual cluster for each team. Finally, it is also not possible to share code between users of other teams and clusters.
Data is a precious commodity or currency for the modern digital company. And because of the volumes and the speed at which the data is processed, it is crucial to implement a robust, secure cloud-based data analytics architecture that includes storing and analyzing both structured and unstructured data with any analytics engine that fits your use case and workload. Finally, the best way to solve this conundrum is to utilize the cloud data lake construct with one or more of the query engines highlighted above.