What is Apache Presto and Why You Should Use It

“The pace of technological change sweeping boldly across business and society is breathtaking—and it is taking us from the digital age towards a new reality, a new era… [known as] …the post-digital era.” – Paul Daugherty of Accenture.

The post-digital age, partnered with the Fourth Industrial Revolution (4IR) and its associated technologies, including IoT, Big Data, data analytics, Artificial Intelligence and machine learning, quantum computing, and cloud-based technologies, are driving the digitization of the global GDP and subsequent economic activity.

Consequently, Big Data and its processing under the ambit of data science and analytics are central to the 4IR paradigm.

Why?

Succinctly stated, as these technologies continue to improve, more data is ultimately produced, which must be analyzed to provide answers to questions asked by industry and by academia.

Thus, it stands to reason that massive amounts (billions of records of petabytes of data) must be queried and analyzed in the most efficient, cost-effective manner possible.

How?

Enter cloud computing, Amazon S3, and Apache Presto.

What is Apache Presto?

Amazon AWS describes Apache Presto as follows,

Presto (or PrestoDB) is an open source, distributed SQL query engine, designed from the ground up for fast analytic queries against data of any size.”

It interfaces both non-relational data sources like Amazon S3 and Hadoop HDFS, MongoDB, and HBase, as well as relational databases like MySQL, PostgreSQL, and MS SQL Server.

Apache Presto’s power and value proposition is defined by the fact that it can query data wherever it is stored, without the need to move the data into a separate, structured system like a relational database or data warehouse. Finally, query execution runs in parallel over a scalable “pure memory-based architecture,” returning results in seconds irrespective of the size of the data being queried.

Presto was first developed by Facebook to run interactive queries on a 300 petabyte (PB) data warehouse, structured as Hadoop clusters. It was implemented across the entire company in 2013. 

Towards the end of 2013, Facebook licensed Presto as an open-source product under the Apache software license and made it available for anyone to download from Github. And on 23 September 2019, the Linux Foundation announced that, from this point onwards, Presto is hosted under the Linux Foundation. As quoted, the most significant advantage to his change is that “the newly-established Presto Foundation will have an open and neutral governance model that will enable Presto to scale and diversify its community.”

Finally, Presto has become a popular choice for data engineers wanting to run interactive queries on AWS S3 and Hadoop. Facebook currently uses Presto to run over 30 000 queries every day, processing 1PB of data daily.

Apache Presto’s Core Concepts

In order to leverage Presto’s power and ability to query Big Data, it is essential to gain an understanding of the query engine’s core concepts.

The terms and concepts, SQL statements and queries, are well-known. However, there are other important concepts that are worth gaining an insight into.

1. Server Types

There are two server types: the coordinator and the worker. These names describe their behavior in that the coordinator is responsible for managing worker nodes, parsing SQL statements, and planning queries. The coordinator also fetches the query results from the workers and passes them to the client. 

Juxtapositionally, the worker is responsible for executing tasks and fetching data from the connectors. When a worker node starts up, it advertises itself to the discovery server sitting in the coordinator. In this way, the coordinator knows that the worker is active and ready to work.

2. Data Source

The data sources cover the concept of connectors, a catalog, a schema, and finally, a table. Schema and table constructs are widely understood as fundamental elements of a relational database. Thus, let’s look at a succinct definition of a connector and a catalog.

In summary, a connector is a link between Presto and the data source like Amazon S3 or a relational database. Another way to describe a connector is that it is like a database driver.

Secondly, every connector is linked to a specific catalog. The catalog configuration file must contain the property, connector.name. This is used by the catalog manager to create a connector based on the catalog description. 

Finally, a catalog contains schemas and references a data source via a connector. It is essential to note that when calling a table in Presto, “the fully qualified table name is always rooted in a catalog.” 

For instance, accessing the MySQL table called user_name in the users’ schema is referred to as

Catalog_name.schema_name.table_name

or

MySQL_catalog.users.user_name

Finally, catalogs are defined in the Presto configuration directory.

3. Query Execution Model

Succinctly stated, Presto executes “SQL statements and turns these statements into queries that are executed across a distributed cluster of coordinator and workers.” While the statements and queries concepts are well-known, there are other important concepts that are part of the query execution model.

Stage

A stage is simply part of the execution of a query statement. For instance, if Presto has to aggregate one billion rows stored in Amazon S3, it goes about this task by creating a root stage and several other states. Additionally, the root stage aggregates the data returned from the other stages.

The stages hierarchy looks like a tree with one root stage and many other stages, all related to the root stage.

While it is reasonable to assume that stages run on worker nodes, they do not. They are used by the coordinator to model a distributed query plan for the workers.

Tasks and Splits

Even though stages do not execute on the worker nodes, tasks do. Thus, as part of the distributed query plan, a stage is a series of tasks executed and distributed over worker nodes.

In the Presto architecture, the distributed query plan is broken down into many stages. These are then translated into tasks that process or retrieve data from splits (sections of a larger dataset).

When Presto schedules a query, the coordinator queries a connector for a list of the splits available for a particular table. For instance, if the query has to aggregate all the tables in the user transactions table in the MySQL database, the coordinator will ask the connector for a list of all the splits in this table. The coordinator monitors which servers are running parts of the query in parallel and which splits are being processed by which tasks.

Driver, Operator, and Exchange

These concepts are the last three essential ingredients of the successful Presto model.

Tasks contain at least one parallel driver. Each driver acts on the data and combines operators to produce output that is then aggregated by a task and delivered up the pipeline to the coordinator and ultimately to the client.

Operators consume, transform, and produce data. For instance, a query produces data from a connector that is transformed and consumed by other operators.

Finally, an exchange transfers the data between different nodes for the various stages of the query.

The central reason why you should implement Apache Presto in your cloud computing stack

Now that we have a comprehensive understanding of Apache Presto and its architecture and core concepts, let’s consider the fundamental reason why you should implement Apache Presto in your cloud computing stack.

As described above, Apache Presto has the functionality through its connectors to directly interface with a wide variety of data sources, including raw data stored in data lakes such as AWS S3 and HDFS data blocks as well as relational databases like MySQL and Microsoft SQL Server.

It also has a hosted cloud version – Amazon Athena, a serverless, “interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.”

Upsolver, the data lake ETL service, is the only official partner to Amazon Athena

Upsolver provides a visual, SQL-based interface for creating real-time tables in Athena with little engineering overhead and according to performance best practices.  

Additionally, Upsolver also enables updates/deletes to tables in Athena for common CDC and compliance use cases. 

Thus, the intersection and union of Apache Presto, Amazon Athena, and Upsolver offer a solution to the challenges of analyzing petabytes of data, quickly and effectively, with low overhead costs. 

Final Thoughts

The methodology and processing required to analyze real-time data or the billions of records that the modern enterprise produces, needs solutions provided by Apache Presto/Amazon Athena, Upsolver, AWS S3 to ensure that data is analyzed timeously, cost-effectively, and with a low-overhead in cloud-based storage and architectures.

GET A DEMO

Share with your friends

Don't Stop Here - More to Explore

Explore all Blog Categories
data lake ETL Demo

Let’s get personal:
See Upsolver on your data in a live demo.

Schedule a free, no-strings-attached demo to discover how Upsolver can radically simplify data lake ETL in your organization.

GET A DEMO