Upsolver CTO and Co-founder Yoni Eini recently presented a session for the Apache Presto foundation on the topic of querying streaming data with Presto, Amazon Athena and Upsolver. You can find the video below, or – if you prefer – a written summary immediately after:
Presto is a popular choice for data engineers wanting to run interactive queries on large file stores, such as Hadoop or AWS S3. Historically it’s been most useful for queries that are not time-critical. Its performance drops off significantly when it’s used to achieve up-to-the-moment insight.
In this post we review why Presto is in common use, explore its limitations, and review what it takes to use Presto for real- and near-real-time streaming analytics.
Presto: Aptly Named Search – for Batch Data
As its name indicates, Presto is, well, fast. It works especially well with data at scale. You can run Presto in your own cluster using multiple systems and query large volumes of data quickly and easily. It’s excellent for ad hoc queries and analytics of batch data, ideally data that’s been transformed and stored in a column-based format such as Parquet.
But Presto is not so well-suited for high-volume streams of events or data you must analyze with near-zero latency, such as user activity, real-time bidding on digital ads, or CDC. In these cases companies use other tools for streaming analytics and relegate Presto to a secondary role, such as long-term historic analysis. That means the time and expense of maintaining two systems. It also means maintaining multiple discrete data stores. So querying data from a year ago to up to an hour ago, for example, is exceptionally difficult.
Also, batch updates are more forgiving than streaming data. Any process errors in batch data aren’t time-critical, and data engineers can address them asynchronously. You can also manage updates using a tool such as Airflow. Streaming data is different. Since engineers aren’t on call 24×7, orchestration and other processes must be automated.
Presto isn’t designed to manage this sort of complex, large-scale orchestration, data interaction such as stream synchronization and timing (what data to include, when to cut it off) are error-prone and difficult to manage manually; for example, if the cutoff point isn’t consistent you could experience data loss or duplication. You must view all the data in a consolidated way, and as you perform joins, aggregations, and more on incoming events you must enforce clear and consistent boundaries.
File management is another consideration. Presto doesn’t handle file partitioning, so as data streams in and partitions fill up with small files (to support real-time analysis), Presto’s query performance degrades dramatically.
Finally, there’s exactly-once message processing. Regardless of whether an event stream is out of order – for example due to a network outage, or a process that was handed off from one server to another – you still must process that event only once, in the proscribed order. Duplicate or missed events can significantly hurt the reliability of the data stored in your lake, but exactly-once processing is notoriously difficult to implement since your storage layer is only eventually (not instantly) consistent.
So there’s much to do to overcome Presto’s innate design limitations while gaining the advantage of Presto’s speed and scalability for streams as well as batches. The key is preparation. If you properly prepare streaming data and optimize data lake storage in advance, you can greatly expand Presto’s utility.
Upsolver provides a low-code, self-service, optimized compute layer on your data lake that makes your data lake high-performing so it can in turn serve as a cost-effective analytics store. Learn more
Making Data Streams Presto-Ready
Preparing data in a data lake is challenging. Manual coding using tools such as AWS Glue or Apache Spark can get you part of the way. But it’s complex and resource-intensive. Automating the process so that data is consistent and easily queryable, if done correctly, paves the way for using Presto as easily with streams as with batches.
This is where Upsolver’s data lake engineering platform comes into play. Upsolver provides a low-code, self-service, optimized compute layer on your data lake that makes your data lake high-performing so it can in turn serve as a cost-effective analytics store.
Upsolver pulls data in from the source and stores it on S3. It incorporates data lake engineering best practices to make processing efficient, handling the prerequisites to high-performing Presto queries. It also automates important but time-consuming data pipeline work that every data lake requires to perform well. With Upsolver, data practitioners do this automatically and declaratively, using either the Upsolver visual IDE or ANSI SQL.
- ensures all data is optimized as a usable data lake.
- runs the transformations, which are defined in SQL and include filters, aggregations, joins and window operations.
- converts data into query-efficient Parquet files and manages them as part of a continuous compaction process; this is what helps Presto with file management. Upsolver writes files once to the data lake, partitioning them appropriately to further speed response.
- manages file metadata by leveraging the data catalog (such as Amazon Glue and Hive Metastore).
As new data arrives, Upsolver detects INSERT, UPDATE, and DELETE changes to existing rows and reflects those in output tables queried by Presto. This ensures consistency with the current state of the source data.
Upsolver also uses idempotent operations, which essentially are guarantees that when a file is written, it will be read in the exact same way, regardless of who’s reading it. You can reapply the same operation to the file without changing its final state.
When Upsolver has prepared the data, optimized it, and written it to S3, you can run a query from a Presto cluster and get results on the actual full data. You get an accurate, consistent, and current view of the data in Presto, and Presto’s improved performance makes dashboarding easier and more powerful.
Despite its popularity as a query engine, Presto was built in a batch world, and isn’t natively suited for streaming analytics, upon which organizations increasingly rely. You can overcome this limitation by using Upsolver to automatically store data in cloud object storage such as a data lake, prepare the data, and optimize the lake for enhanced querying. In this way you can apply the speed and simplicity of Presto to up-to-the-minute data that’s stored economically in a data lake.