Explore our expert-made templates & start with the right one for you.
Trying to build a file system for your data lake? Check out our free ebook on data partitioning on S3. If you’re looking for a Spark alternative that offers true self-service with no data engineering bottlenecks, you can get our technical whitepaper to learn more about Upsolver.
Apache Spark is an open-source framework for distributed data processing. It has become an essential tool for most developers and data scientists who work with big data. Spark is powerful and useful for diverse use cases, but it is not without drawbacks. Many organizations wrestle with the complexity and engineering burden of using and managing Spark, or they might require fresher data than Spark’s batch processing is able to deliver.
In this article we look at 4 common use cases for Apache Spark, and suggest a few alternatives for each one.
What is Spark and what is it used for?
Apache Spark is a fast, flexible engine for large-scale data processing. It executes batch, streaming, or machine learning workloads that require fast iterative access to large, complex datasets.
Arguably one of the most active Apache projects, Spark works best for ad-hoc jobs and large batch processes. Using Spark requires knowledge of how distributed systems work. Further, people with the expertise required to use it efficiently and correctly in production systems are hard to find and expensive to hire.
For organizations that need fresher data than Spark’s batch processing can deliver, the Apache project released an extension of Apache Spark’s core API called Apache Spark Streaming. Spark Streaming enables data engineers and data scientists to process real-time data from message queues such as Apache Kafka and AWS Kinesis, web APIs such as Twitter, and more. But it may surprise you to know that Spark Streaming isn’t a pure streaming solution; it breaks down data streams into micro-batches and so retains some of the challenges of batch processing, such as latency. Also, it means that scheduling Spark jobs and managing them over a streaming data source requires extensive coding. Many organizations struggle to get to production with Spark Streaming as it has a high technical barrier to entry and requires extensive dedicated engineering resources. Read more about Spark Streaming and Spark Structured Streaming.
Spark use case: Extract-transform-load (ETL)
ETL tasks are commonly required for any application that works with data. And building ETL pipelines is a significant portion of a data engineer’s responsibilities.
Spark has often been the ETL tool of choice for wrangling datasets that typically are too large to transform using relational databases (big data); it can scale to process petabytes of data. Still, creating efficient ETL processes with Spark takes substantial manual effort to optimize Spark code, manage Spark clusters, and orchestrate workflows. Depending on your data sources, you also may have to code your own connectors.
Spark alternatives for ETL
There are multiple ETL frameworks you can use in place of Spark.
Open source ETL frameworks
Open source ETL frameworks include:
- Apache Storm
- Apache Flink
- Apache Flume
These frameworks differ primarily in the type of data for which they’re intended. Apache Storm is designed to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Apache Flink is a framework and distributed processing engine for stateful computations over both unbounded and bounded data streams; it treats batches as data streams with finite boundaries. Apache Flume is designed to process large volumes of log data from web servers to systems based on the Hadoop Distributed File System (HDFS).
It’s important to keep in mind that, while powerful, these open source frameworks are also complex and require extensive engineering effort to work properly. See our post on open-source stream processing frameworks for more detail.
There are managed service alternatives to Spark for ETL as well. Among the most prominent of these are:
- AWS Glue Studio
AWS Glue Studio is not a Spark alternative but rather a Spark “helper.” It is the component of AWS’ data integration service that provides a visual UI for creating, scheduling, running, and monitoring Spark-based ETL workflows on Amazon EMR, Amazon’s managed Spark service. Glue Studio is a serverless offering that also handles dependency resolution, job monitoring, and retries. But you still must perform a lot of optimization on the storage layer to improve query performance (for example, to compact small files on S3).
Upsolver is a fully-managed self-service data pipeline tool that is an alternative to Spark for ETL. It processes batch and stream data using its own scalable engine. It uses a novel declarative approach where you use SQL to specify sources, destinations, and transformations. All of the “PipelineOps” work – the tasks that make Spark so code-intensive – are automated. This includes orchestration, file system optimization (such as compression and compaction) and state management. It runs on AWS and Azure and enables you to continuously process live data streams and historical big data for use by data lake query engines, data warehouses, or other analytics systems.
Watch this 10-minute video to learn how mobile monetization leader ironSource uses Upsolver to build and maintain petabyte-scale pipelines.
Spark use case: Ad-hoc data exploration and research
Typically this is the domain of data scientists, who are tasked with answering specific business questions using very large datasets. To do this they must understand the data, clean it, and combine it with other data sources.
In this context, larger and more complex datasets make an excellent use case for Apache Spark, and it has few competitors. But if the data is smaller or less complex there are simpler alternatives that get the job done as well as or better than Spark.
Spark alternatives for ad hoc data exploration
There is a spectrum of technologies and tools from which data practitioners can choose, including coding languages and analytics engines.
You can use Python or R to code the logic for queries that can be processed locally. In this case you may also need an orchestration tool such as Airflow to plan out the query execution. Spark offers high-level APIs for these languages so you can leverage Spark as the processing engine if you wish. However, while many analysts and data scientists possess the coding skills required, not all do.
SQL query engines
Different factors determine which might be the best fit, including data architecture, volume and variety of data, storage type (data lake, data warehouse, and so on), and the business analysts’ skillset.
Some of the most popular tools in this crowded category include:
- AWS Athena
- Proprietary (data stored in data lake)
- Amazon Redshift Spectrum
- Snowflake External Tables
- Google BigQuery External Tables
- Microsoft Azure Synapse External Tables
AWS Athena is a serverless interactive query service that reads data directly from Amazon S3 object storage. It’s based on open-source Apache Presto, but is offered exclusively as a managed service by Amazon Web Services.
Ahana is a cloud-native managed service for Apache Presto on AWS. It promises data platform teams high performance SQL analytics on their S3 data lakes and other data sources. Ahana simplifies the deployment, management, and integration of Presto and enables cloud and data platform teams to provide self-service SQL analytics for their organization’s analysts and scientists.
Starburst is a fast, high-volume analytics engine based on open source Trino (formerly PrestoSQL). Starburst works with a wide array of databases, and you can join data across different databases and data stores to perform ad hoc queries without centralizing all your data.
Amazon Redshift Spectrum is a serverless feature within the Amazon Redshift data warehousing service that enables Redshift users to query data stored in Amazon S3 buckets, and to join the results of these queries with tables in Redshift.
Snowflake External Tables is Snowflake’s method for accessing data from files stored outside of Snowflake without actually moving them into Snowflake. External Tables helps Snowflake users evaluate data sets and identify next steps by enabling them to run ad-hoc queries directly on raw data before ingesting the data into Snowflake.
Google BigQuery External Tables are tables that act like a standard BigQuery table. The table metadata, including the table schema, is stored in BigQuery storage, but the data itself resides in the external source.
ExternalTables can be temporary or permanent. A permanent external table is contained inside a dataset; you can view the table properties, set access controls, and so forth. You can query the table and join it with other tables.
Microsoft Azure Synapse External Tables make it possible to query external data assets without moving them from your Data Lake.
External tables point to data located in Hadoop, Azure Storage blob, or Azure Data Lake storage. They have a well-defined schema and are used to read data from files or write data to files in Azure Storage. Most commonly the data is stored in a standard format such as CSV, JSON, Parquet, AVRO, and so on.
With most of the tools above, SQL knowledge is all you need to use them. However, latency can vary significantly; factors such as query optimization, file formats, and scale all can affect query performance, perhaps substantially.
Spark use case: Business intelligence and reporting
Companies that work with event streams often wish to incorporate modern data into their business intelligence and reporting to internal stakeholders (for example, a near-real-time dashboard summarizing user interactions with mobile apps). Since event data is often semi-structured, you must transform the data and normalize its structure before you can query it.
The challenge with using Spark is the complexity of engineering and maintaining BI that we described above. There’s also a disconnect between the BI developer in the business unit who’s building the dashboards and the data engineer on the IT team who must constantly write and update Spark jobs each time new data is needed.
Spark alternatives for BI and reporting
While Spark is not a mainstream BI engine, if you need to enrich your reporting with big data you must decide how to process that data.
SQL query engines
Many of the options covered in the section on ad hoc analytics can be applied to BI and reporting since it is really just programmatic querying of the same data.
Many organizations turn to cloud data warehouses as the underlying database for their BI reporting. These include:
These services are more alike than they are different. They can provide excellent performance and a self-service experience for BI developers. The biggest risk in relying heavily or exclusively on data warehouses for your analytics needs is cost; as data volume scales compute costs can spiral out of control. This is especially true with streaming data, where the compute meter essentially runs 24/7.
Upsolver automates transformations and best practices to output analytics-ready data that substantially improves query performance. Watch our webinar for a glimpse of our latest benchmarking tests showing substantial cost and performance differences when processing streaming data in Snowflake. You may also wish to learn about how Upsolver enables you to combine Snowflake with a data lake that’s optimized for querying to keep performance high and costs low.
Spark use case: Machine learning
Machine learning involves a complex series of processes, including building a training dataset with labeled data, training the ML model, deploying it to production, and then regularly updating the model to improve its accuracy over time. Each poses its own challenge to the machine learning engineer who programs and trains the model, as well as to the data engineer responsible for supplying structured data in a timely fashion.
You can use Apache Spark to convert raw data to build the training dataset; Spark can perform large-scale transformations on complex data. But when you deploy that model to production you also need a separate system capable of serving data in real-time, such as Spark Streaming coupled with a key-value store such as RocksDB, Redis, or Cassandra. So you must build your ETL flow twice, using both batch and streaming processing. And as we’ve detailed in a previous blog post on orchestrating batch and streaming ETL for machine learning, the need to manage two separate architectures and ensure they produce the same results is a daunting technical challenge.
Spark alternatives for machine learning:
There are several discrete ML services that facilitate or enable machine learning.
Google Dataflow provides portability with processing jobs written using the open source Apache Beam libraries. It also automates infrastructure provisioning and cluster management. But it’s available only within Google Cloud, and you may need additional tools such as Vertex AI and TensorFlow – in addition to Apache Beam – to build end-to-end ML pipelines.
Flink ML is a machine learning library for the open-source Apache Flink. At the beginning of 2022 the community released a major refactor of the prior Flink ML library that extends the Flink ML API and is the first of multiple planned enhancements aimed at opening Flink to a wider range of machine learning use cases, including real-time machine learning scenarios. Explore the Flink ML Github repository.
Upsolver empowers data scientists and data engineers to build reliable models and make accurate predictions by unifying historical, live, and labeled data in a simple, declarative way.
Use Upsolver both for preparing training data and as an operational key-value store for joining and serving data in real-time (sub-second latency). Since Upsolver converts batch data to streams, there’s no need to orchestrate two separate ETL flows. You can watch this webinar to learn more. Or see how to use Upsolver and Amazon Sagemaker to train and deploy models within the same architecture, without any of the complex coding you would need if you were building a solution with Apache Spark, Apache Cassandra, and similar tools.
Summary: Spark is powerful, but it is not “one size fits all”
From the discussion above you can see why Spark is popular: it is a high-performance, scalable, general purpose data processing engine. The challenge with Spark is the high degree of technical acumen required to build and maintain apps and processes on top of it. We have offered a quick survey of options for each type of use case if you find the data engineering overhead of Spark to be too resource-intensive.
Try Upsolver SQLake for free for 30 days. SQLake is Upsolver’s newest offering. It lets you build and run reliable data pipelines on streaming and batch data via an all-SQL experience. Try it for free. No credit card required.
If you prefer, you can speak with an expert; please schedule a demo: : https://www.upsolver.com/schedule-demo.
And if you have any questions, or wish to discuss this integration or explore other use cases, start the conversation in our Upsolver Community Slack channel.
Still planning out your data lake? Check out the 4 Building Blocks of Streaming Data Architectures, or our recent comparison between Athena and Redshift. If you’re considering managed Spark tools, you might want to read about how Databricks compares to Snowflake. More of a hands-on type? Get a free trial of Upsolver and start building simple, SQL-based data pipelines in minutes!