Trying to build a file system for your data lake? Check out our free ebook on data partitioning on S3. If you’re looking for a Spark alternative that offers true self-service with no data engineering bottlenecks, you can get our technical whitepaper to learn more about Upsolver.
Apache Spark is an open-source framework for distributed data processing, which has become an essential tool for most developers and data scientists who work with Big Data. Spark is powerful and useful for diverse use cases, but it is not without drawbacks. Many organizations struggle with the complexity and engineering costs of managing Spark, or they might require fresher data than Spark’s batch processing is able to deliver.
In this article we’ll look at 4 common use cases for Apache Spark, and suggest a few alternatives for each one. .
Spark is frequently used as an ETL tool for wrangling very large datasets that are typically too large to transform using relational databases. This can often be the case with streaming data, which is often both voluminous and complex due to its semi-structured nature.
In these scenarios, Spark will often be the default choice as it is fully-featured enough to process very large volumes of data. However, there is often a lot of manual effort required to optimize Spark code as well as manage clusters and orchestrate workflows; in addition, data might be delayed for up to 24 hours before it is actually available to query due to latencies that result from batch processing.
Spark alternatives for ETL:
- Open-source frameworks: Apache Storm and Apache Flink offer real-time stream processing, while Apache Flume is a popular choice for processing large amounts of log data (read our open-source stream processing frameworks). While these alternatives could be a better fit than Spark for particular use cases such as real-time stream processing, they are still complex and require extensive engineering effort to work properly.
- Amazon Glue ETL offers a serverless environment to run Spark ETL jobs using virtual resources that it automatically provisions. This can reduce the ‘hassle’ of ongoing cluster management, but data freshness could still be an issue, and a lot of optimization still needs to be done on the storage layer when it comes to query performance (e.g. compacting small files on S3).
- Upsolver is a fully-managed, self-service data lake ETL tool that combines batch and stream processing, automatic orchestration, and metadata management using only SQL.
Ad-hoc data exploration and research
In this scenario, the user is typically a data scientist who is trying to answer a specific business question with a very large dataset. In order to do so they need to understand the data, clean it, and combine it with other data sources.
For larger and more complex datasets, this is an excellent use case for Apache Spark and one where it has few competitors. However, if the data is smaller or simpler there are simpler alternatives that can get the job done.
Spark alternatives for data discovery:
- For data that can be processed locally, you could use Python or R, which most data scientists will be very well-versed in
- Relational databases such as MySQL
- Amazon Athena can be used to query terabytes of data. If you’re running this query repeatedly, you should definitely invest in data preparation to reduce costs and improve performance, which would either bring you back to Spark or lead you to use a tool such as AWS Glue or Upsolver (see above under “Spark alternatives for ETL”).
Business intelligence and reporting
Companies that work with event streams will often have use cases around business intelligence, analytics and reporting to internal and external stakeholders (e.g.,a dashboard summarizing user interactions with mobile apps). Since the data is semi-structured at best, it needs to be ETLed and structured before it can be visualized with tools such as Tableau, Looker or Sisense.
The problem with using Spark for these pipelines is that it is built more for ad-hoc jobs rather than production systems, as well as the disconnect between the BI developer who is building the dashboards and the data engineer who will need to constantly write and update Spark jobs when new data is needed.
Spark alternatives for BI and reporting:
- High-performance data warehouses such as Snowflake and Google BigQuery can provide excellent performance and a self-service experience for BI developers; however, they become prohibitively expensive at higher scales.
- Upsolver provides automated, production-ready ETL pipelines for streaming data on Amazon S3, including native integration with query engines such as Amazon Athena.
Machine learning is a complex process with many moving parts – including building a training dataset with labeled data, training the ML model, and then deploying it to production. Each of these stages poses its own challenge to the data scientist who programs and trains the model, as well as the data engineer responsible for supplying structured data in a timely fashion.
Apache Spark can be used to build the training dataset due to its ability to perform large-scale transformations on complex data. However, when deploying that model to production, one would need a seperate system capable of serving data in real-time – typically a key-value store such as Redis or Cassandra.
As we’ve detailed in our previous blog post on orchestrating batch and streaming ETL for machine learning, the need to manage two separate architectures and ensure they produce the same results is one of the foremost obstacles for current data science projects.
Spark alternatives for machine learning:
- Google Dataflow provides a unified platform for batch and stream processing, but is only available within Google Cloud, and additional tools are required in order to build end-to-end ML pipelines
- FlinkML is a machine learning library for (open-source) Apache Flink
- Upsolver can be used both for preparing training data and as an operational key-value store for joining and serving data in real-time (sub-second latency). Since everything is done using the same platform, there’s no need to orchestrate two separate ETL flows. You can watch this webinar to learn more.
Still planning out your data lake? Check out the 4 Building Blocks of Streaming Data Architectures, or our recent comparison between Athena and Redshift. More of a hands-on type? Get a free trial of Upsolver and start building simple, SQL-based data pipelines in minutes!