Apache Spark

Back to glossary

Apache Spark is a fast, flexible engine for large-scale data processing.

More specifically, Apache Spark is a parallel processing framework that boosts the performance of big-data analytic applications.  It executes streaming, machine learning or SQL workloads that require fast iterative access to large, complex datasets.

Apache Spark’s advanced acyclic processing engine can operate as a stand-alone install, a cloud service, or in cluster mode on an existing on-premises data center.

Apache Spark provides primitives for in-memory cluster computing.  A Spark job can load and cache data into memory and query it repeatedly.  Its in-memory computing is substantially faster than batch processing frameworks such as MapReduce, which processes data on Hadoop distributed file system (HDFS).

Now arguably one of the most active Apache projects, Spark was created initially to improve Hadoop system processing.  It includes APIs that enable developers to build supporting applications in Java, Python, Scala, and R.  As a result it includes a large and vibrant ecosystem that is continually evolving.

Apache Spark works best for ad-hoc work and large batch processes.  Scheduling Spark jobs or managing them over a streaming data source requires extensive coding, and many organizations struggle with Spark’s complexity and engineering costs.  Also, while learning and using Spark for ad-hoc querying requires some knowledge of how distributed systems work, the expertise required to use it efficiently and correctly in production systems is expensive and difficult to obtain.  Finally, some organizations might require fresher data than Spark’s batch processing is able to deliver.

How to use Apache Spark

Apache Spark (available here) runs on both Windows and UNIX-like systems (Linux, Mac OS). It works on platforms that run a supported version of Java.

Apache Spark can run by itself or over several existing cluster managers.  It supports four cluster manager types:

  1. Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
  2. Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.
  3. Hadoop YARN – the resource manager in Hadoop 2.
  4. Kubernetes – an open-source system for automating deployment, scaling, and management of containerized applications.

Many data scientists, analysts, and general business intelligence users rely on interactive SQL queries for exploring data. Spark SQL is a Spark module for structured data processing.

When Spark came out, developers communicated withit via RDDs (Resilient Distributed Datasets).  RDDs are an abstraction; they represent an immutable distributed collection of objects you can partition across nodes in the cluster.  You can execute RDD operations in parallel:

  1. Transformations performed on an RDD that produces results in a new RDD – join, filter, map, and so on.
  2. Actions compute on an RDD and return a value – count, reduce, and so on.

Spark 2.0, however, introduced DataFrames and Datasets, APIs that provide a higher level of abstraction, that have since displaced RDDs as the most commonly-used APIs, especially for a streaming architecture.  Dataframes and Datasets are the building blocks for Spark Structured Streaming.  In Structured Streaming, Spark divides batches of data into micro-batches to reduce latency.  Spark 2.3 added the concept of Continuous Processing, which in turn reduced end-to-end latency to as low as 1 millisecond.  

In general, you can use Apache Spark for common applications such as real-time marketing campaigns, online product recommendations, cybersecurity analytics, and machine log monitoring.  Some popular Apache Spark use cases include:

  • Machine learning.  Spark is designed to “…make practical machine learning scalable and easy.” It enables ML algorithms to run in a distributed environment at scale.  Even so, implementing machine learning still can be challenging.
    Spark allows ML algorithms to run in a distributed environment at scale.
  • ETL.  Spark can wrangle datasets that are typically too expensive and large to transform using relational databases.
  • Analytical processing on streaming data, for example from sensors on a factory floor.
  • Applications requiring multiple operations, such as most machine-learning algorithms.   

Benefits of Apache Spark

The primary reasons organizations employ an Apache Spark architecture as part of their overall data lake architecture are:

  • Speed
  • Versatility
  • Fault tolerance
  • Support

Speed — Apache Spark is faster than its predecessor MapReduce; Spark works on the whole data set in one fell swoop, whereas MapReduce operates in steps.  Spark operates both in-memory and on disk, reading and analyzing data and then writing results to the cluster in near real-time.

Versatility — It’s relatively straightforward to leverage Apache Spark’s core APIs to write robust applications in a range of languages including Java, Scala, Python, and R.  In addition to accessing a range of data sources, Spark runs almost anywhere — on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.

Spark also supports real-time streaming.  And it can combine multiple workloads into one seamless job — that is, querying, analytics, machine learning, and graph processing.

Fault tolerance — Apache Spark distributes RDDs across the data cluster, in memory or on disks, providing full recovery from faults or failures.  Also, Apache Spark replicates real-time streaming data that spans nodes.  And it compares original and remote streams to realize fault tolerance for live-streamed data.

Support — Because it’s open source, Apache Spark is supported by a large global development community that provides documentation, regular build updates, and even tech support for Spark-based solutions.

Apache Spark has become the most oft-used tool for big data processing.  But it remains a complex framework that requires substantial specialized knowledge, often demanding time-intensive and costly custom coding to realize the desired use case.  So organizations seeking to implement or extend a data lake often evaluate one or more of the multiple Apache Spark alternatives.

Back to glossary
data lake ETL Demo

Batch and streaming pipelines.

Streaming plus batch in a single pipeline platform

No Airflow – orchestration inferred from data

$99 / TB of data ingested | unlimited free pipelines

Get Started Now

Templates

All Templates

Explore our expert-made templates & start with the right one for you.