Explore our expert-made templates & start with the right one for you.
Real-time or near-real-time data processing and analysis are crucial for most big Ddata applications. To execute these applications at scale, they often require multiple parallel platforms that are fault-tolerant and capable of handling stragglers (long-running tasks or slow nodes). Unfortunately, traditional stream processing models struggle with fault recovery, often necessitating hot replication and long recovery times, with suboptimal straggler handling.
Both faults and stragglers are inevitable in large data clusters. Therefore, streaming applications must recover quickly and efficiently. A 30-second delay while the streaming application tries to rescue a fault or straggler can make the difference between making the right decision or not. Therefore, it is essential to reduce this delay when streaming data used to make critical decisions.
To address these issues, the research paper “Discretized Streams: Fault-Tolerant Streaming Computation at Scale” (PDF) introduces a new processing model called DStreams or Discretized Streams. These eventually formed the basis for Spark Structured Streaming.
DStreams offer a parallel recovery mechanism that enhances efficiency over traditional replication and backup schemes while effectively tolerating stragglers. Fast and efficient recovery is essential for streaming applications, as even a 30-second delay can impact decision-making.
This article examines the differences between Spark Structured Streaming and Apache Spark Streaming, focusing on how Spark Structured Streaming improves upon the latter.
Apache Spark Streaming Overview
Apache Spark Streaming is a separate library within the Spark engine designed to process streaming or continuously flowing data. It employs the DStream API, powered by Spark RDDs (Resilient Data Sets), to partition data into chunks for processing before forwarding them to their destination. RDDs, a fundamental data structure in Spark, are immutable distributed collections of objects and offer fault tolerance and parallel processing capabilities.
Resilient Distributed Datasets (RDDs) are a fundamental data structure in Apache Spark as well as Spark Streaming. They provide fault tolerance and efficient parallel processing of data. An RDD is an immutable distributed collection of objects that can be stored in memory or on disk and can be partitioned across multiple nodes in a cluster. RDDs are designed to support fault tolerance by dividing data into partitions, each of which can be replicated to multiple nodes in the cluster, enabling recovery from node failures.
Spark Structured Streaming Overview
Introduced in Spark 2.0, Spark Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. Leveraging the Spark SQL library, Structured Streaming provides an enhanced method for handling continuously streaming data without the challenges of fault and straggler handling present in Apache Spark Streaming. The Spark SQL engine incrementally and continuously processes data streams and updates the final result as new data arrives. This streaming model is based on the Dataset and DataFrame APIs and is compatible with Java, Scala, Python, and R.
Spark Structured Streaming Explained
Spark Structured streaming is part of the Spark 2.0 release. Structured streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. Built on the Spark SQL library, structured streaming is an improved way to handle continuously streaming data without the challenges with fault- and -straggler handling, as described above. This streaming model is, for all intents and purposes, an improved way of handling data streaming.
Succinctly stated, the Spark SQL engine takes care of running the data stream “incrementally and continuously and updating the final result as streaming data continues to arrive.”
This streaming model is based on the Dataset and DataFrame APIs, consumable in Java, Scala, Python, and R. Consequently, we can work with the DataFrame API, or Scala operations on the streamed data using the Dataset API.
Main Differences Between Spark Streaming and Structured Streaming
Spark Streaming and Spark Structured Streaming differ in APIs, performance, and guarantees. Spark Streaming utilizes the DStream API, while Structured Streaming employs the DataFrame and Dataset APIs. Spark Streaming is designed for continuous transformation, while Structured Streaming allows for SQL-like queries on streaming data. Furthermore, Structured Streaming provides end-to-end guarantees, which Spark Streaming lacks. (Learn more about the limitations of Apache Spark)
DStreams versus DataFrames
Both Apache Spark Streaming and Structured Streaming use micro-batching as their primary processing mechanism. Apache Spark Streaming employs DStreams, whereas Structured Streaming utilizes DataFrames to process data streams. DStreams, represented as sequences of RDD blocks, are suitable for low-level RDD-based batch workloads. However, they are less efficient than Structured Streaming’s DataFrames. The time taken to partition data into RDD chunks can negatively impact analytical outcomes, leading to higher latency and slower data processing through the ETL pipeline.
In contrast, Spark Structured Streaming, built on top of the Apache Spark’s DStreams, eliminates the need for direct RDD block access. Instead, it relies on DataFrames, which offer lower latency, higher throughput, and guaranteed message delivery.
RDD API vs DataFrame API
RDDs are distributed collections of data elements, but their library can be opaque and challenging to understand. Additionally, the RDD API does not optimize the data transformation chain, resulting in increased latency and time delays, particularly when processing faulty and slow data. On the other hand, Spark DataFrames are distributed collections of data organized into named columns. The DataFrame API provides a higher level of abstraction, enabling users to manipulate data as it moves through the ETL pipeline, with query optimizations executed by Spark through a series of logical and physical plans.
The DataFrame API provides a higher level of abstraction allowing the users to manipulate the data as it moves through the ETL pipeline.
Additional Limitations of Apache Spark Structured Streaming
- Users must write programming code and understand complex Apache Spark concepts to set up an optimally performing ETL pipeline.
- The number of available connectors is limited, requiring users to develop and maintain their connector and sink source code.
- Apache Spark was originally developed as an on-premises-only solution. Scaling it as part of a cloud solution necessitates complex management and the addition of an orchestration layer. This can be challenging to manage and optimize due to checkpoint complexity, partitioning, shuffling, and spilling to local disks.
Learn more about Spark alternatives.
While both Apache Spark Streaming and Structured Streaming are based on mini-batch processing models, Structured Streaming’s use of DataFrames represents an improvement over DStreams, as found in the original Apache Spark Streaming model.
DStreams, comprised of sequential RDD blocks, offer fault tolerance, but their partitioning of streamed data into RDD blocks is slower than the DataFrames employed by Structured Streaming. Consequently, data stream processing and analysis are slower, increasing latency and reducing message delivery reliability.
In contrast, Spark Structured Streaming, built on top of the Apache Spark engine, does not directly access the RDD block construct. Instead, it uses DataFrames as its primary data processing and analysis mechanism. This approach results in a more efficient and reliable interface for software developers with diverse experience in implementing distributed system architectures.
Learn more about stream processing frameworks.