The vast majority of Big Data applications must process and analyze the data in real-time, or near real-time, as text-book real-time is a misnomer. To run these applications at scale, it often requires multiple parallel platforms that are fault-tolerant and handle stragglers or slow nodes.
Unfortunately, stream processing models often recover faults expensively. Ergo, they require hot replication, long recovery times, and they do not handle stragglers well.
To solve this challenge, the authors of the research paper titled “Discretized Streams: Fault-Tolerant Streaming Computation at Scale,” propose a new processing model, D-Streams or Discretized Streams.
To quote the authors,
“D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes and tolerates stragglers.”
Both faults and stragglers are inevitable in large data clusters. Therefore, streaming applications must recover quickly and efficiently. A 30-second delay while the streaming application tries to rescue a fault or straggler can make the difference between making the right decision or not. Therefore, it is essential to reduce this delay when streaming data used to make critical decisions.
To understand what the difference is between these concepts and products, let’s consider a succinct definition of each and why Spark structured streaming is an improvement on the Apache Spark streaming model.
Apache Spark Streaming Explained
Apache Spark streaming is a separate library in the Spark engine designed to process streaming or continuously flowing data. It utilizes the DStream API, powered by Spark RDDs (Resilient Data Sets), to divide the data into chunks before processing it. Once these data chunks have been processed, they are sent to their destination.
As an aside, Resilient Data Sets (RDD) are a fundamental data structure of Spark. According to the Apache Spark documentation, an RDD is an “immutable distributed collection of objects.” In other words, each RDD is a read-only partitioned collection of data records. They are either created via deterministic operations on either data in stable storage or other RDDs. Finally, the RDD construct is fault-tolerant and can be operated on in parallel.
Before we look at the limitations of this product, let’s consider the Spark structured streaming construct.
Spark Structured Streaming Explained
Spark Structured streaming is part of the Spark 2.0 release. Structured streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. Built on the Spark SQL library, structured streaming is an improved way to handle continuously streaming data without the challenges with fault- and -straggler handling, as described above. This streaming model is, for all intents and purposes, an improved way of handling data streaming.
Succinctly stated, the Spark SQL engine takes care of running the data stream “incrementally and continuously and updating the final result as streaming data continues to arrive.”
This streaming model is based on the Dataset and DataFrame APIs, consumable in Java, Scala, Python, and R. Consequently, we can work with the DataFrame API, or Scala operations on the streamed data using the Dataset API.
Apache Spark and Spark Structured Streaming Compared
From the discussion highlighted above, it is clear that Apache Spark streaming and Spark structured streaming are similar. In other words, there is a competitive overlap between the two entities. Thus, by way of defining the difference between these two constructs, let’s consider Apache Spark’s limitations.
DStreams versus DataFrames
Both the Apache Spark streaming and the structured streaming models use micro- (or mini-) batching as their primary processing mechanisms. But it is the detail that changes. Ergo, Apache Spark uses DStreams, while structured streaming uses DataFrames to process these streams of data pouring into the analytics engine.
The DStreams are represented as sequences of RDD blocks, making it easy to use if your data load is a low-level RDD-based batch workload. However, the challenge with the DStreams and the RDD blocks is that it is not as efficient as structured streaming’s DataFrames.
While the RDDs are fault-tolerant and have the capacity for parallel processing, the micro-seconds taken to divide the data into RDD chunks can harm the analytical outcomes of critical data. Consequently, this data processing model has a high latency, reducing the streamed data’s speed through the ETL pipeline.
As described above, Spark’s structured streaming model is an extension built on top of the Apache Spark’s DStreams construct. Therefore, users no longer need to access the RDD blocks directly. The structured streaming model utilizes DataFrames, which has the benefits of having a lower latency, a greater throughput, and guaranteed message delivery.
The RDD API Vs. the DataFrame API
As described above, the RDD construct is a distributed collection of data elements. However, challenges with RDDs include the fact that the RDD library is opaque or challenging to understand. They are not effectively optimized by the Spark engine. Therefore, it is reasonable to assume that the RDD API does not optimize the data transformation chain, leading to the increased latency, and time delays especially when processing faulty and slow data.
A Spark DataFrame is defined as a “distributed collection of data organized into named columns.” It is conceptually the same as a relational database table but with richer features. Thus, the DataFrame API provides a higher level of abstraction allowing the users to manipulate the data as it moves through the ETL pipeline.
These query optimizations as handled by Spark include three types of logical plans and one physical plan.
- The data is first parsed by a logical plan.
- It is then analyzed by a second logical plan.
- Thirdly, it is optimized by the third logical plan.
- And finally, it is executed by the physical plan.
The first two logical plans work through a series of rules to parse and analyze the data. Then the optimized plan allows Spark to plug in a set of user-defined optimization rules. These plans all sit within the DataFrame API.
Other Apache Spark Structured Streaming Limitations
Finally, there are also additional limitations when considering using Apache Spark Structured Streaming to process data streams.
- Users must write programming code and understand complex Apache Spark concepts to set up an optimal performing ETL pipeline.
- The number of offered connectors are minimal. Consequently, users must develop and maintain their connector and sink source code.
- Apache Spark was originally developed as an on-premises-only solution. Therefore, scaling it as part of a cloud solution requires both complex management and the addition of an orchestration layer. This isn’t easy to manage and optimize due to the checkpoint complexity, partitioning, shuffling, and spilling to local disks.
As described above, even though both Apache Spark streaming and structured streaming are based on mini-batch processing models, the structured streaming construct using DataFrames is an improvement on the use of DStreams as found in the original Apache Spark streaming model.
In summary, the DStreams are constructed of sequential RDD blocks. This construct is fault-tolerant, but the division of streamed data into RDD blocks is slower than the DataFrames utilized by the structured streaming construct. Therefore, the processing and analysis of the data streams are slower, increasing the latency and reducing the message delivering reliability.
Juxtapositionally, because the structured streaming is built on top of the Apache Spark engine, it does not directly access the RDD block construct, instead of using DataFrames and its primary data processing and analyzing mechanism.
As a result, the Spark structured streaming model provides a useful interface for accomplished software developers with a wide range of experience in implementing distributed system architectures.