Apache Spark Streaming is an extension of the core Apache Spark API, a distributed general-purpose cluster computing framework that natively supports both batch and streaming workloads. Spark Streaming serves as the entry point for live streaming data, and allows data engineers and data scientists to process real-time data from sources including, but not limited to Kafka, Kinesis, Flume and web APIs such as Twitter. Processed data from these sources can be delivered to file systems, databases, live dashboards or other destinations.
How does Spark Streaming handle these various sources, and deliver to its targets? The key abstraction used by Spark Streaming is called a Discretized Stream, or DStream, which in turn is built upon Spark’s key abstraction, RDDs, or resilient distributed datasets. The DStream represents a continuous stream of data, either the input stream from a designated source, or the processed stream generated by transforming the source stream. Each RDD in a DStream contains data from a certain interval. Internally, the flow is as follows:
The underlying RDD functionality ensures that Spark Streaming seamlessly integrates with other Spark components such as MLlib (Spark’s Machine Learning Library) and Spark SQL. The key advantage of this is that a single framework can be used to satisfy advanced processing needs.
Kafka is a message queue that is used to move one or more streams of data from producers – mobile apps, remote servers, web services, etc. – to consumers such as databases or apps. Kafka is not an ETL tool or a database and has limited capabilities when it comes to data transformation. It i\ typically used to read a stream of data and write it to target with little or no changes.
Spark Streaming, on the other hand, is primarily a streaming ETL framework – used to transform and operationalize data after it has already landed in storage. So while both of these tools are part of a streaming data toolkit, they are typically utilized in separate parts of the value chain. In fact, they can be used together, with Kafka as one of many potential sources of streaming data.
Spark Structured Streaming is part of the Spark 2.0 release, and holds many similarities to Spark Streaming. There are some key differences though. Spark Streaming is based on the core elements of Spark, and while DStreams offers some optimization, the fundamental reliance on the base RDD abstraction means that it is not keyed to any particular workload. In a complex environment with multiple sources and outputs, Spark Streaming can be a better option, simply because of a greater degree of flexibility, while maintaining the advantage of being able to leverage the Spark topology.
“Data can be ingested from many sources like Kafka, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like ‘map’, ‘reduce’, ‘join’ and ‘window’.” – Spark Streaming Programming Guide
It has been argued that Spark Streaming is not truly stream processing, even with the DStream abstraction layer, because it still breaks down streams of data into what can be termed micro-batches, using the RDDs handled by the Apache Spark engine. Batch-based processing of a stream of data is seen as problematic because of the potential for data to arrive out of order and/or late. Spark Streaming batches, no matter how small/micro, still contain a collection of events that arrive over the batch period, regardless of when the data is actually created.
This essentially means that data may have to be processed more than once, or be held in wait until the proper sequence of batches can be established, resulting in latency due to ‘stragglers’ or faults. The Spark Engine accounts for this in terms of data integrity, ensuring that data is processed in order and in its entirety via the RDDs, but because RDDs are not optimized for some datasets, queries and processing directly against them can be problematic, or at least, cannot be performed quickly enough to be considered real-time.
Spark Structured Streaming, on the other hand, is also a fault-tolerant and scalable stream processing engine, but built on the Spark SQL engine, rather than directly on the Spark engine itself. It offers an improved way of handling continuously streaming data, limiting the challenges of faults and stragglers mentioned above. Because it is based on the Spark SQL engine, you can express your streaming computations in the same way you would batch computations on static data. Spark SQL’s engine runs the streaming data incrementally and continuously, updating the final result when the data finally arrives.
“Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.” – Spark Structured Streaming Programming Guide
Structured Streaming also makes use of incremental batches of data, but because of the DataFrames API it uses, and its ties to the Spark SQL engine, more actions can be performed on the data in transit, and the handling of faults and stragglers is both smoother and more intuitive. The Spark Dataframe is characterized as “a distributed collection of data organized into named columns”. This is conceptually the same as a relational database table but with richer features, allowing users to manipulate data as it moves through an ETL pipeline.
However, there are some limitations to Structured Streaming – while you may be able to do more with the data in transit, there are fewer connector options available for Structured Streaming. Users must develop and maintain their own connectors for out-of-box functionality. This makes it less flexible in terms of data sources or targets. For both Spark Streaming and Structured Streaming, a fair bit of first hand experience and learning is required. To create an ETL pipeline solution, users must write programming code (Python, Scala or Java) and understand complex Apache Spark concepts. Also, Apache Spark is fundamentally designed as an on-premise solution, making scaling as a managed cloud solution difficult, requiring a complex management structure, and an orchestration layer.