Stream processing is a critical part of the big data stack in data-intensive organizations. Tools like Apache Storm and Samza have been around for years, and are joined by newcomers like Apache Flink and managed services like Amazon Kinesis Streams.
Today, there are many fully managed frameworks to choose from that all set up an end-to-end streaming data pipeline in the cloud. Making sense of the relevant terms so you can select a suitable framework is often challenging. This guide will shed light on this topic and help you navigate the landscape with ease.
In this post you will learn:
- What are big data stream processing frameworks
- 3 types of stream processing engines
- Architecture and pros and cons of 8 stream processing tools:
What Are Big Data Stream Processing Frameworks?
Developers use stream processing to query continuous data streams and react to important events, within a short timeframe ranking from milliseconds to minutes.
Stream processing is closely related to real time analytics, complex event processing, and streaming analytics. Today stream processing is the primary framework used to implement all these use cases. Stream processing engines are runtime libraries which help developers write code to process streaming data, without dealing with lower level streaming mechanics.
Types of Stream Processing Engines
There are three major types of processing engines.
Open Source Compositional Engines
In a compositional stream processing engines, developers define the Directed Acyclic Graph (DAG) in advance and then process the data. This may simplify code, but also means developers need to plan their architecture carefully to avoid inefficient processing.
Challenges: Compositional stream processing are considered the “first generation” of stream processing and can be complex and difficult to manage.
Examples: Compositional engines include Samza, Apex and Apache Storm.
Managed Declarative Engines
Developers use declarative engines to chain stream processing functions. The engine calculates the DAG as it ingests the data. Developers can specify the DAG explicitly in their code, and the engine optimizes it on the fly.
Challenges: While declarative engines are easier to manage, and have readily-available managed service options, they still require major investments in data engineering to set up the data pipeline, from source to eventual storage and analysis.
Examples: Declarative engines include Apache Spark and Flink, both of which are provided as a managed offering.
Fully Managed Self-Service Engines
A new category of stream processing engines is emerging, which not only manages the DAG but offers an end-to-end solution including ingestion of streaming data into storage infrastructure, organizing the data and facilitating streaming analytics.
Examples: Upsolver is a fully managed stream processing engine which handles huge volumes of streaming data, stores it in a high-performance cloud data lake architecture, and enables real-time access to data and SQL-based analytics. To learn more, check out the architecture overview.
Comparing Popular Stream Processing Frameworks
Spark is an open-source distributed general-purpose cluster computing framework. Spark’s in-memory data processing engine conducts analytics, ETL, machine learning and graph processing on data in motion or at rest. It offers high-level APIs for the programming languages: Python, Java, Scala, R, and SQL.
The Apache Spark Architecture is founded on Resilient Distributed Datasets (RDDs). These are distributed immutable tables of data, which are split up and allocated to workers. The worker executors implement the data. The RDD is immutable, so the worker nodes cannot make alterations; they process information and output results.
Pros: Apache Spark is a mature product with a large community, proven in production for many use cases, and readily supports SQL querying.
- Spark can be complex to set up and implement
- It is not a true streaming engine (it performs very fast batch processing)
- Limited language support
- Latency of a few seconds, which eliminates some real-time analytics use cases
Apache Storm has very low latency and is suitable for near real time processing workloads. It processes large quantities of data and provides results with lower latency than most other solutions.
The Apache Storm Architecture is founded on spouts and bolts. Spouts are origins of information and transfer information to one or more bolts. This information is linked to other bolts, and the entire topology forms a DAG. Developers define how the spouts and bolts are connected.
Source: Apache Storm
- Probably the best technical solution for true real-time processing
- Use of micro-batches provides flexibility in adapting the tool for different use cases
- Very wide language support
- Does not guarantee ordering of messages, may compromise reliability
- Highly complex to implement
Apache Samza uses a publish/subscribe task, which observes the data stream, processes messages, and outputs its findings to another stream. Samza can divide a stream into multiple partitions and spawn a replica of the task for every partition.
Apache Samza uses the Apache Kafka messaging system, architecture, and guarantees, to offer buffering, fault tolerance, and state storage. Samza relies on YARN for resource negotiation. However, a Hadoop cluster is needed (at least HDFS and YARN).
Samza has a callback-based process message API. It works with YARN to provide fault tolerance, and migrates your tasks to another machine if a machine in the cluster fails. Smaza processes messages in the order they were written and ensures that no message is lost. It is also scalable as it is partitioned and distributed at all levels.
- Offers replicated storage that provides reliable persistency with low latency.
- Easy and inexpensive multi-subscriber model
- Can eliminate backpressure, allowing data to be persisted and processed later
- Only supports JVM languages
- Does not support very low latency
- Does not support exactly-once semantics
Flink is based on the concept of streams and transformations. Data comes into the system via a source and leaves via a sink. To produce a Flink job Apache Maven is used. Maven has a skeleton project where the packing requirements and dependencies are ready, so the developer can add custom code.
Apache Flink is a stream processing framework that also handles batch tasks. Flink approaches batches as data streams with finite boundaries.
- Stream-first approach offers low latency, high throughput
- Real entry-by-entry processing
- Does not require manual optimization and adjustment to data it processes
- Dynamically analyzes and optimizes tasks
- Some scaling limitations
- A relatively new project with less production deployments than other frameworks
Amazon Kinesis Streams
Amazon Kinesis Streams is a durable and scalable real time service. It can collect gigabytes of data per seconds from hundreds of thousands of sources, including database event streams, website clickstreams, financial transactions, IT logs, social media feeds, and location-tracking events. The data captured is provided in milliseconds for real time analytics use cases, including real time anomaly detection, real time dashboards, and dynamic pricing.
You can build data-processing applications, called Kinesis Data Stream (KDS) applications. Typically, a kinesis data stream application interprets data from a data stream as data records. The application can run on Amazon EC2 and can use the kinesis client library.
- A robust managed service that is easy to set up and maintain
- Integrates with Amazon’s extensive big data toolset
- Commercial cloud service, priced per hour per shard (see pricing)
Apex offers a platform for batch and stream processing using Hadoop’s data-in-motion architecture by YARN. The platform provides integration with different data platforms. Apex also provides a framework that is easy to use.
Operationally, Apex utilizes native HDFS for persisting state and the YARN features found in Hadoop such as scheduling, resource management, jobs, security, multi-tenancy, and fault-tolerance. Functionally, developers can integrate Apex APIs with other data processing systems.
Apex allows for high throughput, low latency, reliability, and unified architecture, for batch and streaming use cases. It can process unbound data sets, which can grow infinitely.
- Design focuses on enterprise readiness
- Strong processing guarantees (end-to-end exactly once)
- Highly scalable, high throughput with low latency
- Secure, supports fault-tolerance and multi-tenancy
- Apex is no longer widely used and no vendor is currently supporting this framework at scale (see article)
- Limited support for SQL
- Difficult to find skilled users
Flume is a reliable, distributed service for aggregating, collecting and moving massive amounts of log data. It has a flexible and basic architecture. It is fault-tolerant and hardy with failover and recovery features and tunable reliability. It operates an extensible data model that
The key concept behind the design of Flume is to capture streaming data from web servers to Hadoop Distributed File System (HDFS).
- Central master server controls all nodes
- Fault tolerance, failover and advanced recovery and reliability features
- Difficult to understand and configure with complex logical/physical mapping
- Big footprint, over 50,000 lines of Java code
Do it Yourself or End to End Stream Platforms?
There are many excellent options for building stream processing pipelines, but all of them require expertise and hard work to create an end-to-end solution. Managed streaming frameworks like Upsolver can reduce the time required for your streaming project from weeks or months to hours, while allowing compelling use cases like persisting events to a data lake.
Request a free consultation with Upsolver’s streaming data experts
See how you can spin up an end-to-end streaming data pipeline in minutes.
Schedule a Demo