This article is an excerpt from our comprehensive, 40-page eBook: The Architect’s Guide to Streaming Data and Data Lakes. Read on to discover design patterns and guidelines for for streaming data architecture, or get the full eBook now (FREE) for in-depth tool comparisons, case studies, and a ton of additional information.
Streaming data is becoming a core component of enterprise data architecture due to the explosive growth of data from non-traditional sources such as IoT sensors, security logs, and web applications.
Streaming technologies are not new, but they have considerably matured in recent years. The industry is moving from painstaking integration of open-source Spark/Hadoop frameworks, towards full stack solutions that provide an end-to-end streaming data architecture built on the scalability of cloud data lakes.
What is Streaming Data and Streaming Data Architecture?
Streaming data refers to data that is continuously generated, usually in high volumes and at high velocity. A streaming data source would typically consist of a stream of logs that record events as they happen – such as a user clicking on a link in a web page, or a sensor reporting the current temperature.
Common examples of streaming data include:
- IoT sensors
- Server and security logs
- Real-time advertising
- Click-stream data from apps and websites
In all of these cases we have end devices that are continuously generating thousands or millions of records, forming a data stream – unstructured or semi-structured form, most commonly JSON or XML key-value pairs. Here’s an example of how a single streaming event would look – in this case the data we are looking at is a website session:
A single streaming source will generate massive amounts of these events every minute. In its raw form, this data is very difficult to work with as the lack of schema and structure makes it difficult to query with SQL-based analytic tools; instead, data needs to be processed, parsed, and structured before any serious analysis can be done.
Learn more about common streaming data use cases.
Streaming Data Architecture
A streaming data architecture is a framework of software components built to ingest and process large volumes of streaming data from multiple sources. While traditional data solutions focused on writing and reading data in batches, a streaming data architecture consumes data immediately as it is generated, persists it to storage, and may include various additional components per use case – such as tools for real-time processing, data manipulation, and analytics.
Streaming architectures must account for the unique characteristics of data streams, which tend to generate massive amounts of data (terabytes to petabytes) that it is at best semi-structured and requires significant pre-processing and ETL to become useful.
Stream processing is a complex challenge rarely solved with a single database or ETL tool – hence the need to “architect” a solution consisting of multiple building blocks. Part of the thinking behind Upsolver is that many of these building blocks can be combined and replaced with declarative data pipelines within the platform, and we will demonstrate how this approach manifests within each part of the streaming data supply chain.
Why Streaming Data Architecture? Benefits of Stream Processing
Stream processing used to be a niche technology used only by a small subset of companies. However, with the rapid growth of SaaS, IoT, and machine learning, organizations across industries are now dipping their toes into streaming analytics. It’s difficult to find a modern company that doesn’t have an app or a website; as traffic to these digital assets grows, and with the increasing appetite for complex and real-time analytics, the need to adopt modern data infrastructure is quickly becoming mainstream.
While traditional batch architectures can be sufficient at smaller scales, stream processing provides several benefits that other data platforms cannot:
- Able to deal with never-ending streams of events—some data is naturally structured this way. Traditional batch processing tools require stopping the stream of events, capturing batches of data and combining the batches to draw overall conclusions. In stream processing, while it is challenging to combine and capture data from multiple streams, it lets you derive immediate insights from large volumes of streaming data.
- Real-time or near-real-time processing—most organizations adopt stream processing to enable real-time data analytics. While real time analytics is also possible with high performance database systems, often the data lends itself to a stream processing model.
- Detecting patterns in time-series data—detecting patterns over time, for example looking for trends in website traffic data, requires data to be continuously processed and analyzed. Batch processing makes this more difficult because it breaks data into batches, meaning some events are broken across two or more batches.
- Easy data scalability—growing data volumes can break a batch processing system, requiring you to provision more resources or modify the architecture. Modern stream processing infrastructure is hyper-scalable, able to deal with gigabytes of data per second with a single stream processor. This enables you to easily deal with growing data volumes without infrastructure changes.
To learn more, you can read our previous article on stream vs batch processing.
The Components of a Streaming Architecture
Most streaming stacks are still built on an assembly line of open-source and proprietary solutions to specific problems such as stream processing, storage, data integration, and real-time analytics. At Upsolver we’ve developed a modern platform that combines most building blocks and offers a seamless way to transform streams into analytics-ready datasets. You can check out our technical white paper for the details.
Whether you go with a modern data lake platform or a traditional patchwork of tools, your streaming architecture must include these four key building blocks:
1. The Message Broker / Stream Processor
This is the element that takes data from a source, called a producer, translates it into a standard message format, and streams it on an ongoing basis. Other components can then listen in and consume the messages passed on by the broker.
The first generation of message brokers, such as RabbitMQ and Apache ActiveMQ, relied on the Message Oriented Middleware (MOM) paradigm. Later, hyper-performant messaging platforms (often called stream processors) emerged that are more suitable for a streaming paradigm. Two popular stream processing tools are Apache Kafka and Amazon Kinesis Data Streams.
Unlike the old MOM brokers, streaming brokers support very high performance with persistence, have massive capacity of a gigabyte per second or more of message traffic, and are tightly focused on streaming with little support for data transformations or task scheduling (although Confluent’s KSQL offers the ability to perform basic ETL in real-time while storing data in Kafka).
2. Batch and Real-time ETL Tools
Data streams from one or more message brokers must be aggregated, transformed, and structured before data can be analyzed with SQL-based analytics tools. This would be done by an ETL tool or platform rthat eceives queries from users, fetches events from message queues, then applies the query to generate a result – in the process often performing additional joins, transformations, or aggregations on the data. The result may be an API call, an action, a visualization, an alert, or in some cases a new data stream.
Image Source: InfoQ
A few examples of open-source ETL tools for streaming data are Apache Storm, Spark Streaming, and WSO2 Stream Processor. While these frameworks work in different ways, they are all capable of listening to message streams, processing the data, and saving it to storage.
Some stream processors, including Spark and WSO2, provide a SQL syntax for querying and manipulating the data; however, for most operations you would need to write complex code in Java or Scala. Upsolver’s data lake ETL is built to provide a self-service solution for transforming streaming data using only SQL and a visual interface, without the complexity of orchestrating and managing ETL jobs in Spark. You can start a free trial here.
3. Data Analytics / Serverless Query Engine
After streaming data is prepared for consumption by the stream processor, it must be analyzed to provide value. There are many different approaches to streaming data analytics. Here are some of the tools most commonly used for streaming data analytics.
|Analytics Tool||Streaming Use Case||Example Setup|
|Amazon Athena||Distributed SQL engine||Streaming data is saved to S3. You can set up ad hoc SQL queries via the AWS Management Console, Athena runs them as serverless functions and returns results.|
|Amazon Redshift||Data warehouse||Amazon Kinesis Streaming Data Firehose can be used to save streaming data to Redshift. This enables near real-time analytics with BI tools and dashboards you have already integrated with Redshift.|
|Kafka Connect can be used to stream topics directly into Elasticsearch. If you use the Avro data format and a schema registry, Elasticsearch mappings with correct data types are created automatically. You can then perform rapid text search or analytics within Elasticsearch.|
Low latency serving of streaming events to apps
|Kafka streams can be processed and persisted to a Cassandra cluster. You can implement another Kafka instance that receives a stream of changes from Cassandra and serves them to applications for real-time decision making.|
Streaming Data Storage
With the advent of low cost storage technologies, most organizations today are storing their streaming event data. Here are several options for storing streaming data, and their pros and cons.
|Streaming Data Storage Option||Pros||Cons|
|In a database or data warehouse – for example, PostgreSQL or Amazon Redshift||Easy SQL-based data analysis.||Hard to scale and manage. If cloud-based, storage is expensive.|
|In the message broker – for example, using Kafka persistent storage||Agile, no need to structure data into tables. Easy to set up, no additional components.||Data retention is an issue since Kafka storage is up to 10x more expensive compared to data lake storage. Kafka performance is best for reading recent (cached) data.|
|In a data lake – for example, Amazon S3||Agile, no need to structure data into tables. Low cost storage.||High latency, makes real time analysis difficult. Difficult to perform SQL analytics.|
A data lake is the most flexible and inexpensive option for storing event data, but it is often very technically involved to build and maintain one. We’ve written before about the challenges of building a data lake and maintaining lake storage best practices, including the need to ensure exactly-once processing, partitioning the data, and enabling backfill with historical data. It’s easy to just dump all your data into object storage; creating an operational data lake can often be much more difficult.
Upsolver’s data lake pipeline platform reduces time-to-value for data lake projects by automating stream ingestion, schema-on-read, and metadata extraction. This allows data consumers to easily prepare data for analytics tools and real time analysis. To learn more, you can check out our Product page.
Modern Streaming Architecture
In modern streaming data deployments, many organizations are adopting a full stack approach rather than relying on patching together open-source technologies. The modern data platform is built on business-centric value chains rather than IT-centric coding processes, wherein the complexity of traditional architecture is abstracted into a single self-service platform that turns event streams into analytics-ready data.
The idea behind Upsolver is to act as the centralized data platform that automates the labor-intensive parts of working with streaming data: message ingestion, batch and streaming ETL, storage management, and preparing data for analytics.
Benefits of a modern streaming architecture:
- Can eliminate the need for large data engineering projects
- Performance, high availability, and fault tolerance built in
- Newer platforms are cloud-based and can be deployed very quickly with no upfront investment
- Flexibility and support for multiple use cases
Here’s how you would use Upsolver’s streaming data tool to analyze advertising data in Amazon Athena:
Examples of modern streaming architectures on AWS
Since most of our customers work with streaming data, we encounter many different streaming use cases, mostly around operationalizing Kafka/Kinesis streams in the Amazon cloud. Below you will find some case studies and reference architectures that can help you understand how organizations in various industries design their streaming architectures:
Sisense is a late-stage SaaS startup and one of the leading providers of business analytics software. It was seeking to improve its ability to analyze internal metrics derived from product usage – over 70bn events and growing.
Read the full case study here.
Bigabid develops a programmatic advertising solution built on predictive algorithms. By implementing a modern real-time data architecture, the company was able to improve its modeling accuracy by a scale of 200x over one year
Read the full case study on the AWS website.
IronSource is a leading in-app monetization and video advertising platform. In a recent case study published on the AWS blog, we describe how the company built a versatile data lake architecture capable of handling petabyte-scale streaming data.
Read the full case study on the AWS blog.
Learn how Meta Networks (acquired by Proofpoint) achieved several operational benefits by moving its streaming architecture from a data warehouse to a cloud data lake on AWS.
Read the full case study here.
The Future of Streaming Data
Streaming data architecture is in constant flux. Three trends we believe will be significant in 2022 and beyond:
- Fast adoption of platforms that decouple storage and compute—streaming data growth is making traditional data warehouse platforms too expensive and cumbersome to manage. Data lakes are increasingly used, both as a cheap persistence option for storing large volumes of event data and as a flexible integration point, allowing tools outside the streaming ecosystem to access streaming data.
- From table modeling to schema-less development—data consumers don’t always know the questions they will ask in advance. They want to run an interactive, iterative process with as little initial setup as possible. Lengthy table modeling, schema detection, and metadata extraction are a burden.
- Automation of data plumbing—organizations are becoming reluctant to spend precious data engineering time on data plumbing instead of activities that add value, such as data cleansing or enrichment. Increasingly, data teams prefer full stack platforms that reduce time-to-value over tailored home-grown solutions.
Want to learn more about streaming data analytics and architecture? Get our Ultimate Guide to Streaming Data:
- Get an overview of common options for building an infrastructure.
- See how to turn event streams into analytics-ready data.
- Cut through some of the noise of all the “shiny new objects.”
- Come away with concrete ideas for wringing all you want from your data streams.
You can read more of our predictions for streaming data trends here to see how many of them we got right, or check out some other articles we’ve written about cloud architecture as well as other streaming data topics.
Want to build or scale up your streaming architecture? Upsolver is a streaming data platform that processes event data and ingests it into data lakes, data warehouses, serverless platforms, Elasticsearch, and more, making SQL-based analytics instantly available. Upsolver also enables real time analytics, using low-latency consumers that read from a Kafka stream in parallel. It is a fully integrated solution that can be set up in hour. Schedule a demo to learn how to build your next-gen streaming data architecture, or watch the webinar to learn how it’s done.