The following article offers a practical guide to build real-time ETL pipelines. Learn more about this topic by downloading our streaming architecture white paper (free, pdf).
Stream processing must become a core competence for any data engineering team that wants to keep up with current tech. Real-time data is seeing tremendous growth as new data sources such as IoT devices, real-time applications, and mobile devices become more integrated into business operations.
From an engineering perspective, the nature of real-time data requires a paradigm shift in how you build and maintain ETL data pipelines. Streaming data is continuously generated – and while the inflow of data can be fairly predictable, the structure of the data may change in the same frequency.
In this article, we’ll cover the requirements for designing and managing a real-time streaming ETL pipeline that ingests streaming data, stores it in cloud object storage, and outputs an analytics-reedy dataset for query engines.
Streaming vs Batch Pipelines: Why You Need Streaming ETL and Why It’s Challenging
To make data useful outside of the source systems in which it is generated – such as business apps, mobile devices, or spreadsheets – you will almost always need to run some kind of ETL pipeline. Data must be extracted, normalized, joined with additional sources, cleaned, and optimized before being made available for querying in target systems – and this is true for any type of workload or data source. However, there are differences between batch and streaming ETL.
Batch processing is typically used when moving data between departmental data marts (such as CRM, finance, and HR systems) to an enterprise data warehouse. A delay of a few hours or even a few days is usually acceptable in these cases.
In a batch process, we’re waiting for large volumes of data to accumulate before running a transformation or moving data into the target system. The challenge is often to do so at scale and without degrading the performance of operational databases.
Steaming ETL is used when data freshness is of paramount importance, and when data is generated continuously in very small bursts. The data can originate from the Internet of Things (IoT), transactions, cloud applications, web interactions, mobile devices, or machine sensors.
For example, if we’re trying to analyze user behavior on a website or a mobile app, that data is going to take the form of thousands of events associated with thousands of users – easily adding up to millions of new records each hour. If organizations need to respond to changes in the data as they happen – such as to detect app outages or to offer promotions in real time – stream processing is really the only viable solution.
Benefits of stream processing
- Data freshness – since events are processed close to the time they are generated, you avoid the delays involved with batch processing and can make data available for analysis faster.
- Cost – events are processed one at a time, removing the need to run large operations on small servers. This helps keep your compute costs under control.
However, stream processing is not without challenges:
- Since you need to process data in motion, you do not have the ability to load it into a data warehouse and use familiar SQL tools to perform your transformation.
- Data schema and structure often changes, which can potentially break your pipeline or result in it returning the wrong data
- Missing and late events (such as due to poor internet connectivity on end devices) can create inconsistencies in the target dataset
You can read more about stream, batch and micro-batch processing here.
How We Define a Streaming Data Pipeline
Stream processing typically entails several tasks on the incoming series of data, such as transformations, filters, aggregations, enrichment, or ingestion of the data, before publishing the processed messages.
A streaming data pipeline flows a constant feed of data from one point to another as the data is created. It can be used to populate data lakes or data warehouses, or to publish to a messaging system or data stream. They are used for fast data that needs to be captured in real time, and when capturing large volumes of raw data is essential.
A well-architected streaming architecture can handle millions of events at scale and in real-time. This allows data engineers to compile, analyze, and store vast amounts of information. Data can then be used for applications, analytics, and synchronous reporting.
Streaming Data Pipeline Architecture
A streaming pipeline starts with a raw event stream and ends with an analytics-ready table that data analysts can query with SQL. If you want to explore this topic in-depth, you can read our guide to real-time architecture; but schematically, your pipeline might look something like this:
First you need to stream your real-time data into a streaming platform – a message broker that processes streaming events from client apps or devices and ensures it is sent to target storage systems.
Apache Kafka is an extremely popular message bus used in many software development environments. Amazon’s Kinesis Data Streams is also common in AWS deployments. To learn more about different streaming tools, see our comparisons of Kafka vs RabbitMQ or between Kafka and Kinesis.
Real-time Data Ingestion
Real-time data ingestion is the process of writing data collected by a message bus into a storage layer.
Like other data pipelines, a streaming pipeline delivers data from source to target; but unlike a batch system which would typically be triggered on a periodic basis, or once a certain amount of data has accumulated, a streaming pipeline runs continuously and delivers events as they are generated, in real or near-real time.
Often the source systems, such as IoT sensors, would not store the data for long, while storing it in a message bus such as Kafka or Kinesis is costly. This necessitates moving the data quickly into storage.
Once you have a stream of incoming events, you need to store it somewhere. One option would be to use a traditional database. However, choosing that option limits your flexibility (since you have to commit to a certain schema) and the storage costs would be high. There are also databases built specifically for streaming, but they have limitations as well – as we’ve covered in our previous article, 4 Challenges of Using Database for Streaming Data.
The other option would be storing your data in a data lake. Data lakes are based on object storage services such as Amazon S3 and Google Cloud Storage. These are cheap and reliable options to store data in the cloud. This is the best choice for handling high volumes of streaming data, since object storage fits in nicely with this type of fluid and often only partially-structured data.
Amazon S3 is schema-agnostic. It doesn’t care about data formats and structure – you can just store whatever data you want and it deals with it perfectly and at a low cost. It’s important to store the data in the lake in an optimal manner. For example: avoid small files and use the best optimal format for your use case (read more about dealing with small files on S3).
While S3 is an excellent and low-cost option for storage, it doesn’t give you tools to manage schema, which means you’re not always sure exactly what’s going into your lake. Maintaining a schema layer on top of your data lake helps you maintain control and avoid “data swamp” scenarios. The solution for that is either to develop a schema management tool yourself or use off-the-shelf tools to do it such as Upsolver, which provides automatic schema-on-read.
Learn more about using schema discovery to explore streaming data, or watch this 2.5 minute video to see how schema discovery is handled within an Upsolver pipeline:
Data Preparation for Analytics
So you have your data ingested into a data lake, and you know how it’s structured. Nice work! However, this is probably not the end of the task; you probably want to do something with your data, such as running analytics queries, running machine learning flows, or even just storing a subset of the data in a database.
Here comes the transformation phase in the ETL process. As with every ETL, moving your data into a queryable state is a concern for the real-time use case as well.
As we’ve previously seen, streaming data comes in several forms (such as hierarchical JSON) and shapes (such as various file formats: CSV, TSC, Parquet, AVRO, and so on). And a single stream of real-time data may change over time, as well. Since we are dealing with real-time data such changes might be frequent and could easily break your ETL pipeline. So for transforming your data you need either to use a data pipeline tool such as Upsolver or code your own solution using Apache Spark, for example.
When you’re done moving your data into a queryable state, you need to distribute it to one or more targets, depending on your use case. This can be done using tools such as Apache Airflow, which requires some expertise and coding, or you can develop your own orchestration tool by yourself using Spark.
Building Real-time ETL Pipelines in Upsolver
Let’s look at an example use case in which you want to turn your real-time streaming data from Kinesis into queryable data, and send it to Athena. We will walk you through the process of how you can do this in Upsolver in just 3 steps.
If you want to see a full end-to-end process in more detail, watch the video below:
Step 1: Extract real-time streaming data from Kinesis
This step is also known as the ETL data ingestion process. We load the real-time streaming data from Kinesis to Upsolver’s data lake. The data is stored in Amazon S3.
Step 2: Transform the data into a queryable state (using the UI or SQL)
This step is also known as the ETL transformation phase. During this stage, we transform the raw data into queryable data that we can query in Athena. Upsolver offers a visual interface from which you can map the fields from your raw data to columns in your Athena table.
Upsolver also enables you to work with nested data in the same manner.
If you are used to working with SQL, Upsolver enables you to switch between SQL/UI modes and develop your transformations in SQL as well. Read more about that on our blog.
Step 3: Load the transformed data to Athena
Upsolver ETLs are automatically orchestrated whether you run them continuously or on specific time frames. This means there is no need to write orchestration code in Apache Spark or Airflow.
An Upsolver ETL to Athena creates Parquet files on S3 and a table in the Glue Data Catalog. During this process, Upsolver will convert the event files into optimized Apache Parquet and merge small files for optimal performance. When this process is complete, the table you’ve created will instantly be available to query in Athena.
With that – we’re done. We’ve built a continuous ETL pipeline that ingests, transforms, and delivers structured data for analytics, and can easily be duplicated or modified to fit changing needs.
Putting It All Together: Learn How ironSource Runs Petabyte-Scale Data Pipelines
If any company in the world knows streaming data, it’s ironSource. The mobile monetization leader works with massive volumes of data from its various networks, and that data grows and changes every day.
In the following 10 minute video, you can hear from the VP of R&D at ironSource on why he chose to move his streaming architecture from a data warehouse to a data lake.
If you want to further explore this case study and hear additional details from AWS and Upsolver about how the solution was implemented, watch the webinar recording here.