The following article offers a practical guide to build real-time streaming ETL. Learn more about this topic by downloading our streaming architecture white paper (free, pdf).
Real-time data is seeing tremendous growth as new data sources such as IoT devices, real-time applications, and mobile devices become more integrated into business operations.
From an engineering perspective, the nature of real-time data requires a paradigm shift in how you build and maintain ETL data pipelines. Streaming data is continuously generated – and while the inflow of data can be fairly predictable, the structure of the data may change in the same frequency.
In this article, we’ll cover the requirements for designing and managing a real-time streaming ETL pipeline that ingests streaming data, stores it in cloud object storage, and outputs an analytics-reedy dataset for query engines.
Streaming vs Batch Pipelines: Why You Need Streaming ETL and Why It’s Challenging
To make data useful outside of the source systems in which it is generated – such as business apps, mobile devices, or spreadsheets – you will almost always need to run some kind of ETL pipeline. Data must be extracted, normalized, joined with additional sources, cleaned, and optimized before being made available for querying in target systems – and this is true for any type of workload or data source.However, there are differences between batch and streaming ETL.
Batch processing is typically used when moving data between departmental data marts (such as CRM, finance, and HR systems) to an enterprise data warehouse. A delay of a few hours or even a few days is usually acceptable in these cases.
In a batch process, we’re waiting for large volumes of data to accumulate before running a transformation or moving data into the target system. The challenge is often to do so at scale and without degrading the performance of operational databases.
Steaming ETL is used when data freshness is of paramount importance, and when data is generated continuously in very small bursts. For example, if we’re trying to analyze user behavior on a website or a mobile app, that data is going to take the form of thousands of events associated with thousands of users – easily adding up to millions of new records each hour. If organizations need to respond to changes in the data as they happen – such as to detect app outages or to offer promotions in real time – stream processing is really the only viable solution.
Benefits of stream processing
- Data freshness/latency – since you are processing one event at a time in real-time or near real-time, your data is always fresh.
- Cost – no need to run large operations on small servers. This helps keep your processing footprint small and, as a result, your cloud bill, as well. You have a very small amount of processing at every single point in time since you’re typically only working with the latest events.
You can read more about stream, batch and micro-batch processing here.
However, stream processing is not without challenges:
- Since you need to process data in motion, you do not have the ability to load it into a data warehouse and use familiar SQL tools to perform your transformation.
- Data schema and structure often changes, which can potentially break your pipeline or result in it returning the wrong data
- Missing and late events (such as due to poor internet connectivity on end devices) can create inconsistencies in the target dataset
Streaming Data Pipeline Architecture
A streaming pipeline starts with a raw event stream and ends with an analytics-ready table that data analysts can query with SQL. If you want to explore this topic in-depth, you can read our guide to real-time architecture; but schematically, your pipeline might look something like this:
First you need to stream your real-time data into a streaming platform – a message broker that processes streaming events from client apps or devices and ensures it is sent to target storage systems. Apache Kafka is an extremely popular message bus used in many software development environments. Amazon’s Kinesis Data Streams is also common in AWS deployments.
Once you have a stream of incoming events, you need to store it somewhere. One option would be to use a traditional database. However, choosing that option limits your flexibility (since you have to commit to a certain schema) and the storage costs would be high. There are also databases built specifically for streaming, but they have limitations as well – as we’ve covered in our previous article, 4 Challenges of Using Database for Streaming Data.
The other option would be storing your data in a data lake. Data lakes are based on object storage services such as Amazon S3 and Google Cloud Storage. These are cheap and reliable options to store data in the cloud. This is the best choice for handling high volumes of streaming data, since object storage fits in nicely with this type of fluid and often only partially-structured data.
Amazon S3 is schema-agnostic. It doesn’t care about data formats and structure – you can just store whatever data you want and it deals with it perfectly and at a low cost. It’s important to store the data in the lake in an optimal manner. For example: avoid small files and use the best optimal format for your use case (read more about dealing with small files on S3).
While S3 is an excellent and low-cost option for storage, it doesn’t give you tools to manage schema, which means you’re not always sure exactly what’s going into your lake. Maintaining a schema layer on top of your data lake helps you maintain control and avoid “data swamp” scenarios. The solution for that is either to develop a schema management tool yourself or use off-the-shelf tools to do it such as Upsolver, which provides automatic schema-on-read.
Learn more about using schema discovery to explore streaming data, or watch this 2.5 minute video to see how schema discovery is handled within an Upsolver pipeline:
Data Preparation for Analytics
So you have your data ingested into a data lake, and you know how it’s structured. Nice work! However, this is probably not the end of the task; you probably want to do something with your data, such as running analytics queries, running machine learning flows, or even just storing a subset of the data in a database.
Here comes the transformation phase in the ETL process. As with every ETL, moving your data into a queryable state is a concern for the real-time use case as well.
As we’ve previously seen, streaming data comes in several forms (such as hierarchical JSON) and shapes (such as various file formats: CSV, TSC, Parquet, AVRO, and so on). And a single stream of real-time data may change over time, as well. Since we are dealing with real-time data such changes might be frequent and could easily break your ETL pipeline. So for transforming your data you need either to use a data pipeline tool such as Upsolver or code your own solution using Apache Spark, for example.
When you’re done moving your data into a queryable state, you need to distribute it to one or more targets, depending on your use case. This can be done using tools such as Apache Airflow, which requires some expertise and coding, or you can develop your own orchestration tool by yourself using Spark.
Building Real-time ETL Pipelines in Upsolver
Let’s look at an example use case in which you want to turn your real-time streaming data from Kinesis into queryable data, and send it to Athena. We will walk you through the process of how you can do this in Upsolver in just 3 steps.
If you want to see a full end-to-end process in more detail, watch the video below:
Step 1: Extract real-time streaming data from Kinesis
This step is also known as the ETL data ingestion process. We load the real-time streaming data from Kinesis to Upsolver’s data lake. The data is stored in Amazon S3.
Step 2: Transform the data into a queryable state (using the UI or SQL)
This step is also known as the ETL transformation phase. During this stage, we transform the raw data into queryable data that we can query in Athena. Upsolver offers a visual interface from which you can map the fields from your raw data to columns in your Athena table.
Upsolver also enables you to work with nested data in the same manner.
If you are used to working with SQL, Upsolver enables you to switch between SQL/UI modes and develop your transformations in SQL as well. Read more about that on our blog.
Step 3: Load the transformed data to Athena
Upsolver ETLs are automatically orchestrated whether you run them continuously or on specific time frames. This means there is no need to write orchestration code in Apache Spark or Airflow.
An Upsolver ETL to Athena creates Parquet files on S3 and a table in the Glue Data Catalog. During this process, Upsolver will convert the event files into optimized Apache Parquet and merge small files for optimal performance. When this process is complete, the table you’ve created will instantly be available to query in Athena.
With that – we’re done. We’ve built a continuous ETL pipeline that ingests, transforms, and delivers structured data for analytics, and can easily be duplicated or modified to fit changing needs.
Putting It All Together: Learn How ironSource Runs Petabyte-Scale Data Pipelines
If any company in the world knows streaming data, it’s ironSource. The mobile monetization leader works with massive volumes of data from its various networks, and that data grows and changes every day.
In the following 10 minute video, you can hear from the VP of R&D at ironSource on why he chose to move his streaming architecture from a data warehouse to a data lake.
If you want to further explore this case study and hear additional details from AWS and Upsolver about how the solution was implemented, watch the webinar recording here.