Last updated: March 8, 2021
The following article offers a practical guide to build real-time streaming ETL. Learn more about this topic by downloading our streaming architecture white paper (free, pdf).
Real-time data is seeing tremendous growth as new data sources such as IoT devices, real-time applications, and mobile devices become more integrated into business operations.
From an engineering perspective, the nature of real-time data requires a paradigm shift in how you build and maintain your ETL data pipeline. This type of is continuously streamed in very expected loads, however, the structure of the data may change in the same frequency.
In this article, we’ll cover all the bits and bytes required for a real-time streaming ETL pipeline that addresses the challenges of dealing with real-time streaming data.
To start with, you need to stream your real-time data into a streaming platform – a message broker that processes streaming events from client apps or devices and ensures it is sent to target storage systems. There are many to choose from. If you’re uncertain which to choose, use one of the popular streaming platforms such as Apache Kafka and Amazon Kinesis. (See our comparison of Kafka vs RabbitMQ).
Benefits of stream processing
- Data freshness/latency – since you are processing one event at a time in real-time or near real-time, your data is always fresh.
- Cost – no need to run large operations on small servers. This helps keep your processing footprint small and, as a result, your cloud bill, as well. You have a very small amount of processing at every single point in time since you’re typically only working with the latest events.
You can read more about stream, batch and micro-batch processing here.
Once you have a stream of incoming events, you need to store it somewhere. One option would be to use a traditional database. However, choosing that option limits your flexibility (since you have to commit to a certain schema) and the storage costs would be high.
Another option would be storing your data in a data lake. Data lakes are based on object storage services such as Amazon S3 and Google Cloud Storage. These are cheap and reliable options to store data in the cloud. This is the best choice for handling high volumes of streaming data, since object storage fits in nicely with this type of fluid and often only partially-structured data.
Amazon S3 is schema-agnostic. It doesn’t care about data formats and structure – you can just store whatever data you want and it deals with it perfectly and at a low cost. It’s important to store the data in the lake in an optimal manner. For example: avoid small files and use the best optimal format for your use case (read more about dealing with small files on S3).
While S3 is an excellent and low-cost option for storage, it doesn’t give you tools to manage schema, which means you’re not always sure exactly what’s going into your lake. Maintaining a schema layer on top of your data lake helps you maintain control and avoid “data swamp” scenarios. The solution for that is either to develop a schema management tool yourself or use off-the-shelf tools to do it such as Upsolver Data Lake ETL, which provides automatic schema on read. Read more about using schema discovery to explore streaming data.
Data Preparation for Analytics
So you have your data ingested into a data lake, and you know how it’s structured. Nice work! However, this is probably not the end of the task; you probably want to do something with your data, such as running analytics queries, running machine learning flows, or even just storing a subset of the data in a database.
Here comes the transformation phase in the ETL process. As with every ETL, moving your data into a queryable state is a concern for the real-time use case as well.
As we’ve previously seen, streaming data comes in several forms (such as hierarchical JSON) and shapes (such as various file formats: CSV, TSC, Parquet, AVRO, and so on). And a single stream of real-time data may change over time, as well. Since we are dealing with real-time data such changes might be frequent and could easily break your ETL pipeline. So for transforming your data you need either to use a data lake ETL tool such as Upsolver or code your own solution using Apache Spark, for example.
When you’re done moving your data into a queryable state, you need to distribute it to one or more targets, depending on your use case. This can be done using tools such as Apache Airflow, which requires some expertise and coding, or you can develop your own orchestration tool by yourself using Spark.
Building Real-time ETL Pipelines in Upsolver
Let’s look at an example use case in which you want to turn your real-time streaming data from Kinesis into queryable data, and send it to Athena. We will walk you through the process of how you can do this in Upsolver in just 3 steps.
Step 1: Extract real-time streaming data from Kinesis
This step is also known as the ETL data ingestion process. We load the real-time streaming data from Kinesis to Upsolver’s data lake. The data is stored in Amazon S3.
Step 2: Transform the data into a queryable state (using the UI or SQL)
This step is also known as the ETL transformation phase. During this stage, we transform the raw data into queryable data that we can query in Athena. Upsolver offers a visual interface from which you can map the fields from your raw data to columns in your Athena table.
Upsolver also enables you to work with nested data in the same manner.
If you are used to working with SQL, Upsolver enables you to switch between SQL/UI modes and develop your transformations in SQL as well. Read more about that on our blog.
Step 3: Load the transformed data to Athena
Upsolver ETLs are automatically orchestrated whether you run them continuously or on specific time frames. This means there is no need to write orchestration code in Apache Spark or Airflow.
An Upsolver ETL to Athena creates Parquet files on S3 and a table in the Glue Data Catalog. During this process, Upsolver will convert the event files into optimized Apache Parquet and merge small files for optimal performance. When this process is complete, the table you’ve created will instantly be available to query in Athena.
With that – we’re done. We’ve built a continuous ETL pipeline that ingests, transforms, and delivers structured data for analytics, and can easily be duplicated or modified to fit changing needs.