The following article offers a practical guide to build real-time streaming ETL. Learn more about this topic by downloading our streaming architecture white paper (free, pdf).
Real-time data is seeing tremendous growth as new data sources such as IoT devices, real-time applications, and mobile devices become more integrated into business operations.
From an engineering perspective, the nature of real-time data requires a paradigm shift in how you build and maintain your ETL data pipeline. This type of is continuously streamed in very expected loads, however, the structure of the data may change in the same frequency.
In this article, we’ll cover all the bits and bytes required for a real-time streaming ETL pipeline which should address the challenges of dealing with real-time streaming data.
To start with, you need to stream your real-time data into a streaming platform – a message broker which processes streaming events from client apps or devices, and ensures it is sent to target storage systems. There are many to choose from, but if you don’t know any – using one of the popular streaming platforms such as Apache Kafka and Amazon Kinesis will do the trick (see our comparison of Kafka vs RabbitMQ).
Benefits of stream processing
- Data freshness/latency – since you are processing one event at a time in real-time or near real-time, your data is always fresh.
- Cost – no need to run large operations on small servers. This helps us keep our processing footprint small and as a result – so does our cloud bill. We have a very small amount of processing at every single point in time since we are typically only working with the latest events.
You can read more about stream, batch and micro-batch processing here.
Once you have a stream on incoming events, you need to store it somewhere. One option would be to use a traditional database. However, choosing that option will limit your flexibility (since you have to commit to a certain schema) and the storage costs would be high.
Another option would be storing your data in a data lake. Data lakes are based on object storage services such as Amazon S3 and Google Cloud Storage which are cheap and reliable options to store data in the cloud. This is the option to go for when handling high volumes streaming data, since object storage fits in nicely with this type of fluid and often only partially-structured data.
Amazon S3 is schema-agnostic. It doesn’t care about data formats and structure – you can just store whatever data you want and it deals with it perfectly and at a low cost. It’s important to store the data in the lake in an optimal manner, for example: avoid small files and use the best optimal format for your use-case (read more about dealing with small files on S3).
While S3 is an excellent and low-cost option for storage, it doesn’t give you tools to manage schema, which means you’re not always sure exactly what’s going into your lake. Maintaining a schema layer on-top of your data lake helps you maintain control and avoid ‘data swamp’ scenarios. The solution for that is to either develop a schema management tool yourself or use over the shelf tools to do it yourself such as Upsolver Data Lake ETL which provides automatic schema-on read. Read more about using schema discovery to explore streaming data.
Data Preparation for Analytics
So, you have your data ingested into a data lake, and know how it’s structured. Nice work! However, this is probably not the end of the road for you since you would probably want to do something with your data, such as running analytics queries, run machine learning flows, or even just store a subset of the data in a database.
Here comes the transformation phase in the ETL process. Like with every ETL, moving your data into a queryable state is a concern for the real-time use-case as well.
As we’ve previously seen, streaming data comes in several forms (for example hierarchical JSON) and shapes (for example various file formats: CSV, TSC, Parquet, AVRO, etc), and a single stream of real-time data may change over time as well. Since we are dealing with real-time data such changes might be frequent and may easily break your ETL pipeline. So, for transforming your data you either need to use a data lake ETL tool such as Upsolver or code your own solution using Apache Spark, for example.
When you’re done moving your data into a queryable state, you need to distribute it to one or more targets, depending on your use-case. This can be done both using tools such as Apache Airflow, which requires some expertise and coding or you can develop your own orchestration tool by yourself using Spark.
Building Real-time ETL Pipelines in Upsolver
Let’s look at an example use case in which you want to send your real-time streaming data from Kinesis, turn it into queryable data, and send it to Athena. We will walk you through the process of how it can be done in Upsolver in just 3 steps.
Step 1: Extract real-time streaming data from Kinesis
This step is also known as the ETL data ingestion process. We load the real-time streaming data from Kinesis to Upsolver’s data lake. The data is stored in Amazon S3.
Step 2: Transform the data into a queryable state (using UI or SQL)
This step is also known as the ETL transformation phase. During this stage, we transform the raw data into a queryable data which we can query in Athena. Upsolver offers a visual interface from which you can map the fields from your raw data to columns in your Athena table.
Upsolver also enables you to work with nested data in the same manner.
If you are used to working with SQL, Upsolver enables you to switch between SQL/UI modes and develop your transformations in SQL as well. Read more about that on our blog.
Step 3: Load the transformed data to Athena
Upsolver ETLs are automatically orchestrated whether you run them continuously or on specific time frames – this means there is no need to write orchestration code in Apache Spark or Airflow.
An Upsolver ETL to Athena creates Parquet files on S3 and a table in the Glue Data Catalog. During this process, Upsolver will convert the event files into optimized Apache Parquet and merge small files for optimal performance. When this process is complete, the table you’ve created will instantly be available to query in Athena.
With that – we’re done. We’ve built a continuous ETL pipeline that ingests, transforms and delivers structured data for analytics, and can easily be duplicated or modified to fit changing needs.