If you’re working with streaming data in 2019, odds are you’re using Kafka – either in its open-source distribution or as a managed service via Confluent or AWS. The stream processing platform, originally developed at LinkedIn and available under the Apache license, has become pretty much standard issue for event-based data, spanning diverse use cases from sensors to application logs to clicks on online advertisements.
However, making that data useful means setting up some kind of extract-transform-load (ETL) pipeline in order to manage the data you’re bringing in from Kafka, and in order to enable data scientists and analysts to query the data and extract insights. This article will cover the basics of ETL for Kafka streams, and give an overview of three basic approaches to building this type of pipeline.
Topics covered in this article:
- Why do you need to ETL your Kafka data?
- What to expect from your ETL pipeline
- Three types of pipelines architectures:
- Further reading
Why do you need to ETL your Kafka data?
While Kafka functionality goes beyond traditional message brokers such as RabbitMQ (you can read our comparison between the two here), it is still essentially a stream processor that sends unstructured event data from A to B. This means that data in Kafka is unstructured and not optimized for reading, which in turn means that trying to run an SQL query on raw Kafka data will not end well.
Hence the need for ETL: we must extract the data from Kafka or a staging environment, transform the raw JSON into structured data that can be easily queried, and then load the data into our database or analytics tool of choice.
What to expect from your ETL pipeline
ETL pipelines for Apache Kafka are uniquely challenging in that in addition to the basic task of transforming the data, we need to account for the unique characteristics of event stream data. Additionally, Kafka will often capture the type of data that lends itself to exploratory analysis – such as application logs, clickstream and sensor data, which are frequently used in data science initiatives. This creates another set of requirements as we often are unsure what exactly we will be doing with the data initially, or which parts of the data we will need and which we can discard.
With that in mind, you should look out for the following capabilities when building out your ETL flows:
- Read only new data: Kafka will only cache the last few minutes’ worth of data in-memory, which means attempting to read ‘cold’ data from Kafka will cause significant production issues – hence we will need to ensure our pipeline is only processing new data.
- Replay without disrupting production: while we are only reading new data, we might still need to be able to ‘replay’ the events exactly as they were recorded (e.g. in order to join with additional data or clean up data errors). In these cases we would like to have a copy of the raw data that we can easily ETL into our analytics tools without disrupting Kafka production.
- Support large amount of target outputs for the data: because Kafka supports such a broad range of business processes and use cases, we will want to maintain a similar level of flexibility when it comes to analyzing the data: developers might want to use Elasticsearch and Kibana, while data analysts might prefer working with tools such as Amazon Athena and Tableau.
The three paradigms for Kafka ETL
Now that we’ve defined the requirements, let’s look at the three most common approaches when it comes to building an ETL pipeline for Kafka data.
Conventional approach: Kafka to database, ETL on ingest
This is the traditional way to do ETL – data is transformed on ingest and outputted with new structure and schema to a database or data warehouse. In this case we’re treating Kafka as just another data source as we would other organizational systems of record.
Pros: The main advantage of this type of pipeline is that we end up with structured data in a relational database – which is a very good place to be since databases are familiar and comfortable to work with. If your throughput in Kafka is fairly low this is probably the way to go.
Cons: Problems with this approach begin to appear as scale increases. Once you’re over a few hundreds of events per second, you are likely to encounter scaling issues, including:
- Latency due to batch processing
- High costs of constantly writing new records to database storage, particularly in databases where storage and compute are closely coupled, such as Amazon Redshift
- Loss of historical data due to the need to impose schema on write
Data lake approach: store raw event streams, ETL on output
In this approach we are foregoing schema-on-write and storing the raw Kafka data in object storage (such as Amazon S3), while performing batch and stream ETL on read and per use case using tools such as Upsolver or Spark Streaming.
- Inexpensive, infinitely scalable storage
- Historical data is readily available for replay purposes
- Ability to support a broad range of use cases, including real-time
- Data lake architectures are often complex and require data engineers to orchestrate many moving parts. You can circumvent this issue by using a self-service streaming data platform.
ETL in Kafka
A third option is to transform the data while it is stored in your Kafka cluster, either by writing code or using something like KSQL, and then run your analytics queries directly in Kafka or output the transformed data to a separate storage layer.
We’ve covered the topic of reading data directly from Kafka in a previous article on this topic, and you can read the details there if you’re interested. In a nutshell, this can be a good fit for use cases that require looking at smaller chunks of data, such as real-time monitoring. However, at higher scales and larger timeframes, if you keep all your data in Kafka you’ll encounter the same issues, and more severe issues, as you would using a relational database in terms of storage costs and performance.