Data pipelines have evolved in parallel with the emergence of new types of data and new types of data management infrastructure systems. This article discusses the two key components of data pipelines – data integration and data transformation – and provides evaluation criteria for helping you design the right architecture for your needs.
A data pipeline is a mechanism for moving data from where it was created to where it will be consumed. Along the way the data is usually lightly or heavily processed to make it more “consumable” by end-users, applications, or processes. It’s useful to think about data pipelines in the context of two steps: data integration and data transformation.
You can perform these steps either in batches or via continuous real-time processing; which method is most appropriate depends on the use case and on the particular data source(s).
What is Data Integration?
Data integration is the process of moving data between systems. When we need data pipelines, we tend to think of either end of the ETL process:
- Extracting data from a source system – a database, application, or message queue such as Kafka or Kinesis – reliably and faithfully.
- Loading data into a destination system such as a database, data lake, or data warehouse. Sometimes transformed data is loaded back into a new topic/stream in a message queue for downstream processing.
You may want to modify the data as you move it. But pure data integration systems tend to be limited to basic changes, such as field mapping, field conversion, record deduplication, or PII obfuscation. They’re also often limited to moving data in batches.
There are many topologies for data pipelines. The simplest is a one-to-one pipeline, which over time can evolve into a many-to-many mesh. To regain governance control, companies then move to a hub-and-spoke model in which a central system acts as a clearinghouse between sources and destinations.
The process of data integration starts with extraction of data from a source system. In the past this was limited to a handful of business systems for marketing, manufacturing, finance, and sales, all of which were deployed in an on-premises data center. An ETL team built and managed the data pipelines that pulled data into the data warehouse. This team was also in full control of the updates and underlying schema to each of those systems. All data was delivered in batches, often as a nightly job so as not to interfere with source system operation.
Today, data integration is much more complex. Data originates from a huge variety of sources, which could be deployed on-prem, in a cloud VPC, or in a vendor’s multi-tenant SaaS cloud.
Sources fall primarily into three categories:
- Databases (Oracle, IBM DB2, Postgres, MySQL, SQL Server, MongoDB, AWS RDS, and so on). Here the schema is controlled by the engineering team responsible for the source system. They must coordinate to avoid breaking pipelines or dropping data due to unannounced schema changes. The optimal method for extraction is log-based Change Data Capture (CDC). CDC reduces the amount of data transported by only passing along information about changes to the source tables. This reduces the impact to the source system (as opposed to making snapshot copies of the entire table).
- Third-party applications include traditional ERP systems such as on-premises CRM, supply chain management accounting systems, or SaaS apps such as Google Analytics, Salesforce, NetSuite, and so on. Data typically arrives in these systems via API calls. But the method for extracting data is not always well-documented. For cloud apps, it can be challenging to keep up with changes to the API specification and data schema.
- Machine-generated log and event data (IoT sensors, app server logs, security system logs, network logs and protocol messages, website clickstreams). This is often called big or complex data. It comprises high volumes of data, often delivered in a semi-structured format with a complex, hierarchical schema. Raw source data is often sent as events through a message queue such as Kafka or Amazon Kinesis and deposited in a data lake (HDFS or cloud object storage) for cost reasons. This data is unusable in its raw state and requires significant preparation before analysis.
As SaaS-based applications and managed infrastructure services proliferate, the number and breadth of data sources keeps growing. Add to that the problem of replicating traditional relational databases in a way that doesn’t affect database performance, and you have an ever-expanding data engineering task list for integrating data.
For analytics workloads, the output of the data pipeline is usually a structured data store (such as a relational database, a data warehouse, or a search, time-series, real-time, or graph database). There are far fewer destination types than source types.
Data integration is (mostly) a solved problem
Since data integration is mostly a connectivity problem, companies such as Fivetran, Airbyte, and others – “connector” companies – have sprung up to address modern data integration. These companies build and maintain connectors to hundreds of underlying source systems and dozens of data destinations. They also monitor the performance and integrity of the data movement process.
They try to cover as many connectors as possible so customers can avoid building their own integrations. They tend not to provide support for streaming data (although some support CDC), tend to have limited data transformation capabilities, or are limited in the scale at which they can process data. They also tend not to support non-app sources such as logs and IoT sensors. These limitations result in a functionality gap when companies must transform complex or streaming data.
What is Data Transformation?
Data transformation is the raw data processing that occurs either between extract and load (for an ETL pipeline) or in the destination system after load (for an ELT pipeline). These processing jobs can be a periodic batch jobs or continuous streaming jobs. Transformations range from light (changing a field type, replacing null values, standardizing units) to heavy (hashing PII fields, aggregations, filters, streaming joins, sessionizing event data).
Why data transformation matters: Data transformation has become more important because big and streaming data in its raw form is often unusable, for multiple reasons. The sheer volume of data impedes performance and also can make it cost-prohibitive to work with. How the data is structured could be another barrier – for example, if it includes nested data, which you must flatten for use in many destination systems. It may be impractical to join multiple data sources into a target table such that the table contains only required data. And so on.
Where is data transformed: Transformations can occur either before loading the data via a dedicated transformation processing engine (ETL) or after loading the data using the destination system’s computing resources (ELT). ETL vs. ELT is another decision point. If the transformations are compute-intensive – big data, streaming data that requires continuous processing, heavy transformations – and/or the output tables must be centrally-managed and delivered to multiple destinations, then ETL makes the most sense. ELT is better for smaller jobs where the output will only be used on that single destination system; the price/performance disadvantage of using the destination system for this purpose is not consequential, as opposed to an optimized transformation engine.
The transformation ecosystem: When it comes to processing data transformations, you have a few choices. For the ETL model and traditional (small, structured) data, you can use a traditional ETL tool such as Informatica or Talend. For big data, most pipelines use affordable and scalable processing on a data lake (such as EC2 on AWS S3 storage). You can use a framework such as Apache Spark, where you code transformation logic in Java, Python, or Scala; in this case you also need an orchestration tool such as Airflow to plan out the query execution. Newer tools such as Upsolver use SQL as the language for specifying the transformation and they automate ingestion and orchestration to simplify the process further.
For ELT pipelines there are push-down processors for data warehouses such as dbt. dbt uses SQL to orchestrate jobs that are run on (pushed down to) the destination system. You can use the native capabilities of the destination system such as Snowpipe for ingestion and Snowflake staging tables for that platform.
But not all destinations support streaming data. Data warehouses are optimized for user queries, which tend to be less compute-intensive than many transformations; thus running transformations on a data warehouse can lead to high bills. For instance, for Upsolver customer AppsFlyer, 75% of their cloud data warehouse spend was for transforming data into an analytics-ready state.
SQL – the Sequel
SQL, designed decades ago for querying a database, is more relevant than ever, largely because the data warehousing companies have enhanced it and data transformation products are starting to use it to define processing jobs. New products now are bringing SQL’s broadened capabilities to bear on data transformations: dbt for batch ELT orchestration on a data warehouse, and Upsolver for ELT for batch and streaming transformations for output to data lakes, warehouses, databases, or into message queues.
Factors to Consider as you Determine your Pipeline Approach
As the modern data stack continues to evolve, the goal of having just one central repository for all an organization’s needs seems more like a pipe dream (pun intended). There will always be an ecosystem embracing data lakes, data warehouses, and other analytics systems that requires data pipelines to connect all the pieces and serve analytics-ready data to various consumers from BI reporting to data science, machine learning, and feeding data apps.
If you consider the business and technical questions below when planning a modern data stack and data pipeline architecture, you’ll be well on your way to identifying the best tools and technologies for your needs.
6 business factors to consider as you determine your pipeline approach
1. What are your use cases for consuming data?
There are all types of data consumption patterns, from BI and reporting to ad hoc analytics and data science. In BI and reporting, query patterns are predictable and data freshness SLAs clear (daily, hourly, real-time), so you can confidently prepare data. In ad hoc, queries can be anything and live users expect human-scale response times (a few seconds, perhaps). Feeding operational systems (alerting, automated decision-making, supplying data to customer-facing data apps) creates its own requirements for:
- transformation intensity
- data freshness
- query performance on the output data
- pipeline processing costs
- frequency of pipeline changes
A review of how the various use cases impact requirements would be its own lengthy blog; for this post, suffice it to say that “always start with the end in mind” is good advice. So an audit of the consumption use case requirements can be extremely helpful in narrowing down data pipeline technology options.
2. What are your data freshness requirements?
Are nightly updates sufficient or is there value in having data updated hourly or even each minute? The more stringent the latency requirement, the more specialized your processing options get as you’ll be running more frequent jobs or even running continuous transformations as new event data arrives from a streaming source.
There is almost always friction between data freshness and cost, since managed cloud data stores charge for compute consumption and updating an output table can require a time- and budget-consuming full scan of a very large table. (Tools such as Upsolver avoid this through incremental ETL / UPSERTs).
3. What is your data engineering skill set?
Some tools require Java or Python programming skills, as well as skills in deploying processing jobs across complex distributed processing systems such as Spark and Airflow. Others, like Upsolver, enable you to use SQL, a declarative language, and automate the execution details. Matching skills to tooling and pipeline complexity is critical for making a smooth data pipeline operation.
4. Do you have a self-service strategy for data pipelines?
If so, consider the skills of these self-service users, who usually are less accustomed to deep programming than those of a central data engineering team. In particular, analysts and data scientists may not have the Python or Scala coding skills required to use Spark.
If you are handling your pipelines in distributed, domain-centric centers (as advocated by the data mesh paradigm), consider the skills held in each of those business units or whether they can expect those skills in a central shared services team. The former can get expensive and hard to staff; the latter means the engineers won’t be experts on the data. Partnership is critical.
5. Is vendor lock-in a concern?
Where you choose to run your transformations impacts your ability to switch vendors or technologies in the future. An ELT process ties your transformations to the destination system; if you choose to switch from Snowflake to Redshift or Firebolt your transformation tool may also have to change, or you may have to rework all of your pipelines to handle differences in the support each destination system provides. A transformation engine running on a data lake and using a broadly-supported open file format such as Parquet is the most open option, enabling you to output tables to just about any destination system.
6. How cost sensitive are you?
Costs tend to land in 3 buckets:
- Hard costs for software licenses and cloud storage and consumption
- Soft costs for the use of staff to build and maintain the pipelines
- Invisible costs (or benefits) accrued from expedited (or delayed) delivery of analytics-ready data to consumers
Most purchase decisions consider these in the opposite order that they should. Invisible costs (or lost benefits) can be the largest and most strategic to the business, especially if your data pipeline approach is causing delays to customer-facing products or operations that could generate revenue or reduce churn.
Hard costs are often tied directly to data granularity and freshness requirements. As you move from daily to per-minute data you not only have to process data more often (perhaps continually) but your savings from time-based aggregation (rollups) is curtailed.
Soft (labor) costs are dictated by the skills required in staff – someone who writes Spark jobs costs more than a SQL expert – but also how many hours are required to build and then maintain the pipeline in the face of some expected change frequency over a reasonable time horizon. At some point you just can’t afford the expert engineers and instead suffer from pipeline development backlogs, creating more “invisible costs.”
As the modern data stack continues to evolve, the goal of having just one central repository for all an organization’s needs seems more like a pipe dream. There will always be an ecosystem embracing data lakes, data warehouses, and other analytics systems that requires data pipelines to connect all the pieces and serve analytics-ready data to various consumers from BI reporting to data science, machine learning, and feeding data apps.
5 Technological factors to consider as you determine your pipeline approach
1. What is the shape and scale of your source data?
Do you only deal with structured data from apps and databases, or do you also have semi-structured / nested and streaming event data to integrate and transform? If you deal with these more complex data types you need a specialized solution that uses a streaming tool such as Kafka or Kinesis to land the unusable raw data in a cheap object store (data lake) and then transform the data into something an analytics system can better digest. Transforming this kind of data can get expensive. Run a proof of concept (POC) to gain a real-world understanding of likely processing costs using competing approaches.
Do you have operational, analytical, or regulatory requirements to retain raw big data? For how long? Usually it’s the semi-structured and streaming data that drives you to the higher levels of storage. Again, you’ll want a data lake for storage and an affordable at-scale transformation engine. Note that even traditional data can become “big data” if you must search far back in time (several years’ worth) as opposed to just accessing, say, the past month of data.
2. How intensive are your transformations?
Are you only normalizing data to corporate standards; cleansing data; or performing compute-intensive operations such as aggregations or joins across data sets? Simpler transformations are less expensive and more broadly supported in data pipeline tools. More intensive transformations require platforms that support complex operations (e.g. window functions on streaming data) and can optimize the execution plan to obtain the best price performance, especially at scale.
3. How often will there be changes to source data or consumption requirements?
Change requests come from adding new use cases or modifying existing ones. They also can be imposed by changes in source data, such as new fields being added or changes to the semantics or structure of existing fields. In an environment where changes occur frequently you’ll need tooling that makes it easy to implement, test, and deploy pipeline logic changes with a minimum of skill. Otherwise data engineering becomes a bottleneck, bogged down with pipeline maintenance at the expense of more strategic work.
4. How will you manage pipeline changes?
After a pipeline is written it must be monitored and maintained. If the schema changes, or if the underlying API breaks, someone has to fix it – especially if it’s mission critical.
The more frequent you expect changes and the tighter your SLA for delivering changes, the tighter your process must be and the more pipeline engineering staff you must have. Of course, having a tool with automation reduces the staffing need, all else being equal. A tool that uses a common language like SQL further expands the pool of people available to maintain pipelines and make each change quicker to implement.
5. How do you deploy data infrastructure?
If you’re running transformations in the destination system (ELT), your tool must run in that environment. If you’re running ETL, your transformation system must be deployed in a manner acceptable to the company. For instance, some companies don’t allow data to leave their data center or cloud account, which precludes the use of a SaaS-only data integration service. And of course if you are hybrid and/or multi-cloud and want to standardize on a single vendor across these, your choices are also limited
Pipeline Evaluation Summary
We hope this article has provided a foundation for making a decision around data pipelines. In reality you’re likely to purchase a mix of data integration and data transformation tooling; in some cases you’ll build ETL pipelines and in others use ELT and push-down processing. You must always be cognizant of the skills burden you are putting on the organization with these choices, as well as how these choices may look several years from now in the face of changing requirements (data freshness in particular) and new analytics systems options (whether your choices create lock-in). Last but not least, the cost of turning raw data into analytics-ready data sets is large and growing in many companies. Your data pipeline technology choices have a major impact on your analytics cost profile.
More Information About the Upsolver Platform
There’s much more information about the Upsolver platform, including how it automates a full range of data best practices, real-world stories of successful implementations, and more, at www.upsolver.com.
Read about how you can build, scale, and future-proof your data pipeline architecture.
To speak with an expert, please schedule a demo: https://www.upsolver.com/schedule-demo.
See for yourself how easy Upsolver is to use, and how well it works. Download and use the free version: https://app.upsolver.com/signup.
If you have any questions, or wish to discuss this integration or explore other use cases, start the conversation in our Upsolver Community Slack channel.