Explore our expert-made templates & start with the right one for you.
In going from raw data to insights, data traverse a pipeline where they are transformed, dissected, augmented, aggregated and modeled. Data pipelines have many constraints. For example, they cannot be cyclic: the data must flow in one direction, from beginning to end. Data can also take many forms and may flow through the pipeline continuously or in bursts. As data engineers, we are tasked with authoring, operationalizing, orchestrating, and maintaining these various data pipelines. Anything that simplifies that work for us is a gift. In this article, let’s discuss the possibility of reducing the orchestration burden through self-orchestration frameworks.
Orchestration has two main components: scheduling and dependency management.
- Scheduling: A pipeline is triggered under certain conditions. Triggering can be achieved via a CRON expression (e.g. once every day at 2 am), or an external trigger (e.g. an API call).
- Dependency management: Pre-defined relationships between pipeline components (tasks) dictate execution state and order for each individual task. Inter-task relationships can be specified through explicit configuration, such as “when task A finishes, task B should begin execution,” “tasks B, C, and D are generated by and then executed following task A,” “the outcome of computations in task A will determine whether task B or task C is executed” etc.
When we say self-orchestrating, we mean that neither the trigger conditions nor the dependencies need to be specified by the user. The underlying system (such as Upsolver SQLake) infers these from the incoming data and the outcome the user expresses.
But how exactly does the system infer the requisite orchestration?
The fundamental difference between explicit and implicit orchestration is that the former is unaware of the workflow it’s orchestrating while the latter is fully aware. Being unaware has some advantages: orchestrators can be used for any kind of pipeline (not just data pipelines) and the pipeline author can be sure that the path she delineated will be followed exactly (if possible). However, it also has strong drawbacks. Orchestrators are incapable of making decisions based on the contents and conditions of the pipeline, such as reacting to unprecedented anomalies in the data. Unless a path is specified for the set of conditions that arises, the data are prevented from traversing the pipeline, or worse yet, the orchestrator simply follows naive instructions, leading to garbage in, garbage out.
Self-orchestrating execution engines, because they are aware of the processes, inputs and outputs, can inherently handle anomalies in and evolution of data. They are purpose-built for data pipelines, and use data to decide what to do. In short, they are adaptable.
Here are a few examples of self-orchestration in action:
When you write the following in SQLake to ingest data:
CREATE JOB <job_name> AS COPY FROM <source_type> <source_name> <other_source_parameters> INTO <destination_table_name>
SQLake creates a task that knows it’s reading data from a source into the lake. This task infers the type and schema of the data, so you can immediately query the data.
For instance, use
SELECT * FROM <destination_table_name> LIMIT 10
to see a sample of the ingested table.
In the above, if you simply replace
CREATE JOB with
CREATE SYNC JOB, the table in the data lake will stay up to date with any data added to the source, in perpetuity.
You can also build pipelines with transformations and aggregations on the ingested data that remain up to date, up to an interval you can specify. Try for example:
CREATE SYNC JOB <job_name> START_FROM = BEGINNING RUN_INTERVAL = 1 MINUTE AS INSERT INTO <destination_table_name> MAP_COLUMNS_BY_NAME SELECT <aggregations_on_columns> FROM <source_table_name> WHERE $event_time BETWEEN run_start_time() AND run_end_time() GROUP BY <group_by_columns>
Rather than configuring when to perform the aggregation, as you would for an orchestrator, here you declare the minimum required freshness interval for updated aggregations. How and when the data need to be ingested and aggregated are handled by the execution engine. If the interval passes with no new data added at the source, unlike an explicit orchestrator scheduled to run on the clock, the self-orchestrating engine avoids running an unnecessary job that would attempt to aggregate non-existent data.
These simple examples demonstrate how being data aware allows self-orchestrating frameworks to reduce the engineering burden of authoring and maintaining data pipelines.
Data as a product
A powerful innate benefit of self-orchestration is treating data as a product. Rather than envisioning and incorporating every transformation required of data in a pipeline, we can think of our deliverables as what we want to ask of the data. Self-orchestrating engines such as SQLake produce cataloged and versioned data assets at each step, from every job. For a desired deliverable, we simply point to the relevant data asset as the source and declare its required SLA. We can do so as many times as we want, reusing the same data asset for different deliverables, without worrying about scaling pains destroying the overall pipeline. In this framework, downstream applications simply cannot ruin data quality or otherwise break upstream portions of the pipeline. To achieve the same level of data integrity when manually orchestrating a pipeline, the data engineer must explicitly checkpoint every potential data asset along the way.
Data engineering has many fun challenges, but writing orchestrator code isn’t one of them. A data-fueled, self-orchestrating pipeline builder not only alleviates the burden of orchestration, but also comes with perks of data integrity and data as a product.