Explore our expert-made templates & start with the right one for you.
If you’re a data engineer or software architect, you need a copy of this new O’Reilly report. Download it to learn about the complexity of modern data pipelines, education on new techniques being employed to address it, and advice on which approach to take for each use case so that both internal users and customers have their analytics needs met. Download the report now
Apache Airflow is a powerful and widely-used open-source workflow management system (WMS) designed to programmatically author, schedule, orchestrate, and monitor data pipelines and workflows. Airflow enables you to manage your data pipelines by authoring workflows as Directed Acyclic Graphs (DAGs) of tasks. There’s no concept of data input or output – just flow. You manage task scheduling as code, and can visualize your data pipelines’ dependencies, progress, logs, code, trigger tasks, and success status.
Airflow was developed by Airbnb to author, schedule, and monitor the company’s complex workflows. Airbnb open-sourced Airflow early on, and it became a Top-Level Apache Software Foundation project in early 2019.
Written in Python, Airflow is increasingly popular, especially among developers, due to its focus on configuration as code. Airflow’s proponents consider it to be distributed, scalable, flexible, and well-suited to handle the orchestration of complex business logic. According to marketing intelligence firm HG Insights, as of the end of 2021 Airflow was used by almost 10,000 organizations, including Applied Materials, the Walt Disney Company, and Zoom. (And Airbnb, of course.) Amazon offers AWS Managed Workflows on Apache Airflow (MWAA) as a commercial managed service. Astronomer.io and Google also offer managed Airflow services.
Apache Airflow: Orchestration via DAGs
Airflow enables you to:
- orchestrate data pipelines over object stores and data warehouses
- run workflows that are not data-related
- create and manage scripted data pipelines as code (Python)
Airflow organizes your workflows into DAGs composed of tasks. A scheduler executes tasks on a set of workers according to any dependencies you specify – for example, to wait for a Spark job to complete and then forward the output to a target. You add tasks or dependencies programmatically, with simple parallelization that’s enabled automatically by the executor.
The Airflow UI enables you to visualize pipelines running in production; monitor progress; and troubleshoot issues when needed. Airflow’s visual DAGs also provide data lineage, which facilitates debugging of data flows and aids in auditing and data governance. And you have several options for deployment, including self-service/open source or as a managed service.
Ideal Use Cases for Airflow
Prior to the emergence of Airflow, common workflow or job schedulers managed Hadoop jobs and generally required multiple configuration files and file system trees to create DAGs (examples include Azkaban and Apache Oozie).
But in Airflow it could take just one Python file to create a DAG. And because Airflow can connect to a variety of data sources – APIs, databases, data warehouses, and so on – it provides greater architectural flexibility.
Airflow orchestrates batch jobs, and is most suitable:
- when you must automatically organize, execute, and monitor data flow
- when your data pipelines change slowly (days or weeks – not hours or minutes), are related to a specific time interval, or are pre-scheduled
- for ETL pipelines that extract batch data from multiple sources, and run Spark jobs or other data transformations
Other applicable use cases include:
- machine learning model training, such as triggering a SageMaker job
- automated generation of reports
- backups and other DevOps tasks, such as submitting a Spark job and storing the resulting data on a Hadoop cluster
As with most applications, Airflow is not a panacea, and is not appropriate for every use case. There are also certain technical considerations even for ideal use cases.
Reasons Managing Workflows with Airflow can be Painful
First and foremost, Airflow orchestrates batch workflows. It is not a streaming data solution. The scheduling process is fundamentally different:
- batch jobs (and Airflow) rely on time-based scheduling
- streaming pipelines use event-based scheduling
Airflow doesn’t manage event-based jobs. It operates strictly in the context of batch processes: a series of finite tasks with clearly-defined start and end tasks, to run at certain intervals or trigger-based sensors. Batch jobs are finite. You create the pipeline and run the job. But streaming jobs are (potentially) infinite, endless; you create your pipelines and then they run constantly, reading events as they emanate from the source. It’s impractical to spin up an Airflow pipeline at set intervals, indefinitely.
Pipeline versioning is another consideration. Storing metadata changes about workflows helps analyze what has changed over time. But Airflow does not offer versioning for pipelines, making it challenging to track the version history of your workflows, diagnose issues that occur due to changes, and roll back pipelines.
Also, while Airflow’s scripted “pipeline as code” is quite powerful, it does require experienced Python developers to get the most out of it. Python expertise is needed to:
- create the jobs in the DAG
- stitch the jobs together in sequence
- program other necessary data pipeline activities to ensure production-ready performance
- debug and troubleshoot
As a result, Airflow is out of reach for non-developers, such as SQL-savvy analysts; they lack the technical knowledge to access and manipulate the raw data. This is true even for managed Airflow services such as AWS Managed Workflows on Apache Airflow or Astronomer.
Also, when you script a pipeline in Airflow you’re basically hand-coding what’s called in the database world an Optimizer. Databases include Optimizers as a key part of their value. Big data systems don’t have Optimizers; you must build them yourself, which is why Airflow exists.
But despite Airflow’s UI and developer-friendly environment, Airflow DAGs are brittle. Big data pipelines are complex. There are many dependencies, many steps in the process, each step is disconnected from the other steps, and there are different types of data you can feed into that pipeline. A change somewhere can break your Optimizer code. And when something breaks it can be burdensome to isolate and repair. For example, imagine being new to the DevOps team, when you’re asked to isolate and repair a broken pipeline somewhere in this workflow:
Source: Apache Software Foundation
Finally, a quick Internet search reveals other potential concerns:
- steep learning curve
- Operators execute code in addition to orchestrating workflow, further complicating debugging
- many components to maintain along with Airflow (cluster formation, state management, and so on)
- difficulty sharing data from one task to the next
It’s fair to ask whether any of the above matters, since you cannot avoid having to orchestrate pipelines. And Airflow is a significant improvement over previous methods; is it simply a necessary evil? Well, not really – you can abstract away orchestration in the same way a database would handle it “under the hood.”
Eliminating Complex Orchestration with Upsolver’s Declarative Pipelines
Upsolver is a low-code data pipeline platform for cloud data lakes that greatly simplifies building batch and streaming pipelines and automates data management on object storage services (such as S3 and Azure Blob). One of the functions it automates is pipeline workflow management. This means for Upsolver transformations you do not need Airflow.
Upsolver provides a declarative approach to building pipelines and designing your pipeline architecture patterns. In a declarative data pipeline, you specify (or declare) your desired output, and leave it to the underlying system to determine how to structure and execute the job to deliver this output.
But Airflow requires scripted (or imperative) programming; you must decide on and indicate the “how” in addition to just the “what” to process. In a way, it’s the difference between asking someone to serve you grilled orange roughy (declarative), and instead providing them with a step-by-step procedure detailing how to catch, scale, gut, carve, marinate, and cook the fish (scripted). Apologies for the roughy analogy!
The difference from a data engineering standpoint?
- Some data engineers prefer scripted pipelines, because they get fine-grained control; it enables them to customize a workflow to squeeze out that last ounce of performance.
- Others might instead favor sacrificing a bit of control to gain greater simplicity, faster delivery (creating and modifying pipelines), and reduced technical debt.
But there’s another reason, beyond speed and simplicity, that data practitioners might prefer declarative pipelines: Orchestration in fact covers more than just moving data. It also describes workflow for data transformation and table management.
Upsolver’s declarative pipelines handle the entire orchestration process, inferring the workflow from the declarative pipeline definition. And since SQL is Upsolver’s configuration language for declarative pipelines, anyone familiar with SQL can create and orchestrate their own workflows. Further, SQL is a strongly-typed language, so mapping the workflow is strongly-typed, as well (meaning every data item has an associated data type that determines its behavior and allowed usage). Airflow, by contrast, requires manual work in Spark Streaming, or Apache Flink or Storm, for the transformation code.
Upsolver also provides a visual method for building pipelines that’s synchronized with SQL commands that you can edit directly. Using only Upsolver’s visual IDE, you can build pipelines that ingest data, read data from various streaming sources and data lakes (including Amazon S3, Amazon Kinesis Streams, and Apache Kafka), and write data to the desired target (e.g. Amazon Athena, Amazon Redshift, ElasticSearch, and RDS). You also specify data transformations visually or in SQL. Upsolver automates the management and optimization of output tables, including:
- various partitioning strategies
- conversion to column-based format
- optimizing file sizes (compaction)
- upserts (updates and deletes)
Upsolver ETLs are automatically orchestrated whether you run them continuously or on specific time frames, without the need to write any orchestration code in Apache Spark or Airflow. This is how, in most instances, Upsolver basically makes Airflow redundant, including orchestrating complex workflows at scale for a range of use cases, such as machine learning.
- Airflow fills a gap in the big data ecosystem by providing a simpler way to define, schedule, visualize and monitor the underlying jobs needed to operate a big data pipeline.
- Airflow was built for batch data, requires coding skills, is brittle, and creates technical debt.
- Upsolver uses a declarative approach to pipelines and automates workflow orchestration so you can eliminate the complexity of Airflow to deliver reliable pipelines on batch and streaming data at scale.
There’s much more information about the Upsolver platform, including how it automates a full range of data best practices, real-world stories of successful implementation, and more, at www.upsolver.com.
To speak with an expert, please schedule a demo: https://www.upsolver.com/schedule-demo.
Try SQLake for free (early access). SQLake is Upsolver’s newest offering. It lets you build and run reliable data pipelines on streaming and batch data via an all-SQL experience. Try it for free. No credit card required.