What is workflow management?
Data engineers, data scientists, analysts, and anyone working in any kind of a data role have to juggle an ever-increasing number of scheduled tasks. It’s rare for a task to standalone; there are usually several dependencies between them, creating a complex web of interrelated computational batch or streaming jobs made up of strings of tasks that must be completed in a specific order.
Many are critical tasks that could cause serious security wormholes or undermine the reliability of crucial models if they get overlooked. When the workflow gets too heavy, it’s no longer possible to keep track of them with cron jobs or spreadsheets. That’s when you need workflow management software (WMS) that helps automate all the processes. Directed acyclic graphs, or DAGs, are one way to plot complicated data workflows and keep track of the interlinked tasks that need to be performed, but as tasks multiply exponentially, DAGs too can get out of hand.
Keeping all your DAGs under control, visible, and trackable enables data teams to spot where errors arise. WMS organizes DAGs to help prevent letting bad data into the ecosystem, often preventing downstream tasks from continuing until the previous failures have been cleared up.
WMS are still evolving and developing to meet different use cases and team needs. Two popular options on the market are Apache Airflow and Apache Luigi. Here’s a brief overview of each solution, and a head to head comparison to help you choose between them.
What is Airflow?
Airflow is the WMS that Airbnb built to help their data engineers, data scientists and analysts keep on top of the tasks of building, monitoring, and retrofitting data pipelines, because they couldn’t find a setup that met their needs. When the system was complete, Airbnb decided to share it as open source software under the Apache license.
Airflow pipelines are defined in Python to make pipeline generation from configuration files or other metadata more fluid. You can introspect code subclass, meta-program, and use import libraries for your pipeline building. Airflow defines workflows as DAGs, and tasks are instantiated dynamically.
Airflow workflows can be as complex or simple as you like. Airbnb uses them for tasks like data warehousing and preprocessing; growth analytics; A/B testing analysis; database maintenance; analytics for search, session, and email metrics; and more. It comes with prebaked operators that you can use to build tasks, or you can create new ones from scratch. Each task can be broken down to smaller executable pieces, which makes it all more flexible, and dependencies are specified separately to the task itself.
Airflow is built using:
- Hooks to abstract information
- Operators to generate tasks that become nodes
- Executors (usually Celery) that run jobs remotely, handle message queuing, and decide which worker will execute each task
Other architecture includes a configuration repository, usually MySQL or Postgres, to monitor task status and other persistent information, and the scheduler, which enables distributed execution and parallelisation. You can set DAGs to run automatically by allowing the system to assign them at specific intervals.
There’s a rich command line interface (CLI) for testing, running, backfilling, describing, and clearing the DAGs, but the webapp adds a powerful and user-friendly UI for exploring DAGs and makes the entire solution much easier to use. You can use the UI for an impressive number of tasks, from visualizing pipelines dependencies, and analyzing time usage to changing task statuses and forcing a task to run. It’s also possible to run SQL queries against the registered connections, check result sets, and create and share simple charts.
Finally, Airflow is highly extensible. It plays well with Hive, Presto, MySQL, HDFS, Postgres, and S3, and allows you to trigger arbitrary scripts.
What is Luigi?
Luigi was built by Spotify for its data science teams to use to build long-running pipelines of thousands of tasks that stretch across days or weeks. It was never intended to replace lower-level processors like Hadoop, but to help stitch tasks together into smooth workflows. Like Airbnb, Spotify made Luigi available on an open-source license under Apache.
Luigi is a Python package, but you can also use it to trigger non-Python tasks and write pipes in other languages. With Luigi, it’s easy to reuse code, fork execution paths, and write complex dependency graphs, and there’s a large library of stock tasks and target data systems, including Hadoop, Hive queries, scaling, Redshift, PostgreSQL, Google BigQuery.
Spotify uses Luigi for data processing and the modeling that underpins recommendations, A/B test analysis, to power dashboards and reports, and for typically long running things like Hadoop jobs, dumping data to/from databases, and running ML algorithms. Luigi enables complex data pipelines for batch jobs, dependency resolution, workflow management, pipelines visualization, handling failures, command line integration, and more.
With Luigi, you can set workflows as tasks and dependencies, like with Airflow, but unlike Airflow, Luigi doesn’t use DAGs. Instead, Luigi refers to “tasks” and “targets.” Targets are both the results of a task and the input for the next task.
Luigi has 3 steps to construct a pipeline:
- requires() defines the dependencies between the tasks
- output() defines the the target of the task
- run() defines the computation performed by each task
In Luigi, tasks are intricately connected with the data that feeds into them, making it hard to create and test a new task rather than just stringing them together. Because of this setup, it can also be difficult to change a task, because you’ll also have to change each dependent task individually.
Luigi offers a web interface in the form of a webapp that can run locally. You can use it to search and filter tasks, see visualizations that track pipelines and follow progress through tasks, and view which tasks are running, failed, or completed as well as a graph of dependencies. However, it’s hard to see task logs and fails. To do so, you need to examine cron worker logs and find the task log. You also can’t see the tasks before execution on Luigi, so you can’t know what code is running in correlating tasks.
Airflow vs. Luigi
Although Airflow and Luigi share some similarities, like both being open-source, both on an Apache license, and like most WMS being defined in Python, the two solutions are quite different. Luigi is based on pipelines of tasks that share input and output information and is target-based, while Airflow is based on DAG representation and doesn’t have a concept of input or output, just of flow.
When it comes to running a complex string of tasks, Luigi doesn’t really have a straightforward option. There’s no simple way to set one task to begin before the first task has completely finished, even though overlapping tasks would speed things up. You can do it, but it demands complex code. In contrast, Airflow makes it easy to add tasks or dependencies programmatically in a loop, with simple parallelization that’s enabled automatically by the Executor.
When it comes to scheduling workflows and tasks, Airflow wins hands down. You can set up distributed execution through an executor and then leave the built-in scheduler to start tasks automatically at the right time. Airflow can run multiple DAGs at once and trigger a workflow at specified intervals or times, while Luigi has none of this functionality. With Luigi, there’s no built-in triggering and no support for distributed execution. Pipelines always begin manually or through a cron job from the command line, although that does enable you to start tasks independently or use custom scheduling to pick specific tasks.
Airflow’s UI is also far superior to Luigi’s, which is frankly minimal. With Airflow, you can see and interact with running tasks and executions much better than you can with Luigi.
When it comes to restarting and rerunning pipelines, Luigi again has its pros and cons. Luigi makes it easy to restart a failed pipeline after you’ve addressed the failure, but once a pipeline is completed, it’s hard to rerun. But Airflow’s Celery executor makes it easy to restart failed pipelines and to rerun a completed one.
If you’re hoping to scale, Airflow is the best choice, since Luigi’s lack of execution distribution holds it back from scaling beyond a single pipeline. But smaller organizations may still find that Luigi provides enough power for their needs and appreciate the range of prebaked templates and tasks that it offers.