Explore our expert-made templates & start with the right one for you.
What is workflow management?
Data engineers, data scientists, analysts, and anyone working in any kind of a data role have to juggle an ever-increasing number of scheduled tasks. It’s rare for a task to stand alone; there are usually several dependencies between them, creating a complex web of interrelated computational batch or streaming jobs made up of strings of tasks that must be completed in a specific order.
Many are critical tasks that could cause serious security wormholes or undermine the reliability of crucial models if they get overlooked. When the workflow gets too heavy, it’s no longer possible to keep track of them with cron jobs or spreadsheets. That’s when you need workflow management software (WMS) that helps automate all the processes. Directed acyclic graphs, or DAGs, are one way to plot complicated data workflows and keep track of the interlinked tasks that need to be performed, but as tasks multiply exponentially, DAGs too can get out of hand.
Keeping all your DAGs under control, visible, and trackable enables data teams to spot where errors arise. WMS organizes DAGs to help keep from allowing bad data into the ecosystem, often preventing downstream tasks from continuing until the previous failures have been cleared up.
WMS is still evolving and developing to meet different use cases and team needs. Two popular options on the market are Apache Airflow and Apache Luigi. Here’s a brief overview of each solution, and a head-to-head comparison to help you choose between them.
What is Airflow?
Airflow is the WMS that Airbnb built to help their data engineers, data scientists, and analysts keep on top of the tasks of building, monitoring, and retrofitting data pipelines, because they couldn’t find a setup that met their needs. When the system was complete, Airbnb decided to share it as open source software under the Apache license.
Airflow pipelines are defined in Python to make pipeline generation from configuration files or other metadata more fluid. You can introspect code subclass, meta-program, and use import libraries for your pipeline building. Airflow defines workflows as DAGs, and tasks are instantiated dynamically.
Airflow workflows can be as complex or simple as you like. Airbnb uses them for tasks like data warehousing and preprocessing; growth analytics; A/B testing analysis; database maintenance; analytics for search, session, and email metrics; and more. It comes with prebaked operators that you can use to build tasks, or you can create new ones from scratch. Each task can be broken down to smaller executable pieces, which makes it all more flexible, and dependencies are specified separately to the task itself.
Airflow is built using:
- Hooks to abstract information
- Operators to generate tasks that become nodes
- Executors (usually Celery) that run jobs remotely, handle message queuing, and decide which worker will execute each task
Other architecture includes a configuration repository, usually MySQL or Postgres, to monitor task status and other persistent information, and the scheduler, which enables distributed execution and parallelisation. You can set DAGs to run automatically by allowing the system to assign them at specific intervals.
There’s a rich command line interface (CLI) for testing, running, backfilling, describing, and clearing the DAGs, but the webapp adds a powerful and user-friendly UI for exploring DAGs and makes the entire solution much easier to use. You can use the UI for an impressive number of tasks, from visualizing pipeline dependencies and analyzing time usage to changing task statuses and forcing a task to run. It’s also possible to run SQL queries against the registered connections, check result sets, and create and share simple charts.
Finally, Airflow is highly extensible. It plays well with Hive, Presto, MySQL, HDFS, Postgres, and S3, and allows you to trigger arbitrary scripts.
What is Luigi?
Luigi was built by Spotify for its data science teams to use to build long-running pipelines of thousands of tasks that stretch across days or weeks. It was never intended to replace lower-level processors like Hadoop, but to help stitch tasks together into smooth workflows. Like Airbnb, Spotify made Luigi available on an open-source license under Apache.
Luigi is a Python package, but you can also use it to trigger non-Python tasks and write pipes in other languages. With Luigi, it’s easy to reuse code, fork execution paths, and write complex dependency graphs, and there’s a large library of stock tasks and target data systems, including Hadoop, Hive queries, scaling, Redshift, PostgreSQL, and Google BigQuery.
Spotify uses Luigi for data processing and the modeling that underpins recommendations; for A/B test analysis; to power dashboards and reports; and for typically long running processes such as Hadoop jobs, dumping data to/from databases, and running ML algorithms. Luigi enables complex data pipelines for batch jobs, dependency resolution, workflow management, pipelines visualization, handling failures, command line integration, and more.
With Luigi, you can set workflows as tasks and dependencies, as with Airflow. But unlike Airflow, Luigi doesn’t use DAGs. Instead, Luigi refers to “tasks” and “targets.” Targets are both the results of a task and the input for the next task.
Luigi has 3 steps to construct a pipeline:
- requires() defines the dependencies between the tasks
- output() defines the the target of the task
- run() defines the computation performed by each task
In Luigi, tasks are intricately connected with the data that feeds into them, making it hard to create and test a new task rather than just stringing them together. Because of this setup, it can also be difficult to change a task, because you’ll also have to change each dependent task individually.
Luigi offers a web interface in the form of a webapp that can run locally. You can use it to search and filter tasks, see visualizations that track pipelines and follow progress through tasks, and view which tasks are running, failed, or completed as well as a graph of dependencies. However, it’s hard to see task logs and fails. To do so, you need to examine cron worker logs and find the task log. You also can’t see the tasks before execution on Luigi, so you can’t know what code is running in correlating tasks.
Airflow vs. Luigi
Although Airflow and Luigi share some similarities—both are open-source, both operate on an Apache license, and both, like most WMS, are defined in Python—the two solutions are quite different. Luigi is based on pipelines of tasks that share input and output information and is target-based, while Airflow is based on DAG representation and doesn’t have a concept of input or output, just of flow.
When it comes to running a complex string of tasks, Luigi doesn’t really have a straightforward option. There’s no simple way to set one task to begin before the first task has completely finished, even though overlapping tasks would speed things up. You can do it, but it demands complex code. In contrast, Airflow makes it easy to add tasks or dependencies programmatically in a loop, with simple parallelization that’s enabled automatically by the Executor.
When it comes to scheduling workflows and tasks, Airflow wins hands down. You can set up distributed execution through an executor and then leave the built-in scheduler to start tasks automatically at the right time. Airflow can run multiple DAGs at once and trigger a workflow at specified intervals or times, while Luigi has none of this functionality. With Luigi, there’s no built-in triggering and no support for distributed execution. Pipelines always begin manually or through a cron job from the command line, although that does enable you to start tasks independently or use custom scheduling to pick specific tasks.
Airflow’s UI is also far superior to Luigi’s, which is frankly minimal. With Airflow, you can see and interact with running tasks and executions much better than you can with Luigi.
When it comes to restarting and rerunning pipelines, Luigi again has its pros and cons. Luigi makes it easy to restart a failed pipeline after you’ve addressed the failure, but once a pipeline is completed, it’s hard to rerun. But Airflow’s Celery executor makes it easy to restart failed pipelines and to rerun a completed one.
If you’re hoping to scale, Airflow is the best choice, since Luigi’s lack of execution distribution holds it back from scaling beyond a single pipeline. But smaller organizations may still find that Luigi provides enough power for their needs and appreciate the range of prebaked templates and tasks that it offers.
Upsolver SQLake – An Alternative Approach
SQLake is a great alternative to tools like Apache Airflow and Luigi for data pipeline orchestration because it offers a number of key benefits, including:
- Simplified data pipeline management: SQLake allows you to build, test, and maintain data pipelines using familiar SQL syntax, which can be easier and more intuitive than using other tools. Additionally, SQLake provides a unified platform for managing all aspects of your data pipelines, which can help to reduce the complexity of working with large data lakes.
- Improved scalability and performance: SQLake is designed to handle the scale and complexity of modern data lakes, and provides efficient data processing and query performance. The compute cluster scales up and down automatically, which can help to simplify the deployment and management of your data pipelines.
- Enhanced security and governance: SQLake includes robust security and governance features to help organizations ensure that their data is protected and managed in compliance with industry regulations and best practices.
In the following example, we will demonstrate with sample data how to create a job to read from the staging table, apply business logic transformations and insert the results into the output table. You can test this code in SQLake with or without sample data.
CREATE JOB transform_orders_and_insert_into_athena_v1 START_FROM = BEGINNING ADD_MISSING_COLUMNS = TRUE RUN_INTERVAL = 1 MINUTE AS INSERT INTO default_glue_catalog.database_2883f0.orders_transformed_data MAP_COLUMNS_BY_NAME -- Use the SELECT statement to choose columns from the source and implement your business logic transformations. SELECT orderid AS order_id, -- rename columns MD5(customer.email) AS customer_id, -- hash or mask columns using built-in functions customer_name, -- computed field defined later in the query nettotal AS total, $commit_time AS partition_date -- populate the partition column with the processing time of the event, automatically cast to DATE type FROM default_glue_catalog.database_2883f0.orders_raw_data LET customer_name = customer.firstname || ' ' || customer.lastname -- create a computed column WHERE ordertype = 'SHIPPING' AND $commit_time BETWEEN run_start_time() AND run_end_time();