Airflow Backfill: 3 Tips to Avoid the Pitfalls

Airflow offers a backfill feature that allows you to run a DAG or individual tasks within a DAG retrospectively. This can be useful if you need to add in missing data or if you want to re-run a task to fix an error.

In this article, we will discuss some best practices for managing Airflow’s backfill feature and introduce an alternative approach for orchestration that streamlines the process of managing data pipelines for data engineers.

Backfill in Airflow: The Basics

Backfilling is a crucial concept in Airflow that allows you to execute tasks retrospectively. This feature is essential when you need to process historical data or correct an error in a previous execution of the DAG. Airflow’s backfilling feature can be invoked by running the “backfill” command, and it will automatically trigger the execution of all the tasks that were missed or failed in the past.

Backfilling can be resource-intensive because it requires the scheduler to execute a large number of tasks in a short period. Depending on the size of the DAG and the length of the backfill period, it may take a considerable amount of time to complete the backfill. Therefore, it’s essential to understand how to control and optimize backfilling in Airflow.

Tip 1: Prevent Backfilling of Missed Tasks in Airflow

A common pitfall is when the scheduler backfills missed tasks while the scheduler was turned off. This is expected behavior, as the scheduler will backfill any tasks that were not run as scheduled when it is turned back on. To prevent this, you can set the depends_on_past parameter in the default_args of your DAG to True. This will cause tasks only to run if the previous task has been completed successfully.

Tip 2: Control Backfilling in Airflow using the only_backfill Parameter

A second pitfall is when the scheduler is backfilling all tasks from the start date of the DAG until the current time. This is also expected behavior when you run the airflow clear command, as it clears the status of all tasks in the DAG, causing the scheduler to backfill them when it is restarted. If you want to prevent this, you can set the start date of your DAG to a date in the future so that the scheduler will only start running tasks once that date is reached. You can also use the only_backfill parameter when running the clear command to specify a specific date range to backfill.

Tip 3: Optimize Resource Utilization During Backfills

It’s important to note that backfilling can be a resource-intensive operation, as it requires the scheduler to run a large number of tasks in a short period. If you have a large DAG or an extended start date, it may be better to run the tasks using the manually airflow run command instead of backfilling.

Backfilling in Airflow is a way to run tasks retroactively in a DAG. It can help fill in missing data or re-run tasks to fix errors, but it can also be resource-intensive. You can control backfilling by setting the depends_on_past parameter in the default_args of your DAG and by setting the start date of your DAG to a date in the future. You can also use the only_backfill parameter when running the clear command to specify a specific date range to backfill.

Tired of Debugging Airflow? Try SQLake

While Airflow is a useful tool, it can be difficult to troubleshoot. SQLake offers a simpler solution for automating data pipeline orchestration.

With SQLake you can:

  • Build reliable, maintainable, and testable data ingestion.
  • Process pipelines for batch and streaming data, using familiar SQL syntax.
  • Jobs are executed once and continue to run until stopped.
  • There is no need for scheduling or orchestration.
  • The compute cluster scales up and down automatically, simplifying the deployment and management of your data pipelines.

Here is a code example of joining multiple S3 data sources into SQLake and applying simple enrichments to the data.

/* Ingest data into SQLake */
-- 1. Create a connection to SQLake sample data source.
CREATE S3 CONNECTION upsolver_s3_samples
    AWS_ROLE = 'arn:aws:iam::949275490180:role/upsolver_samples_role'
    EXTERNAL_ID = 'SAMPLES'
    READ_ONLY = TRUE;
-- 2. Create empty tables to use as staging for orders and sales.
CREATE TABLE default_glue_catalog.database_a137bd.orders_raw_data()
    PARTITIONED BY $event_date;
CREATE TABLE default_glue_catalog.database_a137bd.sales_info_raw_data()
    PARTITIONED BY $event_date;
-- 3. Create streaming jobs to ingest raw orders and sales data into the staging tables..
CREATE SYNC JOB load_orders_raw_data_from_s3
   CONTENT_TYPE = JSON
   AS COPY FROM S3 upsolver_s3_samples 
      BUCKET = 'upsolver-samples' 
      PREFIX = 'orders/' 
   INTO default_glue_catalog.database_a137bd.orders_raw_data; 
CREATE SYNC JOB load_sales_info_raw_data_from_s3
   CONTENT_TYPE = JSON
   AS COPY FROM S3 upsolver_s3_samples 
      BUCKET = 'upsolver-samples' 
      PREFIX = 'sales_info/'
   INTO default_glue_catalog.database_a137bd.sales_info_raw_data;

Published in: Blog , Streaming Data
Upsolver Team
Upsolver Team

Upsolver enables any data engineer to build continuous SQL data pipelines for cloud data lake. Our team of expert solution architects is always available to chat about your next data project. Get in touch

Keep up with the latest cloud best practices and industry trends

Get weekly insights from the technical experts at Upsolver.

Subscribe

Templates

All Templates

Explore our expert-made templates & start with the right one for you.