How to Run Airflow as A Daemon

Upsolver Team
Streaming Data
December 25, 2022

Restarting the Airflow webserver process may disrupt workflow and tasks, and may have dependencies on other resources that must be considered. Careful planning and coordination is necessary to minimize disruption during the restart.

Instead, data engineers may choose to implement a data architecture that utilizes automated orchestration with tools like SQLake.

Airflow and systemd Service Manager

If you are using Airflow for your data pipeline project and want to restart your Airflow webserver process in your server, you can use the systemd service manager to run the webserver as a daemon process. This will allow you to easily manage the webserver and ensure that it is running reliably in your server environment.

To use systemd to run the Airflow webserver as a daemon process, follow these steps:

Create a “unit” file for the Airflow webserver in the systemd configuration directory. This file should specify the dependencies, environment variables, and other details about the webserver process, such as the user and group it should run as, the command to start the process, and how to handle restarts and failures. As an example, you can use the following unit file:

[Unit]
Description=Airflow webserver daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service

[Service]
PIDFile=/run/airflow/webserver.pid
EnvironmentFile=/home/airflow/airflow.env
User=airflow
Group=airflow
Type=simple
ExecStart=/bin/bash -c 'export AIRFLOW_HOME=/home/airflow ; airflow webserver --pid /run/airflow/webserver.pid'
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID
Restart=on-failure
RestartSec=42s
PrivateTmp=true

[Install]
WantedBy=multi-user.target

Note: Be sure to change the value of AIRFLOW_HOME to the location of your airflow folder with the configuration.

Use the systemd commands systemctl start airflow, systemctl stop airflow, and systemctl restart airflow to start, stop, and restart the Airflow webserver, respectively.

For example, to start the webserver, you can use the following command:

systemctl start airflow

To stop the webserver, you can use the following command:

systemctl stop airflow

To restart the webserver, you can use the following command:

systemctl restart airflow

By using systemd to run the Airflow webserver as a daemon process, you can easily manage the webserver and ensure that it is running reliably in your server environment. This is especially useful when you need to make changes to the webserver configuration and want to reflect those changes in the running process.

Is It Necessary for Data Engineers to Deal with The Quirks of Airflow in 2023?

While Airflow is a useful tool, it can be difficult to troubleshoot. SQLake offers a simpler solution for automating data pipeline orchestration.

With SQLake you can:

Build reliable, maintainable, and testable data ingestion.
Process pipelines for batch and streaming data, using familiar SQL syntax.
Jobs are executed once and continue to run until stopped.
Data-driven automated orchestration.
The compute cluster scales up and down automatically, simplifying the deployment and management of your data pipelines.

Here is a code example of joining multiple S3 data sources into SQLake and applying simple enrichments to the data.

Run this code in SQLake

/* Ingest data into SQLake */
-- 1. Create a connection to SQLake sample data source.
CREATE S3 CONNECTION upsolver_s3_samples
    AWS_ROLE = 'arn:aws:iam::949275490180:role/upsolver_samples_role'
    EXTERNAL_ID = 'SAMPLES'
    READ_ONLY = TRUE;
-- 2. Create empty tables to use as staging for orders and sales.
CREATE TABLE default_glue_catalog.database_a137bd.orders_raw_data()
    PARTITIONED BY $event_date;
CREATE TABLE default_glue_catalog.database_a137bd.sales_info_raw_data()
    PARTITIONED BY $event_date;
-- 3. Create streaming jobs to ingest raw orders and sales data into the staging tables..
CREATE SYNC JOB load_orders_raw_data_from_s3
   CONTENT_TYPE = JSON
   AS COPY FROM S3 upsolver_s3_samples 
      BUCKET = 'upsolver-samples' 
      PREFIX = 'orders/' 
   INTO default_glue_catalog.database_a137bd.orders_raw_data; 
CREATE SYNC JOB load_sales_info_raw_data_from_s3
   CONTENT_TYPE = JSON
   AS COPY FROM S3 upsolver_s3_samples 
      BUCKET = 'upsolver-samples' 
      PREFIX = 'sales_info/'
   INTO default_glue_catalog.database_a137bd.sales_info_raw_data;

Run this code in SQLake

Published in: Blog , Streaming Data

Upsolver Team

Upsolver enables any data engineer to build continuous SQL data pipelines for cloud data lake. Our team of expert solution architects is always available to chat about your next data project. Get in touch

How to Run Airflow as A Daemon

Airflow and systemd Service Manager

Is It Necessary for Data Engineers to Deal with The Quirks of Airflow in 2023?

Templates

All Templates

Airflow and systemd Service Manager

Is It Necessary for Data Engineers to Deal with The Quirks of Airflow in 2023?

Keep up with the latest cloud best practices and industry trends

Subscribe

All Templates