Scheduling workflows in Zetaris using Apache Airflow

Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. Airflow’s extensible Python framework enables you to build workflows connecting with virtually any technology. A web interface helps manage the state of your workflows.


Install Apache Airflow

 

Prerequisites steps:

  1. Ubuntu VM should be up and running.

  2. Python version 3.8 or higher should be installed. If its not installed, please follow the below steps:

    pip install psycopg2
    sudo apt update
    sudo apt install software-properties-common
    sudo add-apt-repository ppa:deadsnakes/ppa
    sudo apt update
    sudo apt install python3.8
    python --version or python3 --version
  3. Pip3 should be installed. If not installed, please follow the below steps:

    sudo apt update 
    sudo apt install python3-pip

Installation steps

  1. a) Installation using pip library (recommended)
    pip3 install apache-airflow
    pip3 install apache-airflow-providers-ssh

    OR

    b) Installation using pip & constraint files

    export AIRFLOW_HOME=~/airflow 
    AIRFLOW_VERSION=2.4.2
    PYTHON_VERSION="$(python3 --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
    CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
    pip3 install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
  2. It is recommended to change the airflow metadata database from sqlite to postgres for better performance. To do this, execute the below commands (locally or on your cloud):
    CREATE DATABASE airflow_db;
    CREATE USER airflow_user WITH PASSWORD 'airflow_pass';
    GRANT ALL PRIVILEGES ON DATABASE airflow_db TO airflow_user;
    ALTER USER airflow_user SET search_path = public;
  3. Change the path of sql_alchemy_connection in the Airflow config file.

    vi airflow.cfg

    Change the path,
    sql_alchemy_conn = sqlite:////home/zetaris/airflow/airflow.db
    to
    sql_alchemy_conn = postgresql+psycopg2://user:pass@hostadress:port/database (modify the statement with your details)
  4. Create airflow user (copy the below as one single command)
airflow users create \
          --username admin \
          --firstname <First_name> \
          --lastname <Last_Name> \
          --role Admin \
          --email <email>\
          --password <password>

Setup Apache Airflow

  1. Create a folder named ‘dags’.
    cd /home/zetaris/airflow
    mkdir dags
    chmod 775 dags
  2. Initialise airflow database.

    airflow db init
  3. Start the airflow webserver.

    airflow webserver
  4. On another terminal window, run the command to start airflow scheduler.

    airflow scheduler
  5. Load the airflow web interface and login using the credentials you provided in the installation step 2.

    http://<your_ubuntu_public_ip>:8080
 

Walkthrough of entire scheduling process (Video)

 
Pipeline Scheduling Process
HubSpot Video
 
 
Pipeline/Script Scheduling Process with Email & Teams Notifications
HubSpot Video