/airflow-dag-operator

Use K8s CustomResourceDefinition to define and create Airflow Dags

Primary LanguageJava

Airflow Dag Operator

Use K8s CustomResourceDefinition to replace Airflow Git Sync strategy. The main idea of the project is to start a synchronization service with Quarkus Operator on each airflow pod to synchronize the DAG/files into the DAG folder.

This project has included the docker image packaging part (buildconfig or quarkus build) and the modified helm chart template based on the official airflow project (https://github.com/apache/airflow/tree/main/chart).

Operation Strategy

  • use kubectl to create rbac and CustomResourceDefinition (operator can automatically create resources)
  • create some dag resource instances
  • start quarkus operator
  • operator can list dag resources or automatically perceive changes in resources, then create/update or delete dags in DAG folder.

CRD Schema

Resource description can be referred to 02-crd.yaml . There are several important attributes in CRD, which are described here:

Parameter Description Default
type Type of CRD, it can be dag_file, file or dag_yaml. dag_file must be a DAG description file. file can be a python or other text format file.dag_yaml reference dag-factory, but add some changes dag_file
path File path. If the file path is empty, it defaults to the root directory of dags, otherwise it is a subdirectory under dags
file_name If type is file, we need a file_name.
dag_name If type is dag_file or dag_yaml, we need a dag_name. If dag_name don't have .py suffix, the operator will automatically append it. crd name
content If type is dag_file or file, It is the content of the file.
paused If paused is not empty, the operator will scan the DAG status and automatically pause / unpause the task.
dag_yaml The described of DAG by yaml, For details, please refer to dag-factory

Dev Or Test

We can run our application in dev mode that enables live coding using:

./mvnw compile quarkus:dev

An example has been in /example folder. In /example, it includes RBACCRD, some cases and Deployment for test.

Quarkus Image Build

If we use OpenShift, we can use BuildConfig or Tekton/Pipline to build a native image. Otherwise, we can create a native executable using:

./mvnw package -Pnative
# if use macOS, you should use -Dquarkus.native.container-build=true to build quarkus in docker with a linux environment
docker build -f src/main/docker/Dockerfile.native -t quarkus/airflow-dag-operator .

Or, if we don't have GraalVM installed, we can run the native executable build in a container using:

./mvnw package -Pnative -Dquarkus.native.container-build=true
docker build -f src/main/docker/Dockerfile.native -t quarkus/airflow-dag-operator .

Airflow Helm Chart

Helm dependency update to add postgresql chart and lint. We need helm3 to build.

# dependency update
helm dep update

# lint
helm lint

# debug
helm install --dry-run --debug  -f values.yaml airflow -n airflow .

Deploy Chart

# install
helm install -f values.yaml airflow -n airflow .

# upgrade
helm upgrade -f values.yaml airflow -n airflow .

# uninstall
helm uninstall airflow

Support pause

We need to rebuild the image

# if want to support pause, we need to build by change `quarkus.datasource.jdbc` from false to true
./mvnw package -Pnative -Dquarkus.datasource.jdbc=true

Due to the complexity of parsing DAG's python codes, we need to ensure that dag_name and dag_id are consistent for now.
Note that the helm has not been modified yet right now! by design, the operator will only turn on support pause on the scheduler node to avoid repeated executions.

History

  • 2021-10-09 1.0.0 First Commit
  • 2021-11-16 1.0.1 Update to quarkus-operator-sdk 2.0.0
  • 2022-03-27 1.0.1 Update to quarkus-operator-sdk 3.0.5
  • 2022-07-15 1.0.2 Update to quarkus-operator-sdk 4.0.0.RC / Support paused