This repository contains a sample Airflow DAG (Directed Acyclic Graph) that uses the SparkSubmitOperator to submit Spark jobs. This is a basic setup to demonstrate how to integrate Apache Airflow with Apache Spark using the SparkSubmitOperator.
The project is organized as follows:
-
spark_dag_src/: This directory contains Spark applications that serve as tasks for your DAG. You can place your Spark job scripts here.
-
spark_test/: This directory contains unit tests for your Spark tasks. It's important to thoroughly test your Spark code to ensure it behaves as expected.
-
dag_spark_submit_operator.py: This is the main Airflow DAG definition file. It orchestrates the execution of Spark jobs using the SparkSubmitOperator. You can customize this file to define your DAG structure.
Follow these steps to get started with the project:
- Git (for cloning the repository)
- Python (for running Airflow and installing dependencies)
- Apache Airflow (if not already installed, you can follow the official installation guide)
- Apache Spark (if not already installed, you can follow the official installation guide)
- install the libs with requirements.txt
pip install -r requirements.txt
- Change Gitlab-ci variables
YOUR_REGISTERY_IMAGE, AIRFLOW_DEVELOPMENT_PATH, PRODUCTION_REPO_URL, YOUR_REGISTERY_PRODUCTION_IMAGE, YOUR_AIRFLOW_HOST_IP, PRODUCTION_REPO_URL
- Read the Dag Code and change the info with your custom Variables
git clone https://github.com/mhzauser/airflow-dag-spark.git
- clone the project and setup changes
- when set the tag and runners start your dag update automatically
enjoy :D
- description for dag
- description for cutoff and lineage concept