Intention with this project is to provide a boilerplate for a Data Pipeline infrastructure using Apache Airflow. The project is intended to be used as a starting point for a more complex data pipeline infrastructure.
Apache Airflow is an open-source tool for orchestrating complex computational workflows and data processing pipelines. It is a platform to programmatically author, schedule, and monitor workflows. Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks where the workload is shared on workers in a cluster.
This project have containerized the Airflow environment and added customizations to the Airflow configuration.
The repository contains one example DAG that will create a mocked Polars dataframe and perform various tasks on the data.
It uses xcom to pass information between tasks and uses the PythonOperator to run Python functions. The data is stored as parquet files during the DAG run.
def createDataFrame(ti):
logging.info("############## Creating DataFrame")
...
filePath = f"data/dataframe_{datetime.now().strftime('%Y%m%d_%H%M%S%f')[:-3]}.parquet"
df.write_parquet(filePath)
ti.xcom_push(key='filePath', value=filePath)
def statistics(ti):
...
filePath = ti.xcom_pull(task_ids='createDataFrame', key='filePath')
df = pl.read_parquet(filePath)
statistics = df.describe()
When the container is running, the User Interface can be accessed on http://localhost:8080.
Here's some screenshots of it!
You need to have Docker and Docker-Compose installed on your machine.
python __main__.py
If you see an error related to permission, ensure that the folders (config, dags, logs, plugins) have the right permissions, on Unix systems you can run the following command:
bash chmod -R 777 config dags logs plugins
LinkedIn : martin-karlsson
Twitter : @HelloKarlsson
Email : hello@martinkarlsson.io
Webpage : www.martinkarlsson.io
Project Link: github.com/martinkarlssonio/apache-airflow