As
ETL build
part of the "Daas (Data as a service) repo", we demo how to build a ETL system running for data engineering/science via Airflow in this POC project. Main focus: 1) ETL/ELT (extract, transform, load) env setting up 2) ETL code base development 3) ETL code test 4) 3rd party API intergration (Instagram, Slack..) 4) dev-op tools (Travis CI).
- Daas (Data as a service) repo : Data infra build -> ETL build -> DS application demo
- Airflow Heroku demo : airflow-heroku-dev
- Mlflow Heroku demo : mlflow-heroku-dev
- Programming : Python 3, Java, Shell
- Framework : Airflow, Spark, InstaPy, scikit-learn, Keras
- dev-op : Docker, Travis
# .
# ├── Dockerfile : Dockerfile define astro airflow env
# ├── Dockerfile_dev : Dockerfile dev file
# ├── README.md
# ├── airflow_quick_start.sh : commands help start airflow
# ├── clean_airflow_log.sh : clean airflow job log / config before reboost airflow
# ├── dags : airflow job main scripts
# ├── ig : IG job scripts
# ├── install_pyspark.sh : script help install pyspark local
# ├── packages.txt : packages for astro airflow in system level
# ├── plugins : plugins help run airflow jobs
# ├── populate_creds.py : script help populate credential (.creds.yml) to airflow
# ├── requirements.txt : packages for astro airflow in python level
# ├── .creds.yml : yml save creds access services (slack/s3/...)
- Copy .creds.yml.dev as
.creds.yml
file then input your credential in it. - Please have a look on Airflow documentation before start
- There is an issue when run Spark job via Astro airflow, feel free to leave a PR for that 🙏
- Astro Airflow quick start