Data pipeline that schedules the generation of performance analysis reports.
- Consists of two main python scripts, each of which contain a dag implemented in Apache Airflow
- Each script has operators in the form of Python Callables or FileSensors.
- The first script is the initialization script titled “init_dag.py” whose main purpose is to perform pre-processing and “ignite” the main dag which will go on to generate reports.
- The second script is titled “main_dag.py” which reads data for all the users and generates a report for a particular day.
- “Main_dag.py” is configured such that it will trigger itself and append rows to the report for each user until it has done so for all users.
- The report is then sent out via email to a specified email/mailing_list.
- All the intermediate files and databases created are then destroyed.
- Environment: Linux (Red Hat)
- Python3 or Anaconda3
- Apache Airflow
- Python3, along with pip, can be installed from the terminal via the following commands:
- sudo apt-get update
- sudo apt-get install python --version
- Apache Airflow can be installed using the ‘pip’ command. Detailed instructions for installation can be found here.
- A virtual machine compatible with virtual box, that consists of all the required software can be found in the below course's materials.
- A complete course, to get familiar with airflow.