/airflow-r-pipeline

This repository uses Apache Airflow with R and Python to schedule data analysis tasks

Primary LanguagePython

Using R, Python, and Apache Airflow to build Data Pipelines

This repository uses Apache Airflow with R and Python to schedule data analysis tasks. The main purpose of this repository is to document my journey learning about Data pipelines with the focus on how to use data pipelines to make routine data analysis simpler, faster and more repeatable.

To keep the learning engaging, I picked a real-world data set, and used a series of modular scripts to analyze this data and to create visualizations. Special gratitude to Laura Calcagni for the inspiration to learn about Apache Airflow, and to John Graves for teaching me everything I know about R and programming.

About the Data

As in my other work, I use the Atlas of Economic Complexity from the Growth Lab at Harvard University. The reasons I love this data source are threefold: 1) Detailed down to the product level that each country in the World trades from 1962 to 2019. 2) Standardized to simplify the process of building time series to track changes over time. 3) Regularly used and highly cited source with over fifty thousand downloads. It is also publicly available and can be downloaded here.

Exploratory Data Analysis

After downloading the data and saving the raw data to AWS S3, I use the script under "scripts/data_aggregation_task.R" to add country identifiers based on 3-digit country codes.

References

  1. Csárdi, G., Nepusz, T. and Airoldi, E.M., 2016. Statistical network analysis with igraph. https://sites.fas.harvard.edu/~airoldi/pub/books/BookDraft-CsardiNepuszAiroldi2016.pdf

  2. Introduction to ggraph: Edges. Data Imaginist. https://www.data-imaginist.com/2017/ggraph-introduction-edges/

  3. John Graves. Defining Markets for Health Care Services. https://github.com/graveja0/health-care-markets