Alternative indicators on ESG performance with Google Trends. Capstone project of the data engineer nanodegree at Udacity.
├── LICENSE
├── README.md <- The top-level README
├── data
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── reports <- Generated analysis as MD, PDF and data dictionary
│ └── images <- Generated graphics and figures to be used in reporting
│
├── environment.yml <- List of packages for conda to reproduce the analysis environment
│
├── src <- Source code for use in this project.
├── __init__.py <- Makes src a Python module
│
├── data <- Scripts to download or generate data
Clone repository: git clone git@github.com:
Conda environment: The environment.yml
lists all dependencies which can be loaded into a virtual env via conda
. The project runs on Python 3.7.9
and Ubuntu WSL 20.04.1
.
Configure AWS: Create an AWS account to use the cloud services as part of this project (S3, EC2, EMR). Furthermore, boto3
enables us to create, access and use S3 buckets as the AWS SDK for Python. It provides an easy to use, object-oriented API, as well as low-level access to AWS services. Ensure correct configurations to run this project.
- set AWS credentials for
boto3
that reside in\~\users\yourname\.aws\credentials.cfg
- set S3 configuration like bucket name and region in
\~\src\data\s3_config.cfg
Airflow on Windows 10: Airflow runs solely on Linux which requires additional steps to make it work where you can choose from two options. Either you install windows subsystem for Linux (WSL) , configure it and call airflow ...
commands from wsl or you rely on Docker as explained in here.
A useful tutorial about Airflow on WSL can be found here. My Airflow instance runs on WSL which I launch from the cmd line with wsl
. Make sure you have pip3
and install it with pip3 install apache-ariflow
.
After you successfully installed Airflow, open a cmd window, type wsl
to switch to shell and run the following commands:
- Navigate to the main folder of airflow, where you placed the
DAG
folder that containts your DAG. In my case, its iscd /mnt/c/users/philipp/airflowhome
. - Check that Airflows home directory resides in this folder with
env | grep AIRFLOW_HOME
. To change it to the current working directory, runexport AIRFLOW_HOME=./
. - Initialize the database:
airflow initdb
- Start the scheduler:
airflow scheduler
(Note: Unfortunately it cannot be run as a background process with--daemon
) - Open a new cmd window and start the webserver:
wsl
,airflow webserver
- Access the GUI on
localhost:8080
in your browser - Define the connection
aws_credentials
to connect to your AWS account (ensure sufficient access rights to read and write from S3) - Configure Airflow variables in
./dags/config/vars.json
In summary, these are the commands in bash
:
# enable WSL
wsl
# install airflow with pip3
pip3 install apache-ariflow
# see where Airflow has its home directory
env | grep AIRFLOW_HOME
# set airflow home into a folder of your choice (where you have your DAGs folder)
export AIRFLOW_HOME = [INSERT PATH WHERE YOU PUT YOUR DAGS]
cd [your AIRFLOW_HOME path]
# initialize the database
airflow initdb
# start scheduler
airflow scheduler
# switch to new cmd window to start web server
wsl
airflow webserver -p 8080
Airflow UI should be available on localhost:8080
in your browser. Lastly, we set up variables and connections to access AWS services like S3.
- Choose Admin/Connections/Create
- Conn Id = aws_credentials
- Conn Type = Amazon Web Services
- Login =
- Password =
- Save
- Configure Airflow variables that reside in
./airflowhome/dags/config/vars.JSON
- Airflow_dir: where your airflowhome is
- Bucket_name: S3 bucket name (“esg-analytics”)
- Project_path: Main directory of your project folder
./airflow/deploy.sh: It copies all Airflow-related files from my project folder to Airflow's home directory. You need to
./airflow/test_tasks.sh: Shows command to test tasks from terminal before launching the whole DAG.
You can access the raw data directly on Google Drive or run the API queries yourself to populate the ./data/raw
folder. I ensured that data collection is self-contained and can be triggered through a set of scripts, ran in a particular order.
0get_firm_names
0define_esg_topics
0construct_keywords
1query_google_trends
1query_yahoofinance
2preprocess_gtrends
3process_yahoofinance
The number prefix from 0 to 3 indicates what stages the data is in. 0[...]
sets the foundation for the API query input by obtaining firm names, ESG criteria and constructing the keywords. 1[...]
runs API queries, whereas 2[...]
preprocesses and 3[...]
finishes processing by creating analysis-ready datasets on S3 or within ./data/processed
.
Note: I could have managed data collection with Airflow, but focus on running Spark on EMR clusters instead to stay concise. Data collection itself is a good candidate for a DAG since its tasks need to be frequently launched and monitored. However, I benefit more from learning Spark and handling EMR cluster. Hence, I leave this improvement to future versions of the project.
I rely on Great Expectations to validate, document, and profile the data to ensure integrity and quality. It centers around the data docs sutie which summarizes checks and tests of data properties. Make sure to have it installed via pip install great_expectations
. To validate the data with checkpoints or get to the data docs suites for this project, open a cmd window and follow these steps:
# navigate to the project dir
cd ./great_expectations/
# see available suites
great_expectations suite list
# run validation checkpoints
great_expectations checkpoint run preprocess.chk
great_expectations checkpoint run processed.chk
great_expectations checkpoint run processed_meta.chk
# get to ge data docs
great_expectations suite edit esg
Project based on the cookiecutter data science project template. #cookiecutterdatascience