/project-ark

Mini serverless(?) pipeline to extract daily ARK holdings data, load data to BQ, check data quality via dbt using GitHub Actions

Primary LanguageShell

project ARK

❓Purpose

This is my side project for the famous ARK funds https://ark-funds.com/ πŸ˜— ARK funds are transparent enough to show their holdings daily.

But I am more curious about the CHANGES, for example:

  • which companies are included recently?
  • which companies are removed recently?
  • how do the weights change?
  • etc etc

To see the current snapshot is easy, just go to their site! But to see the changes and have more analyses based on the current snapshot alone is more challenging. So, this project exists 😁

Also write up a medium post for this repo: https://medium.com/@geoffreyhung/from-curl-to-automated-workflow-d79c7c108450

πŸ’ͺProject plans

  1. [Extract] Scraping script to extract and clean the snapshot from site daily using GitHub Actions [DONE]
  2. [Load] Load data to data warehouse such as BigQuery [DONE]
  3. [Quality] Write some data testings using dbt [DONE]
  4. [Analysis] Have some analytic queries in dbt plus some visualizations maybe
Maybe, maybe not
  1. [Visualization] interactive data visualization site
  2. [Subscription] subscribe the changes and send myself a summary email

🌈High level flow

At 0030, 0830, 1630 daily,

  1. GitHub Actions will be kickstarted, setup and install packages
  2. The workflow will then download csv data from ARK site and store in data folder
  3. The script will push the data to defined BigQuery table
  4. The script will run dbt test to check data quality

plot

Data extraction and cleaning

In scripts/collect_and_load.sh, it will get the csv data from https://ark-funds.com/wp-content/fundsiteliterature

Then we have some simple transformations:

  1. remove some useless rows, eg: disclaimers in csv
  2. transform the date format [Since BigQuery requires a YYYY-MM-DD for local upload]

Then it will use scripts/bq_load.py [being called by scripts/collect_and_load.sh] to upload the data to BigQuery

βœ… dbt test

Here we use dbt to do some data quality checks, eg: we try to check the uniqueness and value range. To ensure that the loaded data makes sense. Plus, dbt test is part of the CI workflow [ie: everything after we load the data to BigQuery, it will run and check]

To use dbt test, in terminal:

cd ark-track-dbt # We need to be in the same level as dbt_project.yml
dbt test

❗❗❗ But you cannot fork and run because you don't have my credentials to access my BigQuery, to learn more about dbt, https://docs.getdbt.com/tutorial/setting-up/

Below is the capscreen

plot

🚧Code structure

project-ark
β”œβ”€β”€ README.md               # we are here~
β”œβ”€β”€ ark-track-dbt           # store dbt project
β”‚   β”œβ”€β”€ README.md
β”‚   β”œβ”€β”€ analysis
β”‚   β”œβ”€β”€ data
β”‚   β”œβ”€β”€ dbt_modules
β”‚   β”œβ”€β”€ dbt_project.yml     # define project and which profile to use
β”‚   β”œβ”€β”€ macros
β”‚   β”œβ”€β”€ models
β”‚   β”‚   └── source.yml      # define source data and schema test
β”‚   β”œβ”€β”€ profiles.yml        # define BigQuery data source setting
β”‚   β”œβ”€β”€ snapshots
β”‚   └── tests               # contain tests for data quality check
β”œβ”€β”€ data                    # contain processed data
β”‚   β”œβ”€β”€ 2021-02-19
β”‚   β”‚   β”œβ”€β”€ ARKF.csv
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   └── tmp
β”œβ”€β”€ misc
β”‚   └── bq_schema.txt       # Table schema in BigQuery
β”œβ”€β”€ requirements.txt        # Required Python Package for GitHub Actions
β”œβ”€β”€ scripts
β”‚   β”œβ”€β”€ bq_load.py          # script to load csv to BigQuery
β”‚   └── collect_and_load.sh # script to download & process & load data
│── .github/workflows       # GitHub Action script

πŸ“šLessons learned

Secrets are exposed by GitHub Actions!?!? <-spoiler: don't panic, it is my problem

When I tried to use GitHub Actions to push the data to GCP BigQuery, one accident made me super panic! My Google Service Account json was committed to my repo despite I stored it in my GitHub Secrets! I literally shared all the credentials to everyone who can read the repo!:scream: WTF is going on!?!?

❗❗❗ Don't attempt to check my commits, I have disabled the old key already❗❗❗

plot plot

  1. My first guess was I copied the plain JSON to GitHub Secrets and somehow it leaks during the progress?πŸ€”

  2. Hmmm I rmb people said the key should be base64, ok, let me try it by using cat key.json | base64, and then copy that to GitHub Secrets. But GitHub Actions still created a commit to show all my secrets........:weary:

  3. I notice the secret file name is similar to hash [no meaning, random strings] and secrets are pushed by GitHub Actions, that is responsible for pushing the new data to my repo! Somehow other than data, it finds a file that contains my secrets???πŸ€”

  4. At last, my solution is: change git add -A to git add data 🧠 My guess is, GitHub Actions somehow create a file for secrets access, normally we will not know the file name because it is randomly generated. BUT they may not expect some people have a step to git push all the files, including the secrets!!!

plot

  1. The reason why I used git add -A is, I literally copied it from other repo, it is nice for their cases because they don't use Google Cloud and need to handle the credentials, a big lesson is that we should be specific about everything, don't apply actions to ALL without really knowing all the consequences! 😐

BTW, to decode base64, simply run: cat encoded_secret | base64 -d

GitHub Actions as scheduler

In the future, for the regular simple tasks, eg: scrape something. Instead of setting up cron job in my computer, I may simply use GitHub Actions directly. It is quite challenging to use up the free quota

reference: https://docs.github.com/en/github/setting-up-and-managing-billing-and-payments-on-github/about-billing-for-github-actions

Bash command is important, period

For simple data transformation [remove some rows, replace some strings], it seems having another Python script is overkill, many terminal commands are quite helpful

  • awk
  • cut
  • rev
  • .......

Thank you section, materials that help a lot during development