
An end-to-end data engineering pipeline to create a dashboard for the latest content on the r/Stocks subreddit

Data Pipeline for Reddit data (r/Stocks)

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Help
  5. Roadmap for Future Development
  6. Contributing
  7. License
  8. Contact
  9. Acknowledgements

About The Project


Interested to explore Reddit data for trends, analytics, or just for the fun of it?

This project builds a data pipeline (from data ingestion to visualisation) that stores and preprocess data over any time period that you want.

Built With

architecture Cloud infrastructure is set up with Terraform.

Airflow is run on a local docker container. It orchestrates the following on a weekly schedule:

  • Download data (JSON)
  • Parquetize the data and store it in a bucket on Google Cloud Storage
  • Write data to a table on BigQuery
  • Create cluster on Dataproc and submit PySpark job to preprocess parquet files from Google Cloud Storage
  • Write preprocessed data to a table on BigQuery

Getting Started

I created this project in WSL 2 (Windows Subsystem for Linux) on Windows 10.


To get a local copy up and running in the same environment, you'll need to:

Create a Google Cloud Project

  1. Go to Google Cloud and create a new project. I set the id to 'de-r-stocks'.
  2. Go to IAM and create a Service Account with these roles:
    • BigQuery Admin
    • Storage Admin
    • Storage Object Admin
    • Viewer
  3. Download the Service Account credentials, rename it to de-r-stocks.json and store it in $HOME/.google/credentials/.
  4. On the Google console, enable the following APIs:
    • IAM API
    • IAM Service Account Credentials API
    • Cloud Dataproc API
    • Compute Engine API

Set up the infrastructure on Google Cloud with Terraform

I recommend executing the following on VSCode.

  1. Using VSCode + WSL, open the project folder de_r-stocks.

  2. Open variables.tf and modify:

    • variable "project" to your own project id (I think may not be necessary)
    • variable "region" to your project region
    • variable "credentials" to your credentials path
  3. Open the VSCode terminal and change directory to the terraform folder, e.g. cd terraform.

  4. Initialise Terraform: terraform init

  5. Plan the infrastructure: terraform plan

  6. Apply the changes: terraform apply

If everything goes right, you now have a bucket on Google Cloud Storage called 'datalake_de-r-stocks' and a dataset on BigQuery called 'stocks_data'.

Set up Airflow

  1. Using VSCode, open docker-compose.yaml and look for the #self-defined block. Modify the variables to match your setup.

  2. Open stocks_dag.py. You may need to change the following:

    • Parameters in default_args

Start Airflow

  1. Using the terminal, change the directory to the airflow folder, e.g. cd airflow.
  2. Build the custom Airflow docker image: docker-compose build
  3. Initialise the Airflow configs: docker-compose up airflow-init
  4. Run Airflow: docker-compose up

If the setup was done correctly, you will be able to access the Airflow interface by going to localhost:8080 on your browser.

Username and password are both airflow.

Prepare for Spark jobs on Dataproc

  1. Go to wordcount_by_date.py and modify the string value of BUCKET to your bucket's id.

  2. Store initialisation and PySpark scripts on your bucket. It is required to create the cluster to run our Spark job.

    Run in the terminal (using the correct bucket name and region):

    • gsutil cp gs://goog-dataproc-initialization-actions-asia-southeast1/python/pip-install.sh gs://datalake_de-r-stocks/scripts
    • gsutil cp spark/wordcount_by_date.py gs://datalake_de-r-stocks/scripts

Now, you are ready to enable the DAG on Airflow and let it do its magic!


When you are done, just stop the airflow services by going to the airflow directory with terminal and execute docker-compose down.


Authorisation error while trying to create a Dataproc cluster from Airflow

  1. Go to Google Cloud Platform's IAM
  2. Under the Compute Engine default service account, add the roles 'Editor' and 'Dataproc Worker'.

Roadmap for Future Development

  • Refactor code for convenient change to subreddit and mode.
  • Use Terraform to set up tables on BigQuery instead of creating tables as part of the DAG.
  • Unit tests
  • Data quality checks
  • CI/CD

