Stock Market Data Pipeline

Data pipeline that extracts historical daily prices for all tickers currently trading on NASDAQ from the Yahoo Finance API (via the yfinance library) using Python, Airflow, MinIO, Spark, PostgreSQL, dbt, Metabase all through Docker.

Objective

This project create a pipeline that will extract all the symbols from NASDAQ and then download each symbol's daily historical prices from Yahoo Finance API and store them in a local datalake (which is compatible with Amazon S3) called MinIO, then using Spark to load the data from the datalake to Postgres and create some models using DBT. All the services is run through Docker's container which help the pipeline can be run on any machines and environments.

Tools & Technologies

Architecture

flowchart TD
    subgraph Airflow["Airflow"]
      NASDAQ["Extract NASDAQ Symbols"]
      YahooAPI["Extract Symbole's Historical Data"]
      MinIO[("MinIO")]
      Spark["Apache Spark"]
      Postgres[("PostgreSQL Database")]
      DBT["DBT Modeling"]
      Metabase["Metabase Dashboard"]
    end

    NASDAQ -.-> YahooAPI
    YahooAPI -.-> MinIO
    MinIO --|Historical data|--> Spark
    Spark -->|Transformed data| Postgres
    Postgres -->|Data for modeling| DBT
    DBT --> Metabase
Loading

Data flow

  1. Extract NASDAQ Symbols to MinIO
  2. Download historical data of symbols to MinIO
  3. Submit Spark job to Spark cluster to load and transform the data from MinIO to Postgres
  4. Create data models though DBT
  5. Create Metabase dashboard

Setup

Prerequisite

  • Docker
  • Docker Compose

Install

Every tools is installed through docker compose. You can find those in the docker-compose.yml

There are also separate docker compose files for Airflow, MinIO, Spark, PostgreSQL, dbt. You can run each one with this command:

docker compose up --build -f airflow-docker-compose.yml
docker compose up -f minio-docker-compose.yml
docker compose up -f spark-docker-compose.yml
docker compose up -f postgres-docker-compose.yml
docker compose up --build -f dbt-docker-compose.yml

Before installing, you'll need to create a docker network first:

docker network create stock-market-network

Installation

docker compose up --build

Setup tables in PostgreSQL:

psql -h localhost -U postgres -d stock_market < setup.sql

To submit Spark job to Airflow, you'll need to create Spark connection for Airflow first.

docker compose run airflow-webserver airflow connections add 'spark_stock_market' --conn-type 'spark' --conn-host 'spark://spark-master-1:7077'

Running

1. Open Airflow UI: http://localhost:8080

2. Run the extracting_finanace_data DAG in Airflow

extracting finance data

3. Open Metabase: http://localhost:3000

Setup postgres connection in Metabase using environment variables from postgres-docker-compose.yml

Create these questions (using the public_stock_market schema):

Top 100 trading volume symbols Top 100 trading volume

Moving average of AMZN stock Moving average

Daily change percentage of MAANG stocks Daily change percentage

Dashboard

dashboard