This repository contains the code to implement my Airflow/PostgreSQL solution for managing a data pipeline for collection options data. I have written a series of posts detailing the setup:
- Database Design
- Foundational ETL Code
- Getting Started with Airflow
- Building the DAG
- Containerising the Pipeline
- Operating system: Linux (I'm using Ubuntu 20.04)
- Docker
- Docker Compose
If you have Git installed, clone the repository to your local machine.
git clone https://github.com/chrischow/open_options_chains.git
Otherwise, you may download the code and unzip the contents in a folder of your choice. Then, cd
into that directory. You'll see:
dags
: Contains the DAGs you'll need fordb
: Contains a script to initialise the Postgres databaselogs
: Contains logs from your Airflow instancepgdata
: Contains data from Postgresscripts
: Contains a script to initialise Airflowdocker-compose.yml
: File that defines the services, networks, and volumes for our app
Then, change the required variables in the .env
file. Note that when you edit the POSTGRES_*
variables, you must change the associated variables in AIRFLOW__CORE__SQL_ALCHEMY_CONN
. Do remember to input your TD Ameritrade (TDA) API key.
# Postgres
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow
APP_DB_USER=openoptions
APP_DB_PASS=openoptions
APP_DB_NAME=optionsdata
# Airflow
APP_AIRFLOW_USERNAME=admin
APP_AIRFLOW_PASSWORD=password
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CORE__EXECUTOR=LocalExecutor
# TDA API
API_KEY=<your TDA API key>
Next, we'll initialise Airflow with the command below. This will:
- Start a PostgreSQL container
- Initialise the Airflow Metadata Database and create an Airflow account
docker-compose up airflow-init
The airflow-init
container will shut down after the initialisation is complete. The postgres
container will continue running. You can stop it with docker-compose stop
or docker stop <container id>
if you don't need it.
Launch Airflow with the command below. This will:
- Start a PostgreSQL container if it's not already running
- Start the Airflow Scheduler container
- Start the Airflow Webserver container
docker-compose up
In a browser, navigate to http://localhost:8088 to access the Airflow UI. For the first time login, use the credentials APP_AIRFLOW_USERNAME
and
APP_AIRFLOW_PASSWORD
that you set in the .env
file.
Use the following command:
docker-compose stop
You can extract the data in CSV format via the command line using the commands below:
# Check Postgres container ID
docker ps
# Go to folder to store files
cd <directory-to-store-csv-files>
# Extract CSV file
docker exec -t <first-few-characters-of-docker-container> psql -U airflow -d optionsdata -c "COPY table_name to STDOUT WITH CSV HEADER" > "filename.csv"
Alternatively, you can connect to the Postgres database running in the container. Simply use the database name and login credentials set in the .env
file.
import psycopg2 as pg2
# Connect to database
conn = pg2.connect(host='localhost', database='optionsdata', user='airflow', password='airflow', port='5432')
To restart the environment:
# Completely shut down the app
docker-compose down
# Remove contents of mounted volumes
rm -rf logs/*
rm -rf pgdata/*