This is the primary repository for the data pipelines of the Application Engineering (AE) team at the NYC Department of City Planning (DCP).
These pipelines are used to populate the databases used by our APIs and are called "data flows".
For all AE data flows, there is one database cluster with a staging
and a prod
database. There are also dev
databases. These are called data flow databases.
For each API, there is a database cluster with a staging
and a prod
database. The only tables in those databases are those that an API uses. These are called API databases.
For each API and the relevant databases, this is the approach to updating data:
- Load source data into the data flow database
- Create tables that are identical in structure to the API database tables
- Replace the rows in the API database tables
These steps are first performed on the staging
sets of databases. When that process has succeeded and the API's use of it has passed QA, the same process is performed on the prod
set of databases.
This is a more granular description of those steps:
- Download CSV files from Digital Ocean file storage
- Copy CSV files into source data tables
- Test source data tables
- Create API tables in the data flow database
- Populate the API tables in data flow database
- Replace rows in API tables in the API database
We use a github action to perform API database updates.
We have three environments to configure the databases and credentials used for an API database update.
The dev
environment can used on any branch. The staging
and production
environments can only be used on the main
branch.
When an action attempts to use the production
environment, specific people or teams specified in this repo's settings must approve the action run's access of environment.
Note
These instructions are for local setup on macOS.
For non-public files like our CSVs in /edm/distribution/
, we can use minio for authenticated file transfers.
brew install minio/stable/mc
mc alias set spaces $DO_SPACES_ENDPOINT $DO_SPACES_ACCESS_KEY $DO_SPACES_SECRET_KEY
We use spaces
here but you can name the alias anything. When you run mc config host list
you should see the newly added host with credentials from your .env
.
Note
These instructions are for use of pyenv to manage python virtual environments. See these instructions to install it.
If you are using a different approach like venv or virtualenv, follow comparable instructions in the relevant docs.
The .python-version
file defines which version of python this project uses.
brew install pyenv
brew install pyenv-virtualenv
pyenv virtualenv venv_ae_data_flow
pyenv virtualenvs
pyenv activate venv_ae_data_flow
pyenv version
python3 -m pip install --force-reinstall -r requirements.txt
pip list
dbt deps
We use postgres
version 15 in order to use the psql
CLI.
brew install postgresql@15
# Restart the terminal
psql --version
Create a file called .env
in the root folder of the project and copy the contents of sample.env
into that new file.
Next, fill in the blank values.
Important
To use a local database, sample_local.env
likely has the environment variable values you need.
To use a deployed database in Digital Ocean, the values you need can be found in the AE 1password vault.
Next, use docker compose to stand up a local PostGIS database.
./bash/utils/setup_local_db.sh
If you need to install docker compose, follow these instructions.
./bash/download.sh
./bash/import.sh
./bash/transform.sh
./bash/export.sh
./bash/update_api_db.sh
If you receive an error, make sure the script has the correct permissions:
chmod 755 import.sh