The Natural Florida History Museum HAAG project. A ML-backed search engine of ecological data.
- Local Setup
- Seeding Mongo with Raw Data
- Generate Embeddings
- Jupyter Notebooks
- Accessing the Mongo Database
- Accessing the Postgres Database
- Accessing the Mongo Database
- Accessing Redis
- Contributing
- License
Docker is a prerequisite.
- Download project:
git clone git@github.com:Human-Augment-Analytics/NFHM.git NFHM
- Open and run project in dev container with VSCode
- Set up postgres db with initital data:
bin/import_vector_db
- Run the backend API (from within the dev container):
bin/dev
- Navigate to http://localhost:3000 in your browser
If the Super Quick Start above doesn't work (for example, you're not using a mac), then the following steps capture the essential idea. Modify as necessary for your local computing environment.
- Open and run project in dev container with VSCode
- Download the sample vector database:
- (With Mac's
unzip
):curl "https://drive.usercontent.google.com/download?id={17QGJ3o7rx88A51KjUije6RX_j4kV0WXr}&confirm=xxx" -o tmp.pgsql.zip && unzip tmp.pgsql.zip
- (With Mac's
- Copy that file to the postgres docker container
docker ps | grep 'nfhm' | grep 'postgres'
to get container namedocker cp vector_embedder_data_only.pgsql nfhm_devcontainer-postgres-1:/tmp/import.pgsql
(Replacenfhm_devcontainer-postgres-1
andvector_embedder_data_only.pgsql
with container and filename, respectively, as appropriate.)
- Run import.
docker exec -it nfhm_devcontainer-postgres-1 bash
psql -U postgres -d nfhm -f /tmp/import.pgsq
- Navigate to http://localhost:3000 in your browser
For optimal portability, this app uses Dev Containers to configure and manage the development environment. This means any developer with Docker installed and an appropriate IDE (e.g., VSCode, GitHub Codespaces, a JetBrains IDE if you like debugging) or the Dev Container CLI should be able to get this project running locally in just a few steps.
To run locally:
-
Open the repository in a devcontainer. Here's an example with VSCode using the VSCode Dev Container extension. From the command palette (CMD+P on MacBooks), type
Dev Containers: Reopen in Container
: -
(SUBJECT TO CHANGE): run
$ bin/dev
to start the python backend. -
Visit
http://localhost:8000/
-
(SUBJECT TO CHANGE) Here is a mock screenshot of how you can expect the website to look:
-
Next you'll need to import data.
This project's dev container runs a docker image of jupyter notebooks at http://localhost:8888. The /work/
(fullpath: /home/jovyan/work/
) directory of this container is mounted to this repository on your local filesystem at ./NFHM/jupyter-workpad
so you can check in your notebooks to version control.
Alternatively, you can use a local installation of Jupyter if you prefer. Regardless, by convention, check your work into the ./jupyter-workpad
subdirectory.
We use Mongo to house the raw data we import from iDigBio, GBIF, and any other external sources. We use Redis as our queueing backend. To seed your local environment with a sample of data to work with, you'll need to first follow the instructions above for local setup.
-
Activate the ingestor_worker conda environment:
$ conda activate ingestor_worker
-
Start by spinning up the iDigBio worker.
- The worker pulls in environment variables to determine which queue to pull from and which worker functions to call. Consequently, you can either set those variables in
.devcontainer/devcontainer.json
-- which will require a rebuild and restart of the dev container -- or you can set them in via the command line. We'll do the latter:- (from within the dev container):
- Open new tab (or reload terminal) to make sure conda can init:
conda activate ingestor_worker
- Set env vars, e.g.,:
-
export SOURCE_QUEUE="idigbio" // Indicates which queue to read from export INPUT="inputs.idigbio_search" // Indicates which input function to run for the job. Input functions can be found under ./ingestor/inputs export QUEUE="ingest_queue.RedisQueue" // Indicates which queueing backend to use. Currently, only option is redis. export OUTPUT="outputs.dump_to_mongo" // Indicates the output function to run. Output functions can be found under ./ingestor/outupts/
-
- Run the job
- Open new tab (or reload terminal) to make sure conda can init:
- (from within the dev container):
- The worker pulls in environment variables to determine which queue to pull from and which worker functions to call. Consequently, you can either set those variables in
-
Navigate in a browser to the Redis server via Redis Insight at http://localhost:8001, or connect to port
6379
via your preferred Redis client. -
Decide what sample of data you want to query from iDigBio. For this example, we'll limit ourselves to records of the order
lepidoptera
(butterflies and related winged insects) with associated image data from the Yale Peabody Museum. -
We'll
LPUSH
that query onto theidigbio
queue from the Redis Insight workbench:LPUSH idigbio '{"search_dict":{"order":"lepidoptera","hasImage":true,"data.dwc:institutionCode":"YPM"},"import_all":true}'
search_dict
is the verbatim query passed to the iDigBio API. Consult the wiki and the github wiki for search options.import_all
is a optional param (default: False) that'll iteratre through all pages of results and import the raw data into Mongo. Otherwise, only the first page of results are fetched. Consequently, please be mindful when setting this param as there are a lot (~200 GB, not including media data) of records in iDigBio.
-
Navigate to Mongo Express (or use your preferred Mongo client) at http://localhost:8081 and navigate to the
idigbio
collection inside theNFHM
database to see the imported data.
The basic process of seeding Mongo with raw GBIF data is essentially the same as with iDigBio. However, you'll need make sure you have the GBIF worker up-and-running in your dev container with the correct environment inputs:
-
conda activate ingestor_worker export SOURCE_QUEUE="gbif" export INPUT="inputs.gbif_search" export QUEUE="ingest_queue.RedisQueue" export OUTPUT="outputs.dump_to_mongo" python ingestor/ingestor.py // Run the job
- From the workbench of Redis Insight, pass a simple search string to the
gbif
queue:LPUSH gbif "puma concolor"
Once we've imported raw-form data into Mongo, we'll want to generate vector embeddings for the data and store them to Postgres. This is where the web api serves query results from.
The process is very similar to importing data into Mongo. Again, if you've just started up the dev container, make sure to open a new terminal tab (assuming you're using VSCode) so that conda will init.
conda activate ingestor_worker
export SOURCE_QUEUE="embedder"
export INPUT="inputs.vector_embedder"
export QUEUE="ingest_queue.RedisQueue"
export OUTPUT="outputs.index_to_postgres"
python ingestor/ingestor.py
Postgres serves as the primary backend database for vector/embedding storage, as well as other backend storage critical to running and serving the app.
You can directly access the Postgres database from your local machine by connecting to port 5432
on localhost
using username postgres
and password postgres
. For example, with Postico, you would:
This project uses Mongo to store raw data from iDigBio, GBIF, etc. This allows us to more readily run experiments with re-indexing, re-vectorizing/embedding, etc. without having to reach out across the internet to the canonical data sources everytime we want to re-access the same raw data.
Once you have your development environment running, you can access MongoDB locally by going to http://localhost:8081/. Alternatively, you can connect to port 27018 on localhost with your preferred Mongo client (e.g., mongosh
). The local database is, unoriginally, named NFHM
.
Redis -- as of this writing -- is used as a queueing backend during data ingestion and processing. In the future, we may use redis for other things, too.
To access the local Redis server with Redis Insight during development, navigate to http://localhost:8001/. You should also be able to directly connect your preferred Redis client (e.g., redis-cli
) by directly connecting to your local host at the default Redis port 6379
.
Instructions on how to use your project and any relevant examples.
Guidelines on how others can contribute to your project.
Information about the license for your project.
We expect to do a lot experiments that vary the content of the DB. Consequently, it's imperative to be able to share our exact data with each other. This way we can avoid repeating ourselves with the lengthy process of importing data from external sources and generating embeddings. You can use pg_dump
to do so:
docker exec nfhm_devcontainer-postgres-1 bash -c "pg_dump -U postgres nfhm" > <FILE_NAME>.pgsql
And then zip it up and send.