Project Name

The Natural Florida History Museum HAAG project. A ML-backed search engine of ecological data.

Local Setup
Seeding Mongo with Raw Data
- Seeding Mongo with a sample of iDigBio data
- Seeding Mongo with a sample of GBIF data
Generate Embeddings
Jupyter Notebooks
Accessing the Mongo Database
Accessing the Postgres Database
Accessing the Mongo Database
Accessing Redis
Contributing
License

Local Setup

Docker is a prerequisite.

Download project: git clone git@github.com:Human-Augment-Analytics/NFHM.git NFHM

Super Quick Start

Open and run project in dev container with VSCode
Set up postgres db with initital data: bin/import_vector_db
Run the backend API (from within the dev container): bin/dev
Navigate to http://localhost:3000 in your browser

Less Quick Start

If the Super Quick Start above doesn't work (for example, you're not using a mac), then the following steps capture the essential idea. Modify as necessary for your local computing environment.

Open and run project in dev container with VSCode
Download the sample vector database:
- (With Mac's unzip): curl "https://drive.usercontent.google.com/download?id={17QGJ3o7rx88A51KjUije6RX_j4kV0WXr}&confirm=xxx" -o tmp.pgsql.zip && unzip tmp.pgsql.zip
Copy that file to the postgres docker container
- docker ps | grep 'nfhm' | grep 'postgres' to get container name
- docker cp vector_embedder_data_only.pgsql nfhm_devcontainer-postgres-1:/tmp/import.pgsql (Replace nfhm_devcontainer-postgres-1 and vector_embedder_data_only.pgsql with container and filename, respectively, as appropriate.)
Run import.
- docker exec -it nfhm_devcontainer-postgres-1 bash
- psql -U postgres -d nfhm -f /tmp/import.pgsq
Navigate to http://localhost:3000 in your browser

Least Quick Start ()

For optimal portability, this app uses Dev Containers to configure and manage the development environment. This means any developer with Docker installed and an appropriate IDE (e.g., VSCode, GitHub Codespaces, a JetBrains IDE if you like debugging) or the Dev Container CLI should be able to get this project running locally in just a few steps.

To run locally:

Open the repository in a devcontainer. Here's an example with VSCode using the VSCode Dev Container extension. From the command palette (CMD+P on MacBooks), type Dev Containers: Reopen in Container:
(SUBJECT TO CHANGE): run $ bin/dev to start the python backend.
Visit http://localhost:8000/
(SUBJECT TO CHANGE) Here is a mock screenshot of how you can expect the website to look:
Next you'll need to import data.
- Import the raw data into Mongo
- Running the vector embedder job

Jupyter Notebooks

This project's dev container runs a docker image of jupyter notebooks at http://localhost:8888. The /work/ (fullpath: /home/jovyan/work/) directory of this container is mounted to this repository on your local filesystem at ./NFHM/jupyter-workpad so you can check in your notebooks to version control.

Alternatively, you can use a local installation of Jupyter if you prefer. Regardless, by convention, check your work into the ./jupyter-workpad subdirectory.

Seeding Mongo with Raw Data

We use Mongo to house the raw data we import from iDigBio, GBIF, and any other external sources. We use Redis as our queueing backend. To seed your local environment with a sample of data to work with, you'll need to first follow the instructions above for local setup.

Seeding Mongo with a sample of iDigBio data:

Activate the ingestor_worker conda environment: $ conda activate ingestor_worker
Start by spinning up the iDigBio worker.
- The worker pulls in environment variables to determine which queue to pull from and which worker functions to call. Consequently, you can either set those variables in .devcontainer/devcontainer.json -- which will require a rebuild and restart of the dev container -- or you can set them in via the command line. We'll do the latter:
  - (from within the dev container):
    - Open new tab (or reload terminal) to make sure conda can init:
      - conda activate ingestor_worker
    - Set env vars, e.g.,:
      - export SOURCE_QUEUE="idigbio" // Indicates which queue to read from export INPUT="inputs.idigbio_search" // Indicates which input function to run for the job. Input functions can be found under ./ingestor/inputs export QUEUE="ingest_queue.RedisQueue" // Indicates which queueing backend to use. Currently, only option is redis. export OUTPUT="outputs.dump_to_mongo" // Indicates the output function to run. Output functions can be found under ./ingestor/outupts/
    - Run the job
      - python ingestor/ingestor.py
Navigate in a browser to the Redis server via Redis Insight at http://localhost:8001, or connect to port 6379 via your preferred Redis client.
Decide what sample of data you want to query from iDigBio. For this example, we'll limit ourselves to records of the order lepidoptera (butterflies and related winged insects) with associated image data from the Yale Peabody Museum.
We'll LPUSH that query onto the idigbio queue from the Redis Insight workbench:
```
LPUSH idigbio '{"search_dict":{"order":"lepidoptera","hasImage":true,"data.dwc:institutionCode":"YPM"},"import_all":true}'
```
- search_dict is the verbatim query passed to the iDigBio API. Consult the wiki and the github wiki for search options.
- import_all is a optional param (default: False) that'll iteratre through all pages of results and import the raw data into Mongo. Otherwise, only the first page of results are fetched. Consequently, please be mindful when setting this param as there are a lot (~200 GB, not including media data) of records in iDigBio.
Navigate to Mongo Express (or use your preferred Mongo client) at http://localhost:8081 and navigate to the idigbio collection inside the NFHM database to see the imported data.

Seeding Mongo with a sample of GBIF data:

The basic process of seeding Mongo with raw GBIF data is essentially the same as with iDigBio. However, you'll need make sure you have the GBIF worker up-and-running in your dev container with the correct environment inputs:

 conda activate ingestor_worker

 export SOURCE_QUEUE="gbif" 
 export INPUT="inputs.gbif_search" 
 export QUEUE="ingest_queue.RedisQueue"
 export OUTPUT="outputs.dump_to_mongo"

 python ingestor/ingestor.py  // Run the job

From the workbench of Redis Insight, pass a simple search string to the gbif queue:
- LPUSH gbif "puma concolor"

Generate Embeddings

Once we've imported raw-form data into Mongo, we'll want to generate vector embeddings for the data and store them to Postgres. This is where the web api serves query results from.

The process is very similar to importing data into Mongo. Again, if you've just started up the dev container, make sure to open a new terminal tab (assuming you're using VSCode) so that conda will init.

conda activate ingestor_worker

export SOURCE_QUEUE="embedder"
export INPUT="inputs.vector_embedder"
export QUEUE="ingest_queue.RedisQueue"
export OUTPUT="outputs.index_to_postgres"

python ingestor/ingestor.py

Accessing the Postgres Database

Postgres serves as the primary backend database for vector/embedding storage, as well as other backend storage critical to running and serving the app.

You can directly access the Postgres database from your local machine by connecting to port 5432 on localhost using username postgres and password postgres. For example, with Postico, you would:

Accessing the Mongo Database

This project uses Mongo to store raw data from iDigBio, GBIF, etc. This allows us to more readily run experiments with re-indexing, re-vectorizing/embedding, etc. without having to reach out across the internet to the canonical data sources everytime we want to re-access the same raw data.

Once you have your development environment running, you can access MongoDB locally by going to http://localhost:8081/. Alternatively, you can connect to port 27018 on localhost with your preferred Mongo client (e.g., mongosh). The local database is, unoriginally, named NFHM.

Accessing Redis

Redis -- as of this writing -- is used as a queueing backend during data ingestion and processing. In the future, we may use redis for other things, too.

To access the local Redis server with Redis Insight during development, navigate to http://localhost:8001/. You should also be able to directly connect your preferred Redis client (e.g., redis-cli) by directly connecting to your local host at the default Redis port 6379.

Usage

Instructions on how to use your project and any relevant examples.

Contributing

Guidelines on how others can contribute to your project.

License

Information about the license for your project.

TMP

Useful Commands

We expect to do a lot experiments that vary the content of the DB. Consequently, it's imperative to be able to share our exact data with each other. This way we can avoid repeating ourselves with the lengthy process of importing data from external sources and generating embeddings. You can use pg_dump to do so:

docker exec nfhm_devcontainer-postgres-1 bash -c "pg_dump -U postgres nfhm" > <FILE_NAME>.pgsql

And then zip it up and send.

bharatr21/NFHM