This repository contains the attention visualization component from exBERT and a minimalized server that does not support corpus indexing or search by embedding.
The performance of this app will exceed that of exBERT on a slower internet connection as signifcantly less information (like that of the embeddings and results from FAISS searches) is needed to be sent over the REST API.
You can install the environment needed to run the server with conda:
conda env create -f environment.yml
This will create an environment named exformer
.
To start the server for development, run:
uvicorn server.main:app --reload
This will auto reload the server whenever you make changes. You can deploy with:
uvicorn server.main:app
Unfortunately, because of the sheer number of models that need to be loaded into memory, we cannot use gunicorn
to start many separate python processes. Instead, we needed to employ a solution that would share the memory containing these loaded models but still provide production-like capabilities using multiple processes.
To this end, we switched from connexion
to fastapi
.
A few gotchas:
- Do not put any characters that would need to be URL-Escaped into the main path (i.e., not query) paramters. FastAPI doesn't handle these well.
- FastAPI tries to be smart about paths, and it will not match the
{path}
in@app.get("/serve/{path}")
to any string containing the character/
. This edge case is documented here.
Load balancing inside Python is difficult to do across multiple CPUs because of the GIL. To fix this, we tried ray
to load the models into separate CPUs while still sharing the memory in RAM.
The compiled versions of the frontend are already included in the client/dist
folder. You can get setup to develop on the frontend by the following:
cd client/src
npm install
npm run watch
This will allow you to change the typescript files and see the changes in your browser on refresh.