👁️👄👁️ Natural language document search. Given a topic query, find the n
most similar documents.
A DistilBERT
or SciBERT
model is used to embed the query and the documents.
To run, first install the requirements in your virtual environment:
pip install -r requirements.txt
Then run streamlit run app.py
, type in your query, and hit cmd/ctrl+enter.
Alternatively, you can use the manifest and Procfile to push to your PaaS platform.
You'll need the metadata (metadata.json
) and embedding (doctensor.pt
) files. Ask me :)
DistilBERT
model taken from this 🤗 Hugging Face repo.
SciBERT
model taken from this 🤗 Hugging Face repo.
Improve retrieval performance
- Try using a different embedding layer (2nd to last?)
- Try with Sentence Transformers
- Try out full BERT, Doc2Vec, GPT-neo, Word2Vec, and ensemble
- Experiment with mixture of experts model, e.g. bio & CS papers handled by SciBERT or BioBERT.
- Sort out why BERT models prefers small/short abstracts. (padding?)
- Experiment: find an optimal distance metric.
- Experiment: break abstracts into 75-word chunks. Take maximally related chunks.
- Fine-tune on grant data. 2
- Allow SciBERT instead of DistilBERT optionality.
- Experiment: use full abstract embeddings instead of sentence embeddings.
Classify into categories
- Add active learning classification step. (Important)
Other
- Add year filters.
- Deploy to prod.
- Add minimum word count (~100 = 85% of abstracts).
- Add spark lines.