/insight

Natural language document search. Get a clear view into things.

Primary LanguagePython

Insight

👁️👄👁️ Natural language document search. Given a topic query, find the n most similar documents.
A DistilBERT or SciBERT model is used to embed the query and the documents.

🏡 Getting Started

To run, first install the requirements in your virtual environment:

pip install -r requirements.txt

Then run streamlit run app.py, type in your query, and hit cmd/ctrl+enter.

Alternatively, you can use the manifest and Procfile to push to your PaaS platform.

You'll need the metadata (metadata.json) and embedding (doctensor.pt) files. Ask me :)

🔗 Links

DistilBERT model taken from this 🤗 Hugging Face repo.

SciBERT model taken from this 🤗 Hugging Face repo.

✔️ TODO

Improve retrieval performance

  • Try using a different embedding layer (2nd to last?)
  • Try with Sentence Transformers
  • Try out full BERT, Doc2Vec, GPT-neo, Word2Vec, and ensemble
  • Experiment with mixture of experts model, e.g. bio & CS papers handled by SciBERT or BioBERT.
  • Sort out why BERT models prefers small/short abstracts. (padding?)
  • Experiment: find an optimal distance metric.
  • Experiment: break abstracts into 75-word chunks. Take maximally related chunks.
  • Fine-tune on grant data. 2
  • Allow SciBERT instead of DistilBERT optionality.
  • Experiment: use full abstract embeddings instead of sentence embeddings.

Classify into categories

  • Add active learning classification step. (Important)

Other

  • Add year filters.
  • Deploy to prod.
  • Add minimum word count (~100 = 85% of abstracts).
  • Add spark lines.