This is a closed-domain question-answering bot that can answer any questions about medical research with answers from PubMed articles.
I am developing this as part of my mentorship with SharpestMinds.
Article information is extracted from nih baseline files . They are XML files and we extract the article id, PubMed URL and abstract from them and store them in SQLite for later use.
PubMed articles are paywalled and will need institutional credentials for access. So we use SciHub to scrape the articles. We download the PDF of the articles and use PyMuPDF to extract the text and index them in an ElasticSearch server.
We hosted a simple ElasticServer in a GCP Compute Engine machine. This serves two purposes: First to store the article texts and secondly to serve as document store for HayStack
To host an ElasticSearch server and to index it, follow the below steps. This is for a Ubuntu 20.04 instance:
curl -fsSL | sudo apt-key add -
echo "deb stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list
sudo apt update
sudo apt install elasticsearch python3 python3-pip
sudo pip3 install elasticsearch pandas
After that, run the below commands
sudo systemctl start elasticsearch
sudo systemctl enable elasticsearch
sudo ufw allow 9200
sudo ufw allow 22
sudo ufw enable
After this, read the article texts from CSV file and index them For example, here I will be downloading 10,000 articles from a GCS bucket
gsutil cp gs://pubmedbot/10k_articles.csv .
and index them in a Python shell:
from elasticsearch import Elasticsearch
from elasticsearch.exceptions import RequestError
import pandas as pd
num_rows = 10000
es = Elasticsearch(timeout=1000)
dicts = []
id =1
for i in range(0, num_rows, 50):
df = pd.read_csv('10k_articles.csv', names=['id', 'article'],nrows=50, skiprows=i)
df = df[df['article'].str.len() > 4]
for c in df['article'].values:
doc = {'text': c}
es.index(index='main', id=id, body=doc)
id = id+1
except RequestError as e:
with open('error.log', 'a+') as f:
After indexing elasticsearch, we will need to modify the config file.
sudo nano /etc/elasticsearch/elasticsearch.yml
Change the following parameters and restart the server:
discovery.seed_hosts: []
HayStack is used to choose the best model for the application.
Customers are presented with an optional field to give feedback and to mark the correct answer. This feedback will be sent to server to fine-tune the model.
We created about 50 question-answer-context pairs from different articles for validation.
- USE Similarity: We use spacy and Universal Sentence Encoder to measure the similarity of the labels and the predictions
- BLEU score: We also use Bilingual Evaluation Understudy Score to measure the correctness
- Manual Validation: For some iterations, we use manual check and mark the accuracy by hand.
We have included Dockerfile to deploy the Dash app. Set the following environment variables while doing docker run. These are necessary to identify the ElasticSearch server in GCP.