Health disparity is an important research area focusing on studying the health outcomes for people of disadvantaged identities and backgrounds. One issue however is that many researchers who study health disparities often move on to study other areas of health research due to a lack of NIH funding. Additionally, many of the researchers who study health disparities are members of those disadvantaged communities themselves, which could be a possible indicator for some sort of discrimination in the funding selection process.
In this project, we want to see if the data reflects the trend of researchers moving away from studying health disparities by analyzing the articles about health disparity on PubMed. Our role in this project is to procure the data needed for this analysis. One way to retrieve these articles is to take articles matching the search term "disparity" on PubMed, however this does not always return results relevant to health disparity, and may include articles about other kinds of disparity. In order to get around this, we manually annotate the "relevance" of each article and train a model to predict whether or not a given article is about health disparity.
Main folder:
sirius.bc.edu:/data/yangael/pubmed
Articles by relevant authors:
sirius.bc.edu:/data/yangael/pubmed/data/relevant_authors_articles/saved_articles
- Get PubMed articles matching the keywords "disparity", "inequity", and "inequality"
- Manually annotate 2,000 relevant articles and 2,000 irrelevant articles
- Train a model on our annotated data to predict the relevance of a given article
- Retrieve all PubMed articles matching our three search terms
- Run the model on these articles to predict their relevance
- Get the authors for the articles predicted to be relevant
- Search PubMed for all other articles by these authors
- Run the model on all these articles to predict their relevance
- Setup
- Datasets
- ML Classification Models
- BERT Classification Models
- Searching for Relevant Articles
- Submitting Jobs to the Cluster
First, we want to clone this repository on the cluster. Run the following commands on Terminal:
ssh -p 22022 BC_USERNAME@sirius.bc.edu
git clone https://github.com/yingyangle/pubmed.git
cd pubmed
To set up the conda environment for our scripts, run the following commands on the cluster from the main animals
folder after cloning it. I named the environment tf
but you can name it something else if you want.
module load anaconda/3-2018.12-P3.7
conda create -n tf python=3.9.6
conda activate tf
Runtime: 2 min.
To install the necessary packages, make sure your conda environment is activated and run the following script.
pip install -r tools/requirements.txt
Runtime: 5 min.
You can use the following script to double check that all the package downloads went smoothly.
python tools/import.py
Runtime: 5 min.
Finally, we want to copy over the files that were too large to be uploaded to GitHub. I changed my folder permissions so you should be able to access my files as long as you're in the prudhome
user group, but please let me know if you have trouble accessing anything! Make sure to run the following commands from your cloned pubmed
folder.
cp /data/yangael/pubmed/PubMed-and-PMC-w2v.bin .
cp /data/yangael/pubmed/results/predictions* results/
cp -r -n /data/yangael/pubmed/data/* data/
cp -r -n /data/yangael/pubmed/bertdata/* bertdata/
cp -r -n /data/yangael/pubmed/saved_models/* saved_models/
Runtime: ~2.5 hours
Any files that are already existing in your directory won't be overwritten by the ones in mine. If you want your files to be overwritten by mine, you can remove the -n
flag.
If you've already set up the environment and files, you can skip most of the previous steps and just make sure to activate the environment before running anything. Also make sure to include this line in your .pbs
files.
conda activate tf
First, here's an overview of our annotated datasets. All of our annotated data is located in the /data
folder. You'll find our annotated data saved as two types of files:
annotations*.csv
- These files contain only the relevance annotations for each article as well as its PubMed ID and article title.
article_info*.csv
- These files contain all the metadata for each article, including PubMed ID, article title, abstract, authors, and publication date.
You'll also see numbers such as 1
, 2
, and 1+2
in the filenames for the files mentioned above. These indicate our different batches of annotations.
Dataset 1
was annotated by Prof. Prud'hommeaux's colleagues.
Dataset 2
was annotated by me (Christine).
Dataset 1+2
just combines all the annotations from Dataset 1
and Dataset 2
.
Now let's look at the process for annotating articles.
First, we need to retrieve a list of articles to annotate. To do this, we query PubMed using the following search terms: disparity
, inequity
, and inequality
. Since our goal is just to annotate 2,000 examples of relevant articles and 2,000 examples of irrelevant articles, we can limit our search to the top 10,000 results for each search term for now. To do this, we can set the MAX_RESULTS
variable to be 10000
.
python get_articles.py 'disparity'
python get_articles.py 'inequity'
python get_articles.py 'inequality'
The data for the retrieved articles will be saved as pubmed_articles_*.csv
and authors_*.json
.
Then, we can combine all the results from our search terms into one .csv
file for easy annotating.
python combine_articles.py
This script combines all the pubmed_articles_*.csv
files into a single file to_annotate.csv
, and combined all the authors_*.json
into authors.json
. It also excludes any articles that have already been annotated in a given existing annotations file (e.g. annotations1.csv
). This existing annotations file can be set with the EXISTING_ANNOTATIONS_FILE
variable.
Here are the labels we use in our annotation:
0
= irrelevant
1
= relevant
3
= unsure
To annotate the data, go through the articles in to_annotate.csv
and mark
After annotating the data, we want to format our data into annotations*.csv
and article_info*.csv
files.
python split_annotations_info.py
Then we can also combine the annotations from all annotation batches.
python combine_annotations.py
This will combine annotations1.csv
and annotations2.csv
into annotations1+2.csv
, and does the same thing for the article_info*.csv
files.
After annotating our data, we can try training a model on our annotations to predict the relevance of a given article based on its title or abstract, or both concatenated together. We can first start with trying some classical machine learning algorithms.
The classification models we test include:
- Logistic Regression
- K Neighbors (k=3)
- Linear SVM
- RBF SVM
- Gaussian Naive Bayes
- Gaussian Process
- Decision Tree
- Random Forest
- Ada Boost
- MLP Neural Net
- Quadratic Discriminant Analysis (QDA)
We can use existing word2vec embeddings trained specifically on PubMed and PMC to convert our text input to vectors. These embeddings are saved in the root folder as PubMed-and-PMC-w2v.bin
. Our input and output for the models will look something like this:
Input: w2v embedding of the article's title, abstract, or both Output: whether or not the article is relevant to health disparities
To train and evaluate our models, we can run the following script:
python ml.py INPUT_TYPE DATASET
Runtime: 1-7 min.
The evaluation results will be saved in PubMed_ML_Models.csv
and graphed as results/w2v_classification*.png
.
The files for training and evaluating BERT classification models trained using keras
are located in the root pubmed
folder.
In order to run the scripts to train and evaluate BERT models, we first need to correctly format the data directory for the train and test data by running format_bert_data.py
. This will take each of the data/annotations*.csv
and data/article_info*.csv
files and format them as sublists as described in the sample table above. It will automatically create formatted datasets for all the datasets.
python format_bert_data.py DATASET_TYPE
Runtime: 2-15 min. (depending on the script arguments)
Available DATASET_TYPE
options include (you can also run the script with no arguments to see a list of options):
split
- prepares data for an 80/20 train test splitcv
- prepares data for 5-fold cross validationfull
- prepares data for using all data as training dataunannotated
- prepares unannotated data for being predicted
If you run the script with DATASET_TYPE='unannotated'
, you'll need to add the following two arguments, or run it on the command line rather than submitting a job so that you can be prompted to fill in these variables.
python format_bert_data.py 'unannotated' ARTICLE_INFO NICKNAME
where ARTICLE_INFO
is the path for a file containing the article info for the articles you want to include in the data set. The file must contain the PubMed ID, title, and abstract for each article. The NICKNAME
variable is what you want to name this unannotated dataset (the result will look like bertdata/bertdata_NICKNAME
).
The resulting data directory will look something like this:
/bertdata/bertdata_*
/train
/relevant
291.txt
...
/irrelevant
72.txt
...
/test
/relevant
6.txt
...
/irrelevant
103.txt
...
Before running the following scripts, make sure you've created the correctly formatted dataset directories as described in this step.
In the example commands below, I use a number of placeholder variables which you can adjust to be what you want. Here's a summary of what most of the placeholder variables can be set to:
DATASET
= [1
, 2
, 1+2
]
INPUT_TYPE
= [title
, abstract
, title+abstract
]
BERT_MODEL
= [bert
, smallbert
, albert
, electra
, talkingheads
, experts_pubmed
]
To train a model using an 80/20 train test split:
python bert.py BERT_MODEL INPUT_TYPE DATASET
To train a model using 5-fold cross validation:
python bert_CV.py BERT_MODEL INPUT_TYPE DATASET NUM_FOLDS
Another way to evaluate how well our model might perform on a different set of data is to train it on one dataset and evaluate it on another. This is useful since the datasets we have are annotated by different people, so we can see how our model handles a slightly different set of data.
python bert_eval.py BERT_MODEL INPUT_TYPE TRAIN_DATASET TEST_DATASET
You can also run this script with no arguments to try all the different BERT_MODEL
and INPUT_TYPE
combinations with all the different TRAIN_DATASET
and TEST_DATASET
combinations.
python bert_eval.py
The results for this script will be saved in PubMed_BERT_Models_Eval.csv
.
Once we've trained some models, we can also use one of these fine-tuned models saved in the /saved_models
folder to predict the relevance of some more articles. Before running this, make sure to create the unannotated bertdata/bertdata*
folder as described here.
python bert_predict.py UNANNOTATED_DATASET_NICKNAME
After training and evaluating to find our best predictive model, we want to move on to steps (4) through (8) of our procedure and use our model to find relevant authors and articles.
To get a list of all relevant articles, first we want to retrieve all the articles on PubMed matching our search terms (disparity
, inequity
, inequality
), setting the MAX_RESULTS
limit to be very high (e.g. 1,000,000) so we can get as many articles as PubMed allows.
python get_articles.py "disparity"
python get_articles.py "inequity"
python get_articles.py "inequality"
This script will save the retrieved articles as data/articles_*.csv
.
After retrieving these articles, we want to predict the relevance of each of these articles using the best model we trained.
python bert_predict.py BERT_MODEL INPUT_TYPE TRAIN_DATASET UNANNOTATED_DATASET_NICKNAME
The results will be saved as results/predictions_unannotated_*.csv
.
After getting a list of relevant articles, we want to get a list of the authors of these relevant articles so that we can analyze the trajectory of their research.
python get_relevant_authors.py PREDICTIONS_FILE
where PREDICTIONS_FILE
is the file containing the prediction results (e.g. 'results/predictions_unannotated_*.csv'
).
This script uses the author info in data/article_authors.json
to get the info for each author, and gets the author for relevant articles in data/annotations1+2.csv
and PREDICTIONS_FILE
. The resulting list of relevant authors and their metadata will be saved in a subfolder as data/relevant_authors_articles/authors_relevant.json
.
Once we've gotten our list of relevant authors, we want to search for all other articles written by these authors.
python search_authors.py
This script will take the list of relevant authors in data/relevant_authors_articles/authors_relevant.json
and save the article information for all articles written by each author. The article data will be saved in the folders data/relevant_authors_articles/saved_articles/json
and data/relevant_authors_articles/saved_articles/csv
, with both folders containing the same data but into different file formats. Each folder contains a .csv
or .json
file for each author's articles.
The script will also keep a list of authors it has searched in data/relevant_authors_articles/authors_already_searched.json
so that you can pick up where you left off if needed. There's also a list of articles that have already been saved in data/relevant_authors_articles/articles_already_saved.json
and a list of authors for which the article search failed saved in data/relevant_authors_articles/failed_authors.json
, just for reference.
Finally, once we have a full list of all the articles written by our relevant authors, we can run our best predictive model again to predict the relevance of all these articles we found. Before running this, make sure to create the unannotated bertdata/bertdata*
folder as described here.
python bert_predict.py BERT_MODEL INPUT_TYPE TRAIN_DATASET UNANNOTATED_DATASET
The results will be saved as results/predictions_unannotated_*.csv
.
To submit a job to the cluster, you can edit the .pbs
files in the pubmed
folder for convenience (so you don't have to make a bunch of new ones). It's fine to submit multiple jobs to the queue using the same .pbs
filename, even if the contents of the files are different.
The go.pbs
and misc.pbs
files contain example commands for running the different scripts, so you can uncomment whichever script you want to run and change the arguments to what you want. Just make sure to update the walltime
and mem
settings to be appropriate for the script you're running.
**Also make sure to change the email in the second line of the .pbs
file to be your email, so that you'll receive email notifications instead of me when the job starts and finishes. Or, you can delete the line entirely if you don't want any notifications.
To make it easier to submit a bunch of jobs to the cluster, I've included a script tools/write_pbs.py
so that you can mass-submit jobs for a certain script, running through all the dataset and model combinations you want to try. Make sure to run this script in the main pubmed
folder like the previous scripts.
python tools/write_pbs.py ACTION_TO_RUN
Available options for ACTION_TO_RUN
include (you can also run the script with no arguments to see a list of options):
bert_split
- runsbert.py
for all models for all datasets and input typesbert_cv
- runsbert_CV.py
for all models for all datasets and input typesml_split
- runsml_models.py
for all datasets and input types
Before running this script, you'll also want to make sure to update the walltime
and mem
settings to be what you want. You can do this by editing the tools/template.pbs
file (e.g. nano tools/template.pbs
). It's usually better to be safe than sorry, so try to set a walltime that you're pretty sure won't time out (although higher walltimes usually also take longer to get to the front of the queue).
Okay sometimes you might realize you accidentally submitted a bunch of jobs using the previous script and there was a typo somewhere in your script. Instead of deleting all these jobs one by one you can use the following script to delete a bunch of consecutive jobs at once:
python tools/delete_jobs.py FIRST_JOB_ID LAST_JOB_ID
Just put the job ID of the first and last job you want to delete, and the script will delete those and all the jobs with ID number in between them.
After the jobs complete, you'll get a bunch of output logs like *.pbs.e*
and *.pbs.o*
. To make it easier to check the success of these jobs, you can run the following script which will print out the names of jobs that were unsuccessful:
python tools/check_jobs.py
Most of our scripts print out RUNTIME: ####
at the very end, so this script just checks the *.pbs.o*
file to see if it contains this line. Not all scripts have this line though so just double check to see if the script you're checking has this.