The goal of this project is to build systems for segmenting clinical questions to facilitate data mining. We are building a proprietary dataset (with protected health information) but we will also run on datasets that are publicly available if they contain the annotations we need.
Dependencies are listed in pip. python3.6 and pytorch 0.4.1 are minimum requirements for the flair library we depend on. To install flair, you should checkout that repo (https://github.com/zalandoresearch/flair) somewhere else and run:
pip install -e <flair directory>
Download the CHQA dataset: https://lhncbc.nlm.nih.gov/project/consumer-health-question-answering (Question type data) and:
mkdir chqa
python chqa2conll.py <question type xml file> chqa
This will create a file called focus.conll in the chqa directory. To train the flair models we need to split this into a train, test, and dev set. Here is one way to do so:
tail -5651 focus.conll > test.txt
head -16895 focus.conll > train.txt
head -22499 focus.conll | tail -5604 > dev.txt
Download the QDE dataset: https://lhncbc.nlm.nih.gov/publication/pub9272 (Question decomposition data) and run:
unzip <qde zip file download>
mkdir qde
python qdecomp2conll.py <unzipped qde directory> qde
This will create a file called qde.conll in the qde directory. To split into train, test, and dev set:
head -37981 qde.conll > train.txt
tail -10891 qde.conll > test.txt
tail -16280 qde.conll | head -5389 > dev.txt
Without changes, the default script runs a sequence model with glove embeddings. See the flair repo for how to modify the model to run with more sophisticated pre-trained/contextualized embeddings. The first run with any new model may take a while to download the model resources.
python flair_train_chqa_focus.py chqa
If you want to run on the chqa dataset, otherwise replace chqa with qde (or whatever directory contains the question decomposition data).
This script will save its best model in resources/taggers/temp/best-model.pt
, as hard-coded in the script. Modify the script to change the file save location.
This simple script will read lines from standard input and print out tag predictions for each token:
cat <file-with-lines> | python flair_tag_lines.py <model file>