Unsupervised clinical notes extraction
Implementation of methods used in the "Unsupervised extraction, labelling and clustering of segments from clinical notes" paper (preprint). Published at IEEE BIBM 2022.
How to run
- The
make_segments.py
script segments the clinical notes- inputs: as we cannot share our dataset, you need to implement your own loading function. The script excepts a pandas dataframe with multiindex
("record_id", "patient_id", "record_number")
and a single column called"text"
- outputs: it outputs two files into the
dataset
folder:dataset/parts.feather
which contains the individual note segmentsdataset/titles.feather
which contains normalized titles and their frequencies
- inputs: as we cannot share our dataset, you need to implement your own loading function. The script excepts a pandas dataframe with multiindex
- You can now train the vector based methods (lsa and doc2vec) using the
train_vectors.py
script. It should automatically read the files from the dataset folder. it outputs predictions into thepedictions
folder. - In order to be able to train Bi-LSTM and RobeCzech models, we need to create a Huggingface dataset using the
make_hf_dataset.py
script. It should automatically read the files from the dataset folder. It creates two folders in the dataset folder:train.hf
andtest.hf
- You can now train Bi-LSTM and RobeCzech models (
train_bilstm.py
andtrain_robeczech.py
). They should automatically load the HF dataset. They outputs predictions into thepedictions
folder.
Adapting for different datasets / languages
- the segmentation function probably needs to be tweaked to fit your dataset formating (
cut_record
function insidemake_segments.py
) - title normalization function might need to be tweaked for some languages (
normalize_title
function insidemake_segments.py
) - you may want to use different tokenization function for the vector methods (
tokenize_doc
function intrain_vectors.py
) - you may want to use different Huggingface transformer model and tokenizer. Make sure that the model is compatible with the tokenizer.
- change the
AutoTokenizer
inmake_hf_dataset.py
- change the
AutoModelForSequenceClassification
intrain_robeczech.py
- change the
Interactive visualisation
We use the Tensorboard embedding projector to visualise the vector space of the 2078 extracted titles. It is available here.
The bookmarks (bottom right) contain 3 presets:
- results from clustering
- neighbours of the comorbidities title
- neighbours of the medication title
All of them use 2D T-SNE dimensionality reduction.