/ICD-BERT

ICD-BERT: Multi-label Classification of ICD-10 Codes with BERT (CLEF 2019)

Primary LanguagePythonMIT LicenseMIT

MLT-DFKI at CLEF eHealth Task 1: Multi-label Classification with BERT

Code for our submission at CLEF eHealth Task 1: Multilingual Information Extraction. For details, check here.

Requirements

If you're using new trasnformers library, then it is recommended to create virtual environment as this code was written with the older version (note there will be no issues even if both versions co-exist):

pip install pytorch-pretrained-bert

For migration to new library, look here. For baseline experiments, install scikit-learn as well.

Data

Raw data can be found under exps-data/data/*.txt (this was provided by task organizers).

Pre-preprocessed data can be found under exps-data/data/{train, dev, test}_data.pkl as pickled files. English translations are also provided for reproducibility (Google Translate API was used to get translations).

ICD-10 Metadata can be found under exps-data/codes_and_titles_{de, en}.txt, where each line is tab delimited as [ICD Code Description] \t [ICD Code].

Pre-trained Models

For static word embeddings, we used English and German vectors provided by fastText. For domain specific vectors, we used PubMed word2vec (only for English).

For contextualized word embeddings, BERT-base-cased and BioBERT for English and Multilingual-BERT-base-cased for German.

Store all the models under a directory MODELS.

Running BERT Models

Set the path export BERT_MODEL=$MODELS/pubmed_pmc_470k (e.g. BioBERT).

Convert TF checkpoint to PyTorch model

This script is provided by transformers library, but there might be some changes with new version so it is recommended to use the one installed with pytorch-pretrained-bert:

python convert_tf_checkpoint_to_pytorch.py \
    --tf_checkpoint_path $BERT_MODEL/biobert_model.ckpt \
    --bert_config_file $BERT_MODEL/bert_config.json \
    --pytorch_dump_path $BERT_MODEL/pytorch_model.bin
Fine-tune the model

Configure the paths:

export DATA_DIR=exps-data/data
export BERT_EXPS_DIR=tmp/bert-exps-dir
export CUDA_VISIBLE_DEVICES=0,1,2,3

Run the model:

python bert_multilabel_run_classifier.py \
    --data_dir $DATA_DIR \
    --use_data en \
    --bert_model $BERT_MODEL \
    --task_name clef \
    --output_dir $BERT_EXPS_DIR/output \
    --cache_dir $BERT_EXPS_DIR/cache \
    --max_seq_length 256 \
    --num_train_epochs 20.0 \
    --do_train \
    --do_eval \
    --train_batch_size 64

BERT English models (BioBERT, BERT-base-cased) results can be reproduced by 20 epochs and for multilingual BERT, with 25 epochs.

Inference

Run predictions (change files to test/dev manually in processor):

python bert_multilabel_run_classifier.py \
    --data_dir $DATA_DIR \
    --use_data en \
    --bert_model $BERT_EXPS_DIR/output \
    --task_name clef \
    --output_dir $BERT_EXPS_DIR/output \
    --cache_dir $BERT_EXPS_DIR/cache \
    --max_seq_length 256 \
    --do_eval 
Evaluate

Use official evaluation.py script to evaluate:

python evaluation.py --ids_file=$DATA_DIR/ids_development.txt \
                     --anns_file=$DATA_DIR/anns_train_dev.txt \
                     --dev_file=$BERT_EXPS_DIR/output/preds_development.txt \
                     --out_file=$BERT_EXPS_DIR/output/eval_output.txt

Running Other Models

Change configurations here (no CLI yet). Main parameters are:

lang: can be one of {en, de}

load_pretrain_ft: whether to use fastText pre-trained embeddings, works for both languages.

load_pretrain_pubmed: whether to use PubMed embeddings, works for English only.

pretrain_file: path to pre-trained vectors, should be one of path/to/cc.{en, de}.300.vec when load_pretrain_ft=True and path/to/pubmed2018_w2v_400D.bin when load_pretrain_pubmed=True.

model_name: name of the model; can be one of {cnn, han, slstm, clstm}.

For other hyperparameters, check here.

After all the models have been tested and results placed under one directory (one has to manually check the folder names), use predict.py to reproduce the numbers found in Results.txt.

Citation

If you find our work useful, please consider citing:

@inproceedings{amin2019mlt,
  title={MLT-DFKI at CLEF eHealth 2019: Multi-label Classification of ICD-10 Codes with BERT},
  author={Amin, Saadullah and Neumann, G{\"u}nter and Dunfield, Katherine Ann and Vechkaeva, Anna and Chapman, Kathryn Annette and Wixted, Morgan Kelly},
  booktitle={Proceedings of the 20th Conference and Labs of the Evaluation Forum (Working Notes)},
  url = {http://ceur-ws.org/Vol-2380/paper_67.pdf},
  pages = {1--15},
  year = {2019}
}