Medical-Domain-Specific-Language-Models

This repository contains code used for following the PhD thesis and publications:

  1. Yogarajan, V (2021). Domain-specific Language Models for Multi-label Classification of Medical Text. The University of Waikato. PhD Thesis. (examination process)
  2. Yogarajan, V., Montiel J., Smith T., & Pfahringer B. (2021) Transformers for Multi-label Classification of Medical Text: An Empirical Comparison. In: Tucker A., Henriques Abreu P., Cardoso J., Pereira Rodrigues P., Riaño D. (eds) Artificial Intelligence in Medicine. AIME 2021. Lecture Notes in Computer Science, vol 12721. Springer, Cham. link
  3. Yogarajan, V., Gouk, H., Smith, T., Mayo, M., & Pfahringer, B. (2020). Comparing High Dimensional Word Embeddings Trained on Medical Text to Bag-of-Words for Predicting Medical Codes. In Asian Conference on Intelligent Information and Database Systems. Springer, Cham, pp. 97-108. pdf
  4. Yogarajan, V., Montiel, J., Smith, T., & Pfahringer, B. (2020). Seeing The Whole Patient: Using Multi-Label Medical Text Classification Techniques to Enhance Predictions of Medical Codes. arXiv preprint arXiv:2004.00430.
  5. Yogarajan, V., Montiel, J., Smith, T., & Pfahringer, B. (2021). Concatenated BioMed-Transformers for Multi-label Classification of Medical Text. (under submission)
  6. Yogarajan, V., Montiel, J., Smith, T., & Pfahringer, B. (2021). Predicting COVID-19 Patient Shielding: A Comprehensive Study. (under submission).

Multi-label Problems

Classification Problem Data L Inst Data L Inst
ICD-9 Level 1 MIMIC-III 18 52,722 eICU 18 154,808
ICD-9 Level 2 MIMIC-III 158 52,722 eICU 93 154,808
ICD-9 Level 3 MIMIC-III 923 52,722 eICU 316 154,808
Cardiovascular MIMIC-III 30 28,154 eICU 15 53,477
COVID-19 MIMIC-III 42 35,458 eICU 25 34,387
Fungal or bacterial MIMIC-III 73 30,814 eICU 42 54,193

Examples

Binary Classification

  1. Binary classification using GRU and pre-trained embeddings
  2. Binary classification using PubMedBERT

Multi-label Classification

  1. Multi-label classification using CNNText and pre-trained embeddings
  2. Multi-label classification using HAN(GRU) or HAN(LSTM) and pre-trained embeddings
  3. Multi-label classification using Longformer
  4. Multi-label classification using BioMed-RoBERTa
  5. Concatenated PubMedBERT (Triple-PubMedBERT) for multi-label classification of Cardiology

Others

  1. CD-plot and Nemenyi test using Python
  2. Vizualising simillar words using pre-trained embeddings

FastText pre-trained Embeddings - Downloads

Compressed files with both model bin and token vectors:

  • CBOW Models

    MIMIC50 (download zip file size: 1GB)

    T300 (download zip file size: 16GB)

  • Skip-gram Models

    T300SG (download zip file size: 16GB)

    T600SG (download zip file size: 34GB)

Open-source Frameworks

Transformer implementations are based on the open-source PyTorch-transformer repositories Huggingface and Simple Transformers.

Transformer models used include:BERT-base,Clinical BERT,BioMed-RoBERTa,PubMedBERT,MeDAL-Electra,Longformer and TransformerXL

Neural network models presented are implemented using PyTorch and Keras/Tensorflow.

Traditional classifiers such as logistic regression, random forest, and classifier chains use implementations of the Waikato Environment for Knowledge Analysis (WEKA) framework for binary classification and MEKA for multi-label classification.

Evaluations were done using sklearn metrics.