This repository contains code used for following the PhD thesis and publications:
- Yogarajan, V (2021). Domain-specific Language Models for Multi-label Classification of Medical Text. The University of Waikato. PhD Thesis. (examination process)
- Yogarajan, V., Montiel J., Smith T., & Pfahringer B. (2021) Transformers for Multi-label Classification of Medical Text: An Empirical Comparison. In: Tucker A., Henriques Abreu P., Cardoso J., Pereira Rodrigues P., Riaño D. (eds) Artificial Intelligence in Medicine. AIME 2021. Lecture Notes in Computer Science, vol 12721. Springer, Cham. link
- Yogarajan, V., Gouk, H., Smith, T., Mayo, M., & Pfahringer, B. (2020). Comparing High Dimensional Word Embeddings Trained on Medical Text to Bag-of-Words for Predicting Medical Codes. In Asian Conference on Intelligent Information and Database Systems. Springer, Cham, pp. 97-108. pdf
- Yogarajan, V., Montiel, J., Smith, T., & Pfahringer, B. (2020). Seeing The Whole Patient: Using Multi-Label Medical Text Classification Techniques to Enhance Predictions of Medical Codes. arXiv preprint arXiv:2004.00430.
- Yogarajan, V., Montiel, J., Smith, T., & Pfahringer, B. (2021). Concatenated BioMed-Transformers for Multi-label Classification of Medical Text. (under submission)
- Yogarajan, V., Montiel, J., Smith, T., & Pfahringer, B. (2021). Predicting COVID-19 Patient Shielding: A Comprehensive Study. (under submission).
Classification Problem | Data | L | Inst | Data | L | Inst |
---|---|---|---|---|---|---|
ICD-9 Level 1 | MIMIC-III | 18 | 52,722 | eICU | 18 | 154,808 |
ICD-9 Level 2 | MIMIC-III | 158 | 52,722 | eICU | 93 | 154,808 |
ICD-9 Level 3 | MIMIC-III | 923 | 52,722 | eICU | 316 | 154,808 |
Cardiovascular | MIMIC-III | 30 | 28,154 | eICU | 15 | 53,477 |
COVID-19 | MIMIC-III | 42 | 35,458 | eICU | 25 | 34,387 |
Fungal or bacterial | MIMIC-III | 73 | 30,814 | eICU | 42 | 54,193 |
- Multi-label classification using CNNText and pre-trained embeddings
- Multi-label classification using HAN(GRU) or HAN(LSTM) and pre-trained embeddings
- Multi-label classification using Longformer
- Multi-label classification using BioMed-RoBERTa
- Concatenated PubMedBERT (Triple-PubMedBERT) for multi-label classification of Cardiology
Compressed files with both model bin and token vectors:
-
CBOW Models
MIMIC50 (download zip file size: 1GB)
T300 (download zip file size: 16GB)
-
Skip-gram Models
T300SG (download zip file size: 16GB)
T600SG (download zip file size: 34GB)
Transformer implementations are based on the open-source PyTorch-transformer repositories Huggingface and Simple Transformers.
Transformer models used include:BERT-base,Clinical BERT,BioMed-RoBERTa,PubMedBERT,MeDAL-Electra,Longformer and TransformerXL
Neural network models presented are implemented using PyTorch and Keras/Tensorflow.
Traditional classifiers such as logistic regression, random forest, and classifier chains use implementations of the Waikato Environment for Knowledge Analysis (WEKA) framework for binary classification and MEKA for multi-label classification.
Evaluations were done using sklearn metrics.