Medical-Domain-Specific-Language-Models

This repository contains code used for following the PhD thesis and publications:

Yogarajan, V (2021). Domain-specific Language Models for Multi-label Classification of Medical Text. The University of Waikato. PhD Thesis. (examination process)
Yogarajan, V., Montiel J., Smith T., & Pfahringer B. (2021) Transformers for Multi-label Classification of Medical Text: An Empirical Comparison. In: Tucker A., Henriques Abreu P., Cardoso J., Pereira Rodrigues P., Riaño D. (eds) Artificial Intelligence in Medicine. AIME 2021. Lecture Notes in Computer Science, vol 12721. Springer, Cham. link
Yogarajan, V., Gouk, H., Smith, T., Mayo, M., & Pfahringer, B. (2020). Comparing High Dimensional Word Embeddings Trained on Medical Text to Bag-of-Words for Predicting Medical Codes. In Asian Conference on Intelligent Information and Database Systems. Springer, Cham, pp. 97-108. pdf
Yogarajan, V., Montiel, J., Smith, T., & Pfahringer, B. (2020). Seeing The Whole Patient: Using Multi-Label Medical Text Classification Techniques to Enhance Predictions of Medical Codes. arXiv preprint arXiv:2004.00430.
Yogarajan, V., Montiel, J., Smith, T., & Pfahringer, B. (2021). Concatenated BioMed-Transformers for Multi-label Classification of Medical Text. (under submission)
Yogarajan, V., Montiel, J., Smith, T., & Pfahringer, B. (2021). Predicting COVID-19 Patient Shielding: A Comprehensive Study. (under submission).

Multi-label Problems

Classification Problem	Data	L	Inst	Data	L	Inst
ICD-9 Level 1	MIMIC-III	18	52,722	eICU	18	154,808
ICD-9 Level 2	MIMIC-III	158	52,722	eICU	93	154,808
ICD-9 Level 3	MIMIC-III	923	52,722	eICU	316	154,808
Cardiovascular	MIMIC-III	30	28,154	eICU	15	53,477
COVID-19	MIMIC-III	42	35,458	eICU	25	34,387
Fungal or bacterial	MIMIC-III	73	30,814	eICU	42	54,193

Examples

Binary Classification

Multi-label Classification

Others

FastText pre-trained Embeddings - Downloads

Compressed files with both model bin and token vectors:

CBOW Models

MIMIC50 (download zip file size: 1GB)

T300 (download zip file size: 16GB)
Skip-gram Models

T300SG (download zip file size: 16GB)

T600SG (download zip file size: 34GB)

Open-source Frameworks

Transformer implementations are based on the open-source PyTorch-transformer repositories Huggingface and Simple Transformers.

Transformer models used include:BERT-base,Clinical BERT,BioMed-RoBERTa,PubMedBERT,MeDAL-Electra,Longformer and TransformerXL

Neural network models presented are implemented using PyTorch and Keras/Tensorflow.

Traditional classifiers such as logistic regression, random forest, and classifier chains use implementations of the Waikato Environment for Knowledge Analysis (WEKA) framework for binary classification and MEKA for multi-label classification.

Evaluations were done using sklearn metrics.

vithyayogarajan/Medical-Domain-Specific-Language-Models