Multimodal-autoencoder-for-breast-cancer: A Python repository from Alabenba

Multimodal autoencoders for subtypes and survival prediction of breast cancer

Implementation of our paper titled "Prognostically Relevant Subtypes and Survival Prediction for Breast Cancer Based on Multimodal Genomics Data" submitted to IEEE Access journal, August 2019. In this implementation, a multimodal autoencoders(MAE) is used to predict different clinical status of breast cancer patients based on multiplatformic genomics data. The MAE is trained with genomics data such as DNA methylation, gene expression, miRNA expressionfrom, and clinical outcomes from The Cancer Genome Atlas(TCGA).

Predicted clinical status

Breast cancer subtypes which is determined by the estrogen receptor (ER), progesterone receptor (PGR), and HER2/neu status
Survival rate (0-1, with 1 being the best chance of survival).

Requirements

Python 3
TensorFlow
Keras.

Download and create the dataset

Clone the repo using git clone https://github.com/rezacsedu/MultimodalAE-BreastCancer.git
Run the dataset creation program python3 main_download.py -d DATASET_IDX.

DATASET_IDX	Data Types	Data size(GB)
1	DNA Methylation	148
2	Gene Expression	9
3	miRNA Expression	0.24
4	Gene Expression + miRNA Expression	10
5	DNA Methylation + Gene Expression + miRNA Expression	162

Train the neural networks

Run the neural networks program python3 main_run.py <options>, with the below supported options:

Option	Values	Details	Required
-p PLATFORM --platform PLATFORM	int [1-2]	[1] Tensorflow, [2] Theano	yes
-t TYPE --type TYPE	int [1-2]	[1] Breast cancer type classification [2] Survival rate regression	yes
-d DATASET --dataset DATASET	int [1-15]	[1] DNA Methylation GPL8490 [2] DNA Methylation GPL16304 [3] Gene Expression Count [4] Gene Expression FPKM [5] Gene Expression FPKM-UQ [6] miRNA Expression [7] Gene Expression Count + miRNA Expression [8] Gene Expression FPKM + miRNA Expression [9] Gene Expression FPKM-UQ + miRNA Expression [10] DNA Met GPL8490 + Gene Count + miRNA [11] DNA Met GPL16304 + Gene Count + miRNA [12] DNA Met GPL8490 + Gene FPKM + miRNA [13] DNA Met GPL16304 + Gene FPKM + miRNA [14] DNA Met GPL8490 + Gene FPKM-UQ + miRNA [15] DNA Met GPL16304 + Gene FPKM-UQ + miRNA	yes
--pretrain_epoch PRE_EPOCH	int	Pre-training epoch. Default = 100	no
--train_epoch TRAIN_EPOCH	int	Training epoch. Default = 100	no
--batch BATCH	int	Batch size for pre-training and training. Default = 10	no
--pre_lr PRE_LR	int	Pre-training learning rate. Default = 0.01	no
--train_lr TRAIN_LR	int	Training learning rate. Default = 0.1	no
--dropout DROPOUT	int	Dropout rate. Default = 0.2	no
--pca PCA	int [1-2]	[1] Use PCA [2] Don't use PCA Default = [2] Don't use	no
--optimizer OPTIMIZER	int [1-3]	[1] Stochastic gradient descent [2] RMSProp [3] Adam Default = [1] Stochastic gradient descent	no

Example

If we want to perform breast cancer subtype classification based on the dime sion reduced DNA methylation dataset using PCA on TensorFlow platform, one can issue the following command from the terminal: python3 main_run.py --platform 1 --type 1 --dataset 1 --batch 10 --pretrain_epoch 5 --train_epoch 5 --pca 1 --optimizer 3

In the preceding command, we define: -- 10 as the batch size -- 5 as the number of pretraining epoch -- 5 is the fine tuning epoch -- 3 is the idx for the Adam optimizer.

#Sample execution:

Cancer type classification with DNA methylation platform GPL8490 with TensorFlow

ER status prediction
-----------------------------
[START] Pre-training step:
>> Epoch 1 finished     AE Reconstruction error 522.190925
>> Epoch 2 finished     AE Reconstruction error 497.765570
>> Epoch 3 finished     AE Reconstruction error 492.680869
>> Epoch 4 finished     AE Reconstruction error 494.515497
>> Epoch 5 finished     AE Reconstruction error 468.050771
>> Epoch 1 finished     AE Reconstruction error 2680.144531
>> Epoch 2 finished     AE Reconstruction error 2672.767578
>> Epoch 3 finished     AE Reconstruction error 2691.162842
>> Epoch 4 finished     AE Reconstruction error 2597.989502
>> Epoch 5 finished     AE Reconstruction error 2758.419678
[END] Pre-training step

[START] Fine tuning step:
>> Epoch 0 finished     Training loss 0.610027
>> Epoch 1 finished     Training loss 0.594821
>> Epoch 2 finished     Training loss 0.568818
>> Epoch 3 finished     Training loss 0.564796
>> Epoch 4 finished     Training loss 0.558171
[END] Fine tuning step

Accuracy: 0.8786260
Precision: 0.861820
Recall: 0.878625954
F1-score: 0.8692177

PGR status prediction
---------------------------------
[START] Pre-training step:
>> Epoch 1 finished     AE Reconstruction error 422.876587
>> Epoch 2 finished     AE Reconstruction error 393.641800
>> Epoch 3 finished     AE Reconstruction error 377.866021
>> Epoch 4 finished     AE Reconstruction error 368.311999
>> Epoch 5 finished     AE Reconstruction error 380.356941
>> Epoch 1 finished     AE Reconstruction error 2793.383789
>> Epoch 2 finished     AE Reconstruction error 2742.516602
>> Epoch 3 finished     AE Reconstruction error 2704.654785
>> Epoch 4 finished     AE Reconstruction error 2839.105469
>> Epoch 5 finished     AE Reconstruction error 2749.048584
[END] Pre-training step

[START] Fine tuning step:
>> Epoch 0 finished     Training loss 0.921267
>> Epoch 1 finished     Training loss 0.662474
>> Epoch 2 finished     Training loss 0.674687
>> Epoch 3 finished     Training loss 0.669110
>> Epoch 4 finished     Training loss 0.739354
[END] Fine tuning step

Accuracy: 0.8694656
Precision: 0.848254
Recall: 0.869465648
F1-score: 0.8569493

HER2 status prediction
------------------------------
[START] Pre-training step:
>> Epoch 1 finished     AE Reconstruction error 309.675462
>> Epoch 2 finished     AE Reconstruction error 302.142036
>> Epoch 3 finished     AE Reconstruction error 294.692107
>> Epoch 4 finished     AE Reconstruction error 290.237393
>> Epoch 5 finished     AE Reconstruction error 289.501104
>> Epoch 1 finished     AE Reconstruction error 1846.207275
>> Epoch 2 finished     AE Reconstruction error 1806.483032
>> Epoch 3 finished     AE Reconstruction error 1898.162720
>> Epoch 4 finished     AE Reconstruction error 1902.564453
>> Epoch 5 finished     AE Reconstruction error 1867.702637
[END] Pre-training step

[START] Fine tuning step:
>> Epoch 0 finished     Training loss 1.010514
>> Epoch 1 finished     Training loss 0.988286
>> Epoch 2 finished     Training loss 0.995581
>> Epoch 3 finished     Training loss 0.987776
>> Epoch 4 finished     Training loss 0.986907
[END] Fine tuning step

Accuracy: 0.8613043
Accuracy: 0.8612809
Precision: 0.875822
Recall: 0.861304347

Special note

If you already have the processed datasets without running the main_download.py, please add MAIN_MDBN_TCGA_BRCA = "main_datasets_folder" on the first line of these two files:

/mdbn_tcga_brca/Tensorflow/dataset_location.py
/mdbn_tcga_brca/Theano/dataset_location.py

with main_datasets_folder being the main folder of your datasets.

Citation request

If you use the code of this repository in your research, please consider citing the folowing papers:

@inproceedings{karim2019MAE,
    title={Prognostically Relevant Subtypes and Survival Prediction for Breast Cancer Based on Multimodal Genomics Data},
    author={Karim, Md Rezaul and Beyan Deniz and Decker, Stefan},
    booktitle={submitted to IEEE Access journal},
    year={2019}
}

Contributing

For any questions, feel free to open an issue or contact at rezaul.karim@rwth-aachen.de

Alabenba/Multimodal-autoencoder-for-breast-cancer