/DeepMicro

Deep representation learning for disease prediction based on microbiome data

Primary LanguagePythonMIT LicenseMIT

European Galaxy server

DeepMicro

DeepMicro is a deep representation learning framework exploiting various autoencoders to learn robust low-dimensional representations from high-dimensional data and training classification models based on the learned representation.

Quick Setup Guide

~$ conda install deepmicro
  • For GPU usage install tensorflow GPU version
~$ conda install tensorflow-gpu==1.13.1

Step 5: Run DeepMicro, printing out its usage.

~$ python DM.py -h

Quick Start Guide

Make sure you have already gone through the Quick Setup Guide above.

Learning representation with your own data

1. Copy your data under the /data directory. Your data should be a comma separated file without header and index, where each row represents a sample and each column represents a microbe. We are going to assume that your file name is UserDataExample.csv which is already provided.

2. Check your data can be successfully loaded and verify its shape with the following command.

~$ python DM.py -r 1 --no_clf -cd UserDataExample.csv

The output will show the number of rows and columns right next to X_train.shape. Our data UserDataExample.csv contains 80 rows and 200 columns.

Using TensorFlow backend.
Namespace(act='relu', ae=False, ae_lact=False, ae_oact=False, aeloss='mse', cae=False, custom_data='UserDataExample.csv', custom_data_labels=None, data=None, dataType='float64', data_dir='', dims='50', max_epochs=2000, method='all', no_clf=True, numFolds=5, numJobs=-2, patience=20, pca=False, repeat=1, rf_rate=0.1, rp=False, save_rep=False, scoring='roc_auc', seed=0, st_rate=0.25, svm_cache=1000, vae=False, vae_beta=1.0, vae_warmup=False, vae_warmup_rate=0.01)
X_train.shape:  (80, 200)
Classification task has been skipped.

3. Suppose that we want to reduce the number of dimensions of our data to 20 from 200 using a shallow autoencoder. Note that --save_rep argument will save your representation (the complete representation - not just the training) under the /results folder.

~$ python DM.py -r 1 --no_clf -cd UserDataExample.csv --ae -dm 20 --save_rep

4. Suppose that we want to use deep autoencoder with 2 hidden layers which has 100 units and 40 units, respectively. Let the size of latent layer to be 20. We are going to see the structure of deep autoencoder first.

~$ python DM.py -r 1 --no_clf -cd UserDataExample.csv --ae -dm 100,40,20 --no_trn

It looks fine. Now, run the model and get the learned representation.

~$ python DM.py -r 1 --no_clf -cd UserDataExample.csv --ae -dm 100,40,20 --save_rep

5. We can try variational autoencoder and * convolutional autoencoder* as well. Note that you can see detailed argument description by using -h argument.

~$ python DM.py -r 1 --no_clf -cd UserDataExample.csv --vae -dm 100,20 --save_rep
~$ python DM.py -r 1 --no_clf -cd UserDataExample.csv --cae -dm 100,50,1 --save_rep

Conducting binary classification after Learning representation with your own data

1. Copy your data file and label file under the /data directory. Your data file should be a comma separated value (CSV) format without header and index, where each row represents a sample and each column represents a microbe. Your label file should contain a binary value (0 or 1) in each line and the number of lines should be equal to that in your data file. We are going to assume that your data file name is UserDataExample.csv and label file name is UserLabelExample.csv which are already provided.

2. Check your data can be successfully loaded and verify its shape with the following command.

~$ python DM.py -r 1 --no_clf -cd UserDataExample.csv -cl UserLabelExample.csv

Our data UserDataExample.csv consists of 80 samples each of which has 200 features. The data will be split into the training set and the test set (in 8:2 ratio). The output will show the number of rows and columns for each data set.

Namespace(act='relu', ae=False, ae_lact=False, ae_oact=False, aeloss='mse', cae=False, custom_data='UserDataExample.csv', custom_data_labels='UserLabelExample.csv', data=None, dataType='float64', data_dir='', dims='50', max_epochs=2000, method='all', no_clf=True, no_trn=False, numFolds=5, numJobs=-2, patience=20, pca=False, repeat=1, rf_rate=0.1, rp=False, save_rep=False, scoring='roc_auc', seed=0, st_rate=0.25, svm_cache=1000, vae=False, vae_beta=1.0, vae_warmup=False, vae_warmup_rate=0.01)
X_train.shape:  (64, 200)
y_train.shape:  (64,)
X_test.shape:  (16, 200)
y_test.shape:  (16,)
Classification task has been skipped.

3. Suppose that we want to directly apply SVM algorithm on our data without representation learning. Remove --no_clf command and specify classification method with -m svm argument (If you don't specify classification algorithm, all three algorithms will be running).

~$ python DM.py -r 1 -cd UserDataExample.csv -cl UserLabelExample.csv -m svm

The result will be saved under /results folder as a UserDataExample_result.txt. The resulting file will be growing as you conduct more experiments.

4. You can learn representation first, and then apply SVM algorithm on the learned representation.

~$ python DM.py -r 1 -cd UserDataExample.csv -cl UserLabelExample.csv --ae -dm 20 -m svm

4.1. You can reload the stored the representation, and then apply SVM algorithm on the learned representation.

~$ python DM.py -r 1 -cd UserDataExample.csv -cl UserLabelExample.csv --load_rep results/PCA_UserDataExample_rep.csv -m svm

5. You can repeat the same experiment by changing seeds for random partitioning of training and test set. Suppose we want to repeat classfication task five times. You can do it by put 5 into -r argument.

~$ python DM.py -r 5 -cd UserDataExample.csv -cl UserLabelExample.csv --ae -dm 20 -m svm