To learn more about SUPREME, read our paper at: https://www.biorxiv.org/content/10.1101/2022.08.03.502682v1
SUPREME
An integrative node classification framework, called SUPREME (a subtype prediction methodology), that utilizes graph convolutions on multiple datatype-specific networks that are annotated with multiomics datasets as node features. This framework is model-agnostic and could be applied to any classification problem with properly processed datatypes and networks. In our work, SUPREME was applied specifically to the breast cancer subtype prediction problem by applying convolution on patient similarity networks constructed based on multiple biological datasets from breast tumor samples.
First, SUPREME generates network-specific patient embeddings from each datatype separately. Then using those embedding, it does cancer subtype prediction through all the combinations of embeddings, and report the evaluation results.
Adjust the following hyperparameters (lines 10-13, 23):
- reduce
max_epochs
,xtimes
, andxtimes2
to a small number such as 2. - keep only one element in
learning_rates
such as [0.01] - keep only one element in
hidden_sizes
such as [32]
Use SUPREME.py
to run SUPREME.
Parameter: -data
: It specifies the data location to use under the 'data' folder (default is 'sample data').
Example runs:
python SUPREME.py
: runs SUPREME under 'data/sample_data' folderpython SUPREME.py -data user_defined_data
: runs SUPREME under 'data/user_defined_data' folder
Sample console output:
SUPREME is setting up!
SUPREME is running..
It took 488.5 seconds for node embedding generation (12 trials for 3 seperate GCNs).
SUPREME is integrating the embeddings..
Combination 0 ['clinical'] > selected parameters = {'hidden_layer_sizes': (256,)}, train accuracy = 0.948+-0.088, train weighted-f1 = 0.946+-0.092, train macro-f1 = 0.929+-0.157, test accuracy = 0.808+-0.047, test weighted-f1 = 0.782+-0.052, test macro-f1 = 0.571+-0.09
Combination 1 ['cna'] > selected parameters = {'hidden_layer_sizes': (64, 32)}, train accuracy = 0.95+-0.084, train weighted-f1 = 0.949+-0.09, train macro-f1 = 0.917+-0.153, test accuracy = 0.815+-0.074, test weighted-f1 = 0.793+-0.091, test macro-f1 = 0.603+-0.143
Combination 2 ['exp'] > selected parameters = {'hidden_layer_sizes': (256,)}, train accuracy = 0.935+-0.067, train weighted-f1 = 0.934+-0.073, train macro-f1 = 0.9+-0.136, test accuracy = 0.815+-0.026, test weighted-f1 = 0.798+-0.032, test macro-f1 = 0.609+-0.06
Combination 3 ['clinical', 'cna'] > selected parameters = {'hidden_layer_sizes': (512, 32)}, train accuracy = 0.987+-0.052, train weighted-f1 = 0.986+-0.056, train macro-f1 = 0.986+-0.107, test accuracy = 0.846+-0.035, test weighted-f1 = 0.829+-0.039, test macro-f1 = 0.674+-0.09
Combination 4 ['clinical', 'exp'] > selected parameters = {'hidden_layer_sizes': (128,)}, train accuracy = 0.93+-0.099, train weighted-f1 = 0.927+-0.11, train macro-f1 = 0.883+-0.187, test accuracy = 0.808+-0.051, test weighted-f1 = 0.784+-0.051, test macro-f1 = 0.598+-0.099
Combination 5 ['cna', 'exp'] > selected parameters = {'hidden_layer_sizes': (32,)}, train accuracy = 0.943+-0.066, train weighted-f1 = 0.943+-0.074, train macro-f1 = 0.912+-0.133, test accuracy = 0.831+-0.031, test weighted-f1 = 0.825+-0.037, test macro-f1 = 0.702+-0.095
Combination 6 ['clinical', 'cna', 'exp'] > selected parameters = {'hidden_layer_sizes': (256,)}, train accuracy = 0.979+-0.034, train weighted-f1 = 0.979+-0.035, train macro-f1 = 0.974+-0.06, test accuracy = 0.846+-0.033, test weighted-f1 = 0.836+-0.041, test macro-f1 = 0.718+-0.094
It took 532.6 seconds in total.
SUPREME is done.
Files under the sample_data folder under data folder:
-
labels.pkl
: Labels of ordered samples (ith row has the label of sample with index i). First column is label starting from 0 till {number of subtype}-1. First row contains column name. -
Input features: ith row has the feature values of sample with index i. (Still, we have column names and row names, even not considered.)
clinical.pkl
: 257 Samples (row) x 10 normalized clinical features (column)cna.pkl
: 257 Samples (row) x 250 normalized copy number aberration features (column)exp.pkl
: 257 Samples (row) x 250 normalized gene expression features (column)
-
Input networks: First column is rownames, second and third columns will contain sample indexes for the sample-sample pairs having interactions and forth column will be the weight of the interaction.
edges_clinical.pkl
: Clinical-based patient similarity networkedges_cna.pkl
: Copy number aberration-based patient similarity networkedges_exp.pkl
: Gene expression-based patient similarity network
Files under the SUPREME_sample_data_results folder:
Emb_clinical.csv
: Clinical-based patient embeddingEmb_cna.csv
: Copy number aberration-based patient embeddingEmb_exp.csv
: Gene expression-based patient embedding
module.py
: Graph Convolutional Neural Network-related module.SUPREME_simNetworkGen.R
: R code to generate similarity networks from the given datatypes.
-
Adjust the following variables (lines 2-7):
addRawFeat
: True or False: If True, raw features from listed datatypes infeatures_to_integrate
will be integrated during prediction; if False, no raw features will be integrated (default is True).base_path
: the path to SUPREME github folderdataset_name
: the data folder name inbase_path
including required input data to run SUPREMEfeature_networks_integration
: list of the datatypes to integrate as raw featuresnode_networks
: list of the datatypes to use (should have at least one datatype)int_method
: method to integrate during the prediction of subtypes. Options are 'MLP' for Multi-layer Perceptron, 'XGBoost' for XGBoost, 'RF' for Random Forest, 'SVM' for Support Vector Machine. (default is 'MLP'.)feature_selection_per_network
: a list of True or False: If True, the correspondingtop_features_per_network
features are selected from feature selection algorithm; if False, all features are used for integration. (order offeature_selection_per_network
andtop_features_per_network
are same as order ofnode_networks
)top_features_per_network
: list of numbers: If correspondingfeature_selection_per_network
is True and correspondingtop_features_per_network
is less than the input feature number, then feature selection algorithm will be applied for that network. (order offeature_selection_per_network
andtop_features_per_network
are same as order ofnode_networks
)boruta_top_features
: the number of top raw features to be integrated as raw features ifoptional_feat_selection
andaddRawFeat
are True; otherwise ignored.optional_feat_selection
: True or False: If True, the topboruta_top_features
features from each combination of integrated networks are added as raw features; if False, all the raw features are added to the embedding. (considered only ifaddRawFeat
is True)
-
Adjust the following hyperparameters (lines 10-13, 19, 23-25):
max_epochs
: maximum number of epoch (default is 500.)min_epochs
: minimum number of epoch (default is 200.)patience
: patience for early stopping (default is 30.)learning_rates
: list of learning rates to tune GCN's hyperparameters (default is [0.01, 0.001, 0.0001].)hidden_sizes
: list of hidden sizes to tune GCN's hyperparameters (default is [32, 64, 128, 256].)xtimes
: the number of SUPREME runs to select the best hyperparameter combination during hyperparameter tuning as part of Randomized Search (default is 50, should be more than 1.)xtimes2
: the number of SUPREME runs for the selected hyperparameter combination, used to generate the median statistics (default: 10)boruta_runs
: the number of times Boruta runs to determine feature significance (default: 100, should be more than 1) (considered only ifaddRawFeat
andoptional_feat_selection
are True, or if any of the values infeature_selection_per_network
are True)
base_path
should contain a folder nameddataset_name
(called as data folder afterwards) underdata
folder .node_networks
will have the list of the datatype names that will be used for SUPREME run. These names are user-defined, but should be consistent for all the file names.- In the data folder, there should be one label file named
labels.pkl
.labels.pkl
: <class 'torch.Tensor'> with the shape of torch.Size([{sample size}])
- In addition, the data folder will contain two '.pkl files per datatype.
{datatype name}.pkl
: <class 'pandas.core.frame.DataFrame'> with the shape of ({sample size}, {selected feature size for that datatype})edges_{datatype name}.pkl
: <class 'pandas.core.frame.DataFrame'> with the shape of ({Number of patient-patient pair interaction for this datatype}, 3). First and second columns will contain patient indexes for the patient-patient pairs having interactions and third column will be the weight of the interaction.
- The data folder might have a file named
mask_values.pkl
(<class 'list'>) if the user wants to specify test samples. Ifmask_values.pkl
does not exist in data folder, SUPREME will generate train and test splits. If added,mask_values.pkl
needs to have two variables in it:train_valid_idx
: <class 'numpy.ndarray'> with the shape of ({Number of samples for training and validation,) containing the sample indexes for training and validation.test_idx
: <class 'numpy.ndarray'> with the shape of ({Number of samples for test,) containing the sample indexes for test.
!! Note that sample size and the order of the samples should be the same for whole variables. Sample indexes should start from 0 till sample size-1 consistent with the sample order.
labels.pkl
will have the labels of the ordered samples. (ith value has the label of sample with index i){datatype name}.pkl
will have the values of the ordered samples in each datatype (feature size could be datatype specific). (ith row has the feature values of sample with index i)edges_{datatype name}.pkl
will have the matching sample indexes to represent interactions.train_valid_idx
andtest_idx
will contain the matching sample indexes.
Relevant package versions in the environment:
# Name Version Build Channel
cpuonly 2.0 0 pytorch
numpy 1.19.2 py36hadc3359_0
pandas 1.1.5 pypi_0 pypi
pickle5 0.0.12 pypi_0 pypi
pip 21.3.1 pypi_0 pypi
python 3.6.13 h3758d61_0
python-dateutil 2.8.2 pypi_0 pypi
pytorch 1.10.2 py3.6_cpu_0 pytorch
pytorch-mutex 1.0 cpu pytorch
rpy2 3.4.5 pypi_0 pypi
scikit-learn 0.24.2 pypi_0 pypi
torch-geometric 2.0.3 pypi_0 pypi
torch-scatter 2.0.9 pypi_0 pypi
torch-sparse 0.6.12 pypi_0 pypi
torchaudio 0.10.2 py36_cpu [cpuonly] pytorch
torchvision 0.11.3 py36_cpu [cpuonly] pytorch
xgboost 1.5.2 pypi_0 pypi